Clinical Natural Language Processing Workshop at COLING 2016

Resources

One of the main obstacles to NLP research in the clinical domain is data access. On this page, we will assemble links to existing data sets (both raw and annotated) that are currently available to the general public.

MIMIC-III
MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with >40,000 critical care patients. In addition to structured clinical data (demographics, vital signs, laboratory tests, medications, etc.), it contains over 2 million free-text notes from nurses, physicians, specialists, and more.
i2b2 NLP Research Data Sets
In an effort to provide annotated data for a variety of NLP tasks in the clinical domain, the i2b2 (Informatics for Integrating Biology and the Bedside) project has organized a yearly series of shared tasks, starting in 2006. Each year, several hundreds of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare is annotated for that year's task and released to the research community. To date, these efforts have covered a variety of tasks, including de-identification, named entity and relation extraction, negation and modality, co-reference resolution, temporal information extraction, and others. The notes from each i2b2 shared task are released under the appropriate data use agreements to the research community at large on the one year anniversary of the task's completion. The data from previous shared tasks up through 2014 as i2b2 NLP Research Data Sets from the i2b2 project website.
ShARe/CLEF eHealth
The Sharing Annotated Resources (ShARe) / Conference and Labs of the Evaluation Forum (CLEF) included shared tasks on disease/disorder named entity recognition, normalization of named entities to the Unified Medical Language System (UMLS), and disease/disorder template filling.
- ShARe/CLEF 2013
- ShARe/CLEF 2014
- CLEF Health 2015
- CLEF Health 2016
- A more comprehensive list can be found at: CLEF Data Sets

SemEval
Several shared tasks in the clinical domain have been organized as a part of the yearly SemEval competitions. These include:
The data for SemEval shared tasks is typically available after the tasks complete.
MedNLPDoc
Medical Natural Language Processing for Clinical Document (MedNLPDoc) has run three shared tasks in processing of Japanese clinical records. The tasks included named entity recognition, term normalization, and International Codes for Diseases (ICD) disease name identification.

Clinical Natural Language Processing Workshop at COLING 2016

Osaka, Japan. December 2016.

Resources