Clinical Natural Language Processing Workshop at COLING 2016

Osaka, Japan. December 2016.


One of the main obstacles to NLP research in the clinical domain is data access. On this page, we will assemble links to existing data sets (both raw and annotated) that are currently available to the general public.


    MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with >40,000 critical care patients. In addition to structured clinical data (demographics, vital signs, laboratory tests, medications, etc.), it contains over 2 million free-text notes from nurses, physicians, specialists, and more.

  • i2b2 NLP Research Data Sets

    In an effort to provide annotated data for a variety of NLP tasks in the clinical domain, the i2b2 (Informatics for Integrating Biology and the Bedside) project has organized a yearly series of shared tasks, starting in 2006. Each year, several hundreds of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare is annotated for that year's task and released to the research community. To date, these efforts have covered a variety of tasks, including de-identification, named entity and relation extraction, negation and modality, co-reference resolution, temporal information extraction, and others. The notes from each i2b2 shared task are released under the appropriate data use agreements to the research community at large on the one year anniversary of the task's completion. The data from previous shared tasks up through 2014 as i2b2 NLP Research Data Sets from the i2b2 project website.

  • ShARe/CLEF eHealth

    The Sharing Annotated Resources (ShARe) / Conference and Labs of the Evaluation Forum (CLEF) included shared tasks on disease/disorder named entity recognition, normalization of named entities to the Unified Medical Language System (UMLS), and disease/disorder template filling.

  • SemEval

    Several shared tasks in the clinical domain have been organized as a part of the yearly SemEval competitions. These include:

    The data for SemEval shared tasks is typically available after the tasks complete.

  • MedNLPDoc

    Medical Natural Language Processing for Clinical Document (MedNLPDoc) has run three shared tasks in processing of Japanese clinical records. The tasks included named entity recognition, term normalization, and International Codes for Diseases (ICD) disease name identification.