The free-form portions of clinical notes are a significant source of information for research. One path for protecting patient privacy is to fully de-identify this information before sharing it for research purposes. De-identification efforts have focused on known named entities and other known identifier types (names, ages, dates, addresses, IDs, etc.). However, a note may contain residual "Demographic Traits"(DTs) that are unique enough to identify the patient when combined with other facts. While we believe that re-identification is not possible with these demographic traits alone, we hope that giving healthcare organizations the option to remove them will strengthen privacy standards of automatic de-identification systems and bolster their confidence in such systems.
More specifically, this dataset was used to test the performance of our paper Active Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes. We evaluated our pipeline using a subset of the I2b2 2006 and MIMIC-III datasets. See the "Annotations Guide" file for more information.