The acquisition of diagnostic, treatment, and outcomes information on cancer cases for population-based cancer surveillance currently involves a tremendous amount of manual data abstraction and information processing by expert staff. A majority (estimated 65%) of clinical data elements that are needed to characterize cancer patients come from unstructured sources (e.g. pathology reports, radiology notes, treatment summaries, clinical visit notes). Many hospital-based cancer registries that abstract and report cancer cases to central cancer registries at the state level rely on manual data abstraction from document-based medical records. Central cancer registry staff also perform manual data processing to find additional cases, consolidate records, and fix data errors and gaps. Manual processes impose inherent limitations on the volume and types of information that registries can collect. Furthermore, with the increasing complexity of cancer care, staff may not have the resources to gather data such as biomarkers, treatment details, and disease progression and recurrence. To improve the overall efficiency and quality of data abstraction and processing for cancer registries, and enable acquisition of more detailed clinical data that may not be reported currently, the NCI Surveillance Research Program (SRP) is piloting the use of natural language processing (NLP) and deep learning tools and methods.
In the cancer surveillance context, NLP is the application of linguistics and computer science to extract and interpret linguistic information from health care documents (e.g., pathology reports, radiology reports, treatment summaries, clinical notes) that are created in electronic medical records systems where patients receive cancer care. NLP engineers or data scientists can train computer algorithms to complete tasks such as de-identification (e.g., finding data in text documents that could potentially identify a person and removing it or replacing it with realistic but artificial data), classification (e.g., determining whether a pathology case contains a reportable cancer diagnosis), or information extraction (e.g., pulling out results of tissue biomarker assays from pathology reports). Practically, data abstraction, data quality assurance, and/or searches for low-frequency data may be assisted and improved through application of NLP tools and methods.
The NCI SRP Surveillance Informatics Branch is working with multiple partners to set up a scalable platform for training, validating, and enabling the use of NLP to enhance the efficacy, efficiency, and quality of data extraction for cancer registries and cancer surveillance.