Toward Precision
Cancer Surveillance

An integrated team from NCI’s Surveillance, Epidemiology, and End Results (SEER) Program, four Department of Energy (DOE) labs—Oak Ridge National Laboratory (ORNL), Lawrence Livermore National Lab, Los Alamos National Lab, and Argonne National Lab—Information Management Systems (IMS), and four SEER registries met on March 28th–30th, 2017 to continue their work on the NCI-DOE Pilot 3 collaboration. This partnership will enhance cancer research using the DOE’s expertise in high performance computing and SEER’s expertise in cancer surveillance. The meeting focused on the progress made in Aim 1 and Aim 2 of the pilot as well as future goals for Aim 3.

The goal of Aim 1 is to create natural language processing (NLP) and machine learning tools that can accurately capture information from unstructured clinical text for expanded cancer surveillance data reporting. The collaboration team has completed development of a Clinical Document Annotation and Processing (CDAP) pipeline. This pipeline will help generate annotated data sets that will train the natural language processing (NLP) tools. To date, the DOE labs have created a prototype NLP tool using a de-identified set of pathology reports representing four cancer sites:  lung, prostate, colorectal, and breast. Ongoing annotation of data elements that could be abstracted from pathology reports, such as clinical biomarkers for lung cancer, is underway to support development of additional NLP tools. During the meeting, the team continued to identify and define additional data elements within pathology reports that will broaden the scope of the NLP tools.

The goal of Aim 2 is to collect cancer surveillance data from multiple sources (e.g., insurance claims, pharmacy records, and electronic health records) to create detailed longitudinal patient trajectories. The trajectories will be used to identify patterns in patients’ diagnoses and treatment and thereby increase the understanding of cancer outcomes. In Aim 2, the team is exploring potential data models for scalable processing, visualization, and analysis of patient trajectories, and a list of variables that need to be acquired from multiple data sources is being developed. Aim 2 is also gearing up to perform a systematic review of various literature sources to further develop a common data model. Timelines and visualization tools for patient trajectories were highlighted during the meeting as potential scalable methods to illustrate and explore patterns in care. Work to develop definitions for cancer “recurrence” and “progression” was also discussed.

 The goal of Aim 3 is to build statistical models that can predict the clinical course and treatment outcomes for each cancer type. NCI is facilitating development of research questions that are currently being reviewed by cancer experts. These questions will be used to create disease-specific use cases for scalable predictive modeling and analytics using a variety of integrated data sets, including information extracted from unstructured clinical documents.               

Many of NCI’s Pilot 3 team members, including Tanmoy Bhattacharya, Jessica Boten, and Donna Rivera, will be presenting on the Pilot 3 collaboration at the North American Association of Central Cancer Registries (NAACCR) Conference in Albuquerque, New Mexico from June 16-23, 2017.