The National Cancer Institute (NCI)’s Surveillance, Epidemiology, and End Results (SEER) Program is collaborating with the Department of Energy (DOE) on a 5-year pilot project that focuses on the use of high-performance computing to support cancer surveillance. Pilot 3 of the NCI-DOE Collaboration applies advanced computational capabilities and deep learning methods to population-based cancer data to understand the impact of new diagnostics, treatments, and other factors affecting patient outcomes.
Members and stakeholders of the collaboration came together for a two-day hackathon at the DOE’s Oak Ridge National Laboratory in Knoxville, Tennessee on September 10-11, 2019. The hackathon included a hands-on review of algorithms to improve efficiency of cancer registries and discussions of the next steps for implementation of the new tools in the registry workflow. The event also involved in-depth discussions on project focus areas, including privacy-aware computing, model sharing, recurrence capture, and clinical trials. Find more information about each of these topic areas below:
Hands-on Review of Tools
The hackathon included a hands-on review of the algorithms created by the Pilot 3 DOE team. Two algorithms have been developed: one to capture recurrence (Recurrence algorithm), and another to help distinguish reportable and non-reportable cancer cases (Reportability algorithm). These algorithms are in their initial stages. They will go through vigorous testing to not only add more registry data, but also ensure they are robust and adaptable to meet the needs of all registries. Since cancer registrars must look at pathology documents and manually identify data from various reports, these tools will reduce the burden on cancer registrars. The tools will allow registrars to focus on pathology reports that the algorithms flag for manual/further review. Registrars will also be able to focus their attention on abstracting more complex variables that these algorithms are not trained on yet.
Tool Implementation in the Registry Workflow
Hackathon participants discussed testing and implementation of the tools that have been created through this collaboration. The proposed next steps include developing a study to capture the interaction between registrars and the tools. The tools will capture site, histology, laterality, behavior, and grade variables. This study will inform the most effective, accurate, and powerful way to integrate the tools into the pathology-reporting workflow in the cancer registry environment.
Promoting Privacy-Aware Computing and Model Sharing
An important presentation involved privacy-aware computing. Although collaboration among cancer registries is essential to fully exploit the promise of deep learning, privacy and confidentiality challenges are the main obstacles for data and model sharing across cancer registries. Pilot 3 is developing different privacy-preserving approaches that protect data and models. These approaches will offer a secure deep learning model distribution and enable collaboration between registries and researchers without sharing sensitive data.
Capturing Cancer Recurrence
A significant discussion at the hackathon involved the capture of cancer recurrence, since it requires a broad and multidisciplinary approach and represents one of the most challenging problems for the field of cancer surveillance. This challenge is due to the complexity of how recurrence is defined and diagnosed, the variation in time interval to recurrence, the broad set of specialties and health care settings in which recurrence can be diagnosed, and the longitudinal component required to understand recurrence.
Pilot 3 is taking a multi-pronged approach to capture recurrence, using current datasets to begin developing multiple algorithms. The initial algorithm from the DOE labs utilizes a classifier to identify malignant cases and clinical domain knowledge to capture recurrence cases. Registrars are reviewing validation results from this initial algorithm to determine accuracy and next steps. The Pilot will expand recurrence efforts by leveraging ongoing data capture from registrars and annotating new data elements.
Improving the Capture of Clinical Trials
The hackathon included multiple streams of discussion on ways in which high-performance computing could improve the conduct and analysis of clinical trials. One focus of conversation was how to integrate our deep learning tools into rapid case ascertainment for clinical trials, cancer survivorship initiatives, and other research studies and interventions. Rapid case ascertainment enables research studies and clinical trials to identify eligible patients in a timely manner, so that they may be able to participate in trials shortly after diagnosis or after a recurrence.
The second focus of discussion was on how to utilize a knowledge graph to assess the feasibility of a clinical trial’s success. To determine the potential value of this method, the team used research pilots to preliminarily evaluate this process based on several clinical trials. The NCI-DOE team will continue to work on this effort in collaboration with NCI personnel from the Division of Cancer Treatment and Diagnosis (DCTD). This effort is important in meeting the goals of the National Clinical Trials Network (NCTN), which NCI launched in 2014. The NCTN aims to facilitate the rapid initiation and completion of cancer clinical trials based on improvements in data management infrastructure, the development of a standardized process for prioritizing new studies, consolidation of its component research groups to improve efficiency, and the implementation of a unified system of research subject protection at over 3,000 clinical trials sites. Initial efforts have begun to explore areas of collaboration with the Clinical Trials Working Group to find avenues for integrating Pilot 3’s efforts into the NCTN.
The third focus of discussion of the Pilot 3 work was about expanding the natural language processing (NLP) algorithms from cancer surveillance data to other clinical trial data. The team discussed whether the algorithm can be modified and applied to automatically extract structured data from the unstructured text documents that currently exist in ClinicalTrials.gov. This website, a part of NIH’s National Library of Medicine, hosts a database of publicly and privately funded clinical studies— including those sponsored by the NIH— from all 50 states and around the world. The database includes a wide range of diseases and conditions. Having structured data representing key eligibility criteria is a necessity in clinical trials data in order to facilitate automated matching and more rapid accrual to the trials. This is a challenging goal for the NCI and one that will be an area of focus for the DOE collaboration in Pilot 3.
The hackathon helped collaborators discuss important issues relating to not only the collaboration but also broader issues within cancer surveillance. The NCI-DOE team looks forward to integrating these efforts to enhance the SEER Program and support SEER registries and the cancer surveillance community.