Clinical stage is among the most important metrics in oncology care and research, as it determines treatment pathways and prognosis.

However stage is rarely recorded in structured form in the electronic health record (EHR). There is an emerging catalog of AI tools for extracting meaningful insights from sparse and heterogeneous EHR datasets. The Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) is an R-package that builds phenotype classifiers using structured data in the OMOP (Observational Medical Outcomes Partnership) common data format. Our goal was to predict cancer stage among patients seen at our academic medical center between 2005-2016. We have developed an EHR-based prostate cancer cohort that is linked with the state cancer registry. We combined structured EHR data, including patient demographics, diagnosis and procedure codes, lab results and prescribed medications, with manually-curated staging data for 5861 prostate cancer patients. We used the APHRODITE framework to build models to predict stage at diagnosis using only structured EHR data. This paper outlines feature selection, time binning and model selection. This work is significant because accurate and up-to-date staging information in the EHR is a critical component of observational research in oncology, as well as care coordination and clinical trial recruitment. In addition, AI methods to extract relevant data from the EHR can be a basis for automatically populating registries from local EHRs in a timely fashion.



Author: Martin Seneviratne

Coauthor(s): Martin Seneviratne, Michelle Ferrari, James Brooks, Tina Hernandez-Boussard

Status: Work In Progress

Funding Acknowledgment: AstraZeneca-Stanford collaborative research grant 2017-18