In the era of electronic health records (EHR), there is an opportunity to create data-driven clinical decision support systems that leverage the aggregate expertise of many healthcare providers and automatically adapts to the ongoing stream of practice data.

This would fulfill the vision of a health system that continuously learns from real-world data and translates them into usable, point-of-care information for clinicians. The “wisdom of the crowd” phenomenon purports that indiscriminately learning from data generated by all clinicians can be more robust than learning from a subset of cherry-picked “experts.” However, effective medical decision making may be compromised if patterns are aggregated from clinicians with systematically biased decision making that yield poorer patient outcomes. In this study, we stratify clinicians in a hospital practice based on observed vs. expected 30-day patient mortality rates. We then compare clinical order practice patterns machine-learned from “low-mortality” and “high-mortality” clinician cohorts against an unfiltered clinician “crowd” and clinical practice guidelines.

The authors defined a clinician performance score based on two-sided P-values quantifying the deviation of a clinician’s observed vs. expected 30-day patient mortality rate. Clinicians at extremes of the score distribution were stratified into low-mortality and high-mortality cohorts. Using structured, deidentified (EHR) data from 2010-2013, the authors curated three patient cohorts: patients seen by low-mortality clinicians, patients seen by high-mortality clinicians, and patients seen by the unfiltered crowd of all clinicians. Predicted order lists for 6 common admission diagnoses (pneumonia, chest pain, heart failure, etc.) were generated from association rule episode mining recommender systems trained on each patient cohort and evaluated against manually-curated clinical practice reference standards using ROC analysis, precision and recall, and rank biased overlap (RBO).

Out of 1,822 total clinicians assessed, 397 (21.8%) and 110 (6.0%) were stratified into low- and high-mortality extremes, respectively. Patient treated by low-mortality, high-mortality, or the unfiltered crowd of clinicians were sorted into propensity score matched cohorts of size 1,046, 1046, and 5,230. For all 6 admission diagnoses, ranked lists of associated clinical orders learned from low-mortality and unfiltered crowd clinician cohorts showed “substantial overlap” with RBO values be

tween 0.67-0.79. Order lists learned from the unfiltered crowd showed the most robust alignment with clinical practice guideline references (ROC AUC between 0.86 to 0.91), performing as good as or better than those learned from the low-mortality cohort (0.79 to 0.84, p<10-5) or manually-authored hospital order sets (0.65 to 0.77, p<10-5). Conclusion Whether machine learning models are better trained on all available cases or if we should “cherry-pick” favored subsets of clinical decision makers illustrates a bias-variance tradeoff in data usage. Learning decision support from data generated by all clinicians is as, or more, robust than attempting to select a subgroup of clinicians favored by patient outcomes data when evaluated against clinical practice guidelines. In the absence of gold standards to define “good” medical decisions, defining reusable metrics to assess quality based on external reference standards (e.g. practice guidelines) or relation to hard outcomes (e.g. patient mortality) is critical to assess decision support content.



Author: Jason Ku Wang

Coauthor(s): Jason Hom MD, Alejandro Schuler, Nigam H. Shah MBBS PhD, Mary K. Goldstein MD, Michael T.M. Baiocchi PhD, Jonathan H. Chen MD PhD

Status: Completed Work

Funding Acknowledgment: This research was supported by the NIH Big Data 2 Knowledge initiative via the National Institute of Environmental Health Sciences under Award Number K01ES026837, and the Stanford Clinical and Translational Science Award (CTSA) to Spectrum (UL1 TR001085). The CTSA program is led by the National Center for Advancing Translational Sciences at the National Institutes of Health. Additional support is from the Stanford NIH/National Center for Research Resources CTSA award number UL1 RR025744. Patient data were extracted and de-identified by Stanford Medicine’s Research IT department as part of the Stanford Medicine Research Data Repository (StaRR) project. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or Stanford Healthcare.