Sepsis is the single most expensive disease process in U.S. hospitalizations, accounting for 5% of all national hospital expenditures. Internationally, there are estimated to be 31.5 million cases of sepsis per year, leading up to 5.3 million deaths worldwide every year. Sepsis that occurs in the community is referred to as community acquired sepsis, while sepsis that develops during a hospitalization, is referred to as Hospital acquired sepsis (HAS). Hospital acquired sepsis accounts for 11-15% of all cases of sepsis, but is associated with higher mortality (19.2% versus 8.6%), costs ($38,400 versus $8,800), and length of stay (17 days versus 6 days), compared to community acquired sepsis. Early detection and treatment of HAS is vital, as each 1-hour delay in starting antibiotics equates to 7.6% decrease in survival. Because of this, hospitals have tried developing simple ruled based screening algorithms to detect cases of sepsis early. However, these algorithms suffer from poor accuracy, high false positive rates, and significant alert fatigue. Recently, researchers have turned to machine learning based models to more accurately identify patients with HAS. These models have improved accuracy, but suffer from several limitations that limit their generalizability and clinical utility. These limitations include:

1) Using the old definition of sepsis, rather than the new sepsis-3 definition
2) Depending on ICD coding to identify positive cases of sepsis for model training (specific, but not sensitive)
3) Using small training sets
4) Trying to take models trained on one unique population or location, and generalize to multiple different populations/locations.5) Deploying models in locations where they might not be as clinically helpful (e.g. in the ICU or ED)

Our aim is to create a supervised machine learning model for predicting HAS that can address these previous limitations.

Purpose of model: To predict which hospital patients in the acute floor setting (non-ED, non-ICU) are at high risk to develop HAS

Data source: We propose to train and validate our model on a dataset that includes 5 years of EHR data from an academic medical center. We expect this will equate to approximately 100,000 encounters.

Identification of positive

cases of HAS in data: We will use the sepsis-3 definition of sepsis. Patients with both suspected infection (given IV antibiotics and had sample sent for culture) and new organ dysfunction (Sequential Organ Failure Assessment Score change of ≥ 2 from baseline).

Feature selection: We intend to include a variety of types of data such as: demographics, environmental information, past medical history, comorbidities, labs, vitals, and medications.

Feature engineering: We intend to test previously untested features that are created with clinical insight, this includes: slope, standard deviation, trend, monotonically increasing or decreasing, duration, etc

Machine learning models to be tested: Extreme Gradient Boosting (XG Boost), Logistic Regression, Regression Trees (CART), Generalized Linear Models, K-Nearest Neighbors, Linear Regression, Multi-Layer Perceptron, Bayesian Regression Trees

Currents status: We have IRB approval, and are in the process of completing necessary paperwork to approve collaboration between UW, KenSci, and Microsoft.



Author: Xinran Liu

Coauthor(s): Greg McKelvy, MD, MPH Muhammad A. Ahmad, PHD Rosemary Grant, BSN, RN, CPHQ David Carlbom, MD