The myriad problems preventing machine learning from acting on the wealth of electronic health records data (EHRs) will one day give way to a fantastic revolution in digital health.

Approximately 80% of healthcare data is still unstructured and housed in various silos across different health systems due to complicated regulatory requirements. Data variables are themselves spread across multiple reporting areas of numerous softwares including clinical notes and laboratory results generated during various hospital visits.

This is a problem for the future of digital medicine and Machine learning (ML) offers a potential solution, but not without its own set of unique challenges.

Mapping and validation of these data fields for ML applications requires a significant amount of preprocessing. Most of the models that have been researched to validate ML use limited variables either due to limitations of the data sets or the amount of preprocessing required to curate them.

The Medical Information Mart for Intensive Care (MIMIC) data set has been widely used as this is one of the largest publicly available electronic health records (EHRs) data sets, although it is limited to intensive care unit patient encounters only [2].  

Interoperatibilty along with access is also a big challenge. This means that despite development of various applications it remains unclear if they can be universally applied onto other data sets to deliver desirable results.

Fast Healthcare Interoperability Resources (FHIR), a standard created by the HL7 organization is a step in the right direction to address some of the interoperatibility issues [3].

In a recent article, Rajkomar,et al, highlight some of these key challenges and showcase the development of an innovative data pipeline using raw EHR data along with delivering FHIR standard output data without the need for manual feature harmonization [4].This group of investigators used datasets from two different health systems, which is a unique effort.

They were able to demonstrate superior patient outcome predictions such as mortality, length of hospital stay, readmissions and discharge diagnoses compared to some of the existing models.

A highlight of this study was the use of “big data” with 46 billion individual data points exploited for development of their prediction models. Deep learning models can be improved using increased data availability and better feature engineering. Various prior studies and prediction models have used a lesser number of features and much smaller amounts of data.

Clearly, Rajkomar and colleagues demonstrated an innovative approach to building better data preprocessing techniques which can in turn be applied to real world data sets.

What is unclear is whether this was the result of increased amount of data being made available, or an improved feature extraction or a combination of both. Another question for the future, is whether similar models built on retrospective data using batch data processing techniques will work well with real time data. Data architecture in EHRs is not fixed either, with upgrades coming online on a constant basis.

Although, a step in the right direction, many more rigorous efforts are needed to translate these studies into real-time applications.

While the focus of the above study was on demonstration of a successful HER data processing technique for deep learning prediction models, the accompanying hype about, “Google can predict when you will die”, was unnecessary and far from truth [5].

It is true that available data can be used for predictions, but interpretation and application has still not matured enough. For example, a significant number of the mortalities occurring in hospital settings are based on patients’ rights for an acceptable treatment course and what the clinicians determine as futile care. Using retrospective datasets does not incorporate these two significant confounders and limits the interpretability and application of the results. Additionally, social factors which lie outside of EHR data have been shown to have significant implications on patients’ length of stay and readmissions.

Healthcare systems have spent significant resources in building data warehouses locally with the constantly growing demand to maintain, secure and expand them. Most of the healthcare systems do not have the expertise in this area and cannot manage these data systems by themselves. With constant breaches into hospital data systems, a standard secure architecture is needed to keep the patient data secure.

Cloud computing presents a solution to synchronize real time data with analytical power. Local or integrated clouds need to considered with ever increasing processing needs and delivery of cost effective solutions.

We have to overcome the technical challenges of “big unstructured data” across different silos if we are to make any real progress in this direction:

  • Firstly, similar techniques need to be developed for real time data analysis.
  • Secondly, these then need to be harmonized for clinical interpretation
  • Last but certainly not the least the many known and unknown data silos need to be disrupted and opened up for research and development.

The National Institute of Health’s (NIH) strategic data plan is a welcome step in making large data sets available to communities for research and development [6].

If governments and organizations are providing funding for research, it is logical that these de-identified data sets are made open-source.

Patient privacy is very important but similar to basic principles of informed consent, patients need to be better informed about the value of provision of their data in a secure de-identified manner to enhance quality of care and make healthcare delivery cost-effective. As anticipated there are already concerns about ownership of this data including ethical and legal use.

Organizations should not be able to simply monetize this data for personal gains. Europe has embraced stringent rules around data privacy called General Data Protection Regulation (GDPR) which brings its own challenges around data sharing and access. A recent publication by McLennan and coworkers aptly titled “The challenge of local consent requirements for global critical care databases” highlights some of these issues [7]. Clearly, some governance is required to ensure effective and appropriate use of global databases.

There is no doubt that healthcare data is increasing exponentially and will continue to do so. In the future with the availability of mobiles and wearables, the need for integration and processing is also going to increase. ML techniques are the present and future modalities to handle this data and need to be incorporated within and outside of the EHRs. From passive data repositories, EHRs need to be provided with “brains” which will decrease clinicians’ data processing burden and provide for improved and safer care to our patients.

BrainX take on the challenge of machine learning in EHRs

BrainX is our competing team in the IBM Watson Artificial Intelligence XPrize competition, trying to overcome the big data challenge using novel machine learning approaches [8].

One of only 62 teams left in the second round of this competition, that originally started with 683 teams from all over the world, we have now conclusively demonstrated how to use healthcare data in an analytic manner for clinician interpretation.

This is certainly just the beginning in discovering the right solutions and intelligent answers on how to best handle the many zetabytes of precious healthcare record data that will ultimately help us discover better treatment solutions for our patients the world over.

This article originally appeared in AIMed Magazine issue 04, which you can read here.



International Data Corporation(IDC) Health Insights.September 2013


Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). “MIMIC-III, a freely accessible critical care database”,DOI: 10.1038/sdata.2016.35. Available from:


4. Alvin Rajkomar et al. “Scalable and accurate deep learning with electronic health records (EHRs)”,

npj Digital Medicine (2018)1:18 ; doi:10.1038/s41746-018-0029-1.


  2. McLennan S, Shaw D, Celi LA, “The challenge of local consent requirements for global critical care databases” ,Intensive Care Med. 2018 Jun 19. doi: 10.1007/s00134-018-5257-y.PMID: 29922844.



EHRs piyush mathur artificial intelligence medicine ehr machine learningPiyush Mathur, MD, FCCM – Staff Anesthesiologist/Intensivist/Quality Improvement Officer – Anesthesiology Institute, Cleveland Clinic + Founder – BrainX Community and team. Email: [email protected]





EHRs ashish piyush mathur artificial intelligence medicine ehr machine learningAshish Khanna MD, FCCP, FCCM – Staff Anesthesiologist & Intensivist and Vice-Chief for Research – Center for Critical Care, Cleveland Clinic + Member – BrainX team. Email:  [email protected],Twitter handle @KhannaAshishCCF