By May D. Wang, PhD, Li Tong, Hang Wu
Delivering predictive, precise, participatory, preventive, and personalized health, abbreviated as p-Health, is the primary goal of future healthcare systems to significantly improve care quality while reducing cost.
To accomplish this goal, an individual’s conditions captured by various biomedical data modalities such as electronic health records (EHRs), high-throughput -omics, behavioral and medical images, mobile wearable biosensors, and personal health records (PHRs) should be jointly evaluated by translational data analytics (Figure 1) to assist in medical and health decision making. For analytics pipeline includes multiple steps with interaction feedback: multi-modality data collection, data quality control, feature extraction, knowledge modeling, decision making, and feasible actions.
The state-of-the-art biomedical integration strategy analyzes data or information at raw data, feature, or decision levels [1-4]. Raw data integration aims at harmonizing various biomedical data sources.
For example, to harmonize EHRs from different vendors, HL7 is developing a new resource-centric standard FHIR (i.e. Fast Healthcare Interoperability Resources)  to provide APIs (i.e. application programming interfaces) for interoperating data; to integrate –omic data from different high throughput technologies (e.g. genome, proteome, metabolome, microbiome, and lipidome), normalization is used to map to the same dynamic range followed by concatenation. However, this may result in huge raw data size with large computational challenges.
Feature level integration aims at combining data at the feature extraction step, where features are either reduced by L1-norms or minimum-redundancy maximum-relevancy (mRMR) before concatenation , or are transformed to a joint feature embedding space for combination by methods such as multiple kernel learning (MKL) , canonical correlation analysis (CCA) , probabilistic graphical model (PGM), and auto-encoders.
Decision level integration aims at independently developing base models for each data modality, followed by combining individual models with methods such as majority voting, weighted majority voting , and ensemble learning methods like bagging  and boosting . Thus, building accurate base models and preventing model overfitting while adding more parameters is critical. There are four opportunities in biomedical big data integration.
The first opportunity is on data harmonization. Currently, due to lack of data standards for many biomedical devices or Health IT infrastructures, most biomedical data collected are vendor-dependent.
For example, there are 186 commercial EHR system vendors in the US market and each of them has its specific data standard and format. Harmonizing data from different vendors with multiple modalities (i.e. EHR, -omic, imaging, sensor) to enable cross-talk and secondary data analysis requires joint research and development effort from all stake holders.
The second opportunity is on data quality control for all modalities. For examples, EHR is well-known for having a large percentage of missing data and errors; high-throughput sequencing data contains bias caused by the platforms, protocols, and bioinformatics pipelines; and mobile wearable biosensors and imaging data contain noises and artifacts. “Garbage-In, Garbage-Out”. It is essential to develop quality control protocols and metrics for all types of data before integral data analysis [12-15].
The third opportunity is on advanced data analytics to extract knowledge from different data modalities. In clinical informatics, traditional data analytics rely on hand-crafted features to construct meaningful representations of the patients, which are limited in predictive power.
In genomic medicine, the “curse of dimensionality” (i.e. the feature dimension is significantly larger than the patient sample size) requires effective feature reductions, which is necessary to filter out irrelevant genomic variants. One promising data analytics direction is to utilize deep learning methods for feature engineering.
The fourth opportunity is on actionable decision making. Lack of data model interpretability is one of the major barriers for action taking based on analytics.
For examples, predictive models built from EHR have shown potential in predicting the hospital length of stay, and readmission probability etc., but deploying them in clinical practice is challenging because most models only present a final prediction without explaining the driving patient conditions; also, making sense of genomic data is a bottleneck for translating genomic discoveries into clinical practice .
To improve reliability in final decision making, the rationale and justification, plus the causal relationships among genomics variations, medical events, treatment options, and target diseases are needed. Thus, developing explainable AI methods and causal inference beyond correlation studies are becoming increasingly active.
Overall, to enable evidence-based p-Health, we need to integrate multimodality data on the raw data level, feature level, and decision level.
There are enormous research and development opportunities in data harmonization, quality control, feature engineering, causality modeling, and explainable AI for actionable decision making.
By embracing these opportunities and solving the challenges, biomedical data integration will reshape the medicine towards p-Health.
- Hang Wu ([email protected]): Ph.D Student, Department of Biomedical Engineering, Georgia Institute of Technology, and Emory University
- Li Tong ([email protected]): Ph.D Student, Department of Biomedical Engineering, Georgia Institute of Technology, and Emory University
- May D. Wang, PhD ([email protected]): Director of Biomedical Data Initiative, Kavli Fellow, Petit Institute Fellow, Fellow of AIMBE, IEEE Senior Member, Professor of Departments of Biomedical Engineering, Computational Science and Engineering, Electrical and Computer Engineering, Winship Cancer Institute, IBB, IPaT, Georgia Institute of Technology and Emory University
 L. Wong, “Technologies for integrating biological data,” Brief Bioinform, vol. 3, pp. 389-404, Dec 2002.
 C. Goble and R. Stevens, “State of the nation in data integration for bioinformatics,” J Biomed Inform, vol. 41, pp. 687-93, Oct 2008.
 D. Gomez-Cabrero, I. Abugessaisa, D. Maier, A. Teschendorff, M. Merkenschlager, A. Gisel, et al., “Data integration in the era of omics: current and future challenges,” BMC Syst Biol, vol. 8 Suppl 2, p. I1, 2014.
 M. D. Ritchie, E. R. Holzinger, R. Li, S. A. Pendergrass, and D. Kim, “Methods of integrating data to uncover genotype-phenotype interactions,” Nat Rev Genet, vol. 16, pp. 85-97, Feb 2015.
 D. Bender and K. Sartipi, “HL7 FHIR: An Agile and RESTful approach to healthcare information exchange,” in Computer-Based Medical Systems (CBMS), 2013 IEEE 26th International Symposium on, 2013, pp. 326-331.
 C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” J Bioinform Comput Biol, vol. 3, pp. 185-205, Apr 2005.
 M. Gonen and E. Alpaydin, “Multiple Kernel Learning Algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211-2268, Jul 2011.
 D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Computation, vol. 16, pp. 2639-2664, Dec 2004.
 T. G. Dietterich, “Ensemble methods in machine learning,” Multiple classifier systems, vol. 1857, pp. 1-15, 2000.
 L. Breiman, “Bagging predictors,” Machine learning, vol. 24, pp. 123-140, 1996.
 J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189-1232, 2001.
 S. M.-I. Consortium, “A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium,” Nat Biotechnol, vol. 32, pp. 903-14, Sep 2014.
 N. G. Weiskopf and C. Weng, “Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research,” J Am Med Inform Assoc, vol. 20, pp. 144-51, Jan 1 2013.
 D. Onder, S. Zengin, and S. Sarioglu, “A review on color normalization and color deconvolution methods in histopathology,” Appl Immunohistochem Mol Morphol, vol. 22, pp. 713-9, Nov-Dec 2014.
 S. Kumar, W. Nilsen, M. Pavel, and M. Srivastava, “Mobile Health: Revolutionizing Healthcare Through Transdisciplinary Research,” Computer, vol. 46, pp. 28-35, Jan 2013.
 L. Chin, J. N. Andersen, and P. A. Futreal, “Cancer genomics: from discovery science to personalized medicine,” Nature Medicine, vol. 17, pp. 297-303, Mar 2011.