“You can have data without information, but you cannot have information without data.”

Daniel Keys Moran, American writer and programmer

The current ability to leverage large clinical data stores (about >10,000 patients and/or > 100,000 samples) and data science is a potential game changer in pediatric research and can even provide more accurate information. In the near future, synthetic data and federated/swarm learning technologies can potentially obviate the absolute necessity to share raw data for some (not all) multi-site clinical research projects.

Synthetic data is data that has the statistical characteristics of real-world data (“realistic but not real”) while preserving the privacy of the data. It is estimated that more than half of the data used in ML/AI projects in certain sectors will be synthetic data by the end of the decade. This type of data can be helpful for research in which real-world data is particularly difficult and/or expensive to gather for analysis.

There are two innovative decentralized approaches to ML/AI collaboration: 

  • Federated learning involves training of the ML models with collection of the model parameters in a central location but without a traditional centralized repository of raw data
  • Swarm learning, is even more decentralized in that even the parameters of the models are not collected in a central location; it is in essence, “distributed” ML or AI 

Various methodologies in machine learning such as logistic regression, support vector machines, and decision trees can be useful tools for classification, but ensemble learning methodologies like bagging (Random Forest) or boosting (XGBoost) can be even more useful than some single models. These “meta models” are, however, vulnerable to overfitting- a phenomenon in machine learning when a methodology provides accurate predictions for training data but not for new data.

While supervised machine learning can be used for classification, unsupervised learning is under-leveraged in research but can be very useful for the discovery of new disease phenotypes, therapeutic responses, or even biological theory generation and therefore contribute to the paradigm of precision medicine. Rather than the traditional classification schema for diseases, one can allow data to be analyzed with unsupervised learning and let the data provide novel insights. 

Deep learning, a subset of machine learning that uses neural networks, can better accommodate nonlinear situations but requires larger amount of data and demands more computation. While machine learning performance can plateau with increasing quantity of data, deep learning accuracy can continue to increase with more data. Deep learning also has the advantage over machine learning in that it requires less feature extraction than in machine learning. 

The importance of these observations of artificial intelligence in healthcare will be part of the topics of discussion at the in-person AIMed Global Summit 2023 scheduled for June 4-7th of 2023 in San Diego. The remainder of the week will be other exciting AI in medicine events like the Stanford AIMI Symposium on June 8th. Book your place now!  

We at AIMed believe in changing healthcare one connection at a time. If you are interested in discussing the contents of this article or connecting, please drop me a line – [email protected]