“Most benchmarks provide a fixed set of data and invite researchers to iterate on the code… perhaps it is time to hold the code fixed and invite researchers to improve the data.”

Andrew Ng, Stanford AI expert

 

Last week, we reviewed healthcare data and the nuances that renders the data difficult to access and share. This week, we review some of the possible solutions to our healthcare data conundrum…

 

Less Cumbersome Data Labeling. It is exceedingly difficult to have access to accurately labeled data so that any methodology that can increase data with labels without involving humans is helpful. One such strategy is that of semi-supervised learning. This methodology explores how an algorithm can learn from a small labeled data set rather than entirely labeled data sets as you would need for supervised learning.

Semi-supervised learning is, in short, a combination of supervised and unsupervised learning and is used as a text document classifier. While this methodology has elements of how humans learn from labels and unlabeled data, the disadvantage is that it is difficult to ascertain accuracy of the labeled data.

Independent Trustworthy Data Sharing. While there are data privacy mechanisms such as differential privacy and homomorphic encryption, many institutions remain reluctant to share their health data. A possible solution to overcome the regulatory restrictions, ethical dilemmas, and privacy concerns with sharing data is a data trust. This novel independent entity with the appropriate legal and ethical framework serves as a fiduciary for the data owners (who then are trustees as in a conventional trust) and has oversight of collection, curation, access, and use of the data. It is possible that this data trust can have both public and private involvement with a dyadic structure.

Using Relevant Synthetic Data. There is some discussion on using synthetic data from computer simulations or algorithms generate as a means to have more data (albeit not real-world in the truest sense) for healthcare AI projects. While data augmentation and anonymization are not always considered synthesized data, generative adversarial network (or GAN) is a deep learning methodology that trains algorithms to create new labeled images from the labeled data set by training two competing neural networks (one of which is a discriminator network). There are limitations with this methodology, however, especially with rare cases in biomedicine. In addition, variational autoencoders can also generate synthetic data.

Share Models, Not Data. A novel methodology to have a collaborative effort but without sharing data is federated learning, which uses a shared AI model with weights and parameters to gather the learning from all the peripheral clients (without the clients having to share the data). In short, the model is brought to the data, not the data to the model so that the data stays local. The future of such a decentralized strategy is excellent with the upcoming proliferation of wearable technology in medicine and healthcare that can serve as local AI data sources. This methodology, however, may not be ideal for all healthcare situations (such as rare diseases with very small number of patients).

In short, we need innovation in how we manage and share data, and the four aforementioned methodologies are just highlights of methodologies that can potentially mitigate the problems with data sharing and AI in healthcare.