Statistician Dr. Genevera Allen of Rice University in Houston called it a “crisis in science” as more scientists engaged in machine learning (ML) techniques to analyze their data. Speaking at the American Association for the Advancement of Science (AAAS) in Washington earlier this month, Dr. Allen warned ML is “wasting both time and money” of scientists because it only singles out noise found within existing data patterns which may not be representative of the real World or be reproduced by another experiment.

Dr. Allen believes the problem of reproducibility is especially significant when scientists employ ML on genome data to identify patients with similar genomic profiles. A common approach in precision medicine which aims to develop drugs that target specific genome of a disease. However, ML fails to yield consistent results at the moment. Often, one scientist may sift out a particular cluster which may be entirely different from other scientists engaged in other similar experiments. 

A problem with data rather than ML 

The question thus became, “can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets”. Dr. Allen suggested perhaps it will be fruitful to develop algorithms that access or detect the reliability and validity of these predictions. Something which she is working on right now with biomedical researchers from Baylor College of Medicine in Houston. 

Indeed, reproducibility is a “crisis”. In a survey conducted by Nature three years ago, scientists revealed that up to 70% of researchers were not able to replicate what other scientists had uncovered in their earlier findings. Of all, only researchers specialized in the areas of chemistry, physics and engineering and earth and environmental science expressed confidence in the literature. It’s unsure if any of these surveyed researchers actively using ML in their research. 

As such, although ML may have aggravated the problem of reproducibility in some ways, it is definitely not the sole cause. So it’s myopic for ML to shoulder the blame. Like any form of artificial intelligence (AI), ML feeds on data, the amount of data it can work with determines its capability. Dr. Allen herself had admitted, it’s expensive to obtain a large amount of data. This is especially so for medicine, a facet which most of the data remains unstructured and not freely shared. 

Time to liberate data and consider collaborations 

Dermatology, for example, most dermatologists do not necessarily keep a photographic documentation of patients’ conditions. Hence, it’s nearly impossible to build prediction algorithm to determine one’s skin condition. Likewise, at AIMed North America 2018, pediatrics-cardiology experts questioned the possibility of working with small data. 

While using small data to develop a fully-functional ML algorithm is impossible, the proposed solution was to create a partial trained algorithm. It will then be passed around so that it can be further taught by different groups of clinicians and researchers armed with different limited data sets. That brought up the importance of collaborations. 

To add on, the AIMed community also emphasis the need to regard data as an asset rather than a commodity. This will prevent tycoons or anyone with sufficient resources to privatize data and develop models to monopolize a particular sector. Ultimately, if there is really a “real World” problem which ML could address, it should not just be a reproducibility one but data too. 

Author Bio
synthetic gene empathy chinese artificial intelligence data medicine healthcare ai

Hazel Tang

A science writer with data background and an interest in current affair, culture and arts; a no-med from an (almost) all-med family. Follow on Twitter.