Statistician Dr. Genevera Allen of Rice University in Houston called it a “crisis in science” as more scientists engaged in machine learning (ML) techniques to analyze their data. Speaking at the American Association for the Advancement of Science (AAAS) in Washington earlier this month, Dr. Allen warned ML is “wasting both time and money” of scientists because it only singles out noise found within existing data patterns which may not be representative of the real World or be reproduced by another experiment.
Dr. Allen believes the problem of reproducibility is especially significant when scientists employ ML on genome data to identify patients with similar genomic profiles. A common approach in precision medicine which aims to develop drugs that target specific genome of a disease. However, ML fails to yield consistent results at the moment. Often, one scientist may sift out a particular cluster which may be entirely different from other scientists engaged in other similar experiments.
A problem with data rather than ML
The question thus became, “can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets”. Dr. Allen suggested perhaps it will be fruitful to develop algorithms that access or detect the reliability and validity of these predictions. Something which she is working on right now with biomedical researchers from Baylor College of Medicine in Houston.
Indeed, reproducibility is a “crisis”. In a survey conducted by
As such, although ML may have aggravated the problem of reproducibility in some ways, it is definitely not the sole cause. So it’s myopic for ML to shoulder the blame. Like any form of artificial intelligence (AI), ML feeds on data, the amount of data it can work with determines its capability. Dr. Allen herself had admitted, it’s expensive to obtain a large amount of data. This is especially so for medicine, a facet which most of the data remains
Time to liberate data and consider collaborations
Dermatology, for example, most dermatologists do not necessarily
While using small data to develop a fully-functional ML algorithm is impossible, the proposed solution was to create a partial trained algorithm. It will then be passed around so that it can be further taught by different groups of clinicians and researchers armed with different limited data sets. That brought up the importance of collaborations.
To add on, the AIMed community also emphasis the need to regard data as an asset rather than a commodity. This will prevent tycoons or anyone with sufficient resources to privatize data and develop models to monopolize a particular sector. Ultimately, if there is really a “real World” problem which ML could address, it should not just be a reproducibility one but data too.
A science writer with data background and an interest in current affair, culture and arts; a no-med from an (almost) all-med family. Follow on Twitter.