By Crystal Valentine, PhD

The unique power of machine learning is in its ability to gain insights from noisy or incomplete data, making it an ideal tool for medical applications.

At the most basic level, machine learning is a technique used to distill patterns or insights from complex data sets in order to automate expert decision making.

As interest in machine learning for medical research and practice grows, we see a concomitant emphasis on the collection and analysis of medical and health data sets.

In other words, as machine learning gains popularity, medical data grows in importance.

Across industries, data is seen as the leverage point for competitive advantage and for scientific breakthroughs.

The “big data” era began in the mid 2000’s with the popularization of software frameworks that took advantage of large, scale-out clusters of commodity hardware to parallelize massive computations and data storage at a low cost.

As a result, the amount of data being generated and processed globally is growing at about a 35% compounded annual growth rate [1].

Indeed, data has been at the center of disruption within industries ranging from financial services to telecommunications to manufacturing and more for over a decade and that trend is likely to continue.

“The “big data” era began in the mid 2000’s with the popularization of software frameworks that took advantage of large, scale-out clusters of commodity hardware to parallelize massive computations and data storage at a low cost.”

In the context of medical applications, the data that can be combined and used to train machine learning models is quite varied, including digital pathology slides, smart medical devices, electronic health records, doctors’ notes, and more.

The volume and variety of medical data suggests a wealth of opportunities for leveraging machine learning.

In my work with physicians and medical researchers, however, I have found there exists an undercurrent of scepticism when it comes to data. Not that data isn’t valuable, but there is a fear that medical data is particularly noisy, messy, and incomplete and that these characteristics may undermine the ability of machine learning models to learn and then make inferences.

On the topic of data, the conversation with physicians often turns toward a discussion of the rigid and cumbersome commercial electronic health records systems and the onerous task of recording patient data after visits. The question ultimately arises: do we have a data problem?

There is a common fear among physicians that the messiness and inherent heterogeneities within medical data sets will preclude the effective use of machine learning models for many applications within the field.

With “bad data”, how can we ensure accuracy? How can we have faith in the insights machine learning yields?

In fact, a unique strength of machine learning is that the techniques are robust to noise and can therefore generalize to situations in which the data is imperfect.

This means that if medical data is inherently heterogeneous or inconsistent, it’s likely that machine learning models can learn to anticipate and overcome the fundamental inconsistencies within the data.

“The amount of data being generated and processed globally is growing at about a 35% compounded annual growth rate.”

In machine learning research, “accuracy” is a multi-dimensional concept; models are evaluated not just on their predictive capabilities for a single validation instance but also on the basis of their generalizability and robustness.

Robustness, in particular, refers to the ability of an algorithm to perform consistently on varied data. Specifically, if an algorithm’s empirical loss does not change dramatically for perturbed samples, it is said to be robust.

What’s remarkable is some of the so-called inconsistencies within training data sets can actually prove to be interesting markers or features of the data that improve a model’s predictive capabilities.

In other words, what a human might regard as “noise” or “error” within the data might actually be a fundamental characteristic of the data that can be learned and interpreted.

The ability of machine learning to overcome noisy data and distill the true structure of the underlying model is the essence of its power.

Unlike rules-based programmatic approaches to problem solving, which are rigid and require consistent data inputs, machine learning approaches have been shown to be useful in real-world scenarios where there is inherent noise and variation in the data.

crystal valentine machine learning aimed

Dr Crystal Valentine presenting at AIMed 2017

As an example, consider a machine learning model running onboard an autonomous driving vehicle. Suppose the model in question is performing the pedestrian detection function for the car by analyzing high-definition video from a camera mounted on the dashboard.

That model must be able to identify all pedestrians during day or night, in all weather conditions, in a city or a rural area, and regardless of the relative speeds and orientations of the vehicle and pedestrian to one another.

The model needs to be robust so regardless of the driving conditions and scenery, pedestrians are consistently identified. In other words, as the pedestrian-detection model is trained, residual variation in lighting conditions, scenery, distance to the pedestrian, etc., are not fundamental to the underlying model structure that represents a pedestrian.

As a result, for a well-trained model, pedestrians can be correctly identified in all scenarios and driving conditions, including those that are not part of the exemplary training data set.

Deep learning networks, in particular, have proven to be highly effective in these complex, real-world situations. It stands to reason, then, that machine learning models can also be trained to be robust to the different data collection practices, annotations, and equipment manufacturers between hospitals, for example.

“The potential for machine learning to revolutionize medicine will not be limited by our lack of perfect data.”

As an aside, one of the strengths of nature is in the achievement of robustness in the face of variation. For instance, individual biological organisms contain redundant systems (like having two kidneys), the DNA replication process features error-correcting mechanisms, and populations evolve according to changes in their environments.

Biological systems require robustness in order to grow and thrive under complex and varied circumstances. It’s not surprising, therefore, that computer scientists have endeavoured to develop algorithms based on this principle, so they can perform well on noisy, real-world input data.

Indeed, there are many examples where computer scientists and machine learning experts explicitly draw their inspiration from biology in large part because they recognize that biological systems have this robustness characteristic – hence neural networks, evolutionary algorithms, and genetic algorithms.

In the end, the potential for machine learning to revolutionize medicine will not be limited by our lack of perfect data. The power of machine learning is in its ability to overcome noisy data.

The machine learning renaissance we’re seeing across industries today is fuelled by the tremendous growth in real-world, unadulterated data sets and our drive to gain insights and make better decisions using that data.

In medicine, perhaps even more than in other fields, it’s the complexity and imperfection of our data that necessitates a robust, machine learning approach.



[1] McKinsey Global Institute.  “The age of analytics: competing in a data-driven world.”  2016.



crystal valentine machine learning aimed

Crystal Valentine, PhD, served as the VP of Technology Strategy at MapR Technologies for the past two years.  She has nearly two decades’ experience in big data and machine learning research and practice.  A former professor of computer science at Amherst College, she is the author of several academic publications in the areas of big data, algorithms, computational biology, and high-performance computing, and she holds a patent for Extreme Virtual Memory.  Dr. Valentine was named a Person to Watch in 2018 by Datanami and was awarded the Silver Stevie Award in 2017 for Female Executive of the Year in the computer software category.  She is a member of the Forbes Tech Council and is a frequent contributor to industry journals.  She has consulted extensively with Fortune 500 companies to design and implement high-throughput, mission-critical applications and with equity investors as a technical expert on competing technologies and market trends.  Dr. Valentine was a Fulbright Scholar to Italy.