Access to high-quality medical and healthcare data remains limited due to privacy concerns. But the van der Schaar Lab aims to change that by creating their own


Professor Mihaela van der Schaar, the John Humphrey Plummer Professor of ML, AI and Medicine at the University of Cambridge, has worked in the field of AI and machine learning for the past 17 years. “I began after receiving the NSF (National Science Foundation) Career Award to develop new multi-agent learning methods with limited information and a need to compete for scarce resources,” she says. “At that time, I was probably the only researcher working on this. Now, developing strategic learning methods for specific scenarios has finally become a hot topic but I have since moved on. For the last few years, I worked predominantly on AI and medicine.”

Professor van der Schaar set up a laboratory after her name when she moved to Cambridge in 2018. She aims to develop cutting-edge AI and machine learning methods to improve healthcare and medical knowledge. “Machine learning has already achieved very impressive results in areas where problems are easily defined and solutions are verifiable,” Professor van der Schaar explains.

“Unfortunately, medicine and healthcare are domains that do not have well-posed problems or easily verified solutions. Researchers are very keen to leverage machine learning to solve many real-world problems like, for instance, COVID-19. This is probably what makes the domains so exciting for anyone interested in exploring the boundaries of AI. It feels like other industries had left a map for us to uncover an entirely new way to navigate.”

However, access to high-quality data remains limited due to privacy concerns. “Medicine and healthcare are not domains where one can just venture into the wild and collect new data at will,” she continues. “It’s valid and we must develop new ways that are capable of ensuring the privacy of the provided datasets is not compromised and can communicate their efficacy convincingly to numerous groups of stakeholders, including the general public.”

Professor van der Schaar believes to keep data safe, differential privacy or a data-sharing system that is immune to post-processing or manipulation is required. To overcome the challenge, she and her lab members started examining synthetic data techniques as a potential solution to revolutionize how medical and healthcare datasets are harnessed, interacted and shared. In brief, synthetic data is the outcome of a complex process. It refers to the generalization of artificial information which has the statistical properties of the original information.

There is nothing straightforward about generating realistic synthetic data since patient records are high dimensional and derived from complex distributions. Besides, there is no consensus on how to define or measure the quality of synthetic data and to create different data types to satisfy various purposes can be technically challenging. In medicine and healthcare, specific synthetic data and parameters are required to satisfy different needs from predictions, analyses, decision-making to inferences and so on.

Professor van der Schaar and her lab members chose ADS-GANs, a modified version of GANs (generative adversarial networks), a promising framework in simulating complex distributions and widely used in artificial image generation to develop the needed datasets. She hopes that synthetic data can be made publicly available for research while considerably lowering the risk of breaching patient confidentiality; “I think machine learning for medicine and healthcare is a delicate risk-benefit balance.”

“On one hand, we have the data users or people who need these high-quality data to apply to their research, whether or not it involves AI. While on the other hand, many data guardians do not trust data users; they amplify the risks, rather than benefits that data-empowered machine learning would provide, like more personalized treatments for patients, more efficiently and affordably delivered. It also encourages reproducibility as future researchers would be armed with the same training datasets to verify the developed model.

“Information stored in electronic health records is sensitive and abuse could lead to great harm, she adds. “Synthetic data is part of a wider viable solution to this data logjam that the medicine and healthcare are facing right now. There is also a need for frequent and close collaborations across the board with patients, clinical counterparts, and stakeholders throughout the medical community. This is something our lab is already doing. We want to be the change we want to see.”