The University of Florida’s academic health center, UF Health, has teamed up with NVIDIA to develop a neural network that generates synthetic clinical data – a powerful resource that researchers can use to train other AI models in healthcare. Trained on a decade of data representing more than 2 million patients, SynGatorTron is a language model that can create synthetic patient profiles that mimic the health records it has learned from. The 5 billion-parameter model is the largest language generator in healthcare.

Duane Mitchell, assistant vice president for research and director of the UF Clinical and Translational Science Institute, said:

“Synthetic data isn’t actually linked to a real human being, but it has similar characteristics to real patients. SynGatorTron can, for example, create health records of digital diabetes patients that have features just like a real population.”

Using this synthetic data, researchers can create tools, models and tasks without risks or privacy concerns. These can then be used on real data to ask clinical questions, look for associations and even explore patient outcomes.

Working with synthetic data also makes it easier for different research institutions to collaborate and share models. And since the amount of data that can be synthesized is virtually limitless, researchers can use SynGatorTron-generated data to augment small datasets of rare disease patients or minority populations to reduce model bias.

“SynGatorTron’s generative capability is a great enabler of natural language processing for medicine,” said Mona Flores, global head of medical AI at NVIDIA. “Synthesizing different types of clinical records will democratize the ability to create all sorts of applications dependent on such data by addressing data sparsity and privacy.”

Once available, research institutions outside UF Health could fine-tune the pretrained SynGatorTron model with their own localized data and apply it to their AI projects. For example, if a given condition or a patient population is underrepresented in a health system’s clinical data, SynGatorTron can be prompted to generate additional data with characteristics of that disease or population. These AI-generated records could then be used to supplement and balance out real healthcare datasets used to train other neural networks, so that they better represent the population.

Since synthetic training datasets mimic real medical notes without being associated with specific patients, they can also be more readily shared across research institutions without raising privacy concerns.

“When you have the ability to mimic population characteristics without being tethered to real patients, it opens the imagination to see if we can generate realistic datasets that allow us to answer questions we couldn’t otherwise, due to constraints on access to data or limited information on patients of interest,” Mitchell said.

One potential application is in clinical trials, which often divide patients into treatment and control groups to measure the effectiveness of a new medication. An application derived from SynGatorTron-generated data could parse through real records and create a digital twin of patient records. These records could then be used as the control group in a clinical trial, instead of having a control group derived by giving real patients a placebo treatment.

Researchers developing a deep learning model to study a rare disease, or the effects of a treatment on a specific population, could also use SynGatorTron for data augmentation, generating more training data to supplement the limited amount of real medical records available.