“Multimodal generative AI has the potential to drastically accelerate progress toward precision health by supercharging the structuring of health data and scaling population-level insight generation.”

Hoifung Poon in Multimodal Generative AI for Precision Health paper


Last week, we covered a brief discussion of foundation models and the potential for both higher performance and lower cost in deployment of these AI models compared to other prior models. This week, we will discuss the other exciting AI tool in this new era of AI, the multimodal AI model.

Multimodal Generative AI


Multimodal generative AI with multiple data modalities (enabled with multidimensional embeddings) can generate outputs in these modalities to create a more accurate understanding of the raw data, but it faces the technological challenge of integrating and processing the myriad of different data types. This multimodal generative AI leverages self-supervision (perhaps even more than foundation models) and is considered the AI tool that is capable of delivering multi-sensory immersive experiences and therefore a step closer to artificial general intelligence, or AGI. In multimodal AI, contrastive learning can be used to improve these models.


Examples of multimodal generative AI include: self-driving cars, social media analysis, and new medical diagnostic tools (with data streams of medical image scans, electronic medical records, and genetic test results).

Multimodal AI in Healthcare

It is hoped that multimodal generative AI will accelerate the trajectory of precision health of an entire population by creating a continual health learning system (see prior AIMed newsletter here). The myriad of modes of health data can include: EHR data, imaging data, genomic data, wearable and sensor data, and other -omic data. This health learning system will accommodate multimodal health data in a longitudinal structure format that can utilize self-supervision (similar to the aforementioned foundation model). This AI strategy would have been very helpful for a public health crisis such as the current COVID pandemic.

In conclusion, health data is multimodal in format as well as longitudinal in timeline so the advent of both foundation models and multimodal AI is timely for the future of healthcare. These nascent AI systems will be difficult to assess especially for complex healthcare scenarios of patients with multi system diseases and chronic medical conditions. These AI tools, especially enabled by the robust AI transformer architecture, will realize the dream of many healthcare stakeholders: precision health with a continuous real-time, real-world learning system to achieve real impact on population health.

As was discussed previously, three key elements will need to be in any future AI model: biomedical publications in entirety; patient electronic medical records in large numbers; and finally, perhaps the most important, the collective wisdom from many seasoned clinicians (especially not published). Perhaps it is not at all surprising that the most valuable asset for AI in clinical medicine and healthcare remains the humans that are actually delivering healthcare, the clinicians.

These insights and discussions on AI and both foundation models and multimodal AI will be discussed at the in-person AI-Med Global Summit 2024 scheduled currently for May 29-31, 2024 in Orlando, Florida. Book your place now!

See you there!