“Better than a thousand hollow words is one word that brings peace”


Pre-trained language models have exponentially increased their capabilities in natural language processing (NLP) over the past few years. Bidirectional Encoder Representations from Transformers (BERT) has variants in the clinical domain (BioBERT and PubMedBERT), and similar biomedical variants for Generative Pre-trained Transformers (GPT) are also now in the making. GPT-3 and now ChatGPT have been in the mainstream media and in continual discussions of artificial intelligence in health. 

The authors of the paper “BioGPT: generative pre-trained transformer for biomedical text generation and mining”  introduce BioGPT as a domain-specific generative transformer language that is pre-trained on large-scale biomedical literature for biomedical text generation and mining. The authors adopted GPT-2 as the NLP backbone model and pre-trained over 15 million PubMed abstracts.

The BioGPT model was assessed with six biomedical NLP tasks, and the results showed BioGPT to be superior to other models. It yielded an impressive 78% accuracy (highest recorded accuracy) on PubMedQA, a dataset from PubMed abstracts (over 30 million and growing) that was initially introduced in 2019. This dataset with its queries is the first QA dataset that requires reasoning over the dataset, including a yes/no/maybe answer on questions. The larger version BioGPT-Large scored an even higher PubMedQA accuracy of 81%. Other benchmarks include BC5CDR (1,500 PubMed articles with 4,409 chemicals and 5,818 diseases), KD-DTI (drug-target interaction dataset), and DDI end-to-end relation (an annotated corpus with pharmacological substances and drug-drug interactions).

Several issues regarding these large language models remain. Even published manuscripts and reviews have erroneous and/or outdated information, so “accuracy” in published materials may not adequately reflect real-world situations and real-world data with patients (especially with substantive non-published experiences and observations). For instance, many children with Kawasaki disease never fulfilled the entire diagnostic criteria but the “top-down” criteria is well published and can lead to under-diagnosis.

In addition, one should consider how clinicians as well as patients, interpret the answers from the biomedical queries. The public may interpret a positive diagnosis (such as premature atrial contractions) differently from clinicians and this discrepancy may lead to confusion and unnecessary worry. 

Also, one should also assess outcome variables from the application of such a tool for both clinicians as well as patients and families. An “accuracy arms race” does not necessarily lead to improved patient outcomes or satisfaction, as we have observed with prior efforts on improving accuracy in convolutional neural networks in medical image interpretation. 

Perhaps someday we will have this language model be multi-dimensional and be combined with real-world real-time data to have the best of both worlds.

Read the full paper here

These fascinating topics, along with others will be discussed at the annual AIMed Global Summit 2023. Book your place now! 

We at AIMed believe in changing healthcare one connection at a time. If you are interested in discussing the contents of this article or connecting, please drop me a line – [email protected]