“We balance probabilities and choose the most likely. It is the scientific use of the imagination.”

Sherlock Holmes in The Hound of Baskervilles

 

Undiagnosed diseases (rare diseases that are difficult to identify, atypical presentations of known disorders, and yet to be described diseases), sometimes with manifestations since birth, can lead to long diagnosis odysseys that lead to costly journeys as well as morbidity and even mortality. It is estimated that there are 25-30 million individuals in the United States alone living with a rare disease. This manuscript delineates the use of machine learning computational models to take on an important area of undiagnosed diseases from the Undiagnosed Disease Network (UDN), an NIH-sponsored national network of 12 clinical sites designed to facilitate diagnosis of difficult cases.

This retrospective and prospective prognostic study gathered applications to the UDN over a 5 year period (July 2014 to June 2019) and a classifier used information extracted from application forms, referral letters from health care professionals, and semantic similarity between referral letters and textual description of known Mendelian disorders for its training.

The primary outcome was admission (vs no admission) to the UDN program (1212 admitted vs 1210 not admitted) so the classifier (logistic regression with a linear kernel and a grid search to optimize the hyperparameters) was assessed by comparing its predictions (to be admitted) to the actual UDN admissions by ranking applications based on their likelihood of admitted to the UDN program.

In addition, clinical BERT (bidirectional encoder representations from transformers) was trained on the clinical text. The results showed a sensitivity of 0.843, specificity of 0.738, and area under the ROC curve of 0.844 for predicting admissions to the UDN program. In addition to admissions, the mean processing time of the accepted applications with and without the use of the classifier was measured; the classifier resulted in a decrease in processing time from 3.29 to 1.05 months (interesting the latter is still a month with a classifier-enabled strategy).

Therefore, the classification system can distinguish, prioritize, and expedite admission to the UDN for patients with difficult-to-diagnose diseases. There can be both clinical efficacy and efficiency as well as cost savings for this computational approach to difficult-to-diagnose situations.

Whether this algorithmic strategy is acceptable in the future as part of this overall UDN process remains to be seen, but perhaps this approach can be utilized in situations of difficult to diagnose patients in general in order to prioritize those to be sequenced and to minimize the cost burden to the health system.

What is absolutely necessary in the endeavor to diagnose the undiagnosed, however, is a synergy between machine intelligence and human cognition.

 

The full article can be read here