A lack of generalizability is a major hurdle to clinical adoption of artificial intelligence (AI). A group of researchers in Korea found that as of 17 August 2018, only 6% of 516 studies on Radiology AI did an external or independent validation. Without concrete metrics to assess generalizability, AI developers will have to ensure data used for developing an algorithm and validation come from different clinical sites or there will be no real understanding how well the algorithm can actually perform.
For example, MRNet, an algorithm created by a group of Stanford University researchers to detect tears of anterior cruciate ligament on knee MRI scans. Researchers divided a dataset of 1370 knee MRI scans obtained from Stanford University Hospital into three groups to train, fine-tune, and validate MRNet and arrived at an impressive Area Under Curve (AUC) of 0.96.
Researchers then did an external validation using data obtained from a Turkish study, which contains 917 MRI scans and AUC fell to 0.82. A compromised generalizability is noted here but according to the researchers, MRNet showed significant improvement in performance after it was retrained with the new data. Nevertheless, external and/or independent validation only marks part of the process, there are more to do when it comes to validation radiological AI and below are some of AIMed’s thoughts.
Developing ground truth
Often, there is the need to have corresponding datasets to cultivate a so-called objective ground truth. In radiology, data have to be analogous to test the performance of an algorithm; like chest x-rays correspond with chest CT scans while CT scans correspond with biopsies and so fore. This requires developers to “clean” and “refine” all unstructured data into structured ones. Some studies may choose to have multiple human radiologists to arrive at a consensus on whether certain findings (i.e., present of tumor) truly exist in the scans and use that as the basis to testify how far the AI models are away from this established ground truth. The importance of maintaining high level of objectivity is what makes this the most painstaking step in the whole validation process.
Looking out for false positive and false negative
In order to check an algorithm’s false positive rate, there is a need to have datasets comprising heavily of cases without many positives (i.e., detection or prediction of abnormalities or the purpose of creating the algorithm) to find out how frequent an AI calls for a truly negative case, positive. The same logic goes for checking for false negative.
Thereafter, a deep examination of false positive and negative will take place whereby a radiologist will have to site side by side with his/her data scientist counterpart to go through all the mistakes that an algorithm had made. They will have to come up with hypotheses to explain why the algorithm is making such mistakes, assess the implications of failures, and rank all the shortcomings for retraining purposes.
Some researchers will instill the idea of “dynamic threshold” or threshold values that change based on the given clinical context during retraining. By introducing a clinical context, AI models are found to make less errors. For example, a healthy individual coming into the hospital for routine health check should have a very high threshold for detecting collapsed chest or pneumothorax, as compared to an Intensive Care Unit (ICU) patient whose oxygen saturation is constantly dropping.
At the recent AIMed NHS (National Health Service) AI Lab Virtual Conference, Haris Shuaib, AI Transformation Lead and Topol Digital Health Fellow at Guy’s and St. Thomas’ NHS Foundation Trust said one will never have full knowledge of their AI models until they run the data through the technology. As such, real-world deployment is indispensable from the validation process.
Strategies to facilitate deployment varies but fundamentally, AI ought to present findings in a radiologist friendly manner. For example, an abnormality-finding AI model has to automatically classify chest x-rays into normal or abnormal in the workflow. In an ideal situation, as long as the AI did not find any abnormality in the scans, a “normal” report template will be generated and radiologists are able to change the report or approve it after reviewing the chest x-ray.
On the other hand, a retrospective quality check can be done by requesting an AI to go through the thousands of chest x-rays that were labeled as “normal” based on previous radiology reports to determine an abnormality score. X-rays with higher abnormality scores will be re-read by a human radiologist and it’s likely that the AI finds missed findings or ones that radiologists did not pick up on or do not wish to comment on them.
In sum, validation is an ongoing process which requires constant collaboration between developers, clinicians, and users. AIMed will be hosting AIMed Radiology virtual conference again this November in association with the American College of Radiology (ACR). Registration is now open. You may register or obtain a copy of the agenda here, to find out more about how AI is impacting radiology; challenges in deployment and validation, as well as regulatory issues worth noting and so on.