Often, mammograph interpretation succumbs to variability which may result in false positive or false negative, making it challenging to screen for early onset of breast cancer. On New Year’s day, Nature published the findings of an artificial intelligence (AI) driven breast cancer screening tool developed by Google Health and DeepMind. It believed to have successfully reduced the percentages of both false positive and negative in mammogram readings of two distinct samples of patients from the US and UK, as well as outperformed six human radiologists and minimized the workload of second reader by 88% in an independent study.
AI generalization has been overlooked
Most major media outlets picked the discovery up without fail, but again, their attention was mostly diverted towards the capabilities of human clinicians being “undermined” by a system of algorithms. As mentioned in the article itself, researchers from Google Health and DeepMind acknowledged this is not something new or one-of-its-kind, there are past and ongoing researches demonstrating how AI meet or surpass human expert performances in retinal disease, lung cancer, skin cancer and diabetic retinopathy screenings.
In breast cancer detection, a group of six deep learning experts from Korea founded Lunit, to train its INSIGHT algorithm using chest x-rays and mammograms. They beat Microsoft and IBM in the 2016 Tumor Proliferation Assessment Challenge by giving a 97% accurate detection rate for both lung and breast cancers. Last October, researchers from New York University also demonstrated their AI system is capable to improve radiologists’ performance in assessing for breast cancer. So, what makes this Google Health and DeepMind breast cancer screening tool special was its ability to generalize across populations and screening settings were evaluated.
Thus far, most AI prediction or screening studies train, validate and test their algorithms using data obtained from the same, if not, similar data pool. There is little evidence to indicate whether an AI system, capable of “outperforming” their human counterparts, will work equally well on an entirely different sample of patients and settings without feeding them additional training and validation data. Over here, researchers from Google Health and DeepMind trained its AI system using only the UK dataset and applied it to the US test set.
Significance of AI generalization
They found that the AI system still demonstrate an improvement in specificity and sensitivity when they were compared to human radiologists, by reducing false positive and false negative by 3.5% and 8.1% respectively. In general, when the AI system was trained using both UK and US samples, false positive and negative were decreased by 5.7% and 9.4% in US patients and UK patients were decreased by 1.2% and 2.7%.
This means the Google Health and DeepMind AI system had overcome a rather big challenge as it does not require additional training and validation data to be used on patients from a different data pool. Yet, there is still a need to exercise caution when AI is being generalized. First of all, although UK and US are two different countries, both of which are in the western context whereby patients may have overlapping demographics and traits. It remains unknown if the AI system will remain effective shall the data comes from two or more vastly distinct populations.
Next, the study uses retrospective comparison, as in biopsy-confirmed breast cancer outcomes were used to evaluate the performance of the AI system. This may be a very common way of assessing AI, however, it cannot forecast the prognosis of a medical condition and its survival rate, which are equally important in cancer treatment.