A new peer-reviewed study testing the coverage, accuracy and safety of the eight most popular online symptom assessment apps has found that the performance of apps varies widely, with only a handful performing close to the levels of human general practitioners.

Published in BMJ Open, the study is the first of its kind to be published since 2015 and was conducted by a team of doctors and scientists led by global digital health company Ada Health.

Key Findings

Coverage 

The study looked at how comprehensively the apps covered possible conditions and user types, and found that just a few of the most popular apps are configured to cover all patients. The most comprehensive app was Ada, which provided a condition suggestion 99% of the time. The other apps tested provided a suggestion 69.5% of the time on average, with the lowest scoring just 51%. The least comprehensive apps were not able to suggest conditions for significant numbers of cases, including key groups such as children, patients with a mental health condition, or those that were pregnant. Human GPs provided 100% coverage.

Accuracy

The study also considered the accuracy of each symptom assessment app by comparing the conditions suggested with what was deemed to be the ‘gold standard’ answer for each case as determined by a panel of doctors.

The study found that the apps’ clinical accuracy was also highly variable. Ada was rated as the most accurate, suggesting the right condition in its top three suggestions 71% of the time. The average across all the other apps was just 38%, with scores falling in a range between 23.5% and 43%. This means that, with the exception of Ada, most apps didn’t correctly identify the possible conditions in the majority of the cases. Human GPs were the most accurate, with 82% accuracy.

Safety 

The study also assessed the safety of the app’s advice by examining whether the guidance they provided – such as staying at home to manage symptoms, or going to see a doctor – was considered to have the appropriate level of urgency.

While most apps gave safe advice in the majority of cases, only three apps performed close to the level of human GPs: Ada, Babylon, and Symptomate. Although all the apps assessed scored above 80% on safety, compared to 97% for human GPs, any small disparity in the safety of advice could potentially have a major impact upon patient outcomes if deployed at scale.

Dr. Claire Novorol, co-founder and Chief Medical Officer, Ada Health said; “Symptom assessment apps have seen rapid uptake by users in recent years as they are easy to use, convenient and can provide invaluable guidance and peace of mind. When used in a clinical setting to support – rather than replace – doctors, they also have huge potential to reduce the burden on strained healthcare systems and improve outcomes. This peer-reviewed study provides important new insights into the development and performance of these tools. In particular, it shows that there is still much work to be done to make sure that these technologies are being built to be inclusive and to cover all patients. We believe this is vital if symptom assessment apps are to fulfil their potential: human doctors don’t have the luxury of cherry-picking which patients they help and digital health must be held to the same standard.”

But Dr. Hamish S F Fraser, Associate Professor of Medical Science, Brown Center for Biomedical Informatics commented; “Compared to a similar study from five years ago, this larger and more rigorous study shows improved performance with results closer to those of physicians. It also demonstrates the importance of knowing when apps cannot handle certain conditions. While this is a preclinical study, the one-third of clinical vignettes based on real NHS 111 helpline consultations provide an important link to real urgent care challenges. Notably, both the GPs and the apps tended to perform somewhat worse when tested on those cases.

“These results should help to determine which apps are ready for clinical testing in observational studies and then randomized controlled trials. The study design could form a model for future evaluations of symptom checker apps, and as part of assessment for regulatory approval.”

The full paper can be read on BMJ Open here