“Computation is cheap. Labels are expensive.”


I speak often on the data conundrum of healthcare data and the impact of this on AI in health care (“It’s the data, stupid.”). The manuscript from Duke University Health System is laudable for its effort to have a curated, annotated, and publicly available data set of 3-dimensional digital breast tomosynthesis examinations from over 5,000 patients with over 22,000 reconstructed studies. In addition, the authors have shared the algorithm with the journal.

The study had 4 groups of studies: 1) normal (91%); 2) actionable studies without biopsy (5%); 3) benign biopsy studies (2%); and 4) cancer (2%). So the common class imbalance problem existed in this study. A deep learning algorithm (CNN with DenseNet architecture) was developed for breast cancer detection and had a sensitivity of 65% on the test set of 460 examinations. The authors were forthright with their list of limitations of this study, but one important one is the lack of longitudinal followup of this cohort (and almost all such cohorts in AI in imaging studies). In addition, any publicly available data set will need to include the issue of patient privacy and data security.

Although some efforts have been made in data sharing, the overall data sharing strategy amongst institutions remains borderline effective at best. The pandemic had accelerated this effort to some degree but appears to have waned after an initial period of enthusiasm. This data set and many others in the future will be essential to provide a library of health care data for generations to come. We should also continue to explore all possible solutions to this data conundrum, including methodologies that do not require sharing data, such as federated and swarm learning.

The full paper can be read here.