Multi-site Evaluation of a Study-Level Classifier for Mammography Using Deep Learning
We present a computer-aided diagnosis algorithm for mammography trained and validated on studies acquired from six clinical sites. We hold out the full dataset from a seventh hospital for testing to assess the algorithm’s ability to generalize to new sites. Our classifiers are convolutional neural networks that take multiple input images from a mammography study and produce classifications for the study. The studies are globally labeled as normal, biopsy benign, high risk or biopsy malignant. We report on experimental results from several network variants, including study-level and breast-level models, single- and multiple-output models, and a novel model architecture that incorporates prior studies. Each model variation includes an image-level classifier that is pre-trained with per-image labels and is used as a feature extractor in our study-level models. Our best study-level model achieves 0.85 area under the ROC curve for normal vs malignant classification on the held-out test site. In comparison with other recent work, we achieve a similar level of classification sensitivity and specificity on a dataset with greater site and vendor variation. Additionally, our test performance is demonstrated on a held-out site to more accurately assess how the model would perform when deployed in the field.