Artificial Intelligence Versus Radiologist False-Positives on Digital Breast Tomosynthesis Examinations in a Population-Based Screening Program.

Journal: AJR. American journal of roentgenology
Published Date:

Abstract

BACKGROUND. Insights into the nature of false-positive findings flagged by contemporary mammography artificial intelligence (AI) systems could inform the potential use of AI to reduce false-positive recall rates. OBJECTIVE. The purpose of this study was to compare AI and radiologists in terms of characteristics of false-positive digital breast tomosynthesis (DBT) examinations in a breast cancer screening population. METHODS. This retrospective study included 2977 women (mean age, 55 years) who were participating in an observational population-based screening study and underwent 3183 screening DBT examinations from January 2013 to June 2017. A commercial AI tool analyzed DBT examinations. Positive examinations were defined as having an elevated-risk result for AI and as having been assigned BI-RADS category 0 for interpreting radiologists. False-positive examinations were defined as the absence of a breast cancer diagnosis within 1 year. Radiologists rereviewed the imaging for AI-flagged false-positive findings. RESULTS. The false-positive rate was 10% for both AI (304/3183) and radiologists (308/3183). Of 541 total false-positive examinations, 233 (43%) were false-positives for AI only, 237 (44%) were false-positives for radiologists only, and 71 (13%) were false-positives for both. AI-only versus radiologist-only false-positives were associated with greater mean patient age (60 vs 52 years, p < .001), lower frequency of dense breasts (24% vs 57%, p < .001), and greater frequencies of a personal history of breast cancer (13% vs 2%, p < .001), prior breast imaging studies (95% vs 78%, p < .001), and prior breast surgical procedures (37% vs 11%, p < .001). The false-positive examinations included 932 AI-only flagged findings, 315 radiologist-only flagged findings, and 49 flagged findings concordant between AI and radiologists. AI-only flagged findings were most commonly benign calcifications (40%), asymmetries (13%), and benign postsurgical change (12%); radiologist-only flagged findings were most commonly masses (47%), asymmetries (19%), and indeterminate calcifications (15%). Of 18 concordant flagged findings that were biopsied, 44% yielded high-risk lesions. CONCLUSION. Imaging and patient-level differences were observed between AI and radiologist false-positive DBT examinations. Although only a small fraction of false-positive examinations overlapped between AI and radiologists, concordant flagged findings had a high rate of representing high-risk lesions. CLINICAL IMPACT. The findings may help guide strategies for using AI to improve DBT recall specificity. In particular, concordant findings may represent an enriched subset of actionable abnormalities.

Authors

Keywords

No keywords available for this article.