Accuracy of Large Language Model-based Automatic Calculation of Ovarian-Adnexal Reporting and Data System MRI Scores from Pelvic MRI Reports.
Journal:
Radiology
PMID:
40167432
Abstract
Background Ovarian-Adnexal Reporting and Data System (O-RADS) for MRI helps assign malignancy risk, but radiologist adoption is inconsistent. Automatic assignment of O-RADS scores from reports could increase adoption and accuracy. Purpose To evaluate the accuracy of large language models (LLMs), after strategic optimization, for automatically calculating O-RADS scores from reports. Materials and Methods This retrospective single-center study from a large quaternary care cancer center included consecutive gadolinium chelate-enhanced pelvic MRI reports with at least one assigned O-RADS score from July 2021 to October 2023. Reports from January 2018 to October 2019 (before O-RADS MRI implementation) were randomly selected for additional testing. Reference standard O-RADS scores were determined by radiologists interpreting reports. After prompt optimization using a subset of reports, two LLM-based strategies were evaluated: few-shot learning with GPT-4 (version 0613; OpenAI) prompted with O-RADS rules ("LLM only") and a hybrid strategy leveraging GPT-4 to classify features fed into a deterministic formula ("hybrid"). Accuracy of each model and originally reported scores were calculated and compared using the McNemar test. Results A total of 284 reports from 284 female patients (mean age, 53.2 years ± 16.3 [SD]) with 372 adnexal lesions were included: 10 reports in the training set (16 lesions), 134 reports in the internal test set 1 (173 lesions; 158 O-RADS assigned), and 140 reports in internal test set 2 (183 lesions). For assigning O-RADS MRI scores, the hybrid model accuracy (97%; 168 of 173) outperformed LLM-only model (90%; 155 of 173; = .006). For lesions with an originally reported O-RADS score, hybrid model accuracy exceeded that of reporting radiologists (97% [153 of 158] vs 88% [139 of 158]; = .004). Hybrid model also outperformed LLM-only model for 183 lesions from before O-RADS implementation (95% [173 of 183] vs 87% [159 of 183], respectively; = .01). Conclusion A hybrid LLM-based application, combining LLM feature classification with deterministic elements, accurately assigned O-RADS MRI scores from report descriptions, exceeding both an LLM-only strategy and the original reporting radiologist. © RSNA, 2025