Accuracy of Large Language Model-based Automatic Calculation of Ovarian-Adnexal Reporting and Data System MRI Scores from Pelvic MRI Reports.

Journal: Radiology
PMID:

Abstract

Background Ovarian-Adnexal Reporting and Data System (O-RADS) for MRI helps assign malignancy risk, but radiologist adoption is inconsistent. Automatic assignment of O-RADS scores from reports could increase adoption and accuracy. Purpose To evaluate the accuracy of large language models (LLMs), after strategic optimization, for automatically calculating O-RADS scores from reports. Materials and Methods This retrospective single-center study from a large quaternary care cancer center included consecutive gadolinium chelate-enhanced pelvic MRI reports with at least one assigned O-RADS score from July 2021 to October 2023. Reports from January 2018 to October 2019 (before O-RADS MRI implementation) were randomly selected for additional testing. Reference standard O-RADS scores were determined by radiologists interpreting reports. After prompt optimization using a subset of reports, two LLM-based strategies were evaluated: few-shot learning with GPT-4 (version 0613; OpenAI) prompted with O-RADS rules ("LLM only") and a hybrid strategy leveraging GPT-4 to classify features fed into a deterministic formula ("hybrid"). Accuracy of each model and originally reported scores were calculated and compared using the McNemar test. Results A total of 284 reports from 284 female patients (mean age, 53.2 years ± 16.3 [SD]) with 372 adnexal lesions were included: 10 reports in the training set (16 lesions), 134 reports in the internal test set 1 (173 lesions; 158 O-RADS assigned), and 140 reports in internal test set 2 (183 lesions). For assigning O-RADS MRI scores, the hybrid model accuracy (97%; 168 of 173) outperformed LLM-only model (90%; 155 of 173; = .006). For lesions with an originally reported O-RADS score, hybrid model accuracy exceeded that of reporting radiologists (97% [153 of 158] vs 88% [139 of 158]; = .004). Hybrid model also outperformed LLM-only model for 183 lesions from before O-RADS implementation (95% [173 of 183] vs 87% [159 of 183], respectively; = .01). Conclusion A hybrid LLM-based application, combining LLM feature classification with deterministic elements, accurately assigned O-RADS MRI scores from report descriptions, exceeding both an LLM-only strategy and the original reporting radiologist. © RSNA, 2025

Authors

  • Rajesh Bhayana
    University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C.
  • Ankush Jajodia
    University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C.
  • Tanya Chawla
    University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital and Women's College Hospital, Department of Medical Imaging, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Building, 1st Fl, Toronto, ON, Canada M5G 24C.
  • Yangqing Deng
    Department of Biostatistics, University Health Network, Toronto, Canada.
  • Genevieve Bouchard-Fortier
    Department of Obstetrics and Gynecology, University of Toronto, Toronto, Canada.
  • Masoom Haider
    Department of Biostatistics, University Health Network, Toronto, Canada.
  • Satheesh Krishna
    Department of Biostatistics, University Health Network, Toronto, Canada.