Data Extraction from Oncology Imaging Reports by Large Language Models: A Comparative Accuracy Study

Journal: medRxiv
Published Date:

Abstract

Manual data extraction from clinical text is resource intensive. Locally hosted large language models (LLMs) may offer a privacy-preserving solution, but their performance on non-English data remains unclear. To investigate whether the classification accuracy of locally hosted LLMs is non-inferior to human accuracy when determining metastasis status and treatment response from German radiology reports. In this retrospective comparative accuracy study, five locally hosted LLMs (llama3.3:70b, mistral-small:24b, qwq:32b, qwen3:32b, and gpt-oss:120b) were compared against humans. To calculate accuracy, a ground truth was established via duplicate human extraction and adjudication of discrepancies by a senior oncologist. Both initial human extraction and LLM outputs were compared against this ground truth. The study was conducted at a tertiary referral hospital in Switzerland; data processing and analyses took place inside the hospital network. 400 randomly sampled radiology reports from adult cancer patients (CT, MRI, PET) generated between January 2023 and May 2025. Automated classification of metastasis status and treatment response by LLMs using a standardized prompt pipeline compared to manual human review. Primary outcomes were non-inferiority (5 percentage points [pp] margin) of LLM classification accuracy compared with human accuracy for metastasis status (presence/absence by anatomical site) and treatment response categories. Secondary outcomes included accuracy for primary tumor diagnosis, radiological absence of tumor, and extraction time per report. The analysis included 400 reports from 317 patients (mean age 63 years, 32% women). On the test set (n=300), human accuracy for metastasis status was 98.4% (95% CI 98.0%–98.8%). All LLMs were non-inferior; gpt-oss:120b performed best (97.6% accuracy; difference:xs −0.8pp [90% CI, −1.3 to −0.3 pp]). For response to treatment, human accuracy was 86.0% (95% CI 83.2%–88.8%). All LLMs were inferior; the most accurate model, gpt-oss:120b, achieved 78.3% (difference −7.7 pp [90% CI, −11.6 to −3.8 pp]). Mean human time per report was 120 seconds vs 11–63 seconds for LLMs. In this study, LLMs were non-inferior to human accuracy for classification of metastasis status but were inferior for response to treatment assessment. gpt-oss:120b was the most accurate among tested LLMs. OSF: 45PVQ Can locally hosted large language models (LLMs) match human performance when extracting sites of metastases and response to treatment from radiology reports of cancer patients? In this preregistered, single center study of 300 German radiology reports, all evaluated LLMs were non-inferior to humans in extracting the presence or absence of metastasis by organ site, but LLMs were inferior to humans in classification of response to treatment. LLMs can be suitable for classification of metastasis status, whereas more caution is warranted for more complex tasks where additional clinical reasoning may be required.

Authors

  • Lea P. Passweg; Johannes M. Schwenke; Christof M. Schönenberger; Flavio Locher; Julia Picker; Manuel Dieterle; Benjamin Thiele; Dimitri Hasler; Alessia Danelli; Andreas M. Schmitt; Tobias Heye; Thomas Stojanov; Matthias Briel; Benjamin Kasenda