Performance of GPT-based large language models in hepatocellular carcinoma stratification: liver function assessment, BCLC staging, and treatment recommendations.
Journal:
Scientific reports
Published Date:
Jun 12, 2026
Abstract
Large language models (LLMs) like GPT have been proposed to support complex clinical decision-making. This study evaluated the performance of GPT-based LLM in analyzing clinical, radiological, and laboratory data from patients with hepatocellular carcinoma (HCC) to assess liver function, assign BCLC stage, and recommend treatment. Data from 106 HCC patients (82% male, median age 65 [22-86]) were compiled into anonymized integrated reports. Four GPT-versions (4, o1, o3, 5.4) were prompted-using both short and long instructions-to calculate MELD, ALBI, and Child-Pugh scores, assign BCLC stage, and generate treatment recommendations based on current guidelines. Outputs were compared to expert consensus and tumor board decisions. Errors were categorized by type and source. Time and cost analyses compared GPT to clinical staff. All GPT versions achieved high accuracy (> 85%) in liver function assessment, with MELD calculation being the most error-prone. BCLC staging accuracy ranged from 46.2% (version 4) to 84.0% (o3), with misclassification of radiological reports as the main error source. Reasoning-optimized models (o1, o3) performed best for treatment recommendations, achieving an overall accuracy (correct suggestions and acceptable alternatives) of up to 90.6%. In 9-14% of cases, GPT suggestions were retrospectively more guideline-concordant than tumor board decisions. GPT processing was significantly faster and reduced costs by approximately 300- to 1300-fold compared to clinical staff. GPT-based LLMs show potential as decision-support tools for liver function assessment, BCLC staging, and treatment guidance in HCC. Particularly with reasoning-optimized models and detailed prompting, LLMs may serve as valuable adjuncts in multidisciplinary HCC workflows. However, a non-negligible error rate requires expert oversight and further model refinement.
Authors
Keywords
No keywords available for this article.