Claim-Level Transparency Analysis of LLM-Generated Diagnostic Reports: A Metabolic and Endocrine Biomarker Study

Journal: bioRxiv
Published Date:

Abstract

Large language models are increasingly deployed in clinical decision-support contexts, yet systematic evaluation of their factual reliability in generating patient-specific diagnostic reports remains sparse, particularly for laboratory interpretation tasks. This study presents a controlled transparency experiment in which four frontier LLMs --- Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.2, and Gemini 3.1 Pro --- each generated diagnostic reports for 36 patients (29 female, 7 male; aged 27--64) with biomarker profiles spanning metabolic, endocrine, and nutritional markers. A transparency engine\footnote{The transparency engine is a proprietary mechanistic reasoning system developed by Diadia. It decomposes biomedical claims into directed graphs of physiological mechanisms, independently verifies each mechanistic step against the scientific literature, and derives support classifications through deterministic graph analysis. The engine underpins every root cause analysis in Diadia's clinical platform, providing full traceability from each diagnostic conclusion back to the specific evidence that supports or contradicts it.} extracted up to 50 claims per report (3,035 total), searched for supporting scientific evidence, and classified each claim as supported by science, plausible, or unsupported. Unsupported claims were uncommon: the transparency engine classified 2.7% of claims as unsupported (hereafter, the pipeline-measured hallucination rate; naive claim-level 95% Wilson CI: 2.2%--3.4%), with GPT-5.2 at the lowest observed rate (1.7%) and Claude Opus 4.6 at the highest (3.6%). However, mechanistic verification revealed a much larger plausibility gap: 915 claims (30.2%) were biologically reasonable but lacked a fully verified evidence chain, bringing the share of claims not fully supported by direct evidence to 32.9%. Gemini 3.1 Pro produced the highest plausible proportion (39.6%), suggesting a more conservative but less fully grounded reasoning profile. Although coarse support-level distributions were broadly similar across models (Cramer's V = 0.081), claim-level analysis revealed substantial narrative divergence: 61.2% of claims were unique to a single model, and matched-claim agreement was low (Cohen's kappa = 0.233), indicating that models generate substantively different clinical narratives for the same patient data despite comparable aggregate support profiles. These findings show that hallucination metrics alone understate the share of claims not fully verified under this protocol, and that claim-level mechanistic verification is needed to distinguish the proven from the merely plausible in metabolic and endocrine laboratory interpretation, with generalizability to other clinical domains requiring further study.

Authors

  • Yasinetsky
  • A.; Ikonomovska
  • E.; Geniesse
  • C.; Yasinetsky
  • A.