Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry
Journal:
medRxiv
Published Date:
Feb 9, 2026
Abstract
Importance: Large language models (LLMs) have demonstrated diagnostic potential in several medical specialties, but their application to psychiatry - where diagnosis relies heavily on clinical judgment, narrative interpretation, and reasoning under uncertainty - remains insufficiently evaluated. Objective: To evaluate diagnostic accuracy and clinician-judged reasoning quality of multiple large language models using psychiatric case vignettes. Design: Mixed-methods evaluation study of diagnostic accuracy across four LLMs using 196 psychiatric case vignettes (135 published and 61 novel). Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes using structured clinician ratings along two reasoning dimensions. The highest-performing model was illustratively compared with psychiatry trainees on the same subset. Diagnostic correctness for the full vignette set was assessed by a separate adjudicator LLM. Setting: Publicly available model interfaces, December 2025. Participants: Five board-certified psychiatrists evaluated model-generated clinical reasoning. Two psychiatry residents served as the illustrative human comparison. Main Outcomes and Measures: Diagnostic accuracy and clinician-rated clinical reasoning quality. Diagnostic accuracy was assessed using top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was assessed using two 5-point Likert scales adapted from the American Council of Graduate Medical Education Psychiatry Residency Milestones, evaluating data extraction and diagnostic reasoning. Results: Across 196 psychiatric case vignettes, Claude Opus 4.5 (Anthropic) achieved the highest diagnostic accuracy (top-1 accuracy, 0.638; top-5 accuracy, 0.801; recall@5, 0.731; mean reciprocal rank, 0.710) and clinician-rated reasoning scores. Higher clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses ({beta} = 1.80; p < 0.001), corresponding to an approximately six-fold increase in odds of a correct diagnosis per 1-point increase in reasoning score. In an illustrative comparison, diagnostic accuracy of Claude Opus 4.5 fell within the range observed for psychiatry trainees. Conclusions and Relevance: LLMs demonstrated high diagnostic accuracy and generated clinical reasoning that clinicians judged to be largely coherent and safe. Diagnostic reasoning quality was more strongly associated with diagnostic correctness than data extraction quality, underscoring the importance of evaluating reasoning alongside accuracy when assessing LLMs for clinical decision support in psychiatry.