Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models

Journal: medRxiv

Published Date: Jan 1, 2025

Abstract

Large language models (LLMs) are evolving into diagnostic co-pilots, yet current benchmarks fail to test the integrated, stepwise reasoning required in diagnostic pathology. Here, we present Pathology’s Last Exam (PLE), a curated, highly detailed, text-based benchmark of 100 complex cases spanning organ systems, enriched for rare/challenging entities, plus 20 adversarial cases designed to stress-test model safety. Each case provides structured blocks (Primary, Clinical, Histopathology, IHC/Special Stains, Molecular Pathology) with stepwise information release mirroring real sign-out. We evaluated five LLMs (one proprietary, four open-source) across different stages. While the best model (GPT-5) achieved 70% accuracy on full evidence, performance on safety tests was alarming. Models frequently failed to detect biological contradictions, confidently diagnosing nonsensical “mix-up” cases rather than refusing them. This reveals a critical safety gap: high diagnostic capability is currently coupled with a dangerous inability to recognize impossible clinical scenarios. PLE provides a framework to measure and mitigate these risks before clinical deployment, as well as a foundation for developing multimodal evaluation protocols that can be extended to vision-language models and autonomous diagnostic agents in the future.

Authors

Nic G. Reitsam; Marco Gustav; Moritz Jesinghaus; Bruno Märkl; Sebastian Foersch; Jakob N. Kather

External Resources

View on medRxiv Access via DOI

Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals