Pathology’s Last Exam: Stress-Testing Diagnostic Reasoning and Safety in Large Language Models
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Large language models (LLMs) are evolving into diagnostic co-pilots, yet current benchmarks fail to test the integrated, stepwise reasoning required in diagnostic pathology. Here, we present Pathology’s Last Exam (PLE), a curated, highly detailed, text-based benchmark of 100 complex cases spanning organ systems, enriched for rare/challenging entities, plus 20 adversarial cases designed to stress-test model safety. Each case provides structured blocks (Primary, Clinical, Histopathology, IHC/Special Stains, Molecular Pathology) with stepwise information release mirroring real sign-out. We evaluated five LLMs (one proprietary, four open-source) across different stages. While the best model (GPT-5) achieved 70% accuracy on full evidence, performance on safety tests was alarming. Models frequently failed to detect biological contradictions, confidently diagnosing nonsensical “mix-up” cases rather than refusing them. This reveals a critical safety gap: high diagnostic capability is currently coupled with a dangerous inability to recognize impossible clinical scenarios. PLE provides a framework to measure and mitigate these risks before clinical deployment, as well as a foundation for developing multimodal evaluation protocols that can be extended to vision-language models and autonomous diagnostic agents in the future.