Sequential Diagnosis with Language Models
Journal:
arXiv
Published Date:
Jun 27, 2025
Abstract
Artificial intelligence holds great promise for expanding access to expert
medical knowledge and reasoning. However, most evaluations of language models
rely on static vignettes and multiple-choice questions that fail to reflect the
complexity and nuance of evidence-based medicine in real-world settings. In
clinical practice, physicians iteratively formulate and revise diagnostic
hypotheses, adapting each subsequent question and test to what they've just
learned, and weigh the evolving evidence before committing to a final
diagnosis. To emulate this iterative process, we introduce the Sequential
Diagnosis Benchmark, which transforms 304 diagnostically challenging New
England Journal of Medicine clinicopathological conference (NEJM-CPC) cases
into stepwise diagnostic encounters. A physician or AI begins with a short case
abstract and must iteratively request additional details from a gatekeeper
model that reveals findings only when explicitly queried. Performance is
assessed not just by diagnostic accuracy but also by the cost of physician
visits and tests performed. We also present the MAI Diagnostic Orchestrator
(MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians,
proposes likely differential diagnoses and strategically selects high-value,
cost-effective tests. When paired with OpenAI's o3 model, MAI-DxO achieves 80%
diagnostic accuracy--four times higher than the 20% average of generalist
physicians. MAI-DxO also reduces diagnostic costs by 20% compared to
physicians, and 70% compared to off-the-shelf o3. When configured for maximum
accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO
generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and
Llama families. We highlight how AI systems, when guided to think iteratively
and act judiciously, can advance diagnostic precision and cost-effectiveness in
clinical care.