Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble
Journal:
arXiv
Published Date:
May 29, 2025
Abstract
Despite the growing clinical adoption of large language models (LLMs),
current approaches heavily rely on single model architectures. To overcome
risks of obsolescence and rigid dependence on single model systems, we present
a novel framework, termed the Consensus Mechanism. Mimicking clinical triage
and multidisciplinary clinical decision-making, the Consensus Mechanism
implements an ensemble of specialized medical expert agents enabling improved
clinical decision making while maintaining robust adaptability. This
architecture enables the Consensus Mechanism to be optimized for cost, latency,
or performance, purely based on its interior model configuration.
To rigorously evaluate the Consensus Mechanism, we employed three medical
evaluation benchmarks: MedMCQA, MedQA, and MedXpertQA Text, and the
differential diagnosis dataset, DDX+. On MedXpertQA, the Consensus Mechanism
achieved an accuracy of 61.0% compared to 53.5% and 45.9% for OpenAI's O3 and
Google's Gemini 2.5 Pro. Improvement was consistent across benchmarks with an
increase in accuracy on MedQA
($\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 3.4\%$) and MedMCQA
($\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 9.1\%$). These
accuracy gains extended to differential diagnosis generation, where our system
demonstrated improved recall and precision (F1$_\mathrm{consensus}$ = 0.326 vs.
F1$_{\mathrm{O3\text{-}high}}$ = 0.2886) and a higher top-1 accuracy for DDX
(Top1$_\mathrm{consensus}$ = 52.0% vs. Top1$_{\mathrm{O3\text{-}high}}$ =
45.2%).