The accuracy and repeatability of OpenEvidence on complex medical subspecialty scenarios: a pilot study

Journal: medRxiv
Published Date:

Abstract

OpenEvidence is a popular artificial intelligence (AI) based medical search engine that generates evidence-based answers. It includes a quick search engine method (OE) that takes only seconds to respond, along with a limited number of references. In mid-2025, the platform introduced “Deep Consult” (DC), which takes several minutes to respond and provides more comprehensive answers with additional references. OpenEvidence scored 100% on USMLE-type multiple-choice questions, but it has not been tested on more complex medical scenarios. We tested the OE and DC models using questions primarily derived from medical specialty board exams, specifically, the MedXpertQA dataset. In a prior published study, this dataset was evaluated with eleven large language models (LLMs), and the results indicated poor accuracy (14-46%) for all LLMs. We evaluated the performance of OpenEvidence on a sample of the MedXpertQA dataset, comprising 100 medical subspecialty scenarios and using two independent evaluators. The highest accuracy for DC was 41%, and for OE, 34%. Repeatability testing revealed an evaluator concordance rate of 77% for OE and 72% for DC.

Authors

  • Jawahar Jagarapu; Kikelomo Babata; Surya Chamarthi; Robert Hoyt