An automated framework for assessing how well LLMs cite relevant medical references.

Journal: Nature communications
PMID:

Abstract

As large language models (LLMs) are increasingly used to address health-related queries, it is crucial that they support their conclusions with credible references. While models can cite sources, the extent to which these support claims remains unclear. To address this gap, we introduce SourceCheckup, an automated agent-based pipeline that evaluates the relevance and supportiveness of sources in LLM responses. We evaluate seven popular LLMs on a dataset of 800 questions and 58,000 pairs of statements and sources on data that represent common medical queries. Our findings reveal that between 50% and 90% of LLM responses are not fully supported, and sometimes contradicted, by the sources they cite. Even for GPT-4o with Web Search, approximately 30% of individual statements are unsupported, and nearly half of its responses are not fully supported. Independent assessments by doctors further validate these results. Our research underscores significant limitations in current LLMs to produce trustworthy medical references.

Authors

  • Kevin Wu
    Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
  • Eric Wu
    Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
  • Kevin Wei
    Keck Medicine of USC, Los Angeles, CA, USA.
  • Angela Zhang
    Department of Genetics, Stanford University, Stanford, CA, USA.
  • Allison Casasola
    Department of Computer Science, Stanford University, Stanford, CA, USA.
  • Teresa Nguyen
    Department of Anesthesiology, Stanford University, Stanford, CA, USA.
  • Sith Riantawan
    Keck Medicine of USC, Los Angeles, CA, USA.
  • Patricia Shi
    Loma Linda University School of Medicine, Loma Linda, CA, USA.
  • Daniel Ho
    Stanford Law School, Stanford, CA, USA.
  • James Zou
    Department of Biomedical Data Science, Stanford University, Stanford, California.