Large language models encode clinical knowledge.

Journal: Nature
Published Date:

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Authors

  • Karan Singhal
    Google Research, Mountain View, CA, USA. karansinghal@google.com.
  • Shekoofeh Azizi
  • Tao Tu
    Google Research, Mountain View, CA, USA.
  • S Sara Mahdavi
    Google Research, Mountain View, CA, USA.
  • Jason Wei
    Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire.
  • Hyung Won Chung
    Google Research, Mountain View, CA, USA.
  • Nathan Scales
    Google Research, Mountain View, CA, USA.
  • Ajay Tanwani
    Google Research, Mountain View, CA, USA.
  • Heather Cole-Lewis
    ICF International, Rockville, MD, United States.
  • Stephen Pfohl
  • Perry Payne
    Google Research, Mountain View, CA, USA.
  • Martin Seneviratne
    Stanford Center for Biomedical Informatics Research, Stanford, California 94305, USA.
  • Paul Gamble
    Google Health, Google, Palo Alto, CA, USA.
  • Chris Kelly
    Google Research, Mountain View, CA, USA.
  • Abubakr Babiker
    Google Research, Mountain View, CA, USA.
  • Nathanael Schärli
    Google Research, Mountain View, CA, USA.
  • Aakanksha Chowdhery
    Google Research, Mountain View, CA, USA.
  • Philip Mansfield
    Google Research, Mountain View, CA, USA.
  • Dina Demner-Fushman
    Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD.
  • Blaise Agüera Y Arcas
    Google Research, Mountain View, CA, USA.
  • Dale Webster
    Google Health, Palo Alto, CA, USA.
  • Greg S Corrado
    Google Health, Palo Alto, CA USA.
  • Yossi Matias
    Google Research, Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA, USA.
  • Katherine Chou
    Google Research, San Jose, CA, USA.
  • Juraj Gottweis
    Google Research, Mountain View, CA, USA.
  • Nenad Tomasev
    DeepMind, London, EC4A 3TW, UK.
  • Yun Liu
    Google Health, Palo Alto, CA USA.
  • Alvin Rajkomar
    1Google LLC, Mountain View, CA USA.
  • Joelle Barral
    Google Research, Mountain View, CA, USA.
  • Christopher Semturs
    Google Health, Google LLC, Mountain View, California.
  • Alan Karthikesalingam
    Department of Outcomes Research, St George's Vascular Institute, London, SW17 0QT, United Kingdom.
  • Vivek Natarajan
    Google, Mountain View, CA, USA.