Large language models encode clinical knowledge.

Journal: Nature

Published Date: Jul 12, 2023

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Authors

Karan Singhal

Google Research, Mountain View, CA, USA. karansinghal@google.com.
Shekoofeh Azizi
Tao Tu

Google Research, Mountain View, CA, USA.
S Sara Mahdavi

Google Research, Mountain View, CA, USA.
Jason Wei

Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire.
Hyung Won Chung

Google Research, Mountain View, CA, USA.
Nathan Scales

Google Research, Mountain View, CA, USA.
Ajay Tanwani

Google Research, Mountain View, CA, USA.
Heather Cole-Lewis

ICF International, Rockville, MD, United States.
Stephen Pfohl
Perry Payne

Google Research, Mountain View, CA, USA.
Martin Seneviratne

Stanford Center for Biomedical Informatics Research, Stanford, California 94305, USA.
Paul Gamble

Google Health, Google, Palo Alto, CA, USA.
Chris Kelly

Google Research, Mountain View, CA, USA.
Abubakr Babiker

Google Research, Mountain View, CA, USA.
Nathanael Schärli

Google Research, Mountain View, CA, USA.
Aakanksha Chowdhery

Google Research, Mountain View, CA, USA.
Philip Mansfield

Google Research, Mountain View, CA, USA.
Dina Demner-Fushman

Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD.
Blaise Agüera Y Arcas

Google Research, Mountain View, CA, USA.
Dale Webster

Google Health, Palo Alto, CA, USA.
Greg S Corrado

Google Health, Palo Alto, CA USA.
Yossi Matias

Google Research, Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA, USA.
Katherine Chou

Google Research, San Jose, CA, USA.
Juraj Gottweis

Google Research, Mountain View, CA, USA.
Nenad Tomasev

DeepMind, London, EC4A 3TW, UK.
Yun Liu

Google Health, Palo Alto, CA USA.
Alvin Rajkomar

1Google LLC, Mountain View, CA USA.
Joelle Barral

Google Research, Mountain View, CA, USA.
Christopher Semturs

Google Health, Google LLC, Mountain View, California.
Alan Karthikesalingam

Department of Outcomes Research, St George's Vascular Institute, London, SW17 0QT, United Kingdom.
Vivek Natarajan

Google, Mountain View, CA, USA.

Keywords

Benchmarking Bias Clinical Competence Comprehension Computer Simulation Datasets as Topic Knowledge Licensure Medicine Natural Language Processing Patient Safety Physicians

External Resources

View on PubMed Access via DOI PubMed (37438534)

Large language models encode clinical knowledge.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals