Human-AI collectives most accurately diagnose clinical vignettes.

Journal: Proceedings of the National Academy of Sciences of the United States of America

Published Date: Jun 17, 2025

Abstract

AI systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased-shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here, we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 text-based medical case vignettes. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

Authors

Nikolas Zöller

Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin 14195, Germany.
Julian Berger

Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin 14195, Germany.
Irving Lin

The Human Diagnosis Project, San Francisco, CA 94110.
Nathan Fu

The Human Diagnosis Project, San Francisco, CA 94110.
Jayanth Komarneni

The Human Diagnosis Project, San Francisco, CA 94110.
Gioele Barabucci

Department of Digital Humanities, University of Cologne, Cologne 50931, Germany.
Kyle Laskowski

The Human Diagnosis Project, San Francisco, CA 94110.
Victor Shia

Harvey Mudd College, Claremont, CA 91711.
Benjamin Harack

Department of Politics and International Relations, Oxford University, Oxford OX13UQ, United Kingdom.
Eugene A Chu

Kaiser Permanente, Downey, CA 90242.
Vito Trianni

Institute of Cognitive Sciences and Technologies (ISTC), National Research Council (CNR), Rome, Italy.
Ralf H J M Kurvers

Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin 14195, Germany.
Stefan M Herzog

Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin 14195, Germany.

Keywords

Artificial Intelligence Diagnosis, Differential Humans Physicians

External Resources

View on PubMed Access via DOI PubMed (40512795)

Human-AI collectives most accurately diagnose clinical vignettes.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Human-AI collectives most accurately diagnose clinical vignettes.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals