Human-AI collectives produce the most accurate differential diagnoses
Journal:
arXiv
Published Date:
Jun 21, 2024
Abstract
Artificial intelligence systems, particularly large language models (LLMs),
are increasingly being employed in high-stakes decisions that impact both
individuals and society at large, often without adequate safeguards to ensure
safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are
biased - shortcomings that may reflect LLMs' inherent limitations and thus may
not be remedied by more sophisticated architectures, more data, or more human
feedback. Relying solely on LLMs for complex, high-stakes decisions is
therefore problematic. Here we present a hybrid collective intelligence system
that mitigates these risks by leveraging the complementary strengths of human
experience and the vast information processed by LLMs. We apply our method to
open-ended medical diagnostics, combining 40,762 differential diagnoses made by
physicians with the diagnoses of five state-of-the art LLMs across 2,133
medical cases. We show that hybrid collectives of physicians and LLMs
outperform both single physicians and physician collectives, as well as single
LLMs and LLM ensembles. This result holds across a range of medical specialties
and professional experience, and can be attributed to humans' and LLMs'
complementary contributions that lead to different kinds of errors. Our
approach highlights the potential for collective human and machine intelligence
to improve accuracy in complex, open-ended domains like medical diagnostics.