Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis

Journal: medRxiv
Published Date:

Abstract

A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted by the same LLM in diagnostic reasoning tests [1]. We aimed to replicate this result in a UK setting and explore how interactions with the LLM might explain the observed gaps in performance. This was a within-subjects study of UK physicians. 22 participants answered structured questions on 4 clinical vignettes. For 2 cases physicians had access to an LLM via a custom-built web-application. Results were analysed using a mixed-effects model accounting for case difficulty and the variability of clinicians at baseline. Qualitative analysis involved coding of participant-LLM interaction logs and evaluating the rates of LLM use per question. Physicians with LLM assistance scored significantly lower than the LLM alone (mean difference 21.3 percentage points, p < 0.001). Access to the LLM was associated with improved physician performance compared to using conventional resources (73.7% vs 66.3%, p = 0.001). There was significant heterogeneity in the degree of LLM-assisted improvement (SD 10.4%). Qualitative analysis revealed that only 30% of case questions were directly posed to the LLM, which suggests that under-utilisation of the LLM contributed to the observed performance gap. While access to an LLM can improve diagnostic accuracy, realising the full potential of human-AI collaboration may require a focus on training clinicians to integrate these tools into their cognitive workflows and on designing systems that make these integrations the default rather than an optional extra.

Authors

  • J Healy; J Kossoff; M Lee; C Hasford