Large Language Model Hallucinations in Spine Surgery: A Comparative Analysis of Clinician vs Patient-Level Prompts.

Journal: Neurosurgery practice
Published Date:

Abstract

BACKGROUND AND OBJECTIVES: Large Language Models (LLMs) are increasingly used by healthcare professionals and patients for medical information synthesis. Unfortunately, their tendency to generate fabricated information ("hallucinations") poses a significant risk. This study quantifies and compares citation accuracy of 5 prominent LLMs for common neurosurgical spine topics. METHODS: Five LLMs (ChatGPT, Gemini, Claude, Microsoft Copilot, and OpenEvidence) were evaluated. Ten clinician-level questions derived from North American Spine Society guidelines were posed using a prompt instructing the LLM to act as an "experienced spine surgeon." Ten corresponding patient-level questions were posed using a "patient" persona prompt. Both prompts requested a concise answer and 3 to 5 peer-reviewed references. Each citation was manually verified and scored as 2 (accurate), 1 (real but misrepresented), or 0 (fabricated). RESULTS: Performance varied significantly across models and question types. OpenEvidence, which directly indexes PubMed and requires user verification as a healthcare professional, achieved perfect accuracy. Among general purpose LLMs, Claude demonstrated the highest accuracy (average score 1.67, 78.0% accurate citations). Gemini had the lowest accuracy (average score 0.97) and the highest rate of fabrications (26.2%). All general purpose LLMs performed worse on patient-level queries. ChatGPT's accuracy score dropped from 1.83 to 1.08 for patient questions while its fabrication rate increased from 2.1% to 20.0%. Error profiles differed by model. Copilot frequently misrepresented article metadata (55.1% score-1 citations) and cited non-peer-reviewed websites (31.6%), particularly for patient questions. CONCLUSION: General purpose LLMs exhibit substantial and variable citation errors when queried on spine surgery topics. Accuracy is substantially lower for patient-facing prompts, which often yield fabricated references or nonscientific sources. While specialized, access-restricted platforms such as OpenEvidence provide high accuracy, the performance of widely accessible models highlights the need for careful verification of all LLM-generated medical information to ensure patient safety.

Authors

Keywords

No keywords available for this article.