Self-Logical Consistency Assessment of Large Language Models for Patient Feedback Classification : Algorithm Development and Validation Study
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
Patient satisfaction feedback is crucial for hospital service quality, but manual reviews are not possible due to their time-consumption, and traditional natural language processing methods remain inadequate. Large Language Models (LLMs) show promise but are prone to logical hallucinations—fabricated or illogical outputs that limit their reliability (inconsistent performance across repeated uses) and validity (explainability into clinical contexts) in healthcare. This study aimed to evaluate the Self-Logical Consistency Assessment (SLCA), an original method designed to enhance LLM feedback classification reliability by enforcing a logically-structured chain of thought. SLCA uses two validation steps: self-consistency (identifying the most coherent response) and logical consistency (ensuring alignment with the original statement and expert classifications). We evaluated SLCA using GPT-4 and Llama-3.1 405B on 12,600 classifications from 100 patient feedback samples to assess logical hallucinations, and tested its performance on a 49,140-classification benchmark derived from 1,170 feedbacks. SLCA reduced logical hallucinations among detected categories from 15.80% (168/1063) to 0.51% (4/786) with GPT-4 and from 7.17% (51/711) to 1.67% (10/599) with Llama-3.1, with residual errors confined to the emergency feedback category. On the benchmark, SLCA achieved precision-recall scores of 0.86-0.78 for GPT-4 and 0.84-0.58 for Llama-3.1. These results demonstrate SLCA’s ability to achieve human-level performance across LLMs. SLCA offers a zero-shot, scalable, explainable solution for improving LLM classification reliability in healthcare. Its capacity to enhance performance without fine-tuning positions it as a valuable tool for analyzing patient feedback and supporting hospital service quality improvement. Free-text patient feedback labeling is crucial for healthcare system improvement. Classification by hand is impractical due to its time-consumption. Large language models (LLMs) can out-perform traditional NLP for this task, but their clinical use is limited by inconsistent predictions and “logical hallucinations” that undermine explainability and trust. The Self-Logical Consistency Assessment (SLCA) framework, which couples self-consistency with a novel logical-consistency check, almost eradicates hallucinations (from 15.8 % to 0.5 % with GPT-4) while reaching human-level precision (86%) and better exhaustivity (recall +14%). SLCA offers a scalable, explainable and data-sovereign pathway for hospitals and regulators to adopt LLMs in routine patient-experience monitoring, and it provides a transferable template for evaluating AI safety in other clinical-text tasks.