Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Journal: medRxiv

Published Date: May 6, 2026

Abstract

Abstract Objective: Structured extraction from clinical free-text depends on human annotators whose labels are susceptible to errors and knowledge-driven mistakes; exhaustive quality control is impractical at scale. We evaluate whether disagreement among multiple locally hosted large language models (LLMs) can prioritize human annotations for targeted review. Methods: Multiple LLMs independently extract the same set of structured variables annotated by a human reviewer. For each annotation, an agreement score counts the LLMs matching the human label. Using four locally hosted LLMs (Gemma 3 27B, DeepSeek-R1 70B, GPT-OSS 120B, Mistral Large 3), we evaluated this approach on 910 German-language colonoscopy reports describing endoscopic mucosal resection, with five structured variables per case (anatomical location, two diameters, resection technique, multiple polyps), yielding 4,550 annotations and a 377-case adjudication sample. A stratified sample oversampling low-agreement strata was adjudicated blinded by an experienced reviewer and analyzed with prevalence-adjusted estimates Results: Human error rates rose as LLM agreement fell, from 0% at scores 3/4 to 76% at score 0. The lowest-agreement stratum was only 6.5% of annotations yet concentrated an estimated 80% of errors. The multi-LLM disagreement score achieved a prevalence-adjusted AUC-ROC of 0.991 (95% CI 0.987/0.994) and AUC-PR of 0.893 (95% CI 0.851/0.929) for error detection. Discussion: Multi-LLM disagreement outperformed single models and provided graded operating points for risk stratified review. Conclusion: Multi-LLM disagreement provides a scalable quality control signal for targeted review of the highest-yield cases. Because all models run locally, the framework is GDPR compliant; its language- and task-agnostic design supports application across clinical domains.

Authors

Wittlinger
S.; Meerjansen
J.; Wolf
F.; Wiest
I. C.; Ebert
M. P.; Siegel
F.; Belle
S.

External Resources

View on medRxiv Access via DOI

Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Multi-LLM Disagreement as a Scalable Detector of Human Annotation Errors in Structured Data from Clinical Free-Text

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals