How Agent Role Structure Alters Operating Characteristics of Large Language Model Clinical Classifiers: A Comparative Study of Specialist and Deliberative Multi-Agent Protocols
Journal:
medRxiv
Published Date:
Feb 24, 2026
Abstract
Large language models (LLMs) are increasingly evaluated for structured clinical decision support tasks, often using multi-agent architectures. Prior work has compared single-agent and multi-agent inference. However, the effect of internal role structure within multi-agent systems on classification behavior remains underexplored. We evaluate two multi-agent prompting protocols, implemented as deterministic Directed Acyclic Graph (DAG) systems, a Generic Deliberative (GD) protocol and a Feature-Specialist (FS) protocol, on tabular clinical heart disease data from the UCI Cleveland dataset. Structured variables are rendered into primarily text-based feature descriptions while preserving clinically relevant numeric values. The two protocols differ only in their prompt-level role decomposition and information routing, while base model, model weights, deterministic decoding with temperature set to 0, computational budget, and aggregation logic are held constant. The results show systematic differences in predictive behavior attributable solely to prompt-level role structure. The FS protocol improves overall accuracy by 0.07 and macro-F1 by 0.06. However, this improvement is accompanied by a marked operating-point shift in which specificity increases by 0.22 while sensitivity decreases by 0.13, with corresponding redistribution of class-wise precision. Notably, the increase in specificity corresponds to a reduction in false positive classifications, indicating decreased over-diagnosis under the FS configuration. These findings indicate that multi-agent role decomposition introduces a structured inductive bias in deterministic LLM-based classification. Prompt protocol and agent role design should therefore be regarded as core modeling decisions, as they show measurable influence on performance tradeoffs, particularly in safety-sensitive deployment contexts.