SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence
Journal:
arXiv
Published Date:
Mar 13, 2025
Abstract
Integration of Vision-Language Models (VLMs) in surgical intelligence is
hindered by hallucinations, domain knowledge gaps, and limited understanding of
task interdependencies within surgical scenes, undermining clinical
reliability. While recent VLMs demonstrate strong general reasoning and
thinking capabilities, they still lack the domain expertise and task-awareness
required for precise surgical scene interpretation. Although Chain-of-Thought
(CoT) can structure reasoning more effectively, current approaches rely on
self-generated CoT steps, which often exacerbate inherent domain gaps and
hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent
framework that delivers transparent, interpretable insights for most tasks in
robotic-assisted surgery. By employing specialized CoT prompts across five
tasks: instrument recognition, action recognition, action prediction, patient
data extraction, and outcome assessment, SurgRAW mitigates hallucinations
through structured, domain-aware reasoning. Retrieval-Augmented Generation
(RAG) is also integrated to external medical knowledge to bridge domain gaps
and improve response reliability. Most importantly, a hierarchical agentic
system ensures that CoT-embedded VLM agents collaborate effectively while
understanding task interdependencies, with a panel discussion mechanism
promotes logical consistency. To evaluate our method, we introduce
SurgCoTBench, the first reasoning-based dataset with structured frame-level
annotations. With comprehensive experiments, we demonstrate the effectiveness
of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12
robotic procedures, achieving the state-of-the-art performance and advancing
explainable, trustworthy, and autonomous surgical assistance.