Vision-Language Models vs Autonomous AI Agents for Anterior Capsular Radial Folds: A Diagnostic Study
Journal:
medRxiv
Published Date:
Jan 16, 2026
Abstract
ImportanceVision-language models (VLMs) enable generalist multimodal reasoning, but their ability to resolve brief, low-contrast cues in surgical video without task-specific training is uncertain. Autonomous artificial intelligence (AI) agents offer an alternative paradigm by autonomously generating supervised classifiers tailored to specific visual tasks.
ObjectiveTo benchmark the performance of VLMs against supervised classifiers engineered by autonomous AI agents for detecting anterior capsular radial folds during continuous curvilinear capsulorhexis (CCC), and to compare both approaches with human graders.
Design, Setting, and ParticipantsThis retrospective diagnostic study utilized a multicenter dataset of 537 CCC videos collected from Beijing Tongren Hospital (China), National University Hospital (Singapore), and the OphNet-APTOS public dataset.
ExposurePresence or absence of anterior capsular radial folds during CCC, defined based on established expert consensus, was annotated at both clip and frame levels by senior glaucoma surgeons. Two analytic paradigms were evaluated: (1) direct zero-shot and few-shot inference using 11 generalist and medical-specific VLMs on single frames and frame sequences; and (2) autonomous code generation by 4 AI agents to construct supervised image classifiers from labeled frames. Human comparison included 7 graders with varying levels of ophthalmic experience.
Main Outcomes and MeasuresDiscrimination of fold-positive versus fold-negative cases was assessed using macro-averaged precision, recall, and F1-score at the clip and frame levels. Secondary outcomes included comparisons with human graders.
ResultsAmong 537 video clips (7.32 {+/-} 3.35 seconds), 156 (29.1%) were fold-positive. VLM performance was limited; the top-performing model, Qwen2.5-VL, achieved a mean F1-score of 0.519. Few-shot prompting improved GPT-4.1 performance (mean F1-score from 0.177 to 0.480) but remained unstable. In contrast, an agent-engineered classifier achieved an F1-score of 0.869 and an area under the receiver operating characteristic curve of 0.958. In comparison with human graders, the top agent-generated model (F1-score, 0.835) matched junior specialists (mean F1-score, 0.829), whereas fine-tuned VLMs (maximum F1-score, 0.606) underperformed all human graders.
Conclusions and RelevanceGeneralist VLMs showed limited capacity to detect subtle intraoperative cues, whereas autonomous AI agents successfully constructed task-specific classifiers with near-clinical-level performance. These findings support agent-driven supervised modeling as a more effective strategy for fine-grained surgical video interpretation.
Key PointsO_ST_ABSQuestionC_ST_ABSHow do generalist vision-language models (VLMs) and autonomous AI agents compare with human graders in detecting brief, low-contrast intraoperative cues in surgical video?
FindingsIn this retrospective, multicenter diagnostic benchmarking study of 537 phacoemulsification video clips, VLMs showed limited discrimination of anterior capsular radial folds, even with few-shot prompting, whereas autonomous AI agents successfully generated supervised classifiers with substantially higher performance, approaching that of junior glaucoma specialists.
MeaningFor fine-grained intraoperative video interpretation, task-specific classifiers engineered by autonomous agents currently demonstrate greater clinical relevance than generalist VLMs.