Evaluating Artificial Intelligence-Assisted Current Procedural Terminology Coding in Vascular Surgery: A Comparison of ChatGPT Plus and Perplexity Pro Against Finance Department.
Journal:
Annals of vascular surgery
Published Date:
Nov 13, 2025
Abstract
BACKGROUND: This study evaluates the performance of ChatGPT Plus and Perplexity Pro in matching Current Procedural Terminology (CPT) codes from vascular surgery cases at Tufts Medical Center, comparing their accuracy to that of the finance department's CPT coding, which serves as the reference standard. METHODS: A total of 120 vascular surgery cases from April 2024 were analyzed. Each case was documented in two formats: operative notes with detailed procedural descriptions and brief operative summaries. Both artificial intelligence (AI) models were tested using these formats, and their CPT code outputs were compared to the finance department's codes. Performance was assessed at two levels: CPT-level accuracy (individual code matching) and case-level accuracy (entire case coding correctness). AI-generated codes were categorized as exact match, partial match, or no match at the case level. Cohen's kappa analysis was then used to measure inter-rater agreement. RESULTS: ChatGPT and Perplexity Pro produced varying levels of accuracy, with ChatGPT tending to over-report CPT codes, while Perplexity AI under-reported in full-note cases but over-reported in brief notes. Using full operative notes, ChatGPT matched 45.2% of CPT codes, while Perplexity AI matched 43.7%. At the case level, ChatGPT exactly matched 29.2% of cases, partially matched 49.2%, and had no match in 21.6%. Perplexity AI had a higher exact-match rate (39.2%) but also had more cases with no match (27.5%). With brief operative summaries, both models demonstrated improved accuracy. ChatGPT's CPT-level match rate increased to 64.75% (43% improvement), and its case-level exact match rate rose to 51.6% (77% increase). Perplexity AI's CPT-level accuracy improved to 60.2% (38% increase), but its case-level exact matches dropped slightly to 37.5% (4% decrease). Overall, brief operative summaries improved agreement for both AI models, with ChatGPT achieving the highest agreement (κ = 0.45 at CPT level, κ = 0.341 at case level). However, agreement remained only fair to moderate, indicating that human oversight remains essential in AI-assisted CPT coding when using untrained AI models. CONCLUSION: Both AI models performed better with brief operative summaries than with full operative notes, highlighting the importance of structured documentation for AI-assisted coding. ChatGPT showed significant improvements at both CPT level and case level, whereas Perplexity AI improved at the CPT level but had a slight decline in case-level accuracy. These findings suggest that AI benefits from concise, structured input, though human oversight remains essential, as neither model consistently achieved high accuracy. This work was done without prior training of these models to perform these specific tasks, which presents a significant opportunity for improved accuracy with use of training data.
Authors
Keywords
No keywords available for this article.