Retrieval-Augmented Claude Opus 4.7 and GPT-5.5 Surpass Human Performance on the Nuclear Cardiology Board Preparation Exam (and Claude Drafts a Paper About it)

Journal: medRxiv
Published Date:

Abstract

Background - Previous studies evaluated large language model (LLM) performance on the American Society of Nuclear Cardiology (ASNC) Board Preparation Exam. Without domain-specific context, the best model (GPT-4o) achieved 63.1%, below the estimated 65% passing threshold and the 78% mean score of human fellows-in-training (FITs). Providing textbook context improved GPT-4o to 73.8% on text-only questions, but still fell short of human trainees. Whether next-generation LLMs with retrieval-augmented generation (RAG) can exceed this gap is unknown. Methods- Claude Opus 4.7 and GPT-5.5 were administered all 168 questions (141 text-only, 27 image-based) from the 2023 ASNC Board Preparation Exam across 5 iterations each, using RAG with a nuclear cardiology textbook, companion atlas, and ASNC clinical guidelines. Claude used local FAISS-based semantic retrieval; GPT-5.5 used Azure cloud-hosted vector store. Performance was compared to prior LLM results and 13 human FITs. Results - Across 5 iterations, Claude Opus 4.7 achieved a mean accuracy of 86.3% + or - 1.4% (text 88.8%, image 73.3%). GPT-5.5 achieved 86.7% + or - 2.2% (text 88.5%, image 77.0%) but refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. Both models surpassed the human FIT mean (78.0%) and the estimated passing threshold. Compared to GPT-4o without context (63.1%), this represents a 23-percentage-point improvement in 18 months. Conclusion - Next-generation LLMs with RAG now surpass average human trainee performance on nuclear cardiology board preparation questions, suggesting significant potential as educational tools and knowledge-reference aids in cardiovascular imaging.

Authors

  • Killekar
  • A.; Shanbhag
  • A.; Miller
  • R. J.; Dey
  • D.; Bourque
  • J.; Phillips
  • L.; Chareonthaitawee
  • P.; Slomka
  • P.