Anatomical Accuracy of Generative AI for Congenital Heart Disease Illustrations: Gemini NanoBanana Versus ChatGPT Models in a Blinded Comparative Study
Journal:
medRxiv
Published Date:
Feb 23, 2026
Abstract
Background Generative artificial intelligence (AI) systems are increasingly used to produce medical illustrations for education; however, their anatomical accuracy in complex domains such as congenital heart disease (CHD) remains insufficiently validated. Methods In an assessor-blinded comparative study, we evaluated AI-generated CHD illustrations from two contemporary text-to-image platforms (ChatGPT-5/ChatGPT-Images and Gemini NanoBanana) against human-modified educational images. Twenty different CHD types were included, yielding 147 images that were assessed by 20 physicians (10 CHD experts and 10 non-specialists). Images were rated across four domains: anatomical accuracy, label usefulness, visual attractiveness, and suitability for medical education (total score range, 4-12). Results Among 2,940 total image evaluations, the human-modified images demonstrated the highest anatomical accuracy (48.3% rated accurate), followed by NanoBanana (22.7%), while ChatGPT-generated images were predominantly rated as fabricated or incorrect (86.3% for ChatGPT-5 and 85.2% for ChatGPT-Images; p<0.001). Educational usability "as is" was highest for the human-modified images (37.9%) compared with NanoBanana (13.1%) and ChatGPT platforms ([≤]2.1%; p<0.001). Median overall quality scores were 8 for the human-modified CHD images and NanoBanana, versus 4 for both ChatGPT systems (p<0.001). In multivariable analysis, NanoBanana images were the closest to the human-modified images in quality (95% CI, 0.91-0.98), while ChatGPT-Images (95% CI, 0.58-0.63) and ChatGPT-5 (95% CI, 0.55-0.59) showed marked quality reductions. Conclusions The current generative AI systems produced visually compelling but frequently anatomically inaccurate CHD illustrations, falling substantially short of the current educational standards. Model choice strongly influences performance, with Gemini NanoBanana outperforming ChatGPT-based systems yet remaining inferior to expert-designed human-modified images. AI-generated cardiac imagery should be used only within expert-reviewed educational workflows rather than as independent instructional resources.