Enhancing AI's ability to interpret histological images with prompt engineering: An evaluation of GPT-4o performance.
Journal:
Anatomical sciences education
Published Date:
Apr 1, 2026
Abstract
This study examines the effect of various prompting strategies on ChatGPT's ability to interpret histological images across different tissue types and varying question complexities. GPT-4o's performance was assessed using three distinct prompting techniques: P1 (zero-shot), P2 (few-shot with examples), and P3 (chain-of-thought with reasoning explanations) across 120 histological images of four tissue types (epithelial, connective, muscular, and neural). Three standardized questions assessed tissue recognition, structural identification, and functional assessment. GPT-4o demonstrated a noticeable variation in performance across tissue types (p < 0.001, η2 = 0.042). Question complexity significantly affected performance (p < 0.001, η2 = 0.022), revealing a hierarchical pattern in which structural identification proved most challenging across all conditions. Error dependency analysis revealed that 98% of functional assessment errors co-occurred with structural identification errors, indicating strong cascading effects. Inter-rater reliability remained consistently high across all conditions (ICC = 0.95-0.96). Across prompting approaches, answer accuracy ranged from 58.9% (P1) to 63.6% (P3), with modest or nonsignificant effects of prompt type (p = 0.116, η2 = 0.004). This study demonstrates that GPT-4o's performance in histological image interpretation varies significantly across tissue types and question complexity. Tissue-specific approaches and a focus on structural identification accuracy are essential for effective educational integration, whereas prompt engineering alone yields limited performance gains.
Authors
Keywords
No keywords available for this article.