Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity
Journal:
arXiv
Published Date:
Jul 8, 2025
Abstract
Despite the remarkable progress of large language models (LLMs) across
various domains, their capacity to predict retinopathy of prematurity (ROP)
risk remains largely unexplored. To address this gap, we introduce a novel
Chinese benchmark dataset, termed CROP, comprising 993 admission records
annotated with low, medium, and high-risk labels. To systematically examine the
predictive capabilities and affective biases of LLMs in ROP risk
stratification, we propose Affective-ROPTester, an automated evaluation
framework incorporating three prompting strategies: Instruction-based,
Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme
assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and
ICL schemes leverage external medical knowledge to enhance predictive accuracy.
Crucially, we integrate emotional elements at the prompt level to investigate
how different affective framings influence the model's ability to predict ROP
and its bias patterns. Empirical results derived from the CROP dataset yield
two principal observations. First, LLMs demonstrate limited efficacy in ROP
risk prediction when operating solely on intrinsic knowledge, yet exhibit
marked performance gains when augmented with structured external inputs.
Second, affective biases are evident in the model outputs, with a consistent
inclination toward overestimating medium- and high-risk cases. Third, compared
to negative emotions, positive emotional framing contributes to mitigating
predictive bias in model outputs. These findings highlight the critical role of
affect-sensitive prompt engineering in enhancing diagnostic reliability and
emphasize the utility of Affective-ROPTester as a framework for evaluating and
mitigating affective bias in clinical language modeling systems.