Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems
Journal:
arXiv
Published Date:
Mar 19, 2025
Abstract
Medical Question Answering systems based on Retrieval Augmented Generation is
promising for clinical decision support because they can integrate external
knowledge, thus reducing inaccuracies inherent in standalone large language
models (LLMs). However, these systems may unintentionally propagate or amplify
biases associated with sensitive demographic attributes like race, gender, and
socioeconomic factors. This study systematically evaluates demographic biases
within medical RAG pipelines across multiple QA benchmarks, including MedQA,
MedMCQA, MMLU, and EquityMedQA. We quantify disparities in retrieval
consistency and answer correctness by generating and analyzing queries
sensitive to demographic variations. We further implement and compare several
bias mitigation strategies to address identified biases, including Chain of
Thought reasoning, Counterfactual filtering, Adversarial prompt refinement, and
Majority Vote aggregation. Experimental results reveal significant demographic
disparities, highlighting that Majority Vote aggregation notably improves
accuracy and fairness metrics. Our findings underscore the critical need for
explicitly fairness-aware retrieval methods and prompt engineering strategies
to develop truly equitable medical QA systems.