Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays
Journal:
arXiv
Published Date:
Jul 10, 2025
Abstract
Recent work has revisited the infamous task Name that dataset and established
that in non-medical datasets, there is an underlying bias and achieved high
Accuracies on the dataset origin task. In this work, we revisit the same task
applied to popular open-source chest X-ray datasets. Medical images are
naturally more difficult to release for open-source due to their sensitive
nature, which has led to certain open-source datasets being extremely popular
for research purposes. By performing the same task, we wish to explore whether
dataset bias also exists in these datasets. % We deliberately try to increase
the difficulty of the task by dataset transformations. We apply simple
transformations of the datasets to try to identify bias. Given the importance
of AI applications in medical imaging, it's vital to establish whether modern
methods are taking shortcuts or are focused on the relevant pathology. We
implement a range of different network architectures on the datasets: NIH,
CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more
explainable research being performed in medical imaging and the creation of
more open-source datasets in the medical domain. The corresponding code will be
released upon acceptance.