Semantically redundant training data removal and deep model classification performance: A study with chest X-rays.

Journal: Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society

Published Date: Apr 9, 2024

Abstract

Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. However, the data must also exhibit variety to enable improved learning. In medical imaging data, semantic redundancy, which is the presence of similar or repetitive information, can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Also, the common use of augmentation methods to generate variety in DL training could limit performance when indiscriminately applied to such data. We hypothesize that semantic redundancy would therefore tend to lower performance and limit generalizability to unseen data and question its impact on classifier performance even with large data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data and demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.

Authors

Sivaramakrishnan Rajaraman

Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Ghada Zamzmi

Computer Science and Engineering Department University of South Florida Tampa FL USA.
Feng Yang
Zhaohui Liang

Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Zhiyun Xue

Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Sameer Antani

Computational Health Research Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Keywords

Deep Learning Humans Radiography, Thoracic Semantics

External Resources

View on PubMed Access via DOI PubMed (38608333)

Semantically redundant training data removal and deep model classification performance: A study with chest X-rays.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals