The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets.

Journal: JMIR medical informatics
PMID:

Abstract

BACKGROUND: In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.

Authors

  • Theresa Willem
    Institute of History and Ethics in Medicine, Department of Preclinical Medicine, TUM School of Medicine and Health, Technical University of Munich, Ismaninger Straße 22, 81675, Munich, Germany. theresa.willem@tum.de.
  • Alessandro Wollek
    Munich Institute of Biomedical Engineering, Technical University of Munich, Garching near Munich, Germany.
  • Theodor Cheslerean-Boghiu
    Munich Institute of Biomedical Engineering, School of Computation, Information, and Technology, Technical University of Munich, Munich, Germany.
  • Martha Kenney
    Women & Gender Studies, San Francisco State University, San Francisco, CA, United States.
  • Alena Buyx
    Institute for History and Ethics of Medicine, Technical University of Munich School of Medicine, Technical University of Munich, Munich, Germany.