Tailoring task arithmetic to address bias in models trained on multi-institutional datasets.

Journal: Journal of biomedical informatics
Published Date:

Abstract

OBJECTIVE: Multi-institutional datasets are widely used for machine learning from clinical data, to increase dataset size and improve generalization. However, deep learning models in particular may learn to recognize the source of a data element, leading to biased predictions. For example, deep learning models for image recognition trained on chest radiographs with COVID-19 positive and negative examples drawn from different data sources can respond to indicators of provenance (e.g., radiological annotations outside the lung area per institution-specific practices) rather than pathology, generalizing poorly beyond their training data. Bias of this sort, called confounding by provenance, is of concern in natural language processing (NLP) because provenance indicators (e.g., institution-specific section headers, or region-specific dialects) are pervasive in language data. Prior work on addressing such bias has focused on statistical methods, without providing a solution for deep learning models for NLP.

Authors

  • Xiruo Ding
    Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
  • Zhecheng Sheng
    Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA.
  • Brian Hur
    Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
  • Justin Tauscher
    Department of Psychiatry and Behavioral Sciences, BRiTE Center, University of Washington, Seattle, WA, USA.
  • Dror Ben-Zeev
    BRiTE Center, Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA, United States.
  • Meliha Yetisgen
    Departments of Biomedical and Health Informatics, University of Washington Medical Center, Seattle2Departments of Linguistics, University of Washington Medical Center, Seattle.
  • Serguei Pakhomov
    Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.
  • Trevor Cohen
    University of Washington, Seattle, WA.