Tailoring task arithmetic to address bias in models trained on multi-institutional datasets.

Journal: Journal of biomedical informatics

Published Date: Jun 8, 2025

Abstract

OBJECTIVE: Multi-institutional datasets are widely used for machine learning from clinical data, to increase dataset size and improve generalization. However, deep learning models in particular may learn to recognize the source of a data element, leading to biased predictions. For example, deep learning models for image recognition trained on chest radiographs with COVID-19 positive and negative examples drawn from different data sources can respond to indicators of provenance (e.g., radiological annotations outside the lung area per institution-specific practices) rather than pathology, generalizing poorly beyond their training data. Bias of this sort, called confounding by provenance, is of concern in natural language processing (NLP) because provenance indicators (e.g., institution-specific section headers, or region-specific dialects) are pervasive in language data. Prior work on addressing such bias has focused on statistical methods, without providing a solution for deep learning models for NLP.

Authors

Xiruo Ding

Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
Zhecheng Sheng

Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA.
Brian Hur

Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA.
Justin Tauscher

Department of Psychiatry and Behavioral Sciences, BRiTE Center, University of Washington, Seattle, WA, USA.
Dror Ben-Zeev

BRiTE Center, Psychiatry and Behavioral Sciences, University of Washington, Seattle, WA, United States.
Meliha Yetisgen

Departments of Biomedical and Health Informatics, University of Washington Medical Center, Seattle2Departments of Linguistics, University of Washington Medical Center, Seattle.
Serguei Pakhomov

Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.
Trevor Cohen

University of Washington, Seattle, WA.

Keywords

Bias COVID-19 Databases, Factual Datasets as Topic Deep Learning Humans Machine Learning Natural Language Processing SARS-CoV-2

External Resources

View on PubMed Access via DOI PubMed (40494422)

Tailoring task arithmetic to address bias in models trained on multi-institutional datasets.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Tailoring task arithmetic to address bias in models trained on multi-institutional datasets.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals