Implicit data crimes: Machine learning bias arising from misuse of public data.

Journal: Proceedings of the National Academy of Sciences of the United States of America

Published Date: Mar 21, 2022

Abstract

SignificancePublic databases are an important resource for machine learning research, but their growing availability sometimes leads to "off-label" usage, where data published for one task are used for another. This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms. The underlying cause is that public data are processed with hidden processing pipelines that alter the data features. Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases. We relate to the publication of such results as implicit "data crimes" to raise community awareness of this growing big data problem.

Authors

Efrat Shimron

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.
Jonathan I Tamir

Subtle Medical Inc., Menlo Park, CA, USA.
Ke Wang

China Electric Power Research Institute, Haidian District, Beijing 100192, China. wangke1@epri.sgcc.com.cn.
Michael Lustig

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.

Keywords

Algorithms Bias Crime Image Processing, Computer-Assisted Machine Learning

External Resources

View on PubMed Access via DOI PubMed (35312366)

Implicit data crimes: Machine learning bias arising from misuse of public data.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals