Implicit data crimes: Machine learning bias arising from misuse of public data.

Journal: Proceedings of the National Academy of Sciences of the United States of America
Published Date:

Abstract

SignificancePublic databases are an important resource for machine learning research, but their growing availability sometimes leads to "off-label" usage, where data published for one task are used for another. This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms. The underlying cause is that public data are processed with hidden processing pipelines that alter the data features. Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases. We relate to the publication of such results as implicit "data crimes" to raise community awareness of this growing big data problem.

Authors

  • Efrat Shimron
    Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.
  • Jonathan I Tamir
    Subtle Medical Inc., Menlo Park, CA, USA.
  • Ke Wang
    China Electric Power Research Institute, Haidian District, Beijing 100192, China. wangke1@epri.sgcc.com.cn.
  • Michael Lustig
    Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720.