Data Heterogeneity Modeling for Trustworthy Machine Learning
Journal:
arXiv
Published Date:
Jun 1, 2025
Abstract
Data heterogeneity plays a pivotal role in determining the performance of
machine learning (ML) systems. Traditional algorithms, which are typically
designed to optimize average performance, often overlook the intrinsic
diversity within datasets. This oversight can lead to a myriad of issues,
including unreliable decision-making, inadequate generalization across
different domains, unfair outcomes, and false scientific inferences. Hence, a
nuanced approach to modeling data heterogeneity is essential for the
development of dependable, data-driven systems. In this survey paper, we
present a thorough exploration of heterogeneity-aware machine learning, a
paradigm that systematically integrates considerations of data heterogeneity
throughout the entire ML pipeline -- from data collection and model training to
model evaluation and deployment. By applying this approach to a variety of
critical fields, including healthcare, agriculture, finance, and recommendation
systems, we demonstrate the substantial benefits and potential of
heterogeneity-aware ML. These applications underscore how a deeper
understanding of data diversity can enhance model robustness, fairness, and
reliability and help model diagnosis and improvements. Moreover, we delve into
future directions and provide research opportunities for the whole data mining
community, aiming to promote the development of heterogeneity-aware ML.