Overrepresentation Bias Leads to Performance Overestimation in Blood-Brain Barrier Permeability Prediction Models: Characterization and Mitigation.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

Recent advancements in blood-brain barrier permeability (BBBP) prediction of drug compounds have highlighted the growing role of machine learning, particularly deep learning. While considerable attention has been given to feature engineering and model design, their evaluation often receives insufficient attention despite its fundamental role in model credibility. In this work, we study a phenomenon we term overrepresentation bias, susceptible to be found in drug property databases, characterized by the presence of near-identical compounds with the same or nearly identical property values. Our findings reveal that overrepresentation bias leads to overly optimistic performance estimates in BBBP prediction models by significantly inflating test evaluation metrics─13.3% in average for the area under curve and 16.44% in average for the macro F1-score. To address this bias, we propose (i) an automatic detection algorithm and (ii) a bias-aware data handling procedure. We recommend adopting this approach to ensure more reliable model evaluations. Given that overrepresentation bias can affect performance estimation more than feature selection, model architecture, or even training data, we urge both academic and industrial communities to acknowledge its significance and take proactive measures to identify and address this bias in future studies.

Authors

Keywords

No keywords available for this article.