Deep Learning-Based Genetic Perturbation Models Do Outperform Uninformative Baselines on Well-Calibrated Metrics
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Single cell genetic perturbation modeling involves predicting the effects of unobserved genetic manipulations, enabling scalable in silico screens for target discovery. Recent reports have claimed that deep learning-based perturbation models fail to outperform uninformative baselines, raising doubts about their utility. Here, we show that these conclusions largely stem from limitations of benchmarking metrics, not from the models themselves. We introduce a framework for evaluating bench-mark metric calibration using positive and negative controls, including a new positive control baseline (the interpolated duplicate) and a quantitative calibration measure (the dynamic range fraction). Across 14 perturbation datasets and 13 evaluation metrics, we find that conventional metrics such as mean squared error (MSE) and control-referenced delta correlation (Pearson(Δctrl)) are often poorly calibrated, whereas weighted and rank-based alternatives exhibit consistent calibration. Under well-calibrated metrics, deep learning models outperform mean, control, and linear baselines, and in some cases even surpass the additive baseline in combination-prediction tasks. Calibrated evaluation thus explains prior reports of model underperformance, revealing that deep learning models do outperform uninformative baselines.