Robust Root Cause Diagnosis using In-Distribution Interventions
Journal:
arXiv
Published Date:
May 2, 2025
Abstract
Diagnosing the root cause of an anomaly in a complex interconnected system is
a pressing problem in today's cloud services and industrial operations. We
propose In-Distribution Interventions (IDI), a novel algorithm that predicts
root cause as nodes that meet two criteria: 1) **Anomaly:** root cause nodes
should take on anomalous values; 2) **Fix:** had the root cause nodes assumed
usual values, the target node would not have been anomalous. Prior methods of
assessing the fix condition rely on counterfactuals inferred from a Structural
Causal Model (SCM) trained on historical data. But since anomalies are rare and
fall outside the training distribution, the fitted SCMs yield unreliable
counterfactual estimates. IDI overcomes this by relying on interventional
estimates obtained by solely probing the fitted SCM at in-distribution inputs.
We present a theoretical analysis comparing and bounding the errors in
assessing the fix condition using interventional and counterfactual estimates.
We then conduct experiments by systematically varying the SCM's complexity to
demonstrate the cases where IDI's interventional approach outperforms the
counterfactual approach and vice versa. Experiments on both synthetic and
PetShop RCD benchmark datasets demonstrate that \our\ consistently identifies
true root causes more accurately and robustly than nine existing
state-of-the-art RCD baselines. Code is released at
https://github.com/nlokeshiisc/IDI_release.