Learning dynamic binary treatment policies under treatment selection bias: A Conservative Q-Learning approach with representation balancing.
Journal:
Journal of biomedical informatics
Published Date:
Feb 10, 2026
Abstract
OBJECTIVE: Learn safe, robust dynamic treatment regimes (DTRs) from observational trajectories that exhibit treatment selection bias, using an offline reinforcement learning (RL) approach. METHODS: We propose CQL-RB, which augments Conservative Q-Learning (CQL) with a representation-balancing penalty based on an integral probability metric (IPM) (instantiated as either a maximum mean discrepancy (MMD) or an energy-distance penalty). The penalty aligns latent patient representations across treatment groups to reduce action-conditioned distribution shift while preserving CQL's conservative policy estimation. We evaluate CQL-RB on two clinically realistic simulators: EpiCare (eight environments) and AhnChemo from DTR-Bench, both modeling longitudinal healthcare decisions with binary actions at each stage. To emulate selection bias, we implement clinician-like behavior policies that assign treatment as a function of patient covariates. Baselines include BOWL, ACWL, T-RL, RL-NN, and standard CQL. Outcomes are expected return and adverse-event counts from simulator rollouts; model selection uses weighted importance sampling off-policy evaluation on held-out data. Ablations vary both the IPM weight β and the choice of IPM metric. RESULTS: Across all eight EpiCare environments and the challenging AhnChemo task, CQL-RB with either MMD or energy-distance penalties consistently achieves higher returns than competing methods while yielding lower (or comparable) adverse-event rates. Removing the balancing term degrades both return and safety, confirming its contribution. Performance is robust for moderate penalty weights (e.g., β∈{1,10,100}), with degradation only at overly large values (e.g., β≥1000 for MMD or β=10000 for energy distance). CONCLUSION: Representation balancing materially strengthens conservative offline RL for DTR learning under treatment selection bias. By aligning patient representations without altering CQL's safety mechanics, CQL-RB delivers policies that are both effective (higher returns) and safer (fewer adverse events) in realistic healthcare simulations. These findings underscore the importance of addressing treatment selection bias when learning robust and safe dynamic treatment policies.
Authors
Keywords
No keywords available for this article.