Towards Superhuman Imitation Learning for Sequential Head-and-Neck Cancer Treatment Decisions
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
We propose a simulator-driven imitation learning framework for sequential decision making in head and neck cancer (HNC) treatment. Our method, Superhuman Policy Gradient Optimization (SPGO), integrates inverse reinforcement learning principles with policy gradient updates to derive three-stage treatment policies directly from recorded physician decisions. It leverages a pre-trained clinical simulator—combining a variational autoencoder and gradient boosting models—to generate complete, temporally consistent patient trajectories, enabling safe and reproducible training. Unlike conventional behavior cloning, SPGO optimizes a sub-dominance loss that explicitly rewards surpassing the expert across multiple clinical outcomes, including relapse at year three and patient-reported toxicities at multiple follow-up times. We systematically compare six subdominance configurations (absolute vs. relative, sum vs. max aggregation, per-feature vs. max-only α updates) to assess how loss design affects convergence and treatment quality. Our best configuration—relative differences with sum aggregation and per-feature α updates—achieves over 70% superhuman dominance across clinically relevant features on held-out patients. The learned policies reproduce expert decisions on acute measures while significantly reducing predicted late toxicities and relapse risk, demonstrating generalization beyond the training distribution. • Applied computing → Health informatics; • Computing methodologies → Reinforcement learning; Learning from demonstrations.