Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health
Journal:
arXiv
Published Date:
Jun 6, 2025
Abstract
This position paper argues that post-deployment monitoring in clinical AI is
underdeveloped and proposes statistically valid and label-efficient testing
frameworks as a principled foundation for ensuring reliability and safety in
real-world deployment. A recent review found that only 9% of FDA-registered
AI-based healthcare tools include a post-deployment surveillance plan. Existing
monitoring approaches are often manual, sporadic, and reactive, making them
ill-suited for the dynamic environments in which clinical models operate. We
contend that post-deployment monitoring should be grounded in label-efficient
and statistically valid testing frameworks, offering a principled alternative
to current practices. We use the term "statistically valid" to refer to methods
that provide explicit guarantees on error rates (e.g., Type I/II error), enable
formal inference under pre-defined assumptions, and support
reproducibility--features that align with regulatory requirements.
Specifically, we propose that the detection of changes in the data and model
performance degradation should be framed as distinct statistical hypothesis
testing problems. Grounding monitoring in statistical rigor ensures a
reproducible and scientifically sound basis for maintaining the reliability of
clinical AI systems. Importantly, it also opens new research directions for the
technical community--spanning theory, methods, and tools for statistically
principled detection, attribution, and mitigation of post-deployment model
failures in real-world settings.