Empirically derived evaluation requirements for responsible deployments of AI in safety-critical settings.
Journal:
NPJ digital medicine
Published Date:
Jun 18, 2025
Abstract
Processes to assure the safe, effective, and responsible deployment of artificial intelligence (AI) in safety-critical settings are urgently needed. Here we show a procedure to empirically evaluate the impacts of AI augmentation as a basis for responsible deployment. We evaluated three augmentative AI technologies nurses used to recognize imminent patient emergencies, including combinations of AI recommendations and explanations. The evaluation involved 450 nursing students and 12 licensed nurses assessing 10 historical patient cases. With each technology, nurses' performance was both improved and degraded when the AI algorithm was most correct and misleading, respectively. Our findings caution that AI capabilities alone do not guarantee a safe and effective joint human-AI system. We propose two minimum requirements for evaluating AI in safety-critical settings: (1) empirically measure the performance of people and AI together and (2) examine a range of challenging cases which produce a range of strong, mediocre, and poor AI performance.
Authors
Keywords
No keywords available for this article.