Evaluation of large language model chatbot responses to psychotic prompts
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
The large language model (LLM) chatbot product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input, they may have difficulty responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms. A cross-sectional study of ChatGPT responses to psychotic and control prompts, with blind clinician ratings of response appropriateness. ChatGPT web application accessed on 8/28-8/29/2025, testing three product versions: GPT-5 Auto (current paid default), GPT-4o (previous paid default), and “Free” (version accessible without subscription or account). We presented 158 unique prompts (79 control and 79 psychotic, generated based on the Structured Interview for Psychosis-Risk Syndromes) to three product versions, yielding 474 prompt-response pairs. Blinded clinicians assigned each an appropriateness rating (0 = completely appropriate, 1 = somewhat appropriate, 2 = completely inappropriate) via a standardized rubric. We hypothesized a priori that psychotic prompts would be more likely than control prompts to elicit less appropriate responses both across and within product versions. In the primary (across-version) analysis, psychotic prompts were 25.84 times more likely to elicit less appropriate responses with “Free” ChatGPT (95% CI 12.45 to 53.66, p < 0.001). GPT-5 Auto reduced risk somewhat (OR for interaction term 0.33, 95% CI 0.16 to 0.68, p = 0.005) yet still generated less appropriate responses at a greatly elevated rate (implied OR 8.53, 95% CI 3.05 to 23.84). In the secondary (within-version) analysis, ORs were 9.08 for GPT-5 Auto (95% CI 4.24 to 21.02), 14.15 for GPT-4o (95% CI 6.12 to 37.23) and 43.37 for “Free” (95% CI 18.44 to 112.80). In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions. No tested version of ChatGPT reliably generated appropriate responses to psychotic content. The large language model (LLM) chatbot product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input, they may have difficulty responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms, we conducted a cross-sectional study of ChatGPT responses to psychotic and control prompts, with blind clinician ratings of response appropriateness. We tested three ChatGPT product versions: GPT-5 Auto (current paid default), GPT-4o (previous paid default), and “Free” (version accessible without subscription or account), presenting each with 158 unique prompts (79 control and 79 psychotic, created based on the Structured Interview for Psychosis-Risk Syndromes), yielding 474 prompt-response pairs. Blinded clinicians assigned each an appropriateness rating (0 = completely appropriate, 1 = somewhat appropriate, 2 = completely inappropriate) via a standardized rubric. We hypothesized a priori that psychotic prompts would be more likely than control prompts to elicit less appropriate responses both across and within product versions. We found that psychotic prompts were 25.84 times more likely to elicit less appropriate responses with “Free” ChatGPT (95% CI 12.45 to 53.66, p < 0.001). GPT-5 Auto reduced risk somewhat (OR for interaction term 0.33, 95% CI 0.16 to 0.68, p = 0.005) yet still generated less appropriate responses at a greatly elevated rate (implied OR 8.53, 95% CI 3.05 to 23.84). No tested version of Chat-GPT reliably generated appropriate responses to psychotic content. Can the popular large language model product ChatGPT reliably generate appropriate responses to prompts containing psychotic content? Psychotic prompts were 26 times more likely than control prompts to elicit less appropriate responses from the current free version of ChatGPT, and 9 times more likely to elicit them from the current paid version. No tested version of ChatGPT can reliably generate appropriate responses to psychotic content.