Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

Journal: Journal of medical Internet research
Published Date:

Abstract

BACKGROUND: Large language models (LLMs), such as OpenAI's GPT-3.5, GPT-4, and GPT-4o, have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric "knowledge" stored within the parameters of these models and developing measures of confidence in their responses (ie, the likelihood that an LLM response is accurate) are crucial for the safe and effective integration of these tools into mental health settings.

Authors

  • Kaitlin Hanss
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.
  • Karthik V Sarma
    Department of Bioengineering, University of California, Los Angeles, CA, USA.
  • Anne L Glowinski
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.
  • Andrew Krystal
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.
  • Ramotse Saunders
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.
  • Andrew Halls
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.
  • Sasha Gorrell
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.
  • Erin Reilly
    Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States.