Evaluating anti-LGBTQIA+ medical bias in large language models

Journal: medRxiv
Published Date:

Abstract

Large Language Models (LLMs) are increasingly deployed in clinical settings for tasks ranging from patient communication to decision support. While these models demonstrate race-based and binary gender biases, anti-LGBTQIA+ bias remains understudied despite documented healthcare disparities affecting these populations. In this work, we evaluated the potential of LLMs to propagate anti-LGBTQIA+ medical bias and misinformation. We prompted 4 LLMs (Gemini 1.5 Flash, Claude 3 Haiku, GPT-4o, Stanford Medicine Secure GPT [GPT-4.0]) with 38 prompts consisting of explicit questions and synthetic clinical notes created by medically-trained reviewers and LGBTQIA+ health experts. The prompts consisted of pairs of prompts with and without LGBTQIA+ identity terms and explored clinical situations across two axes: (i) situations where historical bias has been observed versus not observed, and (ii) situations where LGBTQIA+ identity is relevant to clinical care versus not relevant. Medically-trained reviewers evaluated LLM responses for appropriateness (safety, privacy, hallucination/accuracy, and bias) and clinical utility. We found that all 4 LLMs generated inappropriate responses for prompts with and without LGBTQIA+ identity terms. The proportion of inappropriate responses ranged from 43-62% for prompts mentioning LGBTQIA+ identities versus 47-65% for those without. The most common reason for inappropriate classification tended to be hallucination/accuracy, followed by bias or safety. Qualitatively, we observed differential bias patterns, with LGBTQIA+ prompts eliciting more severe bias. Average clinical utility score for inappropriate responses was lower than for appropriate responses (2.6 versus 3.7 on a 5-point Likert scale). Future work should focus on tailoring output formats to stated use cases, decreasing sycophancy and reliance on extraneous information in the prompt, and improving accuracy and decreasing bias for LGBTQIA+ patients. We present our prompts and annotated responses as a benchmark for evaluation of future models. Content warning: This paper includes prompts and model-generated responses that may be offensive. Large Language Models (LLMs), such as ChatGPT, have the potential to enhance healthcare by assisting with tasks like responding to patient messages and assisting providers in making medical decisions. However, these technologies might inadvertently spread medical misinformation or reinforce harmful biases against minoritized groups. Our research examined the risk of LLMs perpetuating anti-LGBTQIA+ biases in medical contexts. We tested four LLMs with prompts designed by medical and LGBTQIA+ health experts. These prompts addressed various clinical scenarios, some historically linked to bias against LGBTQIA+ individuals. Our evaluation revealed that all four LLMs produced responses that were inaccurate or biased for prompts with and without LGBTQIA+ identity terms mentioned. Qualitatively, the nature of inappropriate responses differed between these groups, with LGBTQIA+ identity terms eliciting more severe bias. The clinical utility of responses was, on average, lower for inappropriate responses than for appropriate responses. These findings highlight the urgent need to ensure that LLMs used in medical contexts provide accurate and safe medical advice for LGBTQIA+ patients. Future efforts should focus on refining how LLMs generate responses, minimizing biases, and enhancing reliability in clinical settings in addition to critically examining use cases. This work is crucial for fostering equitable healthcare for all individuals.

Authors

  • Crystal T. Chang; Neha Srivathsa; Charbel Bou-Khalil; Akshay Swaminathan; Mitchell R. Lunn; Kavita Mishra; Sanmi Koyejo; Roxana Daneshjou