Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products
Journal:
arXiv
Published Date:
Jul 6, 2025
Abstract
Machine learning methods are increasingly applied to analyze health-related
public discourse based on large-scale data, but questions remain regarding
their ability to accurately detect different types of health sentiments.
Especially, Large Language Models (LLMs) have gained attention as a powerful
technology, yet their accuracy and feasibility in capturing different opinions
and perspectives on health issues are largely unexplored. Thus, this research
examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in
detecting risk-promoting versus health-supporting sentiments across two
critical public health topics: Human Papillomavirus (HPV) vaccination and
heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we
curated multiple sets of messages supporting or opposing recommended health
behaviors, supplemented with human annotations as the gold standard for
sentiment classification. The findings indicate that all three LLMs generally
demonstrate substantial accuracy in classifying risk-promoting and
health-supporting sentiments, although notable discrepancies emerge by
platform, health issue, and model type. Specifically, models often show higher
accuracy for risk-promoting sentiment on Facebook, whereas health-supporting
messages on Twitter are more accurately detected. An additional analysis also
shows the challenges LLMs face in reliably detecting neutral messages. These
results highlight the importance of carefully selecting and validating language
models for public health analyses, particularly given potential biases in
training data that may lead LLMs to overestimate or underestimate the
prevalence of certain perspectives.