Machine-Assisted Topic Analysis of Large-Scale Health Experience Data: Identifying Sociodemographic Differences and Evaluating Bias in Large Language Models

Journal: medRxiv
Published Date:

Abstract

Introduction: Large-scale free-text data with socio-demographic information can capture nuanced accounts of lived experience that are difficult to detect in structured measures. However, manual qualitative analysis is difficult to scale, while automated approaches may obscure subgroup variation or introduce bias. This is especially relevant for large language models (LLMs), whose use in qualitative health research is increasing despite limited evaluation in socio-demographically stratified analysis. Objectives: This study examined how socio-demographic differences in health and wellbeing experiences were manifested in a large-scale free-text dataset, and evaluated how different AI-assisted analytic approaches identified these differences. Specifically, it aimed to: (1) identify socio-demographic differences using Machine-Assisted Topic Analysis (MATA); (2) compare MATA outputs with topic modelling combined with LLM-based topic interpretation; and (3) examine potential bias in LLM-based analysis. Methods: We analysed 2,177 valid free-text responses from the UK COVID-19 Wellbeing Tracker, a longitudinal survey of adults recruited during the pandemic. Responses described factors influencing health behaviours, mood, and wellbeing over time. Data were preprocessed and stratified by gender, age, and socioeconomic status (SES). MATA combined topic modelling, using Latent Dirichlet Allocation, with humanled qualitative interpretation of topic keywords and representative responses. The same topic model outputs were then interpreted using an LLM for comparison. Potential LLM bias was assessed using a demographic label-swap crossover design, with bias evaluated through Jaccard lexical similarity, VADER sentiment, and NRC emotion analysis. Grounded Review and Assessment of Computational Evidence (GRACE) was used to evaluate the AI outputs. Powered by Editorial Manager(R) and ProduXion Manager(R) from Aries Systems Corporation Results: MATA identified meaningful socio-demographic thematic differences in pandemic-related mood and wellbeing across gender, age, and SES. Common themes included disruption, adaptation, uncertainty, routine, and the influence of work, relationships, and health on wellbeing. Male-stratified topics emphasised routines, habits, and coping with external pressures, whereas female-stratified topics were more relational and reflective, focusing on connection, isolation, family wellbeing, and anxiety. Lower SES narratives included practical strain, financial pressure, and loss of control, while higher SES narratives more often reflected adjustment, autonomy, and meaning-making. Older adults described health, gratitude, and family connection, whereas younger adults emphasised work-related stress and competing demands. LLM-based interpretation broadly reproduced the high-level subgroup patterns identified through MATA, but outputs were more generalised, less conceptually differentiated, and showed greater thematic overlap. Bias analysis showed systematic shifts in vocabulary, sentiment, and emotional tone when demographic labels were swapped, suggesting a risk of representational bias. Conclusions: MATA identified meaningful socio-demographic differences while retaining interpretative depth at scale. LLM-based topic interpretation showed utility for rapid thematic summarisation, but produced less conceptually differentiated outputs and was sensitive to demographic framing. The analysis also identified "LLM speak", where outputs appeared coherent but relied on abstract, generalised, and overlapping interpretations. Human oversight, structured qualitative appraisal, and explicit bias evaluation are necessary when using LLMs to analyse socially stratified free-text health data.

Authors

  • Bondaronek
  • P.; Ward
  • E.; Beecham
  • E.; Zhang
  • E.; Huang
  • Y.; Ive
  • J.; Naughton
  • F.; Wu
  • H.; Vindrola-Padros
  • C.