Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations
Journal:
arXiv
Published Date:
Apr 16, 2025
Abstract
The advancement of AI systems for mental health support is hindered by
limited access to therapeutic conversation data, particularly for trauma
treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset
of 3,000 therapy conversations based on Prolonged Exposure therapy protocols
for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique
cases, each explored through six conversational perspectives that mirror the
progression of therapy from initial anxiety to peak distress to emotional
processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3,
49.4% male, 44.4% female, 6.2% non-binary), 20 trauma types, and 10
trauma-related behaviors using deterministic and probabilistic generation
methods. Analysis reveals realistic distributions of trauma types (witnessing
violence 10.6%, bullying 10.2%) and symptoms (nightmares 23.4%, substance abuse
20.8%). Clinical experts validated the dataset's therapeutic fidelity,
highlighting its emotional depth while suggesting refinements for greater
authenticity. We also developed an emotional trajectory benchmark with
standardized metrics for evaluating model responses. This privacy-preserving
dataset addresses critical gaps in trauma-focused mental health data, offering
a valuable resource for advancing both patient-facing applications and
clinician training tools.