WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning
Journal:
arXiv
Published Date:
Jan 15, 2025
Abstract
Current speech encoding pipelines often rely on an additional text-based LM
to get robust representations of human communication, even though SotA
speech-to-text models often have a LM within. This work proposes an approach to
improve the LM within an audio model such that the subsequent text-LM is
unnecessary. We introduce WhiSPA (Whisper with Semantic and Psychological
Alignment), which leverages a novel audio training objective: contrastive loss
with a language model embedding as a teacher. Using over 500k speech segments
from mental health audio interviews, we evaluate the utility of aligning
Whisper's latent space with semantic representations from a text autoencoder
(SBERT) and lexically derived embeddings of basic psychological dimensions:
emotion and personality. Over self-supervised affective tasks and downstream
psychological tasks, WhiSPA surpasses current speech encoders, achieving an
average error reduction of 73.4% and 83.8%, respectively. WhiSPA demonstrates
that it is not always necessary to run a subsequent text LM on speech-to-text
output in order to get a rich psychological representation of human
communication.