ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
Text-driven speech style transfer aims to mold the intonation, pace, and
timbre of a spoken utterance to match stylistic cues from text descriptions.
While existing methods leverage large-scale neural architectures or pre-trained
language models, the computational costs often remain high. In this paper, we
present \emph{ReverBERT}, an efficient framework for text-driven speech style
transfer that draws inspiration from a state space model (SSM) paradigm,
loosely motivated by the image-based method of Wang and
Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method
operates in the speech space and integrates a discrete Fourier transform of
latent speech features to enable smooth and continuous style modulation. We
also propose a novel \emph{Transformer-based SSM} layer for bridging textual
style descriptors with acoustic attributes, dramatically reducing inference
time while preserving high-quality speech characteristics. Extensive
experiments on benchmark speech corpora demonstrate that \emph{ReverBERT}
significantly outperforms baselines in terms of naturalness, expressiveness,
and computational efficiency. We release our model and code publicly to foster
further research in text-driven speech style transfer.