STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
Recent breakthroughs in singing voice synthesis (SVS) have heightened the
demand for high-quality annotated datasets, yet manual annotation remains
prohibitively labor-intensive and resource-intensive. Existing automatic
singing annotation (ASA) methods, however, primarily tackle isolated aspects of
the annotation pipeline. To address this fundamental challenge, we present
STARS, which is, to our knowledge, the first unified framework that
simultaneously addresses singing transcription, alignment, and refined style
annotation. Our framework delivers comprehensive multi-level annotations
encompassing: (1) precise phoneme-audio alignment, (2) robust note
transcription and temporal localization, (3) expressive vocal technique
identification, and (4) global stylistic characterization including emotion and
pace. The proposed architecture employs hierarchical acoustic feature
processing across frame, word, phoneme, note, and sentence levels. The novel
non-autoregressive local acoustic encoders enable structured hierarchical
representation learning. Experimental validation confirms the framework's
superior performance across multiple evaluation dimensions compared to existing
annotation approaches. Furthermore, applications in SVS training demonstrate
that models utilizing STARS-annotated data achieve significantly enhanced
perceptual naturalness and precise style control. This work not only overcomes
critical scalability challenges in the creation of singing datasets but also
pioneers new methodologies for controllable singing voice synthesis. Audio
samples are available at https://gwx314.github.io/stars-demo/.