Graph Network for Sign Language Tasks
Journal:
arXiv
Published Date:
Apr 16, 2025
Abstract
Recent advances in sign language research have benefited from CNN-based
backbones, which are primarily transferred from traditional computer vision
tasks (\eg object identification, image recognition). However, these CNN-based
backbones usually excel at extracting features like contours and texture, but
may struggle with capturing sign-related features. In fact, sign language tasks
require focusing on sign-related regions, including the collaboration between
different regions (\eg left hand region and right hand region) and the
effective content in a single region. To capture such region-related features,
we introduce MixSignGraph, which represents sign sequences as a group of mixed
graphs and designs the following three graph modules for feature extraction,
\ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and
Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the
correlation of intra-frame cross-region features within one frame, \ie focusing
on spatial features. The TSG module tracks the interaction of inter-frame
cross-region features among adjacent frames, \ie focusing on temporal features.
The HSG module aggregates the same-region features from different-granularity
feature maps of a frame, \ie focusing on hierarchical features. In addition, to
further improve the performance of sign language tasks without gloss
annotations, we propose a simple yet counter-intuitive Text-driven CTC
Pre-training (TCP) method, which generates pseudo gloss labels from text labels
for model pre-training. Extensive experiments conducted on current five public
sign language datasets demonstrate the superior performance of the proposed
model. Notably, our model surpasses the SOTA models on multiple sign language
tasks across several datasets, without relying on any additional cues.