CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Vision-Language Models (vLLMs) have emerged as powerful architectures for
joint reasoning over visual and textual inputs, enabling breakthroughs in image
captioning, cross modal retrieval, and multimodal dialogue. However, as these
models scale to longer video sequences and richer language descriptions, the
quadratic complexity of the standard attention mechanism presents a fundamental
computational bottleneck. This challenge is exacerbated in vLLMs, where
attention must be computed not only within modalities but also across them,
leading to prohibitive memory and latency costs. In this work, we introduce the
Compressed Sensing Attention Transformer (CSAT), a novel architecture that
reimagines attention computation through the lens of compressed sensing. By
projecting high dimensional key and value representations into a
lower-dimensional subspace via random measurement matrices and reconstructing
the attention outputs using sparse recovery algorithms, CSAT significantly
reduces attention complexity while maintaining semantic fidelity. Applied to
vLLMs, CSAT exploits the inherent compressibility of both visual and textual
representations especially evident in video, where temporal redundancy is high,
and in language, where cross-modal grounding is often sparse. In contrast to
LLMs, which must often model entangled symbolic dependencies, vLLMs benefit
from structured sparsity in alignment and scene composition, making them
particularly well-suited to compressed attention. We provide a formal
mathematical treatment of CSAT, demonstrate its integration into vision
language pipelines, and validate its performance on standard benchmarks,
highlighting its promise as a scalable, interpretable, and resource efficient
solution for next generation multimodal transformers.