Leveraging State Space Models in Long Range Genomics
Journal:
arXiv
Published Date:
Apr 7, 2025
Abstract
Long-range dependencies are critical for understanding genomic structure and
function, yet most conventional methods struggle with them. Widely adopted
transformer-based models, while excelling at short-context tasks, are limited
by the attention module's quadratic computational complexity and inability to
extrapolate to sequences longer than those seen in training. In this work, we
explore State Space Models (SSMs) as a promising alternative by benchmarking
two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics
modeling tasks under conditions parallel to a 50M parameter transformer
baseline. We discover that SSMs match transformer performance and exhibit
impressive zero-shot extrapolation across multiple tasks, handling contexts 10
to 100 times longer than those seen during training, indicating more
generalizable representations better suited for modeling the long and complex
human genome. Moreover, we demonstrate that these models can efficiently
process sequences of 1M tokens on a single GPU, allowing for modeling entire
genomic regions at once, even in labs with limited compute. Our findings
establish SSMs as efficient and scalable for long-context genomic analysis.