Hierarchical refinements of cis-regulatory inputs improve scalable gene expression prediction
Journal:
bioRxiv
Published Date:
Jun 2, 2026
Abstract
Deciphering the relationships between cis-regulatory elements (CREs) and target gene expression has long been a challenging problem in molecular biology. However, predicting gene expression from hundreds of candidate cis-regulatory elements (cCREs) requires models that scale to long, noisy inputs while retaining interpretable regulatory structure. Existing Transformer-based approaches typically attend over all nucleotides and all surrounding cCREs, diluting causal signals when hundreds of elements compete for limited model capacity. Here we introduce a two-stage selective framework (TSSF) that performs hierarchical refinements: nucleotide-level masking within each cCRE, followed by cCRE-level selection around each gene, implemented with information-bottleneck priors and a fully Transformer-based architecture. Across 70 human cell types and tissues, TSSF and lightweight variants improve expression prediction and enhancer-gene prioritization relative to strong baselines, including on cross-cell-line and cell-type-specific benchmarks. Prediction-stratified analysis motivates a distance-decay prior that aligns attention with long-range regulatory geometry, and chromatin-contact augmentation improves recovery of distal links. Motif analyses of high-confidence predictions recover proximal and distal regulatory programs, supporting mechanistic interpretability. TSSF offers a general strategy for scalable, interpretable modeling of high-dimensional regulatory inputs in genomics.