Puget predicts gene expression across cell types using sequence and 3D chromatin organization data

Journal: bioRxiv
Published Date:

Abstract

Gene expression is governed by both linear DNA sequence and three-dimensional (3D) chromatin architecture. Most gene expression prediction models rely on sequence alone, thereby failing to capture structural context and to generalize to unseen cell types. We present Puget, a deep learning model that predicts cell type-specific gene expression from sequence and Hi-C data, which captures 3D chromatin organization. Puget pairs pretrained sequence and Hi-C encoders with a lightweight transformer decoder. Using paired Hi-C/RNA-seq from 36 human and 4 mouse biosamples, we evaluate Puget’s ability to generalize to held-out genes, held-out biosamples, and from human to mouse. Relative to a sequence-only baseline, Puget improves cross-biosample Pearson correlation by up to 25% on highly variable genes in training biosamples and, unlike the sequence-only model, generalizes to held-out biosamples and across species. In addition, in silico perturbation experiments show that Puget can prioritize experimentally validated enhancer-gene pairs. Together, these results highlight a generalizable approach for modeling gene expression from sequence and 3D chromatin organization.

Authors

  • Shengqi Hang; Xiao Wang; Ghulam Murtaza; Anupama Jha; Bo Wen; Tangqi Fang; Justin Sanders; Sheng Wang; William Stafford Noble