Integrating histology and spatial transcriptomics via multimodal transformers and contrastive representation learning for accurate gene expression prediction.
Journal:
Journal of biomedical informatics
Published Date:
Feb 26, 2026
Abstract
Predicting spatial gene expression from Histological images is a fundamental task in understanding tissue organization and molecular phenotypes. However, existing methods often rely on single-model representations or lack effective alignment between image and transcriptomic features. To address these limitations, we propose a unified multimodal learning framework that integrates histological imaging and spatial transcriptomics through a shared latent representation space. Specifically, histological H&E images are encoded by a ResNet50-based convolutional stem and a MobileViT Transformer backbone to extract hierarchical visual representations. Both modalities are projected into a shared latent space via linear-GELU-dropout transformation blocks, enabling cross-modal alignment through a contrastive learning objective that maximizes agreement between the corresponding image and the spot embeddings. Experimental results on the 10x Genomics Visium dataset of human liver tissue demonstrate that MViTGene achieves significantly higher prediction accuracy than existing methods across multiple gene subsets, with improvements of 20%, 33%, and 12% in predicting marker genes, highly expressed genes, and highly variable genes, respectively. The significant improvement in relevance indicates that the model can more accurately capture the true correspondence between tissue morphology and gene expression, therefore enabling more reliable biological interpretation. It provides a computational tool for high-throughput spatial gene expression prediction that balances performance and interpretability.
Authors
Keywords
No keywords available for this article.