Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification
Journal:
arXiv
Published Date:
Feb 11, 2025
Abstract
The interactions between DNA, RNA, and proteins are fundamental to biological
processes, as illustrated by the central dogma of molecular biology. While
modern biological pre-trained models have achieved great success in analyzing
these macromolecules individually, their interconnected nature remains
under-explored. In this paper, we follow the guidance of the central dogma to
redesign both the data and model pipeline and offer a comprehensive framework,
Life-Code, that spans different biological functions. As for data flow, we
propose a unified pipeline to integrate multi-omics data by
reverse-transcribing RNA and reverse-translating amino acids into
nucleotide-based sequences. As for the model, we design a codon tokenizer and a
hybrid long-sequence architecture to encode the interactions of both coding and
non-coding regions with masked modeling pre-training. To model the translation
and folding process with coding sequences, Life-Code learns protein structures
of the corresponding amino acids by knowledge distillation from off-the-shelf
protein language models. Such designs enable Life-Code to capture complex
interactions within genetic sequences, providing a more comprehensive
understanding of multi-omics with the central dogma. Extensive Experiments show
that Life-Code achieves state-of-the-art performance on various tasks across
three omics, highlighting its potential for advancing multi-omics analysis and
interpretation.