A transformer-based language model reveals developmental constraint and network complexity during zebrafish embryogenesis

Journal: bioRxiv
Published Date:

Abstract

Understanding how regulatory complexity and constraint shape organismal development remains a central challenge in biology. The developmental hourglass framework posits that mid-embryogenesis –the phylotypic stage– is a period of heightened conservation and regulatory organization. We test this using Zebraformer, a transformer-based language model trained on single-cell transcriptomic data from zebrafish embryos. Unlike models focused on prediction or classification, Zebraformer learns context-sensitive representations of genes and cells that encode temporal progression, anatomical identity, and regulatory relationships. Embeddings reflect differentiation timing and transcriptional divergence, while attention-derived gene networks reveal a transient rise in complexity during the phylotypic stage. This stage also exhibits increased perturbation sensitivity and a shift toward centralized, modular network topology. These features are supported by graph-theoretic metrics and gene ontology enrichment, offering data-driven evidence for highly structured regulation during mid-embryogenesis. Our results demonstrate that language models can extract interpretable biological structure and support longstanding developmental theory from high-dimensional data. Understanding how cells coordinate to build a complex organism remains a central challenge in biology. Development is not only genetically encoded but context-dependent; shaped by dynamic interactions among genes, cells, and time. Here, we use a transformer-based language model, Zebraformer, trained on single-cell gene expression data from zebrafish embryos, to investigate how regulatory structure evolves during development. The model captures key features of organismal formation: increasing transcriptional divergence, anatomical specificity, and a transient rise in regulatory complexity and perturbation sensitivity during the conserved phylotypic stage. These findings provide data-driven support for the developmental hourglass hypothesis and demonstrate that contextual models can uncover fundamental organizational principles from biological data alone.

Authors

  • Juan F Poyatos