Combining transcriptomic resolutions and machine learning strategies uncovers new OXPHOS genes in Caenorhabditis elegans

Journal: bioRxiv
Published Date:

Abstract

Assigning functions to genes remains a major challenge in biology, as a large fraction of genes remain unannotated despite the availability of complete genomes. Oxidative phosphorylation (OXPHOS), the primary source of ATP in eukaryotes, exemplifies this gap: although it has been extensively studied in mammals, our understanding of this process in other lineages remains limited. In general, research in other organisms has relied on the identification of sequence homologs of genes previously characterized in mammals. While this strategy has enabled the inference of certain conserved functions, it may overlook genes with key roles that lack detectable homology. This highlights the need to explore alternative approaches, such as the integration of transcriptomic data, to better understand the specific features and adaptations of this process across different evolutionary lineages. Caenorhabditis elegans provides a powerful framework to address this problem, combining conservation of mitochondrial pathways with extensive transcriptomic resources. Studying this organism also has translational relevance for parasitic helminths, where OXPHOS represents a promising therapeutic target. We hypothesized that genes involved in OXPHOS share transcriptional signatures that can be exploited for functional prediction. Using a curated set of 65 well-established OXPHOS genes, we applied two complementary machine learning strategies to identify new candidates. We trained an ensemble of supervised learning models on a time-resolved bulk RNA-seq transcriptome of C. elegans. To address uncertainty in functional annotations, we implemented a novel informed bagging strategy combined with a two-round training scheme, in which weak positives were initially excluded and subsequently incorporated based on model predictions. In parallel, we performed cluster-based functional inference using embryonic and adult single-cell RNA-seq datasets. Integration of both approaches produced a list of candidate genes supported by strong predictive performance on an independent evaluation set. Several candidates lack prior functional annotation. A mutant strain in ril-1, one of the highly supported predictions, showed decreased respiration rates compared to the wild-type strain. Our results highlight the value of integrating biological priors, complementary learning paradigms, and multi-resolution transcriptomic data to enable systematic gene function discovery.

Authors

  • Zeballos - Goron
  • S.; Salinas
  • G.; Pazos Obregon
  • F.

Categories