Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling
Journal:
arXiv
Published Date:
Apr 9, 2025
Abstract
The ability to quickly and accurately identify microbial species in a sample,
known as metagenomic profiling, is critical across various fields, from
healthcare to environmental science. This paper introduces a novel method to
profile signals coming from sequencing devices in parallel with determining
their nucleotide sequences, a process known as basecalling, via a
multi-objective deep neural network for simultaneous basecalling and
multi-class genome classification. We introduce a new loss strategy where
losses for basecalling and classification are back-propagated separately, with
model weights combined for the shared layers, and a pre-configured ranking
strategy allowing top-K species accuracy, giving users flexibility to choose
between higher accuracy or higher speed at identifying the species. We achieve
state-of-the-art basecalling accuracies, while classification accuracies meet
and exceed the results of state-of-the-art binary classifiers, attaining an
average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a
total of 17 genomes in the Wick bacterial dataset. The work presented here has
implications for future studies in metagenomic profiling by accelerating the
bottleneck step of matching the DNA sequence to the correct genome.