AIEdit: Alignment-free genome assembly polisher trained on spaced seed match patterns.

Journal: PLoS computational biology
Published Date:

Abstract

Polishing, the process of correcting base-level errors in genome assemblies, is a critical step for ensuring accuracy in downstream analyses, such as variant calling, gene annotation, and clinical genomics applications. While recent advances in long-read sequencing technologies have helped improve assembly contiguity and genome completeness, maintaining high base-level accuracy in those genomes remains challenging due to the still appreciable errors associated with certain long-read sequencing technologies. Existing polishing approaches face notable trade-offs: alignment-based methods achieve high accuracy but incur long run times, alignment-free k-mer-based tools are scalable but struggle in regions with dense errors, and machine learning-based polishers often only perform well on specific platforms and require read-to-assembly alignments. We present AIEdit, a machine learning-based polisher designed to operate alignment-free, generalizing across sequencing platforms while remaining computationally efficient. We developed AIEdit by combining spaced seed matching with a neural network trained to detect and correct dense error patterns in an alignment-free manner. We benchmarked the method on simulated and experimental DNA sequencing data. On simulated human long-read assemblies with high error rates, AIEdit reduced error rates by 58% compared to ntEdit's 21%, completing in 2.7 hours using 230 GB of memory - faster than POLCA and Medaka (multi-day run times) and using 3 × less memory than JASPER (689 GB). On experimental Oxford Nanopore Technologies (ONT) data from the NA24385 human genome, AIEdit increased the Merqury quality score (QV) from 28.7 to 32.9 in 9.5 hours, achieving comparable accuracy to Medaka (QV 32.7) in a fraction of the time (1.5 + days) and outperforming k-mer-based tools ntEdit (QV 31.0) and JASPER (QV 31.7). Overall, AIEdit enables scalable and accurate genome polishing across diverse datasets.

Authors

Keywords

No keywords available for this article.