A universal indel filtering workflow for both long-read and short-read NGS data.

Journal: BMC research notes
Published Date:

Abstract

Accurate detection of insertions and deletions (indels) is critical for applications in disease genomics, population genetics, and personalized healthcare. Despite advancements in sequencing technologies, indel detection remains challenging, particularly in difficult-to-map genomic regions. In this study, we present a universal machine learning-based filtering workflow that significantly improves indel detection accuracy for both long-read and short-read sequencing data, utilizing only publicly available genomic annotation datasets, eliminating the need for sequencing workflow-specific information, such as read depth. Our method (a gradient-boosting classifier powered by XGBoost) enhances precision by ~ 26% for long-read and ~ 24% for short-read data while maintaining high recall rates (~ 90%). We validate our approach using the Genome in a Bottle (GIAB) dataset and 62 indel call sets from the precisionFDA Truth Challenge V2, demonstrating its effectiveness in handling complex genomic regions and diverse sequencing workflows. Our tool is open-access and workflow-agnostic, making it a valuable resource for improving indel calling accuracy across various applications.

Authors

Keywords

No keywords available for this article.