Gsformer: a dual-architecture deep learning framework with CNN-self-attention and sparse-attention for genomic selection.

Journal: Genetics, selection, evolution : GSE
Published Date:

Abstract

BACKGROUND: Genomic selection (GS) has revolutionized modern breeding by utilizing genome-wide single nucleotide polymorphisms (SNPs). While traditional models such as GBLUP and Bayesian approaches remain prevalent, several deep learning approaches have recently been introduced for plant GS, demonstrating superior predictive performance. Here, we introduce Gsformer, a novel deep learning framework designed to predict phenotypes by modeling complex genetic architectures. It features two distinct architectures: CSA, which combines convolutional neural networks (CNNs) with self-attention to capture local and long-range genomic dependencies, and NSA, which employs a native sparse attention mechanism to enhance computational efficiency by focusing on the most informative features. We evaluated Gsformer on six datasets spanning animal and plant species-pig, cattle, chicken, mouse, wheat, and maize-and compared its phenotypic prediction performance against five established GS methods: DNNGP, MLP, LightGBM, SVR, and GBLUP. RESULTS: Gsformer generally ranked among the top two models across six diverse animal and plant genomic prediction datasets. Specifically, Gsformer-CSA yielded notable improvements in predicting cattle fat percentage, while Gsformer-NSA was more accurate in predicting chicken first egg weight, pig age at 100 kg body weight, and mouse anxiety. With the topN hyperparameter set to 20%, Gsformer-NSA matched or marginally exceeded Gsformer-CSA for most traits-though it showed lower accuracy for a subset of traits. Adjusting the topN value further enhanced Gsformer-NSA's performance, allowing it to match that of Gsformer-CSA. Ablation studies confirmed the complementary roles of CNN and self-attention modules in the CSA architecture. To enhance interpretability, we applied SHAP (SHapley Additive exPlanations) to identify influential SNPs and annotate candidate genes associated with growth and body size traits in pigs. Functional enrichment analysis revealed biologically relevant pathways involved in nervous system development, glycolytic process regulation, and digestive tract morphogenesis. CONCLUSIONS: In summary, Gsformer establishes a flexible and powerful framework for genomic prediction, demonstrating broad applicability across both animal and plant breeding. Owing to its lower computational cost, Gsformer-NSA is recommended over Gsformer-CSA in scenarios where the minor sacrifice in prediction accuracy is acceptable.

Authors

Keywords

No keywords available for this article.