Benchmarking reveals the superiority of nucleic acid foundation models in predicting lncRNA coding potential.
Journal:
Genome biology
Published Date:
Jun 4, 2026
Abstract
BACKGROUND: A subset of long noncoding RNAs (lncRNAs) contains short open reading frames and can encode functional micropeptides. However, identifying these coding lncRNAs (codlncRNAs) remains challenging due to weak coding signals, short peptide products, and heterogeneous evidence across databases. Existing computational tools lack unified benchmarks, and the utility of nucleic acid foundation models for this task remains unclear. RESULTS: We construct the first multi-species, evidence-stratified benchmark for codlncRNA prediction and systematically characterized codlncRNAs across molecular dimensions. CodlncRNAs consistently exhibited transitional features between mRNAs and untranslated lncRNAs in sequence, structural, and physicochemical properties. Using this benchmark, we evaluate 12 classical tools and 4 foundation models. Classical methods show limited zero-shot performance, whereas RNA-FM, RiNALMo, and DNABERT-2 achieve substantial gains after fine-tuning. Notably, DNABERT-2, trained solely on DNA, performs competitively or even superior to RNA-specific models. An ensemble framework integrating foundation and classical models further improves robustness and has been deployed as an accessible web server. CONCLUSIONS: Our study establishes the first benchmark for codlncRNA prediction, delineates their distinctive transitional molecular profile, and supports the utility of nucleic acid foundation models for codlncRNA prediction within the current benchmark scope. Moreover, the proposed framework provides a practical, scalable computational foundation for micropeptide discovery and RNA functional characterization.
Authors
Keywords
No keywords available for this article.