FveOCRs - a database for open chromatin prediction in wild strawberry based on a large language model
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Woodland strawberry (Fragaria vesca) is a widely used model system for cultivated strawberries and Rosaceae for molecular genetic studies. Nevertheless, available databases for its cis-regulatory element identification are limited. With the emergence of large language models in plant research, strawberry research could benefit significantly from applying these models. One such model derived from Plant DNA Large Language Models (PDLLMs) effectively predicts open chromatin regions (OCRs), which are nucleosome-depleted areas typically associated with cis-regulatory elements. However, improvement of the model’s accessibility and utility will be needed to facilitate the model’s application in strawberry as well as other plant species. We developed the FveOCRs database (http://liulab.online:33838/FveOCRLiuLabTestVer25), which predicts open chromatin regions located within 5,000 bp upstream sequences, the first and second introns and exons, and the longest introns in Fragaria vesca genes. Predictions were generated using a sliding-window approach based on one of the PDLLMs. In the database, users could predict open chromatin regions for input DNA sequences from any plant species and obtain visual graphs with precise predicted regions of open chromatins. The FveOCRs database is the first database for open chromatin region prediction in Fragaria vesca based on a Plant DNA Large Language Model. It can also predict OCRs in user-provided DNA sequences from any plant species. The resource and its accessible visual graphs will facilitate discovery of cis-regulatory elements and engineering of gene expression in strawberry and other plant species.