PlantBert: An Open Source Language Model for Plant Science
Journal:
arXiv
Published Date:
Jun 10, 2025
Abstract
The rapid advancement of transformer-based language models has catalyzed
breakthroughs in biomedical and clinical natural language processing; however,
plant science remains markedly underserved by such domain-adapted tools. In
this work, we present PlantBert, a high-performance, open-source language model
specifically tailored for extracting structured knowledge from plant
stress-response literature. Built upon the DeBERTa architecture-known for its
disentangled attention and robust contextual encoding-PlantBert is fine-tuned
on a meticulously curated corpus of expert-annotated abstracts, with a primary
focus on lentil (Lens culinaris) responses to diverse abiotic and biotic
stressors. Our methodology combines transformer-based modeling with
rule-enhanced linguistic post-processing and ontology-grounded entity
normalization, enabling PlantBert to capture biologically meaningful
relationships with precision and semantic fidelity. The underlying corpus is
annotated using a hierarchical schema aligned with the Crop Ontology,
encompassing molecular, physiological, biochemical, and agronomic dimensions of
plant adaptation. PlantBert exhibits strong generalization capabilities across
entity types and demonstrates the feasibility of robust domain adaptation in
low-resource scientific fields. By providing a scalable and reproducible
framework for high-resolution entity recognition, PlantBert bridges a critical
gap in agricultural NLP and paves the way for intelligent, data-driven systems
in plant genomics, phenomics, and agronomic knowledge discovery. Our model is
publicly released to promote transparency and accelerate cross-disciplinary
innovation in computational plant science.