A Pilot Evaluation of Open-Weight Large Language Models for Screening RNA-seq Metadata in Public Databases
Journal:
bioRxiv
Published Date:
May 5, 2026
Abstract
Although the Gene Expression Omnibus and other public repositories are expanding rapidly, curation across these databases has not kept pace. Data reuse is often hindered by unstandardized metadata comprising unstructured text. To address this, we developed a workflow that combines retrieval via application programming interfaces with semantic filtering using large language models (LLMs) to support metadata screening as an initial step in broader curation workflows. As a focused pilot evaluation, we benchmarked multiple LLMs using metadata from 150 candidate Arabidopsis RNA-seq projects to classify projects containing exogenous ABA-treated samples and matched untreated controls. Simple keyword searches yielded many false positives (F1=0.59); classification using LLMs significantly improved performance. Several open-weight models achieved near-perfect classification performance in this defined task (F1>0.98), comparable to that of closed models. We also found that, for some high-performing models, self-reported confidence scores may help identify high-confidence cases that can be prioritized for automated processing. These results suggest that open-weight LLMs can support scalable metadata screening in local environments as an initial step in broader curation workflows, providing a foundation for accelerating public dataset reuse.