Evaluating a Locally Deployed 20-Billion Parameter Large Language Model for Automated Abstract Screening in Systematic Reviews

Journal: medRxiv
Published Date:

Abstract

Abstract Background: Systematic reviews (SRs) are essential for evidence-based medicine but require extensive time and resources for abstract screening. Large language models (LLMs) offer potential for automating this process, yet concerns about data privacy, intellectual property protection, and reproducibility limit the use of cloud-based solutions in research settings. Objective: To evaluate the performance of a locally deployed 20-billion parameter LLM for automated abstract screening in systematic reviews using a sensitivity-enhanced prompting strategy, with blind expert adjudication of all discordant human-AI cases. Methods: We deployed GPT-OSS:20B locally using Ollama and evaluated its performance across three systematic reviews: AI applications in pediatric surgical pathology (n=3,350), LLM applications in electronic health records (n=4,326), and parental stress/caregiver burden in surgically treated children (n=8,970). A sensitivity-enhanced prompting strategy instructing the model to include abstracts when uncertain was employed. All discordant cases underwent blind expert adjudication. Results: Across 16,646 abstracts, the LLM demonstrated variable sensitivity after expert adjudication: 100% in SR1, 95.7% in SR2, and 85.7% in SR3. Expert adjudication identified 11 human screening errors across all reviews that the LLM had correctly classified. The LLM completed screening 4.7 times faster than human reviewers. Conclusions: A locally deployed LLM with sensitivity-enhanced prompting shows promising performance for systematic review abstract screening, particularly for technology-focused topics. Performance variability across domains suggests that screening accuracy depends partly on the objectivity of inclusion criteria. We recommend deploying LLMs as second screeners alongside human reviewers until performance is more fully validated across diverse domains. Keywords: systematic review; large language model; abstract screening; artificial intelligence; natural language processing; evidence synthesis; local deployment

Authors

  • Moreira Melo
  • P. H.; Poenaru
  • D.; Guadagno
  • E.