Large Language Model-Assisted Systematic Review: Validation Based on Cochrane Review Data.
Journal:
Studies in health technology and informatics
Published Date:
May 15, 2025
Abstract
Large Language Models (LLMs) offer potential for automating systematic reviews, a labor-intensive process in evidence-based medicine. We evaluated GPT-4o, GPT-4o-mini, and Llama 3.1:8B on abstract screening and risk of bias assessment using 12 Cochrane drug intervention reviews. GPT-4o achieved the best screening performance (recall 0.894, precision 0.492). We propose a one-shot inclusivity adjustment method enabling threshold modulation without repeated inferences. For risk of bias, accuracy varied by domain, highest in random sequence generation (0.873), and lowest in selective reporting (0.418). Our findings demonstrate LLMs' practical utility and current limitations in automating systematic reviews.