Large Language Model-Assisted Systematic Review: Validation Based on Cochrane Review Data.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Large Language Models (LLMs) offer potential for automating systematic reviews, a labor-intensive process in evidence-based medicine. We evaluated GPT-4o, GPT-4o-mini, and Llama 3.1:8B on abstract screening and risk of bias assessment using 12 Cochrane drug intervention reviews. GPT-4o achieved the best screening performance (recall 0.894, precision 0.492). We propose a one-shot inclusivity adjustment method enabling threshold modulation without repeated inferences. For risk of bias, accuracy varied by domain, highest in random sequence generation (0.873), and lowest in selective reporting (0.418). Our findings demonstrate LLMs' practical utility and current limitations in automating systematic reviews.

Authors

  • Siun Kim
    Department of Applied Biomedical Engineering, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea; Center for Convergence Approaches in Drug Development, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea.
  • Hyung-Jin Yoon
    Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Korea.