Evaluating the performance of a custom GPT in full text screening of a systematic review.

Journal: Scandinavian journal of public health
Published Date:

Abstract

AIM: Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening are limited. The objective of this study was to evaluate the performance of a custom generative pretrained transformer (cGPT) for full text screening. METHODS: A cGPT powered by OpenAI's ChatGPT4o was tested with subsets of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and an assistant reviewer. Cohen's kappa was used to assess interrater agreement. RESULTS: For the inclusion/exclusion decision, the human-human kappa scores ranged from 0.87 to 0.96, exceeding the ranges of kappa scores for autonomous cGPT-human pairings (0.59 to 0.67) and assistant cGPT-human pairings (0.62 to 0.72). For exclusion reason classification, the human-human kappa scores ranged from 0.71 to 0.78, exceeding the ranges of kappa scores for autonomous cGPT-human pairings (0.47 to 0.53) and assistant cGPT-human pairings (0.52 to 0.63). CONCLUSIONS: The assistant cGPT outperformed the autonomous cGPT. An assistant cGPT could speed up systematic reviewing in a sufficiently reliable manner, however, further research is needed to establish standardized thresholds for practical use. Improved speed of systematic reviewing has implications for directing timely public health policy decisions.

Authors

Keywords

No keywords available for this article.