AutoReporter: Development of an artificial intelligence tool for automated assessment of research reporting guideline adherence

Journal: medRxiv
Published Date:

Abstract

To develop AutoReporter, a large-language-model system that automates evaluation of adherence to research reporting guidelines. Eight prompt-engineering and retrieval strategies coupled with reasoning and general-purpose LLMs were benchmarked on the SPIRIT-CONSORT-TM corpus. The top-performing approach, AutoReporter, was validated on BenchReport, a novel benchmark dataset of expert-rated reporting guideline assessments from 10 systematic reviews. AutoReporter, a zerolZlshot, nolZlretrieval prompt coupled with the o3-mini reasoning LLM, demonstrated optimal accuracy (CONSORT: 90.09%; SPIRIT: 92.07%), run-time (CONSORT: 617.26 seconds; SPIRIT: 544.51 seconds), and cost (CONSORT: 0.68 USD; SPIRIT: 0.65 USD). AutoReporter achieved a mean accuracy of 91.8% and substantial agreement (Cohen’s κ>0.6) with expert ratings from the BenchReport benchmark. Structured prompting alone can match or exceed fine-tuned domain models while forgoing manually annotated corpora and computationally intensive training. LLMs can feasibly automate reporting guideline adherence assessments for scalable quality control in scientific research reporting. AutoReporter is publicly accessible at https://autoreporter.streamlit.app.

Authors

  • David Chen; Patrick Li; Ealia Khoshkish; Seungmin Lee; Tony Ning; Umair Tahir; Henry CY Wong; Michael SF Lee; Srinivas Raman