Auto-METRICS: LLM-assisted scientific quality control for radiomics research

Journal: medRxiv
Published Date:

Abstract

The quality and integrity of scientific publications in clinical machine-learning is of paramount importance. Low-quality publications can offer excessively optimistic estimates of model performance, leading to unrealistic expectations when translating models to clinical scenarios. In radiomics this is no less the case: recent studies have highlighted the stark publication bias in radiomics research, and multiple investigations showed how radiomics-based research can be affected by several methodological confounders. To address this systematically, the Radiomics Quality Score (RQS) and the METhodological RadiomICs Score (METRICS) were designed. Both provide a standardised quantity determining the methodological quality of radiomics scientific manuscripts, thus validating their clinical translation. METRICS, out of the two, has been shown to be more reproducible and accurate. In recent years, large language models (LLMs) have been used in many science-related tasks. Here, we ask the following question: how are LLM-based METRICS assessments correlated with those performed by radiologists? Using two recent reproducibility studies, we provide evidence that LLM-based radiomics assessments can be a useful assistant in determining the scientific quality of radiomics-based publications. Particularly, we show that inter-rater agreements between LLMs and human raters are similar to those reported between human raters. Additionally, we also show that the correlation and error between METRICS scores obtained by human raters and LLMs is similar to those obtained between human raters. These results constitute an important proof of concept — LLMs can be used to assist human raters in deriving standardised scores. How well do LLM-based METRICS assessments correlate with those performed by radiologists in determining the scientific quality of radiomics-based publications, addressing the need for efficient quality control? LLM-based radiomics assessments demonstrate inter-rater agreements with human raters similar to those between human raters, suggesting their utility in assisting with standardized scoring. LLMs can assist in the standardized assessment of radiomics research quality, potentially improving the reliability and clinical translatability of radiomics-based tools by ensuring studies meet rigorous methodological standards, ultimately benefiting patient care through more robust clinical applications.

Authors

  • José Guilherme de Almeida; Nickolas Papanikolaou