Identifying biomedical entities for datasets in scientific articles – A 4-step cache-augmented generation approach using GPT-4o and PubTator 3.0
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
The accurate annotation of biomedical entities in scientific articles is essential for effective metadata generation, ensuring data findability, accessibility, interoperability and reusability in collaborative research. This study introduces a novel 4-step Cache-Augmented Generation (CAG) approach to identify biomedical entities, leveraging GPT-4o and PubTator 3.0. The method integrates (1) GPT-4o-based entity generation, (2) PubTator-based validation, (3) term extraction based on a metadata-schema developed for the specific research area, and (4) a combined evaluation of PubTator-validated and schema-related terms. Applied to 23 articles published in the context of the Collaborative Research Centre OncoEscape, the process was validated through supervised, face-to-face interviews with article authors, allowing an assessment of annotation precision using random effects meta-analysis. The approach yielded a mean number of 19.6 schema-related and 6.7 PubTator-validated biomedical entities per article. Overall precision was 98% [95%CI 94%-100%]. In a subsample (N=20), available supplemental material was included in the prediction process, which did not increase precision (98%, CI 95%-100%). Moreover, the mean number of schema-related (20.1, p=0.561) and PubTator-validated (6.7, p=0.681) biomedical entities did not increase with the additional information provided with the supplement. This study highlights the potential of CAG for metadata annotation. The findings underscore the practical feasibility of full-text analysis for routine metadata annotation in biomedical research.