Relation Extraction for Diet, Non-Communicable Disease and Biomarker Associations (RECoDe): A CoDiet study

Journal: bioRxiv
Published Date:

Abstract

Diet plays a critical role in human health, with growing evidence linking dietary habits to disease outcomes. However, extracting structured dietary knowledge from biomedical literature remains challenging due to the lack of dedicated relation extraction datasets. To address this gap, we introduce RECoDe, a novel relation extraction (RE) dataset designed specifically for diet, disease, and related biomedical entities. RECoDe captures a diverse set of relation types, including a broad spectrum of positive association patterns and explicit negative examples, with over 5,000 human-annotated instances validated by up to five independent annotators. Furthermore, we benchmark various natural language processing (NLP) RE models, including BERT-based architectures and enhanced prompting techniques with locally deployed large language models (LLMs) to improve classification performance on underrepresented relation types. The best performing model gpt-oss-20B, a local LLM, achieved an F1-score of 64% for multi-class classification and 92% for binary classification using a hierarchical prompting strategy with a separate reflection step built in. To demonstrate the practical utility of RECoDe, we introduce the Contextual Co-occurrence Summarisation (CoCoS) framework, which aggregates sentence-level relation extractions into document-level summaries and further integrates evidence across multiple documents. CoCoS produces effect estimates consistent with established dietary knowledge, demonstrating its validity as a general framework for systematic evidence synthesis. Availability: The code, models, and data will be made freely available following peer review.

Authors

  • Choi
  • D.; Gu
  • Y.; Zong
  • K.; Lain
  • A. D.; Zaikis
  • D.; Rowlands
  • T.; Rei
  • M.; CoDiet Consortium
  • ; Beck
  • T.; Posma
  • J. M.

Categories