Prompting large language models to extract chemical‒disease relation precisely and comprehensively at the document level: an evaluation study.

Journal: PloS one
PMID:

Abstract

Given the scarcity of annotated data, current deep learning methods face challenges in the field of document-level chemical-disease relation extraction, making it difficult to achieve precise relation extraction capable of identifying relation types and comprehensive extraction tasks that identify relation-related factors. This study tests the abilities of three large language models (LLMs), GPT3.5, GPT4.0, and Claude-opus, to perform precise and comprehensive extraction in document-level chemical-disease relation extraction on a self-constructed dataset. Firstly, based on the task characteristics, this study designs six workflows for precise extraction and five workflows for comprehensive extraction using prompting engineering strategies. The characteristics of the extraction process are analyzed through the performance differences under different workflows. Secondly, this study analyzes the content bias in LLMs extraction by examining the extraction effectiveness of different workflows on different types of content. Finally, this study analyzes the error characteristics of extracting incorrect examples by the LLMs. The experimental results show that: (1) The LLMs demonstrate good extraction capabilities, achieving the highest F1 scores of 87% and 73% respectively in the tasks of precise extraction and comprehensive extraction; (2) In the extraction process, the LLMs exhibit a certain degree of stubbornness, with limited effectiveness of prompting engineering strategies; (3) In terms of extraction content, the LLMs show a content bias, with stronger abilities to identify positive relations such as induction and acceleration; (4) The essence of extraction errors lies in the LLMs' misunderstanding of the implicit meanings in biomedical texts. This study provides practical workflows for precise and comprehensive extraction of document-level chemical-disease relations and also indicates that optimizing training data is the key to building more efficient and accurate extraction methods in the future.

Authors

  • Mei Chen
    The First People's Hospital of Longquanyi District, Chengdu, Sichuan Province 610100, China.
  • Tingting Zhang
    Department of Environmental Science and Engineering, College of Chemical Engineering, Beijing University of Chemical Technology, Beijing 100029, China. Electronic address: zhangtt@mail.buct.edu.cn.
  • Shibin Wang
    Xi'an Jiaotong University, PR China; School of Mechanical Engineering, Xi'an Jiaotong University, Xi'an, 710049, PR China; State Key Laboratory for Manufacturing Systems Engineering, Xi'an Jiaotong University, Xi'an, 710049, PR China. Electronic address: wangshibin2008@xjtu.edu.cn.