Feature engineering vs. deep learning for paper section identification: Toward applications in Chinese medical literature
Journal:
arXiv
Published Date:
Dec 15, 2024
Abstract
Section identification is an important task for library science, especially
knowledge management. Identifying the sections of a paper would help filter
noise in entity and relation extraction. In this research, we studied the paper
section identification problem in the context of Chinese medical literature
analysis, where the subjects, methods, and results are more valuable from a
physician's perspective. Based on previous studies on English literature
section identification, we experiment with the effective features to use with
classic machine learning algorithms to tackle the problem. It is found that
Conditional Random Fields, which consider sentence interdependency, is more
effective in combining different feature sets, such as bag-of-words,
part-of-speech, and headings, for Chinese literature section identification.
Moreover, we find that classic machine learning algorithms are more effective
than generic deep learning models for this problem. Based on these
observations, we design a novel deep learning model, the Structural
Bidirectional Long Short-Term Memory (SLSTM) model, which models word and
sentence interdependency together with the contextual information. Experiments
on a human-curated asthma literature dataset show that our approach outperforms
the traditional machine learning methods and other deep learning methods and
achieves close to 90% precision and recall in the task. The model shows good
potential for use in other text mining tasks. The research has significant
methodological and practical implications.