A Topic Recognition Method of News Text Based on Word Embedding Enhancement.

Journal: Computational intelligence and neuroscience
Published Date:

Abstract

Topic recognition technology has been commonly applied to identify different categories of news topics from the vast amount of web information, which has a wide application prospect in the field of online public opinion monitoring, news recommendation, and so on. However, it is very challenging to effectively utilize key feature information such as syntax and semantics in the text to improve topic recognition accuracy. Some researchers proposed to combine the topic model with the word embedding model, whose results had shown that this approach could enrich text representation and benefit natural language processing downstream tasks. However, for the topic recognition problem of news texts, there is currently no standard way of combining topic model and word embedding model. Besides, some existing similar approaches were more complex and did not consider the fusion between topic distribution of different granularity and word embedding information. Therefore, this paper proposes a novel text representation method based on word embedding enhancement and further forms a full-process topic recognition framework for news text. In contrast to traditional topic recognition methods, this framework is designed to use the probabilistic topic model LDA, the word embedding models Word2vec and Glove to fully extract and integrate the topic distribution, semantic knowledge, and syntactic relationship of the text, and then use popular classifiers to automatically recognize the topic categories of news based on the obtained text representation vectors. As a result, the proposed framework can take advantage of the relationship between document and topic and the context information, which improves the expressive ability and reduces the dimensionality. Based on the two benchmark datasets of 20NewsGroup and BBC News, the experimental results verify the effectiveness and superiority of the proposed method based on word embedding enhancement for the news topic recognition problem.

Authors

  • Qiming Du
    State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China.
  • Nan Li
    School of Basic Medical Sciences, Jiamusi University No. 258, Xuefu Street, Xiangyang District, Jiamusi 154007, Heilongjiang, China.
  • Wenfu Liu
    State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China.
  • Daozhu Sun
    State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China.
  • Shudan Yang
    State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China.
  • Feng Yue
    Bioinformatics and Genomics Program, Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, 16802, USA. fyue@hmc.psu.edu.