The Impact of Specialized Corpora for Word Embeddings in Natural Langage Understanding.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Recent studies in the biomedical domain suggest that learning statistical word representations (static or contextualized word embeddings) on large corpora of specialized data improve the results on downstream natural language processing (NLP) tasks. In this paper, we explore the impact of the data source of word representations on a natural language understanding task. We compared embeddings learned with Fasttext (static embedding) and ELMo (contextualized embedding) representations, learned either on the general domain (Wikipedia) or on specialized data (electronic health records, EHR). The best results were obtained with ELMo representations learned on EHR data for the two sub-tasks (+7% and +4% of gain in F1-score). Moreover, ELMo representations were trained with only a fraction of the data used for Fasttext.

Authors

  • Antoine Neuraz
    Institut National de la Santé et de la Recherche Médicale (INSERM), Centre de Recherche des Cordeliers, UMR 1138 Equipe 22, Paris Descartes, Sorbonne Paris Cité University, Paris, France.
  • Bastien Rance
    AP-HP, University Hospital Georges Pompidou; INSERM, UMR_S 1138, Centre de Recherche des Cordeliers, Paris, France.
  • Nicolas Garcelon
    Plateforme data science - institut des maladies génétiques Imagine, Inserm, centre de recherche des Cordeliers, UMR 1138 équipe 22, institut Imagine, Paris-Descartes, université Sorbonne- Paris Cité, Paris, France.
  • Leonardo Campillos Llanos
    LIMSI, CNRS, Université Paris Saclay.
  • Anita Burgun
    Hôpital Necker-Enfants malades, AP-HP, Paris, France.
  • Sophie Rosset
    LIMSI, CNRS, Université Paris Saclay.