Do You Need Embeddings Trained on a Massive Specialized Corpus for Your Clinical Natural Language Processing Task?

Journal: Studies in health technology and informatics
Published Date:

Abstract

We explore the impact of data source on word representations for different NLP tasks in the clinical domain in French (natural language understanding and text classification). We compared word embeddings (Fasttext) and language models (ELMo), learned either on the general domain (Wikipedia) or on specialized data (electronic health records, EHR). The best results were obtained with ELMo representations learned on EHR data for one of the two tasks(+7% and +8% of gain in F1-score).

Authors

  • Antoine Neuraz
    Institut National de la Santé et de la Recherche Médicale (INSERM), Centre de Recherche des Cordeliers, UMR 1138 Equipe 22, Paris Descartes, Sorbonne Paris Cité University, Paris, France.
  • Vincent Looten
    Department of Medical Informatics, Necker-Enfants Malades Hospital, Assistance Publique des Hôpitaux de Paris (AP-HP).
  • Bastien Rance
    AP-HP, University Hospital Georges Pompidou; INSERM, UMR_S 1138, Centre de Recherche des Cordeliers, Paris, France.
  • Nicolas Daniel
    Hôpital Européen Georges Pompidou, AP-HP, Université Paris Descartes, Sorbonne Paris Cité, Paris, France.
  • Nicolas Garcelon
    Plateforme data science - institut des maladies génétiques Imagine, Inserm, centre de recherche des Cordeliers, UMR 1138 équipe 22, institut Imagine, Paris-Descartes, université Sorbonne- Paris Cité, Paris, France.
  • Leonardo Campillos Llanos
    LIMSI, CNRS, Université Paris Saclay.
  • Anita Burgun
    Hôpital Necker-Enfants malades, AP-HP, Paris, France.
  • Sophie Rosset
    LIMSI, CNRS, Université Paris Saclay.