Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task.

Journal: Studies in health technology and informatics
Published Date:

Abstract

In this paper, we trained a set of Portuguese clinical word embedding models of different granularities from multi-specialty and multi-institutional clinical narrative datasets. Then, we assessed their impact on a downstream biomedical NLP task of Urinary Tract Infection disease identification. Additionally, we intrinsically evaluated our main model using an adapted version of Bio-SimLex for the Portuguese language. Our empirical results showed that the larger, coarse-grained model achieved a slightly better outcome when compared with the small, fine-grained model in the proposed task. Moreover, we obtained satisfactory results with Bio-SimLex intrinsic evaluation.

Authors

  • Lucas Emanuel Silva E Oliveira
    Health Technology Program, Pontifical Catholic University of Paraná, Curitiba, PR, Brazil.
  • Yohan Bonescki Gumiel
    Health Technology Program, Pontifical Catholic University of Paraná, Curitiba, PR, Brazil.
  • Arnon Bruno Ventrilho Dos Santos
    Health Technology Program, Pontifical Catholic University of Paraná, Curitiba, PR, Brazil.
  • Lilian Mie Mukai Cintho
    Health Technology Program, Pontifical Catholic University of Paraná, Curitiba, PR, Brazil.
  • Deborah Ribeiro Carvalho
    Health Technology Program, Pontifical Catholic University of Paraná, Curitiba, PR, Brazil.
  • Sadid A Hasan
    Philips Research North America, New York, United States.
  • Claudia Maria Cabral Moro
    Health Technology Program, Pontifical Catholic University of Paraná, Curitiba, PR, Brazil.