Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl.

Journal: Artificial intelligence in medicine
Published Date:

Abstract

Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analyzed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. This article describes the development of the first domain-specific LLM for Dutch EHRs: MedRoBERTa.nl. We discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. We evaluate our model extensively, comparing it to various other LLMs. We also illustrate how our model can be used, discussing various studies that built medical text mining technology on top of our model.

Authors

  • Stella Verkijk
    Vrije Universiteit Amsterdam, The Netherlands. Electronic address: s.verkijk@vu.nl.
  • Piek Vossen
    Computational Lexicology and Terminology Lab, Faculty of Humanities, VU Amsterdam.