Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl.

Journal: Artificial intelligence in medicine

Published Date: May 23, 2025

Abstract

Electronic Health Records (EHRs) contain written notes by all kinds of medical professionals about all aspects of well-being of a patient. When adequately processed with a Large Language Model (LLM), this enormous source of information can be analyzed quantitatively, which can lead to new insights, for example in treatment development or in patterns of patient recovery. However, the language used in clinical notes is very idiosyncratic, which available generic LLMs have not encountered in their pre-training. They therefore have not internalized an adequate representation of the semantics of this data, which is essential for building reliable Natural Language Processing (NLP) software. This article describes the development of the first domain-specific LLM for Dutch EHRs: MedRoBERTa.nl. We discuss in detail why and how we built our model, pre-training it on the notes in EHRs using different strategies, and how we were able to publish it publicly by thoroughly anonymizing it. We evaluate our model extensively, comparing it to various other LLMs. We also illustrate how our model can be used, discussing various studies that built medical text mining technology on top of our model.

Authors

Stella Verkijk

Vrije Universiteit Amsterdam, The Netherlands. Electronic address: s.verkijk@vu.nl.
Piek Vossen

Computational Lexicology and Terminology Lab, Faculty of Humanities, VU Amsterdam.

Keywords

Data Mining Electronic Health Records Humans Natural Language Processing Netherlands Semantics Software

External Resources

View on PubMed Access via DOI PubMed (40472749)

Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Creating, anonymizing and evaluating the first medical language model pre-trained on Dutch Electronic Health Records: MedRoBERTa.nl.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals