Sharing models and tools for processing German clinical texts.

Journal: Studies in health technology and informatics
Published Date:

Abstract

The automatic processing of non-English clinical documents is massively hampered by the lack of publicly available medical language resources for training, testing and evaluating NLP components. We suggest sharing statistical models derived from access-protected clinical documents as a reasonable substitute and provide solutions for sentence splitting, tokenization and POS tagging of German clinical texts. These three components were trained on the confidential FRAMED corpus, a non-sharable collection of various German-language clinical document types. The models derived therefrom outperform alternative components from OPENNLP and the Stanford POS tagger, also trained on FRAMED.

Authors

  • Johannes Hellrich
    Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Jena, Germany.
  • Franz Matthies
    Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Jena, Germany.
  • Erik Faessler
    Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Jena, Germany.
  • Udo Hahn
    Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universität Jena, Jena, Germany.