Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit.

Journal: Artificial intelligence in medicine
Published Date:

Abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

Authors

  • Zeljko Kraljevic
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
  • Thomas Searle
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
  • Anthony Shek
    Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
  • Lukasz Roguski
  • Kawsar Noor
    Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK.
  • Daniel Bean
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK.
  • Aurelie Mascio
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
  • Leilei Zhu
    Department of Endocrinology and Metabolism, West China Hospital, Sichuan University, Chengdu 610041, Sichuan Province, China. Electronic address: zhuleilei2018@163.com.
  • Amos A Folarin
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
  • Angus Roberts
    Department of Computer Science, University of Sheffield, Sheffield, UK.
  • Rebecca Bendayan
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
  • Mark P Richardson
    Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
  • Robert Stewart
    Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom.
  • Anoop D Shah
    Farr Institute of Health Informatics Research, Institute of Health Informatics, University College London, London, United Kingdom.
  • Wai Keong Wong
    Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK.
  • Zina Ibrahim
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK.
  • James T Teo
    Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Department of Neurology, King's College Hospital NHS Foundation Trust, London, UK.
  • Richard J B Dobson
    Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, UK.