Comparing natural language processing representations of coded disease sequences for prediction in electronic health records.

Journal: Journal of the American Medical Informatics Association : JAMIA
PMID:

Abstract

OBJECTIVE: Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes.

Authors

  • Thomas Beaney
    Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom.
  • Sneha Jha
    Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Asem Alaa
    Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom.
  • Alexander Smith
    Department of Forensic Psychiatry, University of Bern, Switzerland.
  • Jonathan Clarke
    Centre for Mathematics of Precision Healthcare, Department of Mathematics, Imperial College London, London, UK.
  • Thomas Woodcock
    Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom.
  • Azeem Majeed
    Imperial College London, London, UK.
  • Paul Aylin
  • Mauricio Barahona