Biomedical text normalization through generative modeling.

Journal: Journal of biomedical informatics
Published Date:

Abstract

OBJECTIVE: A large proportion of electronic health record (EHR) data consists of unstructured medical language text. The formatting of this text is often flexible and inconsistent, making it challenging to use for predictive modeling, clinical decision support, and data mining. Large language models' (LLMs) ability to understand context and semantic variations makes them promising tools for standardizing medical text. In this study, we develop and assess clinical text normalization pipelines built using large-language models.

Authors

  • Jacob S Berkowitz
    Center for Systems Immunology, Departments of Immunology and Computational & Systems Biology.
  • Apoorva Srinivasan
    Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N San Vicente Blvd, Pacific Design Center Suite G540, West Hollywood, CA 90069 United States.
  • Jose Miguel Acitores Cortina
    Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N San Vicente Blvd, Pacific Design Center Suite G540, West Hollywood, CA 90069 United States.
  • Yasaman Fatapour
    Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N San Vicente Blvd, Pacific Design Center Suite G540, West Hollywood, CA 90069 United States.
  • Nicholas P Tatonetti
    Departments of Biomedical Informatics, Systems Biology, and Medicine, Columbia University, 622 West 168th St VC5, New York, NY 10032, USA. Electronic address: nick.tatonetti@columbia.edu.