Lessons learned on information retrieval in electronic health records: a comparison of embedding models and pooling strategies.

Journal: Journal of the American Medical Informatics Association : JAMIA
PMID:

Abstract

OBJECTIVES: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over large text sources. However, there are many parameters to optimize in just the retrieval system alone. This paper presents an ablation study exploring how different embedding models and pooling methods affect information retrieval for the clinical domain.

Authors

  • Skatje Myers
    Department of Medicine, University of Wisconsin-Madison, Madison, WI 53726, United States.
  • Timothy A Miller
    Boston Children's Hospital, 300 Longwood Avenue, Boston, MA 02115, United States; Harvard Medical School, 25 Shattuck St, Boston, MA 02115, United States.
  • Yanjun Gao
    Department of Biomedical Informatics, University of Colorado-Anschutz Medical, Aurora, CO 80045, United States.
  • Matthew M Churpek
    Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.
  • Anoop Mayampurath
    Department of Medicine, University of Wisconsin-Madison, Madison, WI, United States.
  • Dmitriy Dligach
    Department of Public Health Sciences, Stritch School of Medicine, Loyola University Chicago, Maywood, IL.
  • Majid Afshar
    Loyola University Chicago, Chicago, IL.