Assessment of a zero-shot large language model in measuring documented goals-of-care discussions

Journal: medRxiv

Published Date: Jan 1, 2025

Abstract

Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data. To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions. We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F1 score, for both note-level and patient-level classification over a 30-day period. In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F1 0.83; BERT: AUC 0.981, AUPRC 0.874, and F1 0.83). A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes. This article reports the performance of a publicly available large language model with no task-specific training data in measuring the occurrence of documented goals-of-care discussions from electronic health records. The study demonstrates that newer large language AI models may allow investigators to measure novel outcomes without requiring costly training data.

Authors

Robert Y. Lee; Kevin S. Li; James Sibley; Trevor Cohen; William B. Lober; Danae G. Dotolo; Erin K. Kross

External Resources

View on medRxiv Access via DOI

Assessment of a zero-shot large language model in measuring documented goals-of-care discussions

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Assessment of a zero-shot large language model in measuring documented goals-of-care discussions

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals