Drug-drug interaction identification using large language models

Journal: medRxiv

Published Date: Jan 1, 2025

Abstract

Drug-drug interactions (DDIs) are a significant source of morbidity and adverse drug events (ADEs), particularly in situations of polypharmacy and complex medication regimens. While rules-based software integrated in electronic health records (EHRs) has demonstrated proficiency in identifying DDIs present in medication regimens, large language model (LLM) based identification requires thorough benchmarking and performance evaluation using high-quality datasets for safe use. We evaluated three LLMs (GPT-4o-mini, MedGemma-27B, and LLaMA3-70B) using a clinician-annotated benchmark dataset of 750 DDI scenarios spanning three levels of diagnostic complexity. Tasks were aligned with flexible judgment formats: (1) a pointwise two-drug classification task, (2) a pairwise three-drug discrimination task, and (3) a listwise 4–6 drug selection task. Standardized zero-shot prompting with task-specific instructions was applied for all models. Performance was assessed using precision, recall, F1 score, and accuracy. Reliability was quantified using self-consistency across repeated runs and confidence-aligned metrics to capture stability in model reasoning. Across the three experiments, model performance varied by task structure and interaction severity. LLaMA3-70B demonstrated the highest recall and F1 score in the pointwise task, whereas GPT-4o-mini achieved superior accuracy and consistency in the pairwise and listwise tasks. MedGemma-27B showed competitive performance in identifying Category D interactions. Self-consistency decreased as task complexity increased, highlighting reduced stability in multi-drug reasoning. No model exhibited uniformly high reliability across all judgment formats. Current LLMs show promising but uneven capabilities in identifying DDIs across clinically relevant task structures. Performance degrades as the reasoning space expands, and stability across repeated queries remains limited. These findings emphasize the need for multi-format evaluation frameworks and reliability-aware assessment when considering LLMs for medication-safety applications.

Authors

Kaitlin Blotske; Xingmeng Zhao; Kelli Henry; Yanjun Gao; Adeleine Tilley; Moriah Cargile; Brian Murray; Susan E. Smith; Erin F. Barreto; Seth Bauer; Sunghwan Sohn; Tianming Liu; Tell Bennett; Mitch Cohen; Andrea Sikora

External Resources

View on medRxiv Access via DOI

Drug-drug interaction identification using large language models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Drug-drug interaction identification using large language models

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals