Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks
Journal:
arXiv
Published Date:
Mar 19, 2025
Abstract
The application of large language models (LLMs) to healthcare information
extraction has emerged as a promising approach. This study evaluates the
classification performance of five open-source LLMs: GEMMA-3-27B-IT,
LLAMA3-70B, LLAMA4-109B, DEEPSEEK-R1-DISTILL-LLAMA-70B, and
DEEPSEEK-V3-0324-UD-Q2_K_XL, across six healthcare-related classification tasks
involving both social media data (breast cancer, changes in medication regimen,
adverse pregnancy outcomes, potential COVID-19 cases) and clinical data (stigma
labeling, medication change discussion). We report precision, recall, and F1
scores with 95% confidence intervals for all model-task combinations. Our
findings reveal significant performance variability between LLMs, with
DeepSeekV3 emerging as the strongest overall performer, achieving the highest
F1 scores in four tasks. Notably, models generally performed better on social
media tasks compared to clinical data tasks, suggesting potential
domain-specific challenges. GEMMA-3-27B-IT demonstrated exceptionally high
recall despite its smaller parameter count, while LLAMA4-109B showed
surprisingly underwhelming performance compared to its predecessor LLAMA3-70B,
indicating that larger parameter counts do not guarantee improved
classification results. We observed distinct precision-recall trade-offs across
models, with some favoring sensitivity over specificity and vice versa. These
findings highlight the importance of task-specific model selection for
healthcare applications, considering the particular data domain and
precision-recall requirements rather than model size alone. As healthcare
increasingly integrates AI-driven text classification tools, this comprehensive
benchmarking provides valuable guidance for model selection and implementation
while underscoring the need for continued evaluation and domain adaptation of
LLMs in healthcare contexts.