Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
The emergence of Small Language Models (SLMs) as privacy-preserving
alternatives for sensitive applications raises a fundamental question about
their inherent understanding capabilities compared to Large Language Models
(LLMs). This paper investigates the mental health understanding capabilities of
current SLMs through systematic evaluation across diverse classification tasks.
Employing zero-shot and few-shot learning paradigms, we benchmark their
performance against established LLM baselines to elucidate their relative
strengths and limitations in this critical domain. We assess five
state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against
three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding
tasks. Our findings reveal that SLMs achieve mean performance within 2\% of
LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot
settings), demonstrating notable competence despite orders of magnitude fewer
parameters. Both model categories experience similar degradation on multi-class
severity tasks (a drop of over 30\%), suggesting that nuanced clinical
understanding challenges transcend model scale. Few-shot prompting provides
substantial improvements for SLMs (up to 14.6\%), while LLM gains are more
variable. Our work highlights the potential of SLMs in mental health
understanding, showing they can be effective privacy-preserving tools for
analyzing sensitive online text data. In particular, their ability to quickly
adapt and specialize with minimal data through few-shot learning positions them
as promising candidates for scalable mental health screening tools.