Evaluation of the phi-3-mini SLM for identification of texts related to medicine, health, and sports injuries
Journal:
arXiv
Published Date:
Mar 31, 2025
Abstract
Small Language Models (SLMs) have potential to be used for automatically
labelling and identifying aspects of text data for medicine/health-related
purposes from documents and the web. As their resource requirements are
significantly lower than Large Language Models (LLMs), these can be deployed
potentially on more types of devices. SLMs often are benchmarked on
health/medicine-related tasks, such as MedQA, although performance on these can
vary especially depending on the size of the model in terms of number of
parameters. Furthermore, these test results may not necessarily reflect
real-world performance regarding the automatic labelling or identification of
texts in documents and the web. As a result, we compared topic-relatedness
scores from Microsofts phi-3-mini-4k-instruct SLM to the topic-relatedness
scores from 7 human evaluators on 1144 samples of medical/health-related texts
and 1117 samples of sports injury-related texts. These texts were from a larger
dataset of about 9 million news headlines, each of which were processed and
assigned scores by phi-3-mini-4k-instruct. Our sample was selected (filtered)
based on 1 (low filtering) or more (high filtering) Boolean conditions on the
phi-3 SLM scores. We found low-moderate significant correlations between the
scores from the SLM and human evaluators for sports injury texts with low
filtering (\r{ho} = 0.3413, p < 0.001) and medicine/health texts with high
filtering (\r{ho} = 0.3854, p < 0.001), and low significant correlation for
medicine/health texts with low filtering (\r{ho} = 0.2255, p < 0.001). There
was negligible, insignificant correlation for sports injury-related texts with
high filtering (\r{ho} = 0.0318, p = 0.4466).