FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models
Journal:
arXiv
Published Date:
Apr 20, 2025
Abstract
Research on evaluating and analyzing large language models (LLMs) has been
extensive for resource-rich languages such as English, yet their performance in
languages such as Persian has received considerably less attention. This paper
introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for
evaluating large language models in Persian. This benchmark consists of 4000
questions and answers in various formats, including multiple choice, short
answer and descriptive responses. It covers a wide range of domains and
tasks,including medicine, law, religion, Persian language, encyclopedic
knowledge, human preferences, social knowledge, ethics and bias, text
generation, and respecting others' rights. This bechmark incorporates
linguistics, cultural, and local considerations relevant to the Persian
language and Iran. To ensure the questions are challenging for current LLMs,
three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this
benchmark. Their average accuracy was below 50%, meaning they provided fully
correct answers to fewer than half of the questions. These results indicate
that current language models are still far from being able to solve this
benchmark