Grade Guard: A Smart System for Short Answer Automated Grading
Journal:
arXiv
Published Date:
Apr 1, 2025
Abstract
The advent of large language models (LLMs) in the education sector has
provided impetus to automate grading short answer questions. LLMs make
evaluating short answers very efficient, thus addressing issues like staff
shortage. However, in the task of Automated Short Answer Grading (ASAG), LLM
responses are influenced by diverse perspectives in their training dataset,
leading to inaccuracies in evaluating nuanced or partially correct answers. To
address this challenge, we propose a novel framework, Grade Guard.
1. To enhance the task-based specialization of the LLMs, the temperature
parameter has been fine-tuned using Root Mean Square Error (RMSE).
2. Unlike traditional approaches, LLMs in Grade Guard compute an
Indecisiveness Score (IS) along with the grade to reflect uncertainty in
predicted grades.
3. Introduced Confidence-Aware Loss (CAL) to generate an optimized
Indecisiveness Score (IS).
4. To improve reliability, self-reflection based on the optimized IS has been
introduced into the framework, enabling human re-evaluation to minimize
incorrect grade assignments.
Our experimentation shows that the best setting of Grade Guard outperforms
traditional methods by 19.16% RMSE in Upstage Solar Pro, 23.64% RMSE in Upstage
Solar Mini, 4.00% RMSE in Gemini 1.5 Flash, and 10.20% RMSE in GPT 4-o Mini.
Future work includes improving interpretability by generating rationales for
grades to enhance accuracy. Expanding benchmark datasets and annotating them
with domain-specific nuances will enhance grading accuracy. Finally, analyzing
feedback to enhance confidence in predicted grades, reduce biases, optimize
grading criteria, and personalize learning while supporting multilingual
grading systems will make the solution more accurate, adaptable, fair, and
inclusive.