Evaluating the Quality of AI-Generated Rubrics and Grading Reliability in Student Written Reflection Assignments.

Journal: American journal of pharmaceutical education
Published Date:

Abstract

OBJECTIVE: To investigate whether generative artificial intelligence (genAI) can improve pharmacy assessment practices, specifically rubric development and grading of student written reflection assignments. METHODS: Sample student written reflections from an advanced diabetes management elective course were used. In phase I, ChatGPT, Copilot, and Gemini generated grading rubrics using both basic and advanced prompts. Five faculty evaluators independently assessed the six blinded rubrics using a seven-item scoring tool. Rubric quality was analyzed using interrater reliability and descriptive statistics. In phase II, 45 deidentified student reflections were graded using the Copilot-generated rubric (advanced prompt) by three genAI platforms and two human evaluators. Scoring consistency across graders was analyzed using ANOVA, Tukey's post hoc tests, and intraclass correlation coefficients (ICC). RESULTS: Rubric quality differed significantly across genAI platforms, with moderate interrater reliability among faculty evaluators. Rubric quality did not significantly differ based on prompt specificity. Significant differences were observed in student reflection scores assigned by the five grading entities. Gemini yielded the lowest agreement with other AI platforms and human graders in both phases. CONCLUSION: GenAI demonstrates the ability to generate rubrics and score reflection assignments; however, variability across platforms and inconsistent agreement with human graders underscores the need for careful validation before educational use.

Authors

Keywords

No keywords available for this article.