Using aggregated AI detector outcomes to eliminate false positives in STEM-student writing.

Journal: Advances in physiology education
PMID:

Abstract

Generative artificial intelligence (AI) large language models have become sufficiently accessible and user-friendly to assist students with course work, studying tactics, and written communication. AI-generated writing is almost indistinguishable from human-derived work. Instructors must rely on intuition/experience and, recently, assistance from online AI detectors to help them distinguish between student- and AI-written material. Here, we tested the veracity of AI detectors for writing samples from a fact-heavy, lower-division undergraduate anatomy and physiology course. Student participants ( = 190) completed three parts: a hand-written essay answering a prompt on the structure/function of the plasma membrane; creating an AI-generated answer to the same prompt; and a survey seeking participants' views on the quality of each essay as well as general AI use. Randomly selected ( = 50) participant-written and AI-generated essays were blindly uploaded onto four AI detectors; a separate and unique group of randomly selected essays ( = 48) was provided to human raters ( = 9) for classification assessment. For the majority of essays, human raters and the best-performing AI detectors ( = 3) similarly identified their correct origin (84-95% and 93-98%, respectively) ( > 0.05). Approximately 1.3% and 5.0% of the essays were detected as false positives (human writing incorrectly labeled as AI) by AI detectors and human raters, respectively. Surveys generally indicated that students viewed the AI-generated work as better than their own ( < 0.01). Using AI detectors in aggregate reduced the likelihood of detecting a false positive to nearly 0%, and this strategy was validated against human rater-labeled false positives. Taken together, our findings show that AI detectors, when used together, become a powerful tool to inform instructors. We show how online artificial intelligence (AI) detectors can assist instructors in distinguishing between human- and AI-written work for written assignments. Although individual AI detectors may vary in their accuracy for correctly identifying the origin of written work, they are most effective when used in aggregate to inform instructors when human intuition gets it wrong. Using AI detectors for consensus detection reduces the false positive rate to nearly zero.

Authors

  • Jon-Philippe K Hyatt
    College of Integrative Sciences and Arts, Arizona State University, Tempe, Arizona, United States.
  • Elisa Jayne Bienenstock
    Watts College of Public Service and Community Solutions, Arizona State University, Tempe, Arizona, United States.
  • Carla M Firetto
    Mary Lou Fulton College for Teaching and Learning Innovation, Arizona State University, Tempe, Arizona, United States.
  • Elizabeth R Woods
    College of Integrative Sciences and Arts, Arizona State University, Tempe, Arizona, United States.
  • Robert C Comus
    College of Integrative Sciences and Arts, Arizona State University, Tempe, Arizona, United States.