Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.
Journal:
Journal of the American Medical Informatics Association : JAMIA
Published Date:
Jun 1, 2026
Abstract
BACKGROUND: To quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for trial-success classification across temperature and reasoning/thinking settings and determine whether single-run reporting suffices. MATERIALS AND METHODS: We utilized 250 trial abstracts labeled based on primary endpoint success. We evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0-2.0 and GPT-5.2 across reasoning-effort levels (none to x-high) with an additional temperature sweep when reasoning was disabled. Each setting was run 3 times. RESULTS: Reproducibility was high for Gemini (κ = 0.942-1.000; invalid outputs 0%-1.5%) and GPT-5.2 (κ = 0.984-0.995; no invalid outputs). F1 remained stable (mean/majority vote 0.955-0.971), with marginal gains from majority voting. CONCLUSION: For binary biomedical classification with tightly constrained outputs, both models were reproducible across decoding and reasoning settings, suggesting single runs are often sufficient, with minimal replication as a practical stability check.
Authors
Keywords
No keywords available for this article.