Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.

Journal: Journal of the American Medical Informatics Association : JAMIA

Published Date: Jun 1, 2026

Abstract

BACKGROUND: To quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for trial-success classification across temperature and reasoning/thinking settings and determine whether single-run reporting suffices. MATERIALS AND METHODS: We utilized 250 trial abstracts labeled based on primary endpoint success. We evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0-2.0 and GPT-5.2 across reasoning-effort levels (none to x-high) with an additional temperature sweep when reasoning was disabled. Each setting was run 3 times. RESULTS: Reproducibility was high for Gemini (κ = 0.942-1.000; invalid outputs 0%-1.5%) and GPT-5.2 (κ = 0.984-0.995; no invalid outputs). F1 remained stable (mean/majority vote 0.955-0.971), with marginal gains from majority voting. CONCLUSION: For binary biomedical classification with tightly constrained outputs, both models were reproducible across decoding and reasoning settings, suggesting single runs are often sufficient, with minimal replication as a practical stability check.

Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals