Evaluating the accuracy of ChatGPT model versions for giving care-seeking advice.
Journal:
Communications medicine
Published Date:
Feb 25, 2026
Abstract
BACKGROUND: Artificial Intelligence tools such as ChatGPT are increasingly used by laypeople to support their care-seeking decisions, although the accuracy of newer models remains unclear. We aimed to evaluate the accuracy of care-seeking advice that is generated by all currently available ChatGPT models. METHODS: We evaluated 22 ChatGPT models using 45 validated vignettes, each prompted ten times (9,900 total assessments). Each model classified the vignettes as requiring emergency care, non-emergency care, or self-care. We evaluated accuracy against each case's gold standard solution (determined by two physicians), examined the variability across trials, and tested algorithms to aggregate multiple recommendations to improve accuracy. RESULTS: We show that o1-mini achieves the highest accuracy (74%), but we cannot observe an overall improvement with newer models - although reasoning models (e.g., o4-mini) improved their accuracy in identifying self-care cases. Selecting the lowest urgency level across multiple trials improves accuracy by 4 percentage points. CONCLUSIONS: Although newer increasingly provide self-care advice, their accuracy remains insufficient for standalone use. However, making use of output variability with aggregation algorithms can improve the performance of existing models.
Authors
Keywords
No keywords available for this article.