Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics.
Journal:
Scientific reports
PMID:
40140500
Abstract
Large language model chatbots such as ChatGPT have shown the potential in assisting health professionals in emergency departments (EDs). However, the diagnostic accuracy of newer ChatGPT models remains unclear. This retrospective study evaluated the diagnostic performance of various ChatGPT models-including GPT-3.5, GPT-4, GPT-4o, and o1 series-in predicting diagnoses for ED patients (n = 30) and examined the impact of explicitly invoking reasoning (thoughts). Earlier models, such as GPT-3.5, demonstrated high accuracy for top-three differential diagnoses (80.0% in accuracy) but underperformed in identifying leading diagnoses (47.8%) compared to newer models such as chatgpt-4o-latest (60%, p < 0.01) and o1-preview (60%, p < 0.01). Asking for thoughts to be provided significantly enhanced the performance on predicting leading diagnosis for 4o models such as 4o-2024-0513 (from 45.6 to 56.7%; p = 0.03) and 4o-mini-2024-07-18 (from 54.4 to 60.0%; p = 0.04) but had minimal impact on o1-mini and o1-preview. In challenging cases, such as pneumonia without fever, all models generally failed to predict the correct diagnosis, indicating atypical presentations as a major limitation for ED application of current ChatGPT models.