Comparison of physician and large language model chatbot responses to online ear, nose, and throat inquiries.
Journal:
Scientific reports
Published Date:
Jul 1, 2025
Abstract
Large language models (LLMs) can potentially enhance the accessibility and quality of medical information. This study evaluates the reliability and quality of responses generated by ChatGPT-4, an LLM-driven chatbot, compared to those written by physicians, focusing on otorhinolaryngological advice in real-world, text-based workflows. Responses from a public social media forum were anonymized, and ChatGPT-4 generated corresponding replies. A panel of seven board-certified otorhinolaryngologists assessed both sets of responses using six criteria: overall quality, empathy, alignment with medical consensus, information accuracy, inquiry comprehension, and harm potential. Ordinal logistic regression analysis identified factors influencing response quality. ChatGPT-4 responses were preferred in 70.7% of cases and were significantly longer (median: 162 words) than physician responses (median: 67 words; Pā<ā.0001). The chatbot's responses received higher ratings across all criteria, with key predictors of this higher quality being greater empathy, stronger alignment with medical consensus, lower potential for harm, and fewer inaccuracies. ChatGPT-4 consistently outperformed physicians in generating responses that adhered to medical consensus, demonstrated accuracy, and conveyed empathy. These findings suggest that integrating AI tools into text-based healthcare consultations could help physicians better address complex, nuanced inquiries and provide high-quality, comprehensive medical advice.