Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments.

Journal: Surgery

Published Date: Jan 20, 2024

Abstract

BACKGROUND: Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.

Authors

Brendin R Beaulieu-Jones

Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. Electronic address: https://twitter.com/bratogram.
Margaret T Berrigan

Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA.
Sahaj Shah

Geisinger Commonwealth School of Medicine, Scranton, PA.
Jayson S Marwaha

American College of Surgeons Health Information Technology Committee and Artificial Intelligence Subcommittee, Chicago, IL.
Shuo-Lun Lai

Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, No.1, Sec.4, Roosevelt Road, Taipei, 10617, Taiwan.
Gabriel A Brat

American College of Surgeons Health Information Technology Committee and Artificial Intelligence Subcommittee, Chicago, IL.

Keywords

Artificial Intelligence Benchmarking Educational Status Humans Language Surgeons

External Resources

View on PubMed Access via DOI PubMed (38246839)

Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals