Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments.

Journal: Surgery
Published Date:

Abstract

BACKGROUND: Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.

Authors

  • Brendin R Beaulieu-Jones
    Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. Electronic address: https://twitter.com/bratogram.
  • Margaret T Berrigan
    Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA.
  • Sahaj Shah
    Geisinger Commonwealth School of Medicine, Scranton, PA.
  • Jayson S Marwaha
    American College of Surgeons Health Information Technology Committee and Artificial Intelligence Subcommittee, Chicago, IL.
  • Shuo-Lun Lai
    Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, No.1, Sec.4, Roosevelt Road, Taipei, 10617, Taiwan.
  • Gabriel A Brat
    American College of Surgeons Health Information Technology Committee and Artificial Intelligence Subcommittee, Chicago, IL.