Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology.

Journal: NPJ digital medicine
Published Date:

Abstract

Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), and Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.

Authors

  • Mauro Giuffrè
    Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, USA.
  • Kisung You
    Department of Mathematics, Baruch College, The City University of New York, New York, USA.
  • Ziteng Pang
    Department of Statistics and Data Science, Northwestern University, Chicago, USA.
  • Simone Kresevic
    Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, USA.
  • Sunny Chung
    Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, USA.
  • Ryan Chen
    University of Massachusetts Chan Medical School, Worcester, Massachusetts.
  • Youngmin Ko
    Department of Statistics and Data Science, Northwestern University, Chicago, USA.
  • Colleen Chan
    Department of Statistics and Data Science, Yale University, New Haven, USA.
  • Theo Saarinen
    Department of Statistics, University of California Berkley, Berkley, USA.
  • Milos Ajcevic
    Department of Engineering and Architecture, University of Trieste, Trieste, Italy.
  • Lory S Crocè
    Department of Medical, Surgical, and Health Sciences, University of Trieste, Trieste, Italy.
  • Guadalupe Garcia-Tsao
    Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, USA.
  • Ian Gralnek
    Rappaport Faculty of Medicine Technion Israel Institute of Technology, Haifa, Israel.
  • Joseph J Y Sung
    Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.
  • Alan Barkun
    Division of Gastroenterology, McGill University, Montreal, Canada.
  • Loren Laine
    Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, USA.
  • Jasjeet Sekhon
    Department of Statistics and Data Science, Yale University, New Haven, USA.
  • Bradly Stadie
    Department of Statistics and Data Science, Northwestern University, Chicago, USA.
  • Dennis L Shung
    Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, USA. dennis.shung@yale.edu.

Keywords

No keywords available for this article.