Benchmarking large language models for genomic knowledge with GeneTuring

Journal: bioRxiv

Published Date: Jan 1, 2025

Abstract

Large language models (LLMs) show promise in biomedical research, but their effectiveness for genomic inquiry remains unclear. We developed GeneTuring, a benchmark consisting of 16 genomics tasks with 1,600 curated questions, and manually evaluated 48,000 answers from ten LLM configurations, including GPT-4o (via API, ChatGPT with web access, and a custom GPT setup), GPT-3.5, Claude 3.5, Gemini Advanced, GeneGPT (both slim and full), BioGPT, and BioMedLM. A custom GPT-4o configuration integrated with NCBI APIs, developed in this study as SeqSnap, achieved the best overall performance. GPT-4o with web access and GeneGPT demonstrated complementary strengths. Our findings highlight both the promise and current limitations of LLMs in genomics, and emphasize the value of combining LLMs with domain-specific tools for robust genomic intelligence. GeneTuring offers a key resource for benchmarking and improving LLMs in biomedical research. Dr. Wenpin Hou is an Assistant Professor (tenure-track) in the Department of Biostatistics at Columbia University and member of its Data Science Institute, developing AI and statistical methods for decoding gene regulatory programs from single-cell and spatial multiomics data.

Authors

Xinyi Shang; Xu Liao; Zhicheng Ji; Wenpin Hou

External Resources

View on bioRxiv Access via DOI

Benchmarking large language models for genomic knowledge with GeneTuring

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Benchmarking large language models for genomic knowledge with GeneTuring

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals