Benchmarking large language models for genomic knowledge with GeneTuring
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Large language models (LLMs) show promise in biomedical research, but their effectiveness for genomic inquiry remains unclear. We developed GeneTuring, a benchmark consisting of 16 genomics tasks with 1,600 curated questions, and manually evaluated 48,000 answers from ten LLM configurations, including GPT-4o (via API, ChatGPT with web access, and a custom GPT setup), GPT-3.5, Claude 3.5, Gemini Advanced, GeneGPT (both slim and full), BioGPT, and BioMedLM. A custom GPT-4o configuration integrated with NCBI APIs, developed in this study as SeqSnap, achieved the best overall performance. GPT-4o with web access and GeneGPT demonstrated complementary strengths. Our findings highlight both the promise and current limitations of LLMs in genomics, and emphasize the value of combining LLMs with domain-specific tools for robust genomic intelligence. GeneTuring offers a key resource for benchmarking and improving LLMs in biomedical research. Dr. Wenpin Hou is an Assistant Professor (tenure-track) in the Department of Biostatistics at Columbia University and member of its Data Science Institute, developing AI and statistical methods for decoding gene regulatory programs from single-cell and spatial multiomics data.