Biomedical knowledge graph-optimized prompt generation for large language models.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: Large language models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains such as biomedicine. Solutions such as pretraining and domain-specific fine-tuning add substantial computational overhead, requiring further domain-expertise. Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation (KG-RAG) framework by leveraging a massive biomedical KG (SPOKE) with LLMs such as Llama-2-13b, GPT-3.5-Turbo, and GPT-4, to generate meaningful biomedical text rooted in established knowledge.

Authors

  • Karthik Soman
    Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India.
  • Peter W Rose
    San Diego Supercomputer Center, University of California, San Diego, CA 92093, United States.
  • John H Morris
    Department of Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, CA 94143, USA.
  • Rabia E Akbas
    Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States.
  • Brett Smith
    Department of Genomic Medicine, MD Anderson Cancer Center, Houston, Texas, USA.
  • Braian Peetoom
    Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States.
  • Catalina Villouta-Reyes
    Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, United States.
  • Gabriel Cerono
    Department of Neurology, University of California San Francisco, San Francisco, CA USA.
  • Yongmei Shi
    Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, United States.
  • Angela Rizk-Jackson
    Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, United States.
  • Sharat Israni
    Bakar Computational Health Sciences Institute, University of California, San Francisco, CA 94158, United States.
  • Charlotte A Nelson
    Integrated Program in Quantitative Biology, University of California San Francisco, San Francisco, CA, USA.
  • Sui Huang
    2 Institute for Systems Biology, 401 Terry Ave. N. Seattle, WA 98109, USA.
  • Sergio E Baranzini
    MS Genetics, Department of Neurology, School of Medicine, University of California San Francisco (UCSF), San Francisco, CA, USA.