Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research
Journal:
arXiv
Published Date:
Jun 21, 2024
Abstract
The applications of large language models (LLMs) are promising for biomedical
and healthcare research. Despite the availability of open-source LLMs trained
using a wide range of biomedical data, current research on the applications of
LLMs to genomics and proteomics is still limited. To fill this gap, we propose
a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse,
for three novel tasks in genomic and proteomic research. The models in
Geneverse are trained and evaluated based on domain-specific datasets, and we
use advanced parameter-efficient finetuning techniques to achieve the model
adaptation for tasks including the generation of descriptions for gene
functions, protein function inference from its structure, and marker gene
selection from spatial transcriptomic data. We demonstrate that adapted LLMs
and MLLMs perform well for these tasks and may outperform closed-source
large-scale models based on our evaluations focusing on both truthfulness and
structural correctness. All of the training strategies and base models we used
are freely accessible.