COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models
Journal:
arXiv
Published Date:
Dec 13, 2024
Abstract
As key elements within the central dogma, DNA, RNA, and proteins play crucial
roles in maintaining life by guaranteeing accurate genetic expression and
implementation. Although research on these molecules has profoundly impacted
fields like medicine, agriculture, and industry, the diversity of machine
learning approaches-from traditional statistical methods to deep learning
models and large language models-poses challenges for researchers in choosing
the most suitable models for specific tasks, especially for cross-omics and
multi-omics tasks due to the lack of comprehensive benchmarks. To address this,
we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for
Biological COmprehensive Multi-omics Evaluation Tasks and Language Models),
designed to evaluate models across single-omics, cross-omics, and multi-omics
tasks. First, we curate and develop a diverse collection of downstream tasks
and datasets covering key structural and functional aspects in DNA, RNA, and
proteins, including tasks that span multiple omics levels. Then, we evaluate
existing foundational language models for DNA, RNA, and proteins, as well as
the newly proposed multi-omics method, offering valuable insights into their
performance in integrating and analyzing data from different biological
modalities. This benchmark aims to define critical issues in multi-omics
research and guide future directions, ultimately promoting advancements in
understanding biological processes through integrated and different omics data
analysis.