A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language
Journal:
arXiv
Published Date:
Jul 21, 2024
Abstract
Predicting gene function from its DNA sequence is a fundamental challenge in
biology. Many deep learning models have been proposed to embed DNA sequences
and predict their enzymatic function, leveraging information in public
databases linking DNA sequences to an enzymatic function label. However, much
of the scientific community's knowledge of biological function is not
represented in these categorical labels, and is instead captured in
unstructured text descriptions of mechanisms, reactions, and enzyme behavior.
These descriptions are often captured alongside DNA sequences in biological
databases, albeit in an unstructured manner. Deep learning of models predicting
enzymatic function are likely to benefit from incorporating this multi-modal
data encoding scientific knowledge of biological function. There is, however,
no dataset designed for machine learning algorithms to leverage this
multi-modal information. Here we propose a novel dataset and benchmark suite
that enables the exploration and development of large multi-modal neural
network models on gene DNA sequences and natural language descriptions of gene
function. We present baseline performance on benchmarks for both unsupervised
and supervised tasks that demonstrate the difficulty of this modeling
objective, while demonstrating the potential benefit of incorporating
multi-modal data types in function prediction compared to DNA sequences alone.
Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.