Contrastive learning and mixture of experts enables precise vector embeddings in biological databases.

Journal: Scientific reports

PMID: 40301476

Abstract

The advancement of transformer neural networks has significantly enhanced the performance of sentence similarity models. However, these models often struggle with highly discriminative tasks and generate sub-optimal representations of complex documents such as peer-reviewed scientific literature. With the increased reliance on retrieval augmentation and search, representing structurally and thematically-varied research documents as concise and descriptive vectors is crucial. This study improves upon the vector embeddings of scientific text by assembling domain-specific datasets using co-citations as a similarity metric, focusing on biomedical domains. We introduce a novel Mixture of Experts (MoE) extension pipeline applied to pretrained BERT models, where every multi-layer perceptron section is copied into distinct experts. Our MoE variants are trained to classify whether two publications are cited together (co-cited) in a third paper based on their scientific abstracts across multiple biological domains. Notably, because of our unique routing scheme based on special tokens, the throughput of our extended MoE system is exactly the same as regular transformers. This holds promise for versatile and efficient One-Size-Fits-All transformer networks for encoding heterogeneous biomedical inputs. Our methodology marks advancements in representation learning and holds promise for enhancing vector database search and compilation.

Authors

Logan Hallee

Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19713, USA.
Rohan Kapur

Lincoln Laboratory, Massachusetts Institute of Technology, Boston, USA.
Arjun Patel

The College of the University of Chicago, Chicago, USA.
Jason P Gleghorn

Department of Biomedical Engineering, University of Delaware, Newark, USA. gleghorn@udel.edu.
Bohdan B Khomtchouk

Center for Therapeutic Innovation and Department of Psychiatry and Behavioral Sciences, University of Miami Miller School of Medicine, 1120 NW 14th St., Miami, FL, USA 33136.

Keywords

Algorithms Databases, Factual Humans Machine Learning Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (40301476)

Contrastive learning and mixture of experts enables precise vector embeddings in biological databases.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals