A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Journal: arXiv

Published Date: Mar 26, 2025

Abstract

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Authors

Alejandro Lozano
Min Woo Sun
James Burgess
Jeffrey J. Nirschl
Christopher Polzak
Yuhui Zhang
Liangyu Chen
Jeffrey Gu
Ivan Lopez
Josiah Aklilu
Anita Rau
Austin Wolfgang Katzer
Collin Chiu
Orr Zohar
Xiaohan Wang
Alfred Seunghoon Song
Chiang Chia-Chun
Robert Tibshirani
Serena Yeung-Levy

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2503.22727v2)

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals