A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI
Journal:
arXiv
Published Date:
Mar 26, 2025
Abstract
Despite the excitement behind biomedical artificial intelligence (AI), access
to high-quality, diverse, and large-scale data - the foundation for modern AI
systems - is still a bottleneck to unlocking its full potential. To address
this gap, we introduce Biomedica, an open-source dataset derived from the
PubMed Central Open Access subset, containing over 6 million scientific
articles and 24 million image-text pairs, along with 27 metadata fields
(including expert human annotations). To overcome the challenges of accessing
our large-scale dataset, we provide scalable streaming and search APIs through
a web server, facilitating seamless integration with AI systems. We demonstrate
the utility of the Biomedica dataset by building embedding models, chat-style
models, and retrieval-augmented chat agents. Notably, all our AI models surpass
previous open systems in their respective categories, underscoring the critical
role of diverse, high-quality, and large-scale biomedical data.