A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Journal: arXiv
Published Date:

Abstract

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Authors

  • Alejandro Lozano
  • Min Woo Sun
  • James Burgess
  • Jeffrey J. Nirschl
  • Christopher Polzak
  • Yuhui Zhang
  • Liangyu Chen
  • Jeffrey Gu
  • Ivan Lopez
  • Josiah Aklilu
  • Anita Rau
  • Austin Wolfgang Katzer
  • Collin Chiu
  • Orr Zohar
  • Xiaohan Wang
  • Alfred Seunghoon Song
  • Chiang Chia-Chun
  • Robert Tibshirani
  • Serena Yeung-Levy