HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology
Journal:
arXiv
Published Date:
May 17, 2025
Abstract
Recent advancements in Digital Pathology (DP), particularly through
artificial intelligence and Foundation Models, have underscored the importance
of large-scale, diverse, and richly annotated datasets. Despite their critical
role, publicly available Whole Slide Image (WSI) datasets often lack sufficient
scale, tissue diversity, and comprehensive clinical metadata, limiting the
robustness and generalizability of AI models. In response, we introduce the
HISTAI dataset, a large, multimodal, open-access WSI collection comprising over
60,000 slides from various tissue types. Each case in the HISTAI dataset is
accompanied by extensive clinical metadata, including diagnosis, demographic
information, detailed pathological annotations, and standardized diagnostic
coding. The dataset aims to fill gaps identified in existing resources,
promoting innovation, reproducibility, and the development of clinically
relevant computational pathology solutions. The dataset can be accessed at
https://github.com/HistAI/HISTAI.