STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics
Journal:
arXiv
Published Date:
Jun 10, 2024
Abstract
Recent advances in multi-modal algorithms have driven and been driven by the
increasing availability of large image-text datasets, leading to significant
strides in various fields, including computational pathology. However, in most
existing medical image-text datasets, the text typically provides high-level
summaries that may not sufficiently describe sub-tile regions within a large
pathology image. For example, an image might cover an extensive tissue area
containing cancerous and healthy regions, but the accompanying text might only
specify that this image is a cancer slide, lacking the nuanced details needed
for in-depth analysis. In this study, we introduce STimage-1K4M, a novel
dataset designed to bridge this gap by providing genomic features for sub-tile
images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics
data, which captures gene expression information at the level of individual
spatial spots within a pathology image. Specifically, each image in the dataset
is broken down into smaller sub-image tiles, with each tile paired with
15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile
images and gene expressions, STimage-1K4M offers unprecedented granularity,
paving the way for a wide range of advanced research in multi-modal data
analysis an innovative applications in computational pathology, and beyond.