SpatialDINO: A Self-Supervised 3D Vision Transformer that enables Segmentation and Tracking in Crowded Cellular Environments
Journal:
bioRxiv
Published Date:
Jan 25, 2026
Abstract
Quantitative, time-resolved 3D fluorescence microscopy can reveal complex cellular dynamics in living cells and tissues. Broader use remains limited by the difficulty of identifying, segmenting, and tracking objects of different size and shape in crowded intracellular environments in low-contrast, anisotropic, monochromatic image volumes. Objects overlap, deform, appear and disappear, and span wide ranges of size and intensity. Classical segmentation pipelines typically require high signal-to-noise data and rely on intensity heuristics with hand-tuned postprocessing that generalize poorly. Supervised deep learning methods require extensive voxel-level annotations that are costly, inconsistent across phenotypes, and rapidly become obsolete as imaging conditions change. We introduce SpatialDINO, a fully automated self-supervised method that trains a native 3D vision transformer, based on a modified version of DINOv2. SpatialDINO yields robust semantic feature maps from single channels of multi-channel microscopy that, irrespective of object shape, support object detection and segmentation directly from naive 3D images across z-spacings and numbers of planes and different imaging modalities, without retraining or voxel annotations. We trained SpatialDINO on a small set of confocal volumes acquired by live-cell fluorescent 3D lattice light-sheet microscopy, spanning targets of different size and shape located in crowded cellular environments, from diffraction-limited clathrin coated pits and clathrin coated vesicles to bigger structures including endosomes and lysosomes, and endosomes and lysosomes pharmacologically enlarged to highlight endosomal membrane profiles. Post-processing of the features generated by SpatialDINO allows detection and unique object identification of these objects in naive 3D images. It also enables detection of significantly different previously unseen object classes, such as cellular plasma membranes and nuclei and even tumors in MRI scans. Finally, we illustrate its value by tracking endosomes in 3D time series, combining SpatialDINO-derived feature similarity with spatial proximity to improve association through occlusion, abrupt appearance changes, and dense packing -- all conditions that have been challenging for existing methods. SpatialDINO therefore lowers a major barrier to quantitative analysis of heterogeneous, monochromatic objects in crowded 3D cellular environments.