ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
Stereo matching has become an increasingly important component of modern
autonomous systems. Developing deep learning-based stereo matching models that
deliver high accuracy while operating in real-time continues to be a major
challenge in computer vision. In the domain of cost-volume-based stereo
matching, accurate disparity estimation depends heavily on large-scale cost
volumes. However, such large volumes store substantial redundant information
and also require computationally intensive aggregation units for processing and
regression, making real-time performance unattainable. Conversely, small-scale
cost volumes followed by lightweight aggregation units provide a promising
route for real-time performance, but lack sufficient information to ensure
highly accurate disparity estimation. To address this challenge, we propose the
Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with
small-scale cost volumes. ESM restores critical details by integrating primary
features into the disparity upsampling unit. It quickly extracts features from
the initial disparity estimation and fuses them with image features. These
features are mixed by shuffling and layer splitting then refined through a
compact feature-guided hourglass network to recover more detailed scene
geometry. The ESM focuses on local contextual connectivity with a large
receptive field and low computational cost, leading to the reconstruction of a
highly accurate disparity map at real-time. The compact version of ESMStereo
achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX
Orin.