Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
We present Sparsh-X, the first multisensory touch representations across four
tactile modalities: image, audio, motion, and pressure. Trained on ~1M
contact-rich interactions collected with the Digit 360 sensor, Sparsh-X
captures complementary touch signals at diverse temporal and spatial scales. By
leveraging self-supervised learning, Sparsh-X fuses these modalities into a
unified representation that captures physical properties useful for robot
manipulation tasks. We study how to effectively integrate real-world touch
representations for both imitation learning and tactile adaptation of
sim-trained policies, showing that Sparsh-X boosts policy success rates by 63%
over an end-to-end model using tactile images and improves robustness by 90% in
recovering object states from touch. Finally, we benchmark Sparsh-X ability to
make inferences about physical properties, such as object-action
identification, material-quantity estimation, and force estimation. Sparsh-X
improves accuracy in characterizing physical properties by 48% compared to
end-to-end approaches, demonstrating the advantages of multisensory pretraining
for capturing features essential for dexterous manipulation.