TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
The rapid advancement of 3D vision-language models (VLMs) has spurred
significant interest in interactive point cloud processing tasks, particularly
for real-world applications. However, existing methods often underperform in
point-level tasks, such as segmentation, due to missing direct 3D-text
alignment, limiting their ability to link local 3D features with textual
context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled
with a Direct cross-modal Alignment module and memory module for interactive
point cloud Segmentation. We introduce the direct cross-modal alignment module
to establish explicit alignment between 3D point clouds and textual/2D image
data. Within the memory module, we employ multiple dedicated memory banks to
separately store text features, visual features, and their cross-modal
correspondence mappings. These memory banks are dynamically leveraged through
self-attention and cross-attention mechanisms to update scene-specific features
based on prior stored data, effectively addressing inconsistencies in
interactive segmentation results across diverse scenarios. Experiments
conducted on multiple 3D instruction, reference, and semantic segmentation
datasets demonstrate that the proposed method achieves state-of-the-art
performance.