Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting
Journal:
arXiv
Published Date:
Jul 10, 2025
Abstract
Background: Accurate forecasting of lung tumor motion is essential for
precise dose delivery in proton therapy. While current markerless methods
mostly rely on deep learning, transformer-based architectures remain unexplored
in this domain, despite their proven performance in trajectory forecasting.
Purpose: This work introduces a markerless forecasting approach for lung
tumor motion using Vision Transformers (ViT). Two training strategies are
evaluated under clinically realistic constraints: a patient-specific (PS)
approach that learns individualized motion patterns, and a multi-patient (MP)
model designed for generalization. The comparison explicitly accounts for the
limited number of images that can be generated between planning and treatment
sessions.
Methods: Digitally reconstructed radiographs (DRRs) derived from planning
4DCT scans of 31 patients were used to train the MP model; a 32nd patient was
held out for evaluation. PS models were trained using only the target patient's
planning data. Both models used 16 DRRs per input and predicted tumor motion
over a 1-second horizon. Performance was assessed using Average Displacement
Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and
treatment (T2) data.
Results: On T1 data, PS models outperformed MP models across all training set
sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However,
MP models demonstrated stronger robustness to inter-fractional anatomical
variability and achieved comparable performance on T2 data without retraining.
Conclusions: This is the first study to apply ViT architectures to markerless
tumor motion forecasting. While PS models achieve higher precision, MP models
offer robust out-of-the-box performance, well-suited for time-constrained
clinical settings.