Teaching Metric Distance to Autoregressive Multimodal Foundational Models
Journal:
arXiv
Published Date:
Mar 4, 2025
Abstract
As large language models expand beyond natural language to domains such as
mathematics, multimodal understanding, and embodied agents, tokens increasingly
reflect metric relationships rather than purely linguistic meaning. We
introduce DIST2Loss, a distance-aware framework designed to train
autoregressive discrete models by leveraging predefined distance relationships
among output tokens. At its core, DIST2Loss transforms continuous exponential
family distributions derived from inherent distance metrics into discrete,
categorical optimization targets compatible with the models' architectures.
This approach enables the models to learn and preserve meaningful distance
relationships during token generation while maintaining compatibility with
existing architectures. Empirical evaluations show consistent performance gains
in diverse multimodal applications, including visual grounding, robotic
manipulation, generative reward modeling, and image generation using
vector-quantized features. These improvements are most notable in low-data
regimes, demonstrating DIST2Loss's strength under resource constraints.