Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving
Journal:
arXiv
Published Date:
May 23, 2025
Abstract
Current large vision-language models (LVLMs) typically employ a connector
module to link visual features with text embeddings of large language models
(LLMs) and use end-to-end training to achieve multi-modal understanding in a
unified process. Well alignment needs high-quality pre-training data and a
carefully designed training process. Current LVLMs face challenges when
addressing complex vision-language reasoning tasks, with their reasoning
capabilities notably lagging behind those of LLMs. This paper proposes a
paradigm shift: instead of training end-to-end vision-language reasoning
models, we advocate for developing a decoupled reasoning framework based on
existing visual interpretation specialists and text-based reasoning LLMs. Our
approach leverages (1) a dedicated vision-language model to transform the
visual content of images into textual descriptions and (2) an LLM to perform
reasoning according to the visual-derived text and the original question. This
method presents a cost-efficient solution for multi-modal model development by
optimizing existing models to work collaboratively, avoiding end-to-end
development of vision-language models from scratch. By transforming images into
language model-compatible text representations, it facilitates future low-cost
and flexible upgrades to upcoming powerful LLMs. We introduce an
outcome-rewarded joint-tuning strategy to optimize the cooperation between the
visual interpretation and linguistic reasoning model. Evaluation results on
vision-language benchmarks demonstrate that the decoupled reasoning framework
outperforms recent LVLMs. Our approach yields particularly significant
performance gains on visually intensive geometric mathematics problems. The
code is available: https://github.com/guozix/DVLR.