DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
Journal:
arXiv
Published Date:
Apr 24, 2025
Abstract
This paper presents the technical solution proposed by Huawei Translation
Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation
for Complex Layouts" competition at the 19th International Conference on
Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging
state-of-the-art open-source large vision-language model (LVLM), we introduce a
training framework that combines multi-task learning with perceptual
chain-of-thought to develop a comprehensive end-to-end document translation
system. During the inference phase, we apply minimum Bayesian decoding and
post-processing strategies to further enhance the system's translation
capabilities. Our solution uniquely addresses both OCR-based and OCR-free
document image translation tasks within a unified framework. This paper
systematically details the training methods, inference strategies, LVLM base
models, training data, experimental setups, and results, demonstrating an
effective approach to document image machine translation.