Towards better text image machine translation with multimodal codebook and multi-stage training.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
May 23, 2025
Abstract
As a widely-used machine translation task, text image machine translation (TIMT) aims to translate the source texts embedded in the image to target translations. However, studies in this aspect face two challenges: (1) constructed in a cascaded manner, dominant models suffer from the error propagation of optical character recognition (OCR), and (2) they lack publicly available large-scale datasets. To deal with these issues, we propose a multimodal codebook based TIMT model. In addition to a text encoder, an image encoder, and a text decoder, our model is equipped with a multimodal codebook that effectively associates images with relevant texts, thus providing useful supplementary information for translation. Particularly, we present a multi-stage training framework to fully exploit various datasets to effectively train our model. Concretely, we first conduct preliminary training on the text encoder and decoder using bilingual texts. Subsequently, via an additional code-conditioned mask translation task, we use the bilingual texts to continuously train the text encoder, multimodal codebook, and decoder. Afterwards, by further introducing an image-text alignment task and adversarial training, we train the whole model except for the text decoder on the OCR dataset. Finally, through the above training tasks except for text translation, we adopt a TIMT dataset to fine-tune the whole model. Besides, we manually annotate a Chinese-English TIMT dataset, named OCRMT30K, and extend it to Chinese-German TIMT dataset through an automatic translation tool. To the best of our knowledge, it is the first public manually-annotated TIMT dataset, which facilitates future studies in this task. To investigate the effectiveness of our model, we conduct extensive experiments on Chinese-English and Chinese-German TIMT tasks. Experimental results and in-depth analyses strongly demonstrate the effectiveness of our model. We release our code and dataset on https://github.com/DeepLearnXMU/mc_tit.