Towards better text image machine translation with multimodal codebook and multi-stage training.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: May 23, 2025

Abstract

As a widely-used machine translation task, text image machine translation (TIMT) aims to translate the source texts embedded in the image to target translations. However, studies in this aspect face two challenges: (1) constructed in a cascaded manner, dominant models suffer from the error propagation of optical character recognition (OCR), and (2) they lack publicly available large-scale datasets. To deal with these issues, we propose a multimodal codebook based TIMT model. In addition to a text encoder, an image encoder, and a text decoder, our model is equipped with a multimodal codebook that effectively associates images with relevant texts, thus providing useful supplementary information for translation. Particularly, we present a multi-stage training framework to fully exploit various datasets to effectively train our model. Concretely, we first conduct preliminary training on the text encoder and decoder using bilingual texts. Subsequently, via an additional code-conditioned mask translation task, we use the bilingual texts to continuously train the text encoder, multimodal codebook, and decoder. Afterwards, by further introducing an image-text alignment task and adversarial training, we train the whole model except for the text decoder on the OCR dataset. Finally, through the above training tasks except for text translation, we adopt a TIMT dataset to fine-tune the whole model. Besides, we manually annotate a Chinese-English TIMT dataset, named OCRMT30K, and extend it to Chinese-German TIMT dataset through an automatic translation tool. To the best of our knowledge, it is the first public manually-annotated TIMT dataset, which facilitates future studies in this task. To investigate the effectiveness of our model, we conduct extensive experiments on Chinese-English and Chinese-German TIMT tasks. Experimental results and in-depth analyses strongly demonstrate the effectiveness of our model. We release our code and dataset on https://github.com/DeepLearnXMU/mc_tit.

Authors

Zhibin Lan

School of Informatics, Xiamen University, Xiamen, 361005, Fujian, China; Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, Xiamen, 361005, Fujian, China. Electronic address: lanzhibin@stu.xmu.edu.cn.
Jiawei Yu

College of Engineering, Shibaura Institute of Technology, Tokyo 135-8548, Japan.
Shiyu Liu

State Key Laboratory of Oral Diseases, Sichuan University, Chengdu, China.
Junfeng Yao

School of Informatics, Xiamen University, Xiamen 361005, China.
Degen Huang

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China.
Jinsong Su

Xiamen University, Xiamen, 361005, China. Electronic address: jssu@xmu.edu.cn.

Keywords

Humans Image Processing, Computer-Assisted Machine Learning Neural Networks, Computer Translating

External Resources

View on PubMed Access via DOI PubMed (40435555)

Towards better text image machine translation with multimodal codebook and multi-stage training.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Towards better text image machine translation with multimodal codebook and multi-stage training.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals