HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Journal: arXiv

Published Date: Jun 3, 2025

Abstract

With the advancement of language models, unified multimodal understanding and generation have made significant strides, with model architectures evolving from separated components to unified single-model frameworks. This paper explores an efficient training paradigm to build a single transformer for unified multimodal understanding and generation. Specifically, we propose a multimodal warmup strategy utilizing prior knowledge to extend capabilities. To address cross-modal compatibility challenges, we introduce feature pre-scaling and multimodal AdaLN techniques. Integrating the proposed technologies, we present the HaploOmni, a new single multimodal transformer. With limited training costs, HaploOmni achieves competitive performance across multiple image and video understanding and generation benchmarks over advanced unified models. All codes will be made public at https://github.com/Tencent/HaploVLM.

Authors

Yicheng Xiao
Lin Song
Rui Yang
Cheng Cheng
Zunnan Xu
Zhaoyang Zhang
Yixiao Ge
Xiu Li
Ying Shan

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.02975v1)

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals