ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Journal: arXiv

Published Date: May 29, 2025

Abstract

Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

Authors

Yiming Lei
Zhizheng Yang
Zeming Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2505.23121v1)

ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals