ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Journal:
arXiv
Published Date:
May 29, 2025
Abstract
Multi-modal large language models have demonstrated remarkable zero-shot
abilities and powerful image-understanding capabilities. However, the existing
open-source multi-modal models suffer from the weak capability of multi-turn
interaction, especially for long contexts. To address the issue, we first
introduce a context modeling module, termed ContextQFormer, which utilizes a
memory block to enhance the presentation of contextual information.
Furthermore, to facilitate further research, we carefully build a new
multi-turn multi-modal dialogue dataset (TMDialog) for pre-training,
instruction-tuning, and evaluation, which will be open-sourced lately. Compared
with other multi-modal dialogue datasets, TMDialog contains longer
conversations, which supports the research of multi-turn multi-modal dialogue.
In addition, ContextQFormer is compared with three baselines on TMDialog and
experimental results illustrate that ContextQFormer achieves an improvement of
2%-4% in available rate over baselines.