On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval
Journal:
arXiv
Published Date:
Jun 13, 2025
Abstract
Multimodal chatbots have become one of the major topics for dialogue systems
in both research community and industry. Recently, researchers have shed light
on the multimodality of responses as well as dialogue contexts. This work
explores how a dialogue system can output responses in various modalities such
as text and image. To this end, we first formulate a multimodal dialogue
response retrieval task for retrieval-based systems as the combination of three
subtasks. We then propose three integration methods based on a two-step
approach and an end-to-end approach, and compare the merits and demerits of
each method. Experimental results on two datasets demonstrate that the
end-to-end approach achieves comparable performance without an intermediate
step in the two-step approach. In addition, a parameter sharing strategy not
only reduces the number of parameters but also boosts performance by
transferring knowledge across the subtasks and the modalities.