Multimodal Multihop Source Retrieval for Web Question Answering
Journal:
arXiv
Published Date:
Jan 7, 2025
Abstract
This work deals with the challenge of learning and reasoning over multi-modal
multi-hop question answering (QA). We propose a graph reasoning network based
on the semantic structure of the sentences to learn multi-source reasoning
paths and find the supporting facts across both image and text modalities for
answering the question. In this paper, we investigate the importance of graph
structure for multi-modal multi-hop question answering. Our analysis is
centered on WebQA. We construct a strong baseline model, that finds relevant
sources using a pairwise classification task. We establish that, with the
proper use of feature representations from pre-trained models, graph structure
helps in improving multi-modal multi-hop question answering. We point out that
both graph structure and adjacency matrix are task-related prior knowledge, and
graph structure can be leveraged to improve the retrieval performance for the
task. Experiments and visualized analysis demonstrate that message propagation
over graph networks or the entire graph structure can replace massive
multimodal transformers with token-wise cross-attention. We demonstrated the
applicability of our method and show a performance gain of \textbf{4.6$\%$}
retrieval F1score over the transformer baselines, despite being a very light
model. We further demonstrated the applicability of our model to a large scale
retrieval setting.