Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild
Journal:
arXiv
Published Date:
Jan 6, 2025
Abstract
Complex visual reasoning remains a key challenge today. Typically, the
challenge is tackled using methodologies such as Chain of Thought (COT) and
visual instruction tuning. However, how to organically combine these two
methodologies for greater success remains unexplored. Also, issues like
hallucinations and high training cost still need to be addressed. In this work,
we devise an innovative multi-round training and reasoning framework suitable
for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning
approach heuristically guides MLLMs to focus on visual clues relevant to the
target problem, reducing hallucinations and enhancing the model's ability to
describe fine-grained image details. This ultimately enables the model to
perform well in complex visual reasoning and question-answering tasks. We have
named this framework Socratic Questioning(SQ). To facilitate future research,
we create a multimodal mini-dataset named CapQA, which includes 1k images of
fine-grained activities, for visual instruction tuning and evaluation, our
proposed SQ method leads to a 31.2% improvement in the hallucination score. Our
extensive experiments on various benchmarks demonstrate SQ's remarkable
capabilities in heuristic self-questioning, zero-shot visual reasoning and
hallucination mitigation. Our model and code will be publicly available.