Towards Statistical Factuality Guarantee for Large Vision-Language Models
Journal:
arXiv
Published Date:
Feb 27, 2025
Abstract
Advancements in Large Vision-Language Models (LVLMs) have demonstrated
promising performance in a variety of vision-language tasks involving
image-conditioned free-form text generation. However, growing concerns about
hallucinations in LVLMs, where the generated text is inconsistent with the
visual context, are becoming a major impediment to deploying these models in
applications that demand guaranteed reliability. In this paper, we introduce a
framework to address this challenge, ConfLVLM, which is grounded on conformal
prediction to achieve finite-sample distribution-free statistical guarantees on
the factuality of LVLM output. This framework treats an LVLM as a hypothesis
generator, where each generated text detail (or claim) is considered an
individual hypothesis. It then applies a statistical hypothesis testing
procedure to verify each claim using efficient heuristic uncertainty measures
to filter out unreliable claims before returning any responses to users. We
conduct extensive experiments covering three representative application
domains, including general scene understanding, medical radiology report
generation, and document understanding. Remarkably, ConfLVLM reduces the error
rate of claims generated by LLaVa-1.5 for scene descriptions from 87.8\% to
10.0\% by filtering out erroneous claims with a 95.3\% true positive rate. Our
results further demonstrate that ConfLVLM is highly flexible, and can be
applied to any black-box LVLMs paired with any uncertainty measure for any
image-conditioned free-form text generation task while providing a rigorous
guarantee on controlling the risk of hallucination.