Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction
Journal:
arXiv
Published Date:
Apr 24, 2025
Abstract
This study addresses the critical challenge of hallucination mitigation in
Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks
through a Split Conformal Prediction (SCP) framework. While LVLMs excel in
multi-modal reasoning, their outputs often exhibit hallucinated content with
high confidence, posing risks in safety-critical applications. We propose a
model-agnostic uncertainty quantification method that integrates dynamic
threshold calibration and cross-modal consistency verification. By partitioning
data into calibration and test sets, the framework computes nonconformity
scores to construct prediction sets with statistical guarantees under
user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous
control of \textbf{marginal coverage} to ensure empirical error rates remain
strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes
inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of
prior distribution assumptions and retraining requirements. Evaluations on
benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces
theoretical guarantees across all $\alpha$ values. The framework achieves
stable performance across varying calibration-to-test split ratios,
underscoring its robustness for real-world deployment in healthcare, autonomous
systems, and other safety-sensitive domains. This work bridges the gap between
theoretical reliability and practical applicability in multi-modal AI systems,
offering a scalable solution for hallucination detection and uncertainty-aware
decision-making.