Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.
Overview of SANTA. We employ (a) Mitigating Video-Level Hallucination by applying Hallucinative Self-Augmentation to identify the highly potential hallucinated tokens in MLLM \(\theta_M\) that deviate from ground truth words (e.g., synonyms or hypernyms) and then perform video-caption contrastive alignment. SANTA then (b) Mitigating Object and Action-Level Hallucinations by Tracklet-Phrase Contrastive Alignment to align object and action tracklets with visual and temporal phrases while contrasting hallucinative negatives.
Evaluation on both object and action hallucinations with existing methods on MiraData-9k across three types of video captions: overall content, main object, and background. Bold and underline indicate the best and second best results, respectively.
Quantitative comparisons with hallucination mitigation methods on video captioning using FactVC.
Quantitative evaluation of both object and action hallucinations on video question answering using VidHal.
Qualitative comparison of video captions predicted by HACL and SANTA. Note that words highlighted in green indicate action faithfulness, while those in red indicate action hallucination. Similarly, words in blue represent object faithfulness, whereas those in orange indicate object hallucination. The examples at (a) and (b) are sampled from the hallucination benchmark.
t-SNE visualization of the latent features of (a) video and caption, (b) object tracklets and phrases, and (c) action tracklets and phrases. For the w/o SANTA setting, we directly visualize features from LLaVA-Video. Upon training with SANTA LLaVA-Video improves the alignment between visual-language modalities while exemplifying better distinction from the hallucinative captions.