Mitigating Object and Action Hallucinations in Multimodal LLMs
via Self-Augmented Contrastive Alignment

WACV 2026
1National Taiwan University 2NVIDIA
f11942093@ntu.edu.tw, frankwang@nvidia.com
NTU Logo
NVIDIA Logo
Teaser Image

Goal

Enable MLLMs to generate faithful textual captions that accurately describe visual objects and temporal actions without hallucinations.

Challenge

MLLMs often hallucinate non-existent objects or incorrect actions due to language priors and inability to ground temporal dynamics.

Our Solution

SANTA enhances faithfulness via self-augmented hallucinations as hard negatives and fine-grained tracklet-phrase contrastive alignment.

Abstract

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

Methodology

SANTA Framework Overview

Overview of SANTA. We employ (a) Mitigating Video-Level Hallucination by applying Hallucinative Self-Augmentation to identify the highly potential hallucinated tokens in MLLM \(\theta_M\) that deviate from ground truth words (e.g., synonyms or hypernyms) and then perform video-caption contrastive alignment. SANTA then (b) Mitigating Object and Action-Level Hallucinations by Tracklet-Phrase Contrastive Alignment to align object and action tracklets with visual and temporal phrases while contrasting hallucinative negatives.

Quantitative Results

Video Hallucination Examination Benchmarks

Evaluation on both object and action hallucinations with existing methods on MiraData-9k across three types of video captions: overall content, main object, and background. Bold and underline indicate the best and second best results, respectively.

MiraData Results

Quantitative comparisons with hallucination mitigation methods on video captioning using FactVC.

FactVC Results

Quantitative evaluation of both object and action hallucinations on video question answering using VidHal.

VidHal Results

Qualitative Results & Analysis

Qualitative Comparison

Qualitative Results

Qualitative comparison of video captions predicted by HACL and SANTA. Note that words highlighted in green indicate action faithfulness, while those in red indicate action hallucination. Similarly, words in blue represent object faithfulness, whereas those in orange indicate object hallucination. The examples at (a) and (b) are sampled from the hallucination benchmark.

Analysis of Feature Space (t-SNE)

t-SNE Analysis

t-SNE visualization of the latent features of (a) video and caption, (b) object tracklets and phrases, and (c) action tracklets and phrases. For the w/o SANTA setting, we directly visualize features from LLaVA-Video. Upon training with SANTA LLaVA-Video improves the alignment between visual-language modalities while exemplifying better distinction from the hallucinative captions.