ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition
[ICLR 2025]

1Center for Research in Computer Vision (CRCV), University of Central Florida

Abstract

Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases.

In this paper, we propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches.

Background/Foreground Bias in Action Recognition

Paper Details

Method Diagram

Figure 1: Full ALBAR method diagram showing two types clips passed through the same video encoder with different losses applied to each. The first clip simply uses a standard cross entropy loss to learn to classify actions based on clips with motion. The second clip is created by sampling a frame from the first clip and repeating it to match the original clip shape, creating a static clip with no motion. The adversarial component is created by subtracting the cross entropy of the static clip prediction. This prediction is encouraged to be uncertain by the entropy loss, and the gradients w.r.t. the static prediction (shown in red) are encouraged to be lower for more stable training by the gradient penalty loss.

Static Adversarial Loss

We propose an adversarial setup that uses the negative cross-entropy loss applied to a static clip to encourage the model to predict incorrect action classes given a clip with no motion. The model is trained to predict the correct class for the original clip, while training the model to predict a different class for the static clip. This creates a push-pull effect between the two components, leading to a more robust model that is less reliant on static information.

Static Entropy Maximization Loss

Simply applying the above adversarial loss causes the model to still learn the static correlations, but will intentionally choose an incorrect class for the static clip with high confidence. To combat this, we introduce a static entropy maximization loss that encourages the model to be uncertain about the static clip prediction by maximizing the entropy of the prediction. This way, the encoder is trained to have higher uncertainty when not provided temporal motion information, ideally losing the ability to predict actions properly based on spatial information.

Gradient Penalty Loss

Even with the entropy loss, we find that the adversarial training is still unstable, experiencing large fluctuations in performance during intermittent validation steps. To further stabilize and prevent drastic weight updates from static inputs, we introduce a gradient penalty loss that minimizes the gradient norm, only with respect to static clips.

The below Figure 2 highlights the efficacy of our additional gradient penalty objective in stabilizing training and improving performance. It prevents the model from taking large steps in different directions as the static adversarial and temporal cross-entropy objectives fight back and forth, leading to overall smoother learning and improved overall performance.

Figure 2: Training curves showing the stabilizing effect of the gradient penalty loss. Without the penalty (red), training exhibits high variance and lower overall performance. Adding the gradient penalty (blue) results in smoother training and better final performance.

Results

Below are quantitative results highlighting the improvement when compared to existing debiasing techniques in both background and foreground debiasing performance across all protocols based on the HMDB51 dataset. ALBAR defines a strong SOTA by achieving impressive foreground debiasing performance, demonstrating the efficacy of the proposed static adversarial approach.

Table 1: Comparison of ALBAR with existing debiasing approaches on HMDB51 dataset. Results show significant improvements in both background and foreground debiasing performance.

UCF101 Updated Protocol

StillMix proposed SCUBA and SCUFO datasets and metrics to evaluate both background and foreground bias in video action recognition models. These protocols require masks to extract the foreground from the original clips. However, the masks used for the UCF101 variation are bounding boxes. Thus, surrounding background information is carried into the bias evaluation videos, as seen in Figure 3. This is not sufficient to evaluate performance on this dataset, as a classifier reliant on the background can still take advantage of this information to score high on the protocol. To mitigate this effect, we utilize a flexible video object segmentation model, SAMTrack to segment the actors (subjects) in each video. The actors are initially grounded using the same bounding boxes. Each testing video is manually checked for accurate segmentation. The StillMix dataset creation protocol is used to create SCUBA and SCUFO variations with these new masks. The improved benchmark no longer includes background information, such as in Figure 3(b), tightly bounding the human subject as seen in Figure 3(c). Results in Table 2 show results on our newly created benchmark.

(a)
(b)
(c)
Figure 3: Comparison of (a) original video, (b) existing bounding box-based foreground extraction, and (c) our improved SAMTrack segmentation-based foreground extraction on UCF101 videos.
Table 2: Performance comparison on our improved UCF101 benchmark using SAMTrack segmentation masks.

Conclusion

We propose ALBAR, a novel label-free adversarial training framework for efficient background and foreground debiasing of video action recognition models. The framework eliminates the need for direct knowledge of biased attributes such as an additional critic model, instead leveraging the negative cross-entropy loss of a clip without motion passed through the same model as the adversarial component. To ensure optimal training, we incorporate static clip entropy maximization and gradient penalty objectives. We thoroughly validate the performance of our approach across a comprehensive suite of bias evaluation protocols, demonstrating its effectiveness and generalization across multiple datasets. Moreover, ALBAR can be seamlessly combined with existing debiasing augmentations to achieve performance that significantly surpasses the current state-of-the-art. It is our hope that our work contributes to the development of fair, unbiased, and trustworthy video understanding models.

For more technical details and results, check out our attached main paper, thank you!

BibTeX


@inproceedings{fioresi2025albar,
  title={ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition},
  author={Fioresi, Joseph and Dave, Ishan Rajendrakumar and Shah, Mubarak},
  booktitle={Proceedings of the International Conference on Learning Representations},
  pages={13598--13609},
  year={2025}
}

Acknowledgement

This work was supported in part by the National Science Foundation (NSF) and Center for Smart Streetscapes (CS3) under NSF Cooperative Agreement No. EEC-2133516.