Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases.
In this paper, we propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches.
Below are quantitative results highlighting the improvement when compared to existing debiasing techniques in both background and foreground debiasing performance across all protocols based on the HMDB51 dataset. ALBAR defines a strong SOTA by achieving impressive foreground debiasing performance, demonstrating the efficacy of the proposed static adversarial approach.
StillMix proposed SCUBA and SCUFO datasets and metrics to evaluate both background and foreground bias in video action recognition models. These protocols require masks to extract the foreground from the original clips. However, the masks used for the UCF101 variation are bounding boxes. Thus, surrounding background information is carried into the bias evaluation videos, as seen in Figure 3. This is not sufficient to evaluate performance on this dataset, as a classifier reliant on the background can still take advantage of this information to score high on the protocol. To mitigate this effect, we utilize a flexible video object segmentation model, SAMTrack to segment the actors (subjects) in each video. The actors are initially grounded using the same bounding boxes. Each testing video is manually checked for accurate segmentation. The StillMix dataset creation protocol is used to create SCUBA and SCUFO variations with these new masks. The improved benchmark no longer includes background information, such as in Figure 3(b), tightly bounding the human subject as seen in Figure 3(c). Results in Table 2 show results on our newly created benchmark.
We propose ALBAR, a novel label-free adversarial training framework for efficient background and foreground debiasing of video action recognition models. The framework eliminates the need for direct knowledge of biased attributes such as an additional critic model, instead leveraging the negative cross-entropy loss of a clip without motion passed through the same model as the adversarial component. To ensure optimal training, we incorporate static clip entropy maximization and gradient penalty objectives. We thoroughly validate the performance of our approach across a comprehensive suite of bias evaluation protocols, demonstrating its effectiveness and generalization across multiple datasets. Moreover, ALBAR can be seamlessly combined with existing debiasing augmentations to achieve performance that significantly surpasses the current state-of-the-art. It is our hope that our work contributes to the development of fair, unbiased, and trustworthy video understanding models.
For more technical details and results, check out our attached main paper, thank you!
@inproceedings{fioresi2025albar,
title={ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition},
author={Fioresi, Joseph and Dave, Ishan Rajendrakumar and Shah, Mubarak},
booktitle={Proceedings of the International Conference on Learning Representations},
pages={13598--13609},
year={2025}
}