We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.
Below are quantitative results covering privacy protocols and a variety of downstream tasks. We observe in Table 1 that our approach consistently generalizes well across all tasks, closely maintaining the performance of the non-anonymized videos. In contrast, previous methods struggle to preserve performance uniformly across tasks, evident in the temporal action detection results. Experiments with large video foundation models see similar performance trends, confirming the efficacy and scalability of SPLAVU.
We propose an innovative privacy-preserving method via a novel formulation of latent space anonymization called SPLAVU. Our method is the first to enable generalized anonymization for unprecedented performance across various downstream video understanding tasks, including action recognition, anomaly detection, and temporal action detection. It employs a clip-level self-supervised privacy budget within the latent space, coupled with a latent consistency loss to maintain its powerful generalization capability. Moreover, the latent formulation enables, for the first time, training an anonymizer with gradients from multiple downstream tasks, which is impractical for pixel-level anonymization. Furthermore, our novel protocols for assessing gender bias contribute to the development of more responsible and unbiased video understanding models.
For more technical details and results, check out our attached main paper, thank you!
@inproceedings{fioresi2025privacy,
title={Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding},
author={Fioresi, Joseph and Dave, Ishan Rajendrakumar and Shah, Mubarak},
booktitle={arXiv},
year={2025}
}