Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
[ICLR 2026]

1Institute of Artificial Intelligence, University of Central Florida

Abstract

We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.

Latent Space Anonymization

Teaser Figure
Figure 1: Our proposed latent anonymization setup (red) utilizes large pretrained video encoders, applying a lightweight anonymizer that maintains performance on multiple video understanding tasks while strongly reducing performance on private attribute prediction tasks (right).

Paper Details

Method Diagram

Figure 2: Workflow illustrating the SPLAVU training process. The process begins with a video clip, from which two random frames are sampled to create static clips. All clips are passed through the frozen video encoder to extract latent features, then further processed by our Anonymization Adapter Module (AAM). The temporal clip features are used for the latent consistency loss and given to the set of task-specific classifier heads. The two static clip features are utilized in the self-supervised mutual information minimization objective. Gradients from all losses are back-propagated through AAM.

Anonymization with Multitask Co-training

To retain the action understanding capabilities of the pretrained model, we employ a co-training framework where multiple tasks collaborate to optimize performance. The action classifier head is trained using the standard cross-entropy loss. Our latent formulation enables, for the first time, anonymization training using gradients from alternate downstream utility tasks, namely Temporal Action Detection (TAD) and Anomaly Detection (AD). As such, we integrate training objectives from state-of-the-art approaches in TAD and AD.

Regularization via Latent Consistency

Early experiments with the privacy and utility losses indicated that the anonymization process tends to overfit to the proxy-utility tasks used in training, compromising its effectiveness on unseen tasks. Consequently, the primary motivation behind introducing our latent consistency objective is to ensure that the anonymization learned by the model remains generalizable and is not biased toward the specific utility task(s) it is trained on. This can be accomplished by regularizing the anonymization to preserve the general latent structure of the utility encoder. To this end, we propose a latent consistency loss that encourages the model to preserve important latent features while still achieving privacy preservation.

Clip-Level Self-Supervised Privacy Objective

Our clip-level self-supervised budget privacy objective is the key component for facilitating anonymization without requiring private attribute labels. The intuition is that two frames share a lot of mutual information, so if we minimize the similarity between them, the shared spatial information gets destroyed. A crucial difference setting SPLAVU apart from prior works is that the anonymizer works across the temporal dimension using 3D clip features instead of a 2D U-Net. This way, when combined with utility task losses, the anonymization model learns to remove all spatial information, maintaining only temporal information useful for solving the utility task.

Results

Below are quantitative results covering privacy protocols and a variety of downstream tasks. We observe in Table 1 that our approach consistently generalizes well across all tasks, closely maintaining the performance of the non-anonymized videos. In contrast, previous methods struggle to preserve performance uniformly across tasks, evident in the temporal action detection results. Experiments with large video foundation models see similar performance trends, confirming the efficacy and scalability of SPLAVU.

Table 1: Performance of anonymization methods across a downstream task evaluation suite. Methods in gray train using private attribute labels. Our method achieves a strong improvement in privacy-preservation with minimal reduction in task performance.

Task Ablation

Our important ablation in Table 2 demonstrates the effects of training our anonymizer without specific tasks. Notably, the highlighted cells show impressive generalization to unseen tasks with just a minor drop in performance compared to training on them. For example, looking at row (c) shows anonymization training with only action detection, yet the performance on action recognition and anomaly detection remain within 1.3% of the non-anonymized score. Across the board, thanks to the latent consistency loss, performance is not dependent on having seen the given utility task during training, proving that SPLAVU can effectively generalize to unseen tasks.

Table 2: Ablation on tasks seen during anonymization training. The checkmark () labels seen tasks, x-mark (X) and highlighted cells indicate tasks unseen during training. Performance generalizes to unseen tasks, while directly training further improves results.

Bias Evaluation

The first row of Table 3 shows the performance difference between each gender presentation subclass in the NTU-Bias-F protocol, where the action brush_hair is chosen as the gendered shortcut action label. The baseline performance disparity between perceived gender subclasses is an unacceptably large 9.42%. Applying latent anonymization impressively reduces this gap by a relative 42.3%. The second row includes results for the complimentary protocol NTU-Bias-M (also brush_hair shortcut). Interestingly, the baseline subclass performance disparity is less than that of NTU-Bias-F (5.00%), but our method is still capable of reducing this unfair split and improving overall performance. To confirm that these observations hold true in a real-world setting, we look at the final row of Table 2 to see the performance on the TSH protocol. Notably, our method improves the both the classifier quality and fairness. In this realistic scenario with a naturally occurring bias, SPLAVU reduces the gap between perceived gender subclasses by an astonishing relative 39.5%. Please refer to the paper for more details on protocol creation.

Table 3: Bias evaluation across gendered groups; anonymization reduces subclass accuracy gaps.

Conclusion

We propose an innovative privacy-preserving method via a novel formulation of latent space anonymization called SPLAVU. Our method is the first to enable generalized anonymization for unprecedented performance across various downstream video understanding tasks, including action recognition, anomaly detection, and temporal action detection. It employs a clip-level self-supervised privacy budget within the latent space, coupled with a latent consistency loss to maintain its powerful generalization capability. Moreover, the latent formulation enables, for the first time, training an anonymizer with gradients from multiple downstream tasks, which is impractical for pixel-level anonymization. Furthermore, our novel protocols for assessing gender bias contribute to the development of more responsible and unbiased video understanding models.

For more technical details and results, check out our attached main paper, thank you!

BibTeX


@inproceedings{fioresi2025privacy,
  title={Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding},
  author={Fioresi, Joseph and Dave, Ishan Rajendrakumar and Shah, Mubarak},
  booktitle={arXiv},
  year={2025}
}

Acknowledgement

This work was supported in part by the National Science Foundation (NSF) and Center for Smart Streetscapes (CS3) under NSF Cooperative Agreement No. EEC-2133516.