MART

Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation

CVPR 2024

Zhicheng Zhang1,2, Pancheng Zhao1,2, Eunil Park3, Jufeng Yang1,2
1VCIP & TMCC & DISSec, College of Computer Science, Nankai University
2Nankai International Advanced Research Institute (SHENZHEN·FUTIAN)
3Data experience Laboratory, College of Computing, Sungkyunkwan University

Overview

TL;DR: We present MART, an MAE-style self-supervised method for learning robust affective representation of videos that exploits the sentiment complementary and emotion intrinsic among temporal segments.

Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to extract the temporal correlation of affective cues in the video. Inspired by psychology research and empirical theory, we verify that the degree of emotion may vary in different segments of the video, thus introducing the sentiment complementary and emotion intrinsic among temporal segments. We propose an MAE-style method for learning robust affective representation of videos via masking, termed MART. First, we extract the affective cues of the lexicon and verify the extracted one by computing its matching score with video content. The hierarchical verification strategy is proposed, in terms of sentiment and emotion, to identify the matched cues alongside the temporal dimension. Then, with the verified cues, we propose masked affective modeling to recover temporal emotion distribution. We present temporal affective complementary learning that pulls the complementary part and pushes the intrinsic part of masked multimodal features, for learning robust affective representation. Under the constraint of affective complementary, we leverage cross-modal attention among features to mask the video and recover the degree of emotion among segments. Extensive experiments on five benchmark datasets demonstrate the superiority of our method in video sentiment analysis, video emotion recognition, multimodal sentiment analysis, and multimodal emotion recognition.

BibTeX

@inproceedings{zhang2024MART,
  title={MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation},
  author={Zhang, Zhicheng and Zhao, Pancheng and Park, Eunil and Yang, Jufeng},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2024}
}