MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

ICML 2025 Spotlight

Zhicheng Zhang1,2, Wuyou Xia1, Cheni Zhao1, Yan Zhou3, Xiaoqiang Liu3, Yongjie Zhu3, Wenyu Qin3, Pengfei Wan3, Di Zhang3, Jufeng Yang1,2
1VCIP & TMCC & DISSec, College of Computer Science, Nankai University
2Pengcheng Laboratory
3Kuaishou Technology

Overview

TL;DR: We i) identify attention deficit disorder as a critical barrier hindering fine-grained content understanding in MLLMs; ii) introduce a modular duplex attention mechanism to mitigate modality bias and enhance attention score justification; and iii) develop MODA-based MLLMs that enable fine-grained multimodal understanding across perception, cognition, and emotion tasks.

Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (\model), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.

BibTeX

@inproceedings{zhang2025MODA,
  title={MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding},
  author={Zhicheng Zhang and Wuyou Xia and Cheni Zhao and Zhou Yan and Xiaoqiang Liu and Yongjie Zhu and Wenyu Qin and Pengfei Wan and Di ZHANG and Jufeng Yang},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  year={2025}
}