What is it about?
Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49 % compared to 78.63 %, thereby validating the effectiveness of the proposed framework.
Featured Image
Photo by Alexas_Fotos on Unsplash
Why is it important?
This study proposes a multimodal emotion recognition framework to address the MER2025 challenge. Our contributions are threefold. • To address the issue of data scarcity, we leverage appropriate pre-trained models as multimodal feature extractors. Specifically, for visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For textual modality, we propose a context-enriched method using LLMs to enrich emotional cues in the text inputs. • To handle modality competition, we design a fusion strategy that dynamically weights different modalities to ensure robust performance. • Extensive experiments conducted on the official dataset demonstrate a significant improvement over the baseline, achieving a weighted F-score of 87.49 % compared to 78.63 %.
Perspectives
This study presents a multimodal emotion recognition framework for the MER2025-SEMI challenge, leveraging pre-trained models and advanced fusion techniques to enhance performance under limited labeled data. Our contributions include: a context-enriched method using LLMs to improve the emotional expressiveness of text features; a dual-branch visual encoder integrating global framelevel features and localized facial representations to enhance visual modality analysis; a fusion strategy based on self-attention with residual connections to effectively integrate multimodal features; and a multi-source labeling strategy to correct noisy labels in the training set. Experimental results demonstrate superior performance on the MER2025-SEMI dataset, significantly outperforming the baseline. Future work will explore additional data augmentation and fusion strategies to further enhance the accuracy and robustness of the proposed emotion recognition framework.
pring wong
Read the Original
This page is a summary of: ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746270.3760225.
You can read the full text:
Contributors
The following have contributed to this page







