Multi-Modal Emotion Recognition Using Situation-Based Video Context Emotion Dataset

Authors

  • Guiping Lu Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
  • Honghua Liu Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
  • Kejun Wang Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
  • Weidong Hu Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
  • Wenliang Peng Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
  • Tao Yang School of Intelligent Science and Engineering, Harbin Engineering University, Harbin, Heilongjiang, China
  • Shan Lu BMW Brilliance Automotive Ltd., Shenyang, Liaoning, China

Keywords:

Multi-modal fusion, emotion recognition, transfer learning, dataset, deep learning

Abstract

Current multi-modal emotion recognition techniques primarily use modalities such as expression, speech, text, and gesture. Existing methods only capture emotion from the current moment in a picture or video, neglecting the influence of time and past experiences on human emotion. Expanding the temporal scope can provide more clues for emotion recognition. To address this, we constructed the Situation-Based Video Context Emotion Datasets (SVCEmotion) dataset in video form. Experiments show that both VGGish and BERTbase achieve good results on SVCEmotion. Comparison with other audio emotion recognition methods proves that VGGish is more suitable for audio emotion feature extraction on the dataset constructed in this paper. Comparison experiments with textual descriptions demonstrate that the contextual descriptions introduced in the SVCEmotion dataset for the emotion recognition task under wide time range can provide clues for emotion recognition, and that the combination with factual descriptions can substantially improve the emotion recognition effect.

Downloads

Download data is not yet available.

Published

2025-10-30

How to Cite

Lu, G., Liu, H., Wang, K., Hu, W., Peng, W., Yang, T., & Lu, S. (2025). Multi-Modal Emotion Recognition Using Situation-Based Video Context Emotion Dataset. Computing and Informatics, 44(5). Retrieved from http://147.213.75.17/ojs/index.php/cai/article/view/7147