Multi-Modal Emotion Recognition Using Situation-Based Video Context Emotion Dataset

Guiping Lu; Honghua Liu; Kejun Wang; Weidong Hu; Wenliang Peng; Tao Yang; Shan Lu

Authors

Guiping Lu Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
Honghua Liu Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
Kejun Wang Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
Weidong Hu Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
Wenliang Peng Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology and Terahertz Science Application Center (TSAC), Beijing Institute of Technology, Zhuhai, Guangdong, China
Tao Yang School of Intelligent Science and Engineering, Harbin Engineering University, Harbin, Heilongjiang, China
Shan Lu BMW Brilliance Automotive Ltd., Shenyang, Liaoning, China

Keywords:

Multi-modal fusion, emotion recognition, transfer learning, dataset, deep learning

Abstract

Current multi-modal emotion recognition techniques primarily use modalities such as expression, speech, text, and gesture. Existing methods only capture emotion from the current moment in a picture or video, neglecting the influence of time and past experiences on human emotion. Expanding the temporal scope can provide more clues for emotion recognition. To address this, we constructed the Situation-Based Video Context Emotion Datasets (SVCEmotion) dataset in video form. Experiments show that both VGGish and BERTbase achieve good results on SVCEmotion. Comparison with other audio emotion recognition methods proves that VGGish is more suitable for audio emotion feature extraction on the dataset constructed in this paper. Comparison experiments with textual descriptions demonstrate that the contextual descriptions introduced in the SVCEmotion dataset for the emotion recognition task under wide time range can provide clues for emotion recognition, and that the combination with factual descriptions can substantially improve the emotion recognition effect.

Downloads

Download data is not yet available.

Multi-Modal Emotion Recognition Using Situation-Based Video Context Emotion Dataset

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Make a Submission

Keywords