Multi-Modal Emotion Recognition Using Situation-Based Video Context Emotion Dataset
Keywords:
Multi-modal fusion, emotion recognition, transfer learning, dataset, deep learningAbstract
Current multi-modal emotion recognition techniques primarily use modalities such as expression, speech, text, and gesture. Existing methods only capture emotion from the current moment in a picture or video, neglecting the influence of time and past experiences on human emotion. Expanding the temporal scope can provide more clues for emotion recognition. To address this, we constructed the Situation-Based Video Context Emotion Datasets (SVCEmotion) dataset in video form. Experiments show that both VGGish and BERTbase achieve good results on SVCEmotion. Comparison with other audio emotion recognition methods proves that VGGish is more suitable for audio emotion feature extraction on the dataset constructed in this paper. Comparison experiments with textual descriptions demonstrate that the contextual descriptions introduced in the SVCEmotion dataset for the emotion recognition task under wide time range can provide clues for emotion recognition, and that the combination with factual descriptions can substantially improve the emotion recognition effect.
