FESNet: Spotting Facial Expressions Using Local Spatial Discrepancy and Multi-Scale Temporal Aggregation

Bohao Zhang; Jiale Lu; Changbo Wang; Gaoqi He

doi:10.31577/cai_2024_2_458

Authors

Bohao Zhang Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing 401120, China & School of Computer Science and Technology, East China Normal University, Shanghai 200333, China
Jiale Lu Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing 401120, China & School of Computer Science and Technology, East China Normal University, Shanghai 200333, China
Changbo Wang Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing 401120, China & School of Computer Science and Technology, East China Normal University, Shanghai 200333, China
Gaoqi He Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing 401120, China & School of Computer Science and Technology, East China Normal University, Shanghai 200333, China

DOI:

https://doi.org/10.31577/cai_2024_2_458

Keywords:

Facial expression analysis, micro-expression spotting, video understanding, convolutional neural networks

Abstract

Facial expressions (FEs) spotting aims to split long videos into intervals of neutral expression, macro-expression, or micro-expression. Recent works mainly focus on feature descriptor or optical flow methods, suffering from difficulty capturing subtle facial motion and efficient temporal aggregation. This paper proposes a novel end-to-end network, named FESNet (Facial Expression Spotting Network), to solve the above challenges. The main idea is to model the subtle facial motion as local spatial discrepancy and incorporate temporal correlation by multi-scale temporal convolution. The FESNet comprises a local spatial discrepancy module (LSDM) and a multi-scale temporal aggregation module (MTAM). The LSDM first extracts the static spatial features from each frame by residual convolution and learns the inner spatial correlation by multi-head attention. Moreover, the subtle facial motion of facial expression is modeled as the discrepancy between the first frame and the current frame of the input interval, making frame-wise spatial proposals. Using the local spatial discrepancy features and proposals as input, the MTAM incorporates the temporal correlation by multi-scale temporal convolution and performs cascade refinement to make the final prediction. Furthermore, this paper proposes a smooth loss to ensure the temporal consistency of the cascade refined proposals from MTAM. Comprehensive experiments show that FESNet achieves competitive performance compared to state-of-the-art methods.

Downloads

Download data is not yet available.

FESNet: Spotting Facial Expressions Using Local Spatial Discrepancy and Multi-Scale Temporal Aggregation

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

Information

Make a Submission

Keywords