Title :
Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos
Author :
Acar, Esra ; Hopfgartner, Frank ; Albayrak, Sahin
Author_Institution :
DAI Lab., Tech. Univ. Berlin, Berlin, Germany
Abstract :
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation.
Keywords :
audio-visual systems; cepstral analysis; feature selection; learning (artificial intelligence); neural nets; support vector machines; video signal processing; CNNs; DEAP dataset; HSV color space; MFCC; VA space; convolutional neural networks; deep learning methods; dense trajectory based motion features; discriminative feature selection; emotional analysis; fusion mechanisms; handcrafted higher level representations; learned multimodal representations; low-level audio-visual features; mel-frequency cepstral coefficients; midlevel representations; multiclass SVMs; multiclass support vector machines; music video clips; valence-arousal space; video affective content analysis algorithm; video segment representation; Color; Feature extraction; Mel frequency cepstral coefficient; Support vector machines; Trajectory; Videos; Visualization;
Conference_Titel :
Content-Based Multimedia Indexing (CBMI), 2015 13th International Workshop on
Conference_Location :
Prague
DOI :
10.1109/CBMI.2015.7153603