• DocumentCode
    720880
  • Title

    Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos

  • Author

    Acar, Esra ; Hopfgartner, Frank ; Albayrak, Sahin

  • Author_Institution
    DAI Lab., Tech. Univ. Berlin, Berlin, Germany
  • fYear
    2015
  • fDate
    10-12 June 2015
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation.
  • Keywords
    audio-visual systems; cepstral analysis; feature selection; learning (artificial intelligence); neural nets; support vector machines; video signal processing; CNNs; DEAP dataset; HSV color space; MFCC; VA space; convolutional neural networks; deep learning methods; dense trajectory based motion features; discriminative feature selection; emotional analysis; fusion mechanisms; handcrafted higher level representations; learned multimodal representations; low-level audio-visual features; mel-frequency cepstral coefficients; midlevel representations; multiclass SVMs; multiclass support vector machines; music video clips; valence-arousal space; video affective content analysis algorithm; video segment representation; Color; Feature extraction; Mel frequency cepstral coefficient; Support vector machines; Trajectory; Videos; Visualization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Content-Based Multimedia Indexing (CBMI), 2015 13th International Workshop on
  • Conference_Location
    Prague
  • Type

    conf

  • DOI
    10.1109/CBMI.2015.7153603
  • Filename
    7153603