• DocumentCode
    73022
  • Title

    Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects

  • Author

    Izadinia, Hamid ; Saleemi, Imran ; Shah, Mubarak

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., Univ. of Central Florida, Orlando, FL, USA
  • Volume
    15
  • Issue
    2
  • fYear
    2013
  • fDate
    Feb. 2013
  • Firstpage
    378
  • Lastpage
    390
  • Abstract
    In this paper, we propose a novel method that exploits correlation between audio-visual dynamics of a video to segment and localize objects that are the dominant source of audio. Our approach consists of a two-step spatiotemporal segmentation mechanism that relies on velocity and acceleration of moving objects as visual features. Each frame of the video is segmented into regions based on motion and appearance cues using the QuickShift algorithm, which are then clustered over time using K-means, so as to obtain a spatiotemporal video segmentation. The video is represented by motion features computed over individual segments. The Mel-Frequency Cepstral Coefficients (MFCC) of the audio signal, and their first order derivatives are exploited to represent audio. The proposed framework assumes there is a non-trivial correlation between these audio features and the velocity and acceleration of the moving and sounding objects. The canonical correlation analysis (CCA) is utilized to identify the moving objects which are most correlated to the audio signal. In addition to moving-sounding object identification, the same framework is also exploited to solve the problem of audio-video synchronization, and is used to aid interactive segmentation. We evaluate the performance of our proposed method on challenging videos. Our experiments demonstrate significant increase in performance over the state-of-the-art both qualitatively and quantitatively, and validate the feasibility and superiority of our approach.
  • Keywords
    audio signal processing; image motion analysis; image representation; image segmentation; learning (artificial intelligence); object detection; pattern clustering; statistical analysis; synchronisation; video signal processing; K-means clustering; Mel-frequency cepstral coefficients; QuickShift algorithm; appearance cue; audio dominant source; audio signal correlation; audio-video synchronization; audio-visual dynamics; canonical correlation analysis; motion cue; motion feature representation; moving object acceleration; moving object velocity; moving-sounding object; multimodal analysis; object identification; object segmentation; two-step spatiotemporal segmentation mechanism; video dynamics; Acceleration; Correlation; Feature extraction; Image segmentation; Mel frequency cepstral coefficient; Motion segmentation; Visualization; Audio-visual analysis; audio-visual synchronization; canonical correlation analysis; video segmentation;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2012.2228476
  • Filename
    6357311