• DocumentCode
    1759293
  • Title

    Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention

  • Author

    Evangelopoulos, Georgios ; Zlatintsi, Athanasia ; Potamianos, Alexandros ; Maragos, Petros ; Rapantzikos, Konstantinos ; Skoumas, Georgios ; Avrithis, Yannis

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Nat. Tech. Univ. of Athens, Athens, Greece
  • Volume
    15
  • Issue
    7
  • fYear
    2013
  • fDate
    Nov. 2013
  • Firstpage
    1553
  • Lastpage
    1568
  • Abstract
    Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.
  • Keywords
    feature extraction; image colour analysis; image fusion; video signal processing; video streaming; attention-invoking audiovisual segment detection; auditory saliency; aural saliency; bottom-up video summarization algorithm; composite video streaming; content-independent fusion; energy tracking; fusion schemes; higher level cognitive processes; linguistic saliency; low-level features; modality-depended cues; movie database; movie distributions; movie summarization; multifrequency waveform modulations; multimodal saliency annotations; multimodal saliency curve; multimodal saliency model; multimodal streams; part-of-speech tagging; sensory information; signal-level feature extraction; spatiotemporal attention model; textual attention; video streaming; visual saliency; Attention; audio saliency; fusion; movie summarization; multimodal saliency; multistream processing; text saliency; video summarization; visual saliency;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2013.2267205
  • Filename
    6527322