• DocumentCode
    254123
  • Title

    Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts

  • Author

    Shuang Wu ; Bondugula, Sravanthi ; Luisier, Florian ; Xiaodan Zhuang ; Natarajan, Prem

  • Author_Institution
    Speech, Language & Multimedia, Raytheon BBN Technol., Cambridge, MA, USA
  • fYear
    2014
  • fDate
    23-28 June 2014
  • Firstpage
    2665
  • Lastpage
    2672
  • Abstract
    Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of performing high-level event detection with no training exemplars, using only textual descriptions. This task goes beyond the traditional zero-shot framework of adapting a given set of classes with training data to unseen classes. We leverage video and image collections with free-form text descriptions from widely available web sources to learn a large bank of concepts, in addition to using several off-the-shelf concept detectors, speech, and video text for representing videos. We utilize natural language processing technologies to generate event description features. The extracted features are then projected to a common high-dimensional space using text expansion, and similarity is computed in this space. We present extensive experimental results on the large TRECVID MED [26] corpus to demonstrate our approach. Our results show that the proposed concept detection methods significantly outperform current attribute classifiers such as Classemes [34], ObjectBank [21], and SUN attributes[28] . Further, we find that fusion, both within as well as between modalities, is crucial for optimal performance.
  • Keywords
    Web sites; feature extraction; natural language processing; TRECVID MED [26] corpus; Web sources; common high-dimensional space; event description features; extracted features; free-form text descriptions; high-level event detection; image collections; multimodal fusion; natural language processing; text expansion; textual descriptions; training sets; video collections; visual content analysis; weakly supervised concepts; zero-shot event detection; zero-shot framework; zero-shot learning problem; Detectors; Feature extraction; Speech; Support vector machines; Training; Vectors; Visualization; Concept Detection; Multimodal Fusion; Video Event Detection; Zero-shot Learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on
  • Conference_Location
    Columbus, OH
  • Type

    conf

  • DOI
    10.1109/CVPR.2014.341
  • Filename
    6909737