مرکز منطقه ای اطلاع رساني علوم و فناوري - Multimodal feature fusion for robust event detection in web videos

DocumentCode :

2714488

Title :

Multimodal feature fusion for robust event detection in web videos

Author :

Natarajan, Pradeep ; Wu, Shuang ; Vitaladevuni, Shiv ; Zhuang, Xiaodan ; Tsakalidis, Stavros ; Park, Unsang ; Prasad, Rohit ; Natarajan, Premkumar

Author_Institution :

Speech, Language & Multimedia Bus. Unit, Raytheon BBN Technol., Cambridge, MA, USA

fYear :

2012

fDate :

16-21 June 2012

Firstpage :

1298

Lastpage :

1305

Abstract :

Combining multiple low-level visual features is a proven and effective strategy for a range of computer vision tasks. However, limited attention has been paid to combining such features with information from other modalities, such as audio and videotext, for large scale analysis of web videos. In our work, we rigorously analyze and combine a large set of low-level features that capture appearance, color, motion, audio and audio-visual co-occurrence patterns in videos. We also evaluate the utility of high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Further, we exploit multimodal information by analyzing available spoken and videotext content using state-of-the-art automatic speech recognition (ASR) and videotext recognition systems. We combine these diverse features using a two-step strategy employing multiple kernel learning (MKL) and late score level fusion methods. Based on the TRECVID MED 2011 evaluations for detecting 10 events in a large benchmark set of ~45000 videos, our system showed the best performance among the 19 international teams.

Keywords :

Internet; audio-visual systems; computer vision; feature extraction; image colour analysis; image motion analysis; learning (artificial intelligence); object detection; speech recognition; video signal processing; ASR; MKL; TRECVID MED 2011; Web videos; action concepts; audio-visual cooccurrence patterns; color patterns; computer vision tasks; high-level visual information; large scale analysis; motion patterns; multimodal information; multiple kernel learning; multiple low-level visual features fusion; object detection; robust event detection; scene detection; score level fusion methods; state-of-the-art automatic speech recognition; two-step strategy; videotext recognition systems; Encoding; Feature extraction; Image color analysis; Kernel; Speech; Vectors; Videos;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on

Conference_Location :

Providence, RI

ISSN :

1063-6919

Print_ISBN :

978-1-4673-1226-4

Electronic_ISBN :

1063-6919

Type :

conf

DOI :

10.1109/CVPR.2012.6247814

Filename :

6247814

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2714488