مرکز منطقه ای اطلاع رساني علوم و فناوري - YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

DocumentCode :

3427364

Title :

YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

Author :

Guadarrama, Sergio ; Krishnamoorthy, N. ; Malkarnenkar, Girish ; Venugopalan, Sarad ; Mooney, Randy ; Darrell, Trevor ; Saenko, Kate

fYear :

2013

fDate :

1-8 Dec. 2013

Firstpage :

2712

Lastpage :

2719

Abstract :

Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities ``in-the-wild´´. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from Web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects, we also use a Web-scale language model to ``fill in´´ novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

Keywords :

natural language processing; object recognition; social networking (online); text analysis; Web-scale natural language corpora; YouTube2Text; actor-action-object combination penalization; arbitrary activity description; arbitrary activity recognition; generalization level; large YouTube corpus; large-scale object recognition; out-of-domain actions; pragmatic standpoint; pretrained model; semantic hierarchy learning; short-sentence description generation; training set; verbs; video activity; video clips; zero-shot recognition; Accuracy; Predictive models; Semantics; Support vector machines; Training; Visualization; YouTube; Describing Activities in videos; Large-scale activity recognition; Recognizing activities in videos; semantic hierarchies; zero-shot learning;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Vision (ICCV), 2013 IEEE International Conference on

Conference_Location :

Sydney, VIC

ISSN :

1550-5499

Type :

conf

DOI :

10.1109/ICCV.2013.337

Filename :

6751448

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3427364