Parsing video events with goal inference and intent prediction

Author

Pei, Mingtao ; Jia, Yunde ; Zhu, Song-Chun

Author_Institution

Lab. of Intell. Inf. Technol., Beijing Inst. of Technol., Beijing, China

fYear

2011

fDate

6-13 Nov. 2011

Firstpage

487

Lastpage

494

Abstract

In this paper, we present an event parsing algorithm based on Stochastic Context Sensitive Grammar (SCSG) for understanding events, inferring the goal of agents, and predicting their plausible intended actions. The SCSG represents the hierarchical compositions of events and the temporal relations between the sub-events. The alphabets of the SCSG are atomic actions which are defined by the poses of agents and their interactions with objects in the scene. The temporal relations are used to distinguish events with similar structures, interpolate missing portions of events, and are learned from the training data. In comparison with existing methods, our paper makes the following contributions. i) We define atomic actions by a set of relations based on the fluents of agents and their interactions with objects in the scene. ii) Our algorithm handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework; iii) The algorithm infers the goal of the agents and predicts their intents by a top-down process; iv) The algorithm improves the detection of atomic actions by event contexts. We show satisfactory results of event recognition and atomic action detection on the data set we captured which contains 12 event categories in both indoor and outdoor videos.

Keywords

Bayes methods; context-sensitive grammars; inference mechanisms; multi-agent systems; video signal processing; Bayesian framework; atomic actions; event insertion; event recognition; goal inference; indoor videos; intent prediction; multiagent events; optimal parsing solution; outdoor videos; stochastic context sensitive grammar; temporal relations; top-down process; video event parsing algorithm; Atomic clocks; Context; Grammar; Hidden Markov models; Portable computers; Prediction algorithms; Training data;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Vision (ICCV), 2011 IEEE International Conference on

Conference_Location

Barcelona

ISSN

1550-5499

Print_ISBN

978-1-4577-1101-5

Type

conf

DOI

10.1109/ICCV.2011.6126279

Filename

6126279