DocumentCode :
2530156
Title :
A Multi-Scale Hierarchical Codebook Method for Human Action Recognition in Videos Using a Single Example
Author :
Roshtkhari, Mehrsan Javan ; Levine, Martin D.
Author_Institution :
Dept. of Electr. & Comput. Eng., McGill Univ., Montreal, QC, Canada
fYear :
2012
fDate :
28-30 May 2012
Firstpage :
182
Lastpage :
189
Abstract :
This paper presents a novel action matching method based on a hierarchical codebook of local spatio-temporal video volumes (STVs). Given a single example of an activity as a query video, the proposed method finds similar videos to the query in a video dataset. It is based on the bag of video words (BOV) representation and does not require prior knowledge about actions, background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. The hierarchical algorithm yields a compact subset of salient code words of STVs for the query video, and then the likelihood of similarity between the query video and all STVs in the target video is measured using a probabilistic inference mechanism. This hierarchy is achieved by initially constructing a codebook of STVs, while considering the uncertainty in the codebook construction, which is always ignored in current versions of the BOV approach. At the second level of the hierarchy, a large contextual region containing many STVs (Ensemble of STVs) is considered in order to construct a probabilistic model of STVs and their spatio-temporal compositions. At the third level of the hierarchy a codebook is formed for the ensembles of STVs based on their contextual similarities. The latter are the proposed labels (code words) for the actions being exhibited in the video. Finally, at the highest level of the hierarchy, the salient labels for the actions are selected by analyzing the high level code words assigned to each image pixel as a function of time. The algorithm was applied to three available video datasets for action recognition with different complexities (KTH, Weizmann, and MSR II) and the results were superior to other approaches, especially in the cases of a single training example and cross-dataset action recognition.
Keywords :
gesture recognition; image matching; image representation; inference mechanisms; motion estimation; tracking; video signal processing; action matching; background subtraction; bag of video words representation; codebook construction; human action recognition; local spatio-temporal video volumes; motion estimation; motion tracking; multiscale hierarchical codebook; probabilistic inference mechanism; probabilistic model; query video; video dataset; Context; Humans; Probabilistic logic; Probability density function; Uncertainty; Videos; Volume measurement; action recognition; bag of video words; hierarchical codebook;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Robot Vision (CRV), 2012 Ninth Conference on
Conference_Location :
Toronto, ON
Print_ISBN :
978-1-4673-1271-4
Type :
conf
DOI :
10.1109/CRV.2012.32
Filename :
6233140
Link To Document :
بازگشت