A Multi-Scale Hierarchical Codebook Method for Human Action Recognition in Videos Using a Single Example

Author

Roshtkhari, Mehrsan Javan ; Levine, Martin D.

Author_Institution

Dept. of Electr. & Comput. Eng., McGill Univ., Montreal, QC, Canada

fYear

2012

fDate

28-30 May 2012

Firstpage

182

Lastpage

189

Abstract

This paper presents a novel action matching method based on a hierarchical codebook of local spatio-temporal video volumes (STVs). Given a single example of an activity as a query video, the proposed method finds similar videos to the query in a video dataset. It is based on the bag of video words (BOV) representation and does not require prior knowledge about actions, background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. The hierarchical algorithm yields a compact subset of salient code words of STVs for the query video, and then the likelihood of similarity between the query video and all STVs in the target video is measured using a probabilistic inference mechanism. This hierarchy is achieved by initially constructing a codebook of STVs, while considering the uncertainty in the codebook construction, which is always ignored in current versions of the BOV approach. At the second level of the hierarchy, a large contextual region containing many STVs (Ensemble of STVs) is considered in order to construct a probabilistic model of STVs and their spatio-temporal compositions. At the third level of the hierarchy a codebook is formed for the ensembles of STVs based on their contextual similarities. The latter are the proposed labels (code words) for the actions being exhibited in the video. Finally, at the highest level of the hierarchy, the salient labels for the actions are selected by analyzing the high level code words assigned to each image pixel as a function of time. The algorithm was applied to three available video datasets for action recognition with different complexities (KTH, Weizmann, and MSR II) and the results were superior to other approaches, especially in the cases of a single training example and cross-dataset action recognition.

Keywords

gesture recognition; image matching; image representation; inference mechanisms; motion estimation; tracking; video signal processing; action matching; background subtraction; bag of video words representation; codebook construction; human action recognition; local spatio-temporal video volumes; motion estimation; motion tracking; multiscale hierarchical codebook; probabilistic inference mechanism; probabilistic model; query video; video dataset; Context; Humans; Probabilistic logic; Probability density function; Uncertainty; Videos; Volume measurement; action recognition; bag of video words; hierarchical codebook;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer and Robot Vision (CRV), 2012 Ninth Conference on

Conference_Location

Toronto, ON

Print_ISBN

978-1-4673-1271-4

Type

conf

DOI

10.1109/CRV.2012.32

Filename

6233140