DocumentCode :
70746
Title :
Animated Pose Templates for Modeling and Detecting Human Actions
Author :
Yao, Benjamin Z. ; Nie, Bruce X. ; Zicheng Liu ; Song-Chun Zhu
Author_Institution :
Beijing Univ. of Posts & Telecommun., Beijing, China
Volume :
36
Issue :
3
fYear :
2014
fDate :
Mar-14
Firstpage :
436
Lastpage :
452
Abstract :
This paper presents animated pose templates (APTs) for detecting short-term, long-term, and contextual actions from cluttered scenes in videos. Each pose template consists of two components: 1) a shape template with deformable parts represented in an And-node whose appearances are represented by the Histogram of Oriented Gradient (HOG) features, and 2) a motion template specifying the motion of the parts by the Histogram of Optical-Flows (HOF) features. A shape template may have more than one motion template represented by an Or-node. Therefore, each action is defined as a mixture (Or-node) of pose templates in an And-Or tree structure. While this pose template is suitable for detecting short-term action snippets in two to five frames, we extend it in two ways: 1) For long-term actions, we animate the pose templates by adding temporal constraints in a Hidden Markov Model (HMM), and 2) for contextual actions, we treat contextual objects as additional parts of the pose templates and add constraints that encode spatial correlations between parts. To train the model, we manually annotate part locations on several keyframes of each video and cluster them into pose templates using EM. This leaves the unknown parameters for our learning algorithm in two groups: 1) latent variables for the unannotated frames including pose-IDs and part locations, 2) model parameters shared by all training samples such as weights for HOG and HOF features, canonical part locations of each pose, coefficients penalizing pose-transition and part-deformation. To learn these parameters, we introduce a semi-supervised structural SVM algorithm that iterates between two steps: 1) learning (updating) model parameters using labeled data by solving a structural SVM optimization, and 2) imputing missing variables (i.e., detecting actions on unlabeled frames) with parameters learned from the previous step and progressively accepting high-score frames as newly labeled examples. This algorithm belongs to a- family of optimization methods known as the Concave-Convex Procedure (CCCP) that converge to a local optimal solution. The inference algorithm consists of two components: 1) Detecting top candidates for the pose templates, and 2) computing the sequence of pose templates. Both are done by dynamic programming or, more precisely, beam search. In experiments, we demonstrate that this method is capable of discovering salient poses of actions as well as interactions with contextual objects. We test our method on several public action data sets and a challenging outdoor contextual action data set collected by ourselves. The results show that our model achieves comparable or better performance compared to state-of-the-art methods.
Keywords :
computer animation; concave programming; convex programming; dynamic programming; expectation-maximisation algorithm; feature extraction; hidden Markov models; image motion analysis; image representation; inference mechanisms; learning (artificial intelligence); object detection; pose estimation; search problems; support vector machines; video signal processing; APT; And-Or tree structure; And-node; CCCP; EM algorithm; HMM; HOF features; HOG features; Or-node; animated pose templates; appearance representation; beam search; concave-convex procedure; contextual action detection; deformable parts; dynamic programming; expectation-maximization algorithm; hidden Markov model; histogram of optical-flows; histogram-of-oriented gradient; human action detection; human action modeling; inference algorithm; latent variables; learning algorithm; long-term action detection; missing variables; model parameters; motion template; part-deformation; pose-transition; salient pose detection; semi-supervised structural SVM algorithm; shape template; short-term action detection; structural SVM optimization; support vector machines; temporal constraints; videos; Complexity theory; Feature extraction; Hidden Markov models; Optical imaging; Shape; Support vector machines; Videos; Action detection; action recognition; animated pose templates; structural SVM;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/TPAMI.2013.144
Filename :
6574854
Link To Document :
بازگشت