Title :
Parsing human motion with stretchable models
Author :
Sapp, Benjamin ; Weiss, David ; Taskar, Ben
Author_Institution :
Univ. of Pennsylvania, Philadelphia, PA, USA
Abstract :
We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.
Keywords :
image colour analysis; image sequences; inference mechanisms; motion estimation; pose estimation; tree data structures; video signal processing; approximate inference; articulated human pose estimation problem; color tracking; dual decomposition; human motion parsing; image appearance; intractable inference; joint multiple articulated parts parsing; learning problem; max-marginal combination method; motion cues; optical flow contours; single-frame pose estimation; stretchable model; temporal coupling; temporal features; tractable model ensemble; tree structure; unconstrained videos; video dataset; Computational modeling; Decoding; Elbow; Humans; Image color analysis; Joints; Videos;
Conference_Titel :
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on
Conference_Location :
Providence, RI
Print_ISBN :
978-1-4577-0394-2
DOI :
10.1109/CVPR.2011.5995607