Animated Pose Templates for Modeling and Detecting Human Actions

Embed Size (px)

Text of Animated Pose Templates for Modeling and Detecting Human Actions

  • Animated Pose Templates for Modelingand Detecting Human Actions

    Benjamin Z. Yao, Bruce X. Nie, Zicheng Liu, Senior Member, IEEE, and Song-Chun Zhu, Fellow, IEEE

    AbstractThis paper presents animated pose templates (APTs) for detecting short-term, long-term, and contextual actions from

    cluttered scenes in videos. Each pose template consists of two components: 1) a shape template with deformable parts represented in

    an And-node whose appearances are represented by the Histogram of Oriented Gradient (HOG) features, and 2) a motion template

    specifying the motion of the parts by the Histogram of Optical-Flows (HOF) features. A shape template may have more than one motion

    template represented by an Or-node. Therefore, each action is defined as a mixture (Or-node) of pose templates in an And-Or tree

    structure. While this pose template is suitable for detecting short-term action snippets in two to five frames, we extend it in two ways:

    1) For long-term actions, we animate the pose templates by adding temporal constraints in a Hidden Markov Model (HMM), and 2)

    for contextual actions, we treat contextual objects as additional parts of the pose templates and add constraints that encode spatial

    correlations between parts. To train the model, we manually annotate part locations on several keyframes of each video and cluster

    them into pose templates using EM. This leaves the unknown parameters for our learning algorithm in two groups: 1) latent variables for

    the unannotated frames including pose-IDs and part locations, 2) model parameters shared by all training samples such as weights for

    HOG and HOF features, canonical part locations of each pose, coefficients penalizing pose-transition and part-deformation. To learn

    these parameters, we introduce a semi-supervised structural SVM algorithm that iterates between two steps: 1) learning (updating)

    model parameters using labeled data by solving a structural SVM optimization, and 2) imputing missing variables (i.e., detecting actions

    on unlabeled frames) with parameters learned from the previous step and progressively accepting high-score frames as newly labeled

    examples. This algorithm belongs to a family of optimization methods known as the Concave-Convex Procedure (CCCP) that converge

    to a local optimal solution. The inference algorithm consists of two components: 1) Detecting top candidates for the pose templates,

    and 2) computing the sequence of pose templates. Both are done by dynamic programming or, more precisely, beam search. In

    experiments, we demonstrate that this method is capable of discovering salient poses of actions as well as interactions with contextual

    objects. We test our method on several public action data sets and a challenging outdoor contextual action data set collected by

    ourselves. The results show that our model achieves comparable or better performance compared to state-of-the-art methods.

    Index TermsAction detection, action recognition, structural SVM, animated pose templates


    1.1 Backgrounds and Motivations

    HUMAN action recognition has attracted increasingresearch interest in recent years motivated by a rangeof applications from video surveillance, human-computerinteraction, to content-based video retrieval. Building arobust system for real-world human action understandingpresents challenges at multiple levels: 1) localizing theactions of interest; 2) recognizing the actions; and 3) inter-preting the interactions between agents and contextualobjects. Recent research has made major progress onclassifying actions under idealized conditions in severalpublic data sets, such as the KTH data set [1] and Weizmanndata set [2], where each video clip contains one personacting in front of static or uncluttered background with one

    action per video clip. State-of-the-art methods have achievednearly 100 percent accuracy on these two data sets. On theother hand, action understanding from real-world videos,which commonly contain heavy clutter and backgroundmotions, remains a hard problem.

    In general, actions are building blocks for activities orevents, and thus are simpler than the latter in terms of thenumber of agents involved and the time duration, but stillhave diverse complexities in space and time:

    1. In space, actions can be defined by body parts, suchas waving and clapping, by human poses, such aswalking, or by the human-scene interactions, such asmaking coffee in an office and washing dishes in akitchen. In the last case, the whole image providescontextual information for action recognition.

    2. In time, actions can be defined in a single frame,such as sitting and meeting, two to five frames suchas pushing a button and waving hand which are alsocalled action snippets [3], or a longer duration say in5 seconds, such as answering a phone call.

    In the following, we briefly review the literature in fourcategories according to their space-time complexity:

    1. Action recognition by template matching. The idea of

    template matching has been previously exploited by research-

    ers for action recognition. These approaches attempt to

    characterize the motion by looking at video sequences as


    . B.Z. Yao is with Beijing University of Posts and Telecommunications,China.

    . B.X. Nie, and S.-C. Zhu are with the Statistics Department, University ofCalifornia at Los Angeles, 8125 Mathsciences Bldg., Los Angeles, CA90095. E-mail: {xhnie, sczhu}

    . Z. Liu is with Microsoft Research, One Microsoft Way, Building 113/2116,Redmond, WA 98052. E-mail:

    Manuscript received 7 May 2012; revised 5 Jan. 2013; accepted 30 Mar. 2013;published online 2 Aug. 2013.Recommended for acceptance by F. de la Torre.For information on obtaining reprints of this article, please send e-mail, and reference IEEECS Log NumberTPAMI-2012-05-0351.Digital Object Identifier no. 10.1109/TPAMI.2013.144.

    0162-8828/14/$31.00 2014 IEEE Published by the IEEE Computer Society

  • either 2D templates or 3D volumes. For example, Essa andPentland [4] built a detailed, physically based 2D templatesof the face using optical flow features. Recognition is thendone by directly pattern matching with the templates.Bobick and Davis [5] use motion history images that captureboth motion and shape to represent actions. They haveintroduced the global descriptors motion energy image andmotion history image, which are used as templates thatcould be matched to stored models of known actions. Efroset al. [6] also perform action recognition by correlatingoptical flow measurements, but they focuses on the case oflow-resolution video of human behaviors, targeting whatthey refer to as the 30-pixel man. Following this line ofresearch, Gorelick et al. [2] extend the concept of templatematching from 2D motion templates to 3D space-timevolumes. They extract space-time features for actionrecognition, such as local space-time saliency, actiondynamics, shape structures, and orientation. One commonshortcoming of these types of template matching method isthat they rely on the restriction of static backgrounds whichallows them to segment the foreground using backgroundsubtraction. In our method, there is no need of this kind offoreground segmentation. Also, since all these method userigid templates, rather than the deformable templates usedin this paper, they are much more sensitive to appearanceand view-point variations.

    2. Action recognition by spatiotemporal interest points (STIP).To overcome the large geometric and appearance variationsin videos of an action, researchers extracted and HoG andHOF features around them for action classification in theSVM framework [7]. Different approaches either pooled thefeatures in a bag-of-word (BoW) representation [1], [8] orembraced a pyramid structure [9].

    If we compare with the task of object recognition, it isworth noting that a quantum jump in performance has beenachieved in the past few years when researchers departedfrom the BoW features and adopted the deformable part-based model (DPM) [10], especially for human detection inimages [11], [12]. Thus we expect that a better representa-tion for human action should extend the DPM to thetemporal domain and capture some important informationmissed by the STIP representations. First, body parts shouldbe explicitly modeled with rich appearance features, unlikethe BoW methods that rely on a codebook of quantizedfeatures or visual-words. The quantization of code booksthrough K-mean clustering is often unreliable, largelybecause these clusters have varying complexities anddimensions. Second, spatial relations between parts shouldbe represented to account for the human poses. Third, themotion information for each part and the poses as a wholeshould be represented to account for the transitions and,thus, temporal regularity in sequential poses.

    The above observations motivated our animated posetemplate (APT) model, and we shall overview these aspectswhen we overview APT in the next section.

    3. Action recognition by pose estimation. Encouraged by therelative success of human detection using the deformablepart-based models, recently people have attempted to treataction recognition as a pose-estimation problem in a singleimage frame. Ferrari et al. [13] retrieved TV shots containing

    a particular 2D human pose by first estimating the humanpose, then searching for shots based on a feature vectorextracted from the pose. Johnson and Everingham [14] useda mixture of tree-structured poses. Yang et al. [11] used HOGfeatures and support vector machines (SVMs) classifiers foraction recognition, where they argue that it is benefi