Action Recognition based on Human Movement Characteristics · Action Recognition based on Human Movement Characteristics Radu Dondera, David Doermann and Larry Davis University of

Action Recognition based on Human Movement Characteristics

Radu Dondera∗, David Doermann and Larry DavisUniversity of MarylandCollege Park, MD USA

{rdondera, [email protected]}, {[email protected]}

Abstract

We present a motion descriptor for human action recog-nition where appearance and shape information are unre-liable. Unlike other motion-based approaches, we lever-age image characteristics specific to human movement toachieve better robustness and lower computational cost.Drawing on recent work on motion recognition with bal-listic dynamics, an action is modeled as a series of shortcorrelated linear movements and represented with a proba-bility density function over motion vector data. We are tar-geting common human actions composed of ballistic move-ments, and our descriptor can handle both short actions(e.g. reaching with the hand) and long actions with eventsat relatively stable time offsets (e.g. walking). The pro-posed descriptor is used for both classification and detec-tion of action instances, in a nearest-neighbor framework.We evaluate the descriptor on the KTH action databaseand obtain a recognition rate of 90% in a relevant testsetting, comparable to the state-of-the-art approaches thatuse other cues in addition to motion. We also acquired adatabase of actions with slight occlusion and a human actormanipulating objects of various shapes and appearances.This database makes the use of appearance and shape in-formation problematic, but we obtain a recognition rate of95%. Our work demonstrates that human movement hasdistinctive patterns, and that these patterns can be used ef-fectively for action recognition.

1. IntroductionHuman action recognition is an active field of computer

vision research with applications to visual surveillance, hu-man computer interaction and video indexing. One of themain challenges in action recognition is action representa-tion. Previous work has investigated the use of appearance,shape, motion and sequencing information and their com-binations. Although state of the art methods [17] [18] [21]

∗Supported by a grant from Panasonic Technologies Inc.

achieve near perfect results on standard databases [24], ac-tion recognition is still challenging for real-world data.

Understanding human behavior at a checkout counter ofa department store is a challenging application and currentstatistics on revenue loss due to illicit cashier actions [9]make it an important problem. It is hard because the ap-pearance and shape of people can vary considerably: cus-tomers can wear different clothes, have different heights andbody types. Moreover, as customers push shopping carts,cashiers stand behind counters, and large items are ma-nipulated, occlusions and cluttered background frequentlyoccur. Action recognition approaches that track humans,estimate pose, extract silhouettes, find space-time interestpoints or employ any cues other than motion are thereforeproblematic. Since actions relevant to this context have dis-tinctive movement patterns, a motion-based approach caneffectively compensate for missing and noisy motion vec-tors through partial matching. Typical motion-based repre-sentations compute optical flow over a video volume of in-terest and classify the action in the volume without assum-ing a particular model for the resulting set of motion vec-tors. But ignoring the characteristics of human movementin these and other motion-based approaches limits perfor-mance in terms of both robustness and computational cost.

Efros et al. [7] track human actors in a video wherean action is defined by optical flow for each of the actor-centered bounding boxes in a time interval. The approachis time consuming because it computes similarity using thecross-correlation score and thus examines each pair of cor-responding frames in training and testing video sequences.Fathi and Mori [10] also work on carved volumes ob-tained through tracking, but increase efficiency by using atwo level boosting scheme with weak classifiers defined onsmall and medium sized cuboids. For both approaches, oc-clusions, background changes or lighting differences occur-ring for long periods of time can lead to misalignment of thebounding box and substantially reduce performance. Ke etal. [16] use a one level boosting scheme with weak clas-sifiers on arbitrary sized cuboids (volumetric features) anddo not require bounding boxes. However, they scan entire

videos with a query video volume and during training theyselect relevant volumetric features from a very large num-ber of candidates, even for a query volume of short duration(1 million for 40 frames, 25 fps).

We propose a motion-based action representation looselybased on psycho-kinesiological models of human move-ment that achieves high robustness to appearance and shapevariation, handles both short and long actions and has verylow computational requirements. The representation issuitable for common human actions composed of ballisticmovements, which create linear image patterns that can beefficiently leveraged to detect and classify actions. Whilemost other motion-based approaches require initial process-ing involving appearance and/or shape, our action recogni-tion approach relies on movement only, offering advantagesin certain recognition scenarios and potential for extensionby incorporating appearance and shape information.

1.1. Related Work

In early work, Bregler [4] formulated the action recog-nition problem as grouping pixels based on color, motionand time continuity and then classfiying actions based onsequences of atomic group motions. Bobick and Davis [3]considered the shape and motion of the human body dur-ing actions, assuming either stationary background or re-liable background subtraction. Restricting the recognitioncontext, Cutler and Davis [5] specifically targeted periodicactions. In a broader context, Parameswaran and Chellappa[23] focused on viewpoint invariant action recognition.

A class of approaches has been built on the concept ofspace-time interest points, which capture distinctive localevents in videos [6] [19]. 3D neighborhood descriptors ofinterest points detected in training videos are clustered andactions represented in terms of resulting cluster representa-tives (visual words) [24] [22]. In recent work, Mikolajczykand Uemura [21] detected space (image) interest points andmatched them, based on appearance descriptors and veloc-ity, to a learned motion-appearance vocabulary of interestpoints. The consistency of appearance and motion plays amajor role in both action detection and classification, allow-ing them to use interest points from only a small number offrames. Relying on consistency is typical of interest pointsapproaches, which become problematic in settings with ap-pearance variation, background clutter and extraneous ob-jects, especially for non-repetitive actions.

A complementary class of approaches was designed towork with dense video volume information, either opticalflow [7] or intensity values directly [17]. Other approachesused boosting schemes. Ke et al. [16] and Fathi and Mori[10] computed optical flow and represented video volumeswith arbitrary size cuboids and, respectively, small fixedsize cuboids containing salient points for motion. Draw-ing on neurobiological models of perception, Jhuang et al.

[14] proposed a hierarchy of motion features and were ableto reduce the number of features at the top level to hundredsfor video sequences a few seconds long.

In model-based motion recognition work, Fablet andBouthemy [8] employed low level information derived fromthe optical flow equation, and captured co-occurrence statis-tics with temporal multiscale Gibbs models. High levelinformation was used by Song et al. [26], who modeledthe positions and velocities of vertices of a human bodygraph detected in a sequence of frames. More recently,Ali et al. [2] represented actions with chaos theory invari-ants extracted from trajectories of six semantically relevantpoints on the human body (e.g., “head”). The latter twoapproaches assume that points with clear semantics are de-tected and tracked throughout the entire video, which isdifficult to achieve in practice. In the related field of mo-tion segmentation, a number of approaches clustered mo-tion vectors to obtain meaningful tracks [13] [28]. Jiangand Martin [15] detected actions by matching tracks of pix-els on the human body contour. In the absence of high levelinformation, clustering may produce inconsistent partitions(very different for slightly different action instances) andtrack matching is computationally difficult.

Specifically targeting the same physical setting as ourwork (checkout counters), Fan et al. [9] proposed a frame-work for recognition of repetitive sequential actions. Ac-tion primitives (e.g. cashier picks up an item) were effi-ciently integrated into sequences of actions (e.g. cashierpicks up, scans, puts down an item) through a Viterbi-likealgorithm leveraging spatio-temporal and sequencing con-straints. Action primitive detection is the low level of theframework and is implemented mainly through heuristicsrelated to physical constraints. This is in line with otherwork in recognition of composite actions, which focusedon the high level of the framework by attempting to supportparallel execution, partial ordering of actions and robustnessto detection errors [20] [29].

Recently, Vitaladevuni et al. [27] proposed an actionrecognition framework based on ballistic dynamics. In-spired by psychological studies showing the units of move-ment planning are ballistic in nature (e.g. reach), they seg-mented videos into time intervals with semantically rele-vant movements of the whole body in one direction (e.g.reach for object on floor). Motion History Images [3] werecomputed on segments of both training and test videos andthen matched using dynamic time warping, resulting in bet-ter recognition performance than Motion History Images onentire videos. We draw on [27] for the idea of linear move-ments, central to our representation.

1.2. Approach

The underlying assumption of our representation is thata human action is composed of a set of short and small cor-

related linear movements (< 1 s, tens of pixels). Althoughactions with multiple ballistic movements can be seen ashaving multiple such sets (one per ballistic movement), werepresent the entire set without temporally segmenting thevideo, in contrast to [27].

Given a video volume, we compute sparse optical flow[25] between consecutive frames, transform the directionsand positions of the resulting motion vectors into a position-augmented Hough space, and compute a probability den-sity function over vector coordinates in this space. We usethe Bhattacharyya coefficient of two pdf’s as the similar-ity measure in a k-NN framework for action detection andclassification. The proposed motion descriptor can func-tion as execution rate dependent or independent; however,it assumes the action events occur at relatively stable timeoffsets. Our descriptor recognizes actions performed in thesame image location, but it can also be made location inde-pendent through a simple scheme to estimate the displace-ment between two similar action instances. The descrip-tor is viewpoint dependent, but in the checkout counter sce-nario, the camera is fixed and the number of possible pos-tures for each action of interest is small and manageable ina k-NN framework with a smoothly varying similarity mea-sure.

2. Motion Descriptor

2.1. Motion Profiles

Our assumption that “human actions consist of small andshort correlated linear movements” implies that (1) the mo-tion vectors will lie on a small number of image lines, (2)the motion vectors that lie approximately on the same im-age line will form groups spanning small image regions andsmall time intervals and (3) groups will be correlated interms of image regions and time intervals. We representan action with a mixture of Gaussians, each modeling theimage coordinates and times of motion vectors in a group,and we call this representation a motion profile. We lever-age the linearity hypothesis by transforming motion vectorsinto an augmented Hough space that enables efficient useof the mixture model. The following subsections presentthe steps involved in computing motion profiles.

2.1.1 Sparse Optical Flow

In the context of our assumptions on human movement, ac-curate motion information is more important than dense in-formation, so we track KLT (Kanade-Lucas-Tomasi) fea-ture points [1] from frame to frame and process the re-sulting tracks to reduce noise. Static feature points areeliminated with a threshold on velocity magnitudes (e.g.1 pixel/frame) and short movements (most often occurringas background noise confusing the KLT tracker) are elimi-

nated with a threshold on track duration (e.g. 5 frames). Wesubsequently discard track information and what remains isan unordered set of motion vectors given by origin, veloc-ity and time (video frame index relative to a base frame):{(u1, v1, du1, dv1, t1), . . . , (un, vn, dun, dvn, tn)}.

Modeling motion vectors with a mixture of Gaussiansat this point requires some form of partitioning, but gen-eral purpose frameworks (e.g. Expectation Maximization,Mean Shift) do not produce partitions consistent across ac-tion instances. We solve this problem by a transformationstep that allows us to solve the partitioning problem with asliding window technique.

2.1.2 Augmented Hough Space

The Hough space [12] can be used to represent the imagelines along which movement occurs. To completely encodedirection and position information for motion vectors, weaugment the Hough space (ρ, θ) by attaching a 1D coordi-nate system d to each line. The origin of that coordinatesystem is the projection of the image origin on the line andthe positive direction is left of the perpendicular (see Figure1(a)). For each motion vector, we drop magnitude informa-tion and express direction and position by (ρ, θ, d) accord-ing to the equations

α = atan2(dv, du) (1)ρ = −usinα+ vcosα (2)

θ = α+π

2(3)

d = ucosα+ vsinα (4)

θ is normalized to the range [0, 2π] and ρ can be pos-itive or negative. The set of motion vectors becomes{(ρ1, θ1, d1, t1), . . . , (ρn, θn, dn, tn)}.

We discretize the augmented Hough space (cell sizee.g. 10 pixels and 5 degrees) and approximate the mo-tion vectors in cell (i, j) as belonging to the cell mid-line, (ρmidline = (i + 0.5) × (ρ-cell-size), θmidline =(j + 0.5) × (θ-cell-size)). i’s can be negative because ρ’scan be negative, but we offset them so all values are posi-tive. The d value is replaced by the d of the projection of themotion vector origin on the cell midline (see Figure 1(b)).At this point, the set of motion vectors in a video volume isstored as {(i1, j1, d1, t1), . . . , (in, jn, dn, tn)}.

2.1.3 Motion Vector Groups

The visualization of (d, t) values in a Hough space cell(i, j) shows clear separation into groups (see Figure 2), con-firming the existence of short linear movements. Since thegroups typically span short time intervals, we use a slidingwindow to extract groups from a cell. We order vectors in

(a)

(b)

Figure 1. (a) Coordinates of a motion vector in augmented Houghspace. The coordinate system of a line is defined using the perpen-dicular from the image origin. (b) Coordinates in the discretizedversion of the augmented Hough space.

a cell by time and scan the list with a small window (e.g.0.5 s), counting the number of motion vectors inside, anddetecting peaks. Peaks with too few motion vectors (e.g.< 3) are considered noise and eliminated, and groups areformed with the vectors in the windows of the remainingpeaks. Groups are included in one cell only and cells maycontain zero, one or multiple groups. The partitioning maybe inadequate when two groups are too close (i.e. within0.5 s of each other) or a group splits into two or more smallgroups in adjacent Hough space cells, but we observed theprocedure works well for the end goal of action recognition.

Figure 2. Typical (d, t) values of motion vectors in a Hough spacecell, for a 500 frame video volume (25 fps). Our partitioning pro-cedure produces the four highlighted groups.

2.1.4 Mixture Model

We represent an action’s motion vectors with a mixture ofbivariate Gaussians, each modeling the (d, t) values in agroup of motion vectors. Body parts may have differentnumbers of feature points on them (due to different bodysize, clothing, illumination etc) and we weigh Gaussiansequally to reduce the influence of appearance and shape.The probability density function is

p(ρ, θ, d, t) =1n

∑(ρ,θ)∈cell(i)

Ni(d, t) (5)

where n is the number of groups and Ni is the bivariateGaussian for group i. The summation is over the groups inthe cell containing (ρ, θ).

If we stipulate that a Gaussian is zero outside the cell ofits group, the probability density function becomes

p(ρ, θ, d, t) =1n

∑i

Ni(d, t) (6)

(summation over all groups). We observe that groups inthe same cell are usually far away from each other in time,therefore there is no overlap between Gaussians in the samecell, so at any (ρ, θ, d, t) point all terms of the sum in Equa-tion 6 are negligible or zero except for at most one.

Our representation maintains a Hough space cell indexand (d, t) averages and covariance matrices for each group.The averages and covariances are efficiently computed viaan integral video structure [16] that stores for each cell thenumber of motion vectors and the sum of their d, t, d2, t2,dt values. The values of these statistics for arbitrary timeintervals are computed in O(1) and so the averages and co-variances are also computed in O(1).

Although our approach is largely inspired by the useof the Hough transform in line detection, it is significantlydifferent. In line detection schemes, edge pixels vote forcells and local maxima are the detection results, while ourapproach does not vote for cells, but instead maintains allnon-empty cells. Also, note there are no assumptions aboutthe numbers of groups in cells and the mixture model is builtfrom the groups discovered in data, rather than by learn-ing the parameters of a fixed cell structure. Finally, eventhough it might seem that the d values in a group would de-pend linearly on t values, this is not the case. The points onbody parts slowly change direction and their trajectories aremultiple connected small segments (best modeled by Gaus-sians), rather than one large segment (best modeled by auniform distribution).

2.2. Motion Profile Similarity

To compute similarity between motion profiles, we usethe Bhattacharyya coefficient, defined as

BC(p1, p2) =∫X

√p1(x)p2(x)dx (7)

for two probability density functions p1 and p2 over X. Wechose the Bhattacharyya coefficient as the similarity mea-sure because of the simplifications it enables in the finalexpression. Also, the Bhattacharyya coefficient can han-dle “spiky” distributions like the ones used here, for whichother similarity metrics are not suitable (e.g. the symmetricKL divergence would equal infinity).

For mixtures of equally weighted Gaussians,

BC(p1, p2) =∫X

√1n1

∑i

N1,i(x)1n2

∑j

N2,j(x)dx

(8)where n1 and n2 are the numbers of groups in the profilesand N1,i and N2,j are the Gaussians for groups i and j,respectively.

As observed in Section 2.1.4, in a motion profile there isno overlap between Gaussians in the same cell or betweenGaussians in different cells. Consequently, the similaritycomputation for two motion profiles can be simplified tosumming Bhattacharyya coefficients for pairs of Gaussiansin the same cell.

BC(p1, p2) =1

√n1n2

∑cell(i)=cell(j)

BC(N1,i, N2,j) (9)

Gaussians in a cell of one motion profile may be bettermatched with Gaussians from neighboring cells in the othermotion profile. This is not reflected in Equation 9, but wealleviate the problem with a multiresolution technique (de-scribed below).

The Bhattacharyya coefficient for two Gaussians reflectsthe closeness of means and the similarity of covariances intwo distinct factors:

BC(N1, N2) =

exp(− 1

8 (µ1 − µ2)tΣ−112 (µ1 − µ2)

)√√det Σ1 det Σ2

det Σ12

(10)

where N1 ∼ N(µ1; Σ1), N2 ∼ N(µ2; Σ2) and Σ12 =12 (Σ1 + Σ2).

Since the number of groups in a motion profile is typ-ically on the order of hundreds, naive evaluation of thesum in Equation 9 would be computationally expensive(O(n1n2)). We use a reverse lookup table, mapping cells togroups belonging to them to reduce complexity to O(n1 +n2) and we store precomputation results (e.g. covariance

matrix determinants) for faster evaluation of the right handside of Equation 10.

We use a multiresolution technique to ensure the sim-ilarity measure does not become unreliable due to inade-quate cell size, and observe that recognition performanceimproves. We compute each motion profile at r exponen-tially decreasing resolutions of Hough space and sum sim-ilarity measures with exponentially decreasing weights, asin the pyramid match kernel scheme [11]:

similarity(p1, p2) =∑rk=0 2−kBCk(p1, p2)∑r

k=0 2−k(11)

where BCk(p1, p2) is the Bhattacharyya coefficient of thetwo motion profiles at resolution k.

The similarity measure can be made invariant to execu-tion rate by using time values normalized with respect to thevideo volume’s time extent. It can also be made invariantto image location through a scheme to estimate image dis-placement between motion profiles (see Section 2.3). Weexperiment with the four resulting variants of the similar-ity measure in a k-NN framework for action detection andclassification.

2.3. Estimating the Displacement between SimilarAction Instances

In certain surveillance scenarios, the image locationwhere an action is performed carries meaning, but in gen-eral recognition tasks (e.g. is the human running or walk-ing?) the location is irrelevant. We develop a scheme toestimate the displacement between two similar action in-stances that is robust to small changes in the motion vectorsets (due to e.g. body occlusion or missed feature points).

We observe that the translation of a motion vector affectsits ρ and d values but not its θ value. Given the (ρ, θ, d) val-ues of a motion vector and (ρ′, θ, d′) values of its translatedversion, from Equations 2-4 the translation (du, dv) is

du = (ρ′ − ρ)cosθ + (d′ − d)sinθ (12)dv = (ρ′ − ρ)sinθ − (d′ − d)cosθ (13)

The groups in the motion profile of a set of motion vec-tors correspond to the groups in the profile of a displacedversion of that set. For two matched groups, the displace-ment can be estimated via Equations 12 and 13, but we donot attempt to find correspondences between groups. In-stead, we average the displacements estimated for pairs ofgroups in cells with the same discretized θ value; errors forwrong pairs compensate if the two action instances are sim-ilar. We iterate over all discretized θ values, weigh displace-ments proportional to closeness in time, and use a multires-olution technique as for profile similarity (Section 2.2).

With an estimate of the displacement between two pro-files, we can translate the groups of one profile and then

compare it to the other. The ρ and d values of groups areupdated by adding du, dv to u, v in Equations 2 and 4.The new ρ values are discretized, possibly shifting groupsto new cells. The estimated displacement is precise onlyfor reasonably similar action instances, and the computedsimilarity value is high. Dissimilar instances tend to pro-duce random displacements, which could still lead to highsimilarity values, but this is not a problem in practice.

3. Experimental Evaluation3.1. KTH Database

We tested our system on the commonly used KTH ac-tion database [24], which has 6 action classes (boxing, clap-ping, waving, jogging, running, walking) performed by 25persons under 4 scenarios (outdoors, outdoors with camerazoom, outdoors with different clothes, indoors). We fol-lowed the leave-one-person-out procedure [22] [17]: trainon videos from 24 persons and test on videos from the re-maining 1 person, iterating through all the 25 persons. Thevideos with camera zoom are discarded because our motiondescriptor does not yet handle scale changes. We obtainedan overall recognition rate of 0.90; the confusion matrix isshown in Figure 3(a). We also tested on the entire database(including videos with camera zoom) and used a 16-9 per-sons random split [14] (20 trials) in order to compare ourresults with state of the art approaches (see Figure 3(b)).On an Intel Core2 3GHz and with a MATLAB implementa-tion, our system is an order of magnitude faster than state ofthe art methods. Recognition performance is slightly lowerthan state of the art, but a large fraction of the misclassifiedinstances are virtually impossible to discriminate using mo-tion only. Instances of clapping vary a lot in their motionpatterns and some patterns are similar to those of boxingand waving (see Figure 4); there is confusion between clap-ping and boxing and clapping and waving, but not viceversaas boxing and waving are more homogenous. The classicjogging-running confusion and the less common jogging-walking are more pronounced because our approach doesnot use appearance or shape, which can discriminate furtherwhen speeds are too similar.

3.2. Checkout Counter Database

We also tested the system on a database of videos simu-lating a checkout counter of a department store in a labora-tory setting. One person performed 5 actions: pick up itemfrom cart, put item down on belt (customer), pick up itemfrom belt, scan item and put item down on table (cashier).There are 447 action instances and the average and maxi-mum duration are 0.89 and 2.13 s respectively (shorter thanthe average KTH duration of ≈ 4 s). The human actoris sometimes occluded and manipulates objects of variousshapes and appearances (see Figure 5). We evaluated the

box clap wave jog run walkbox 0.97 0.03 0.00 0.00 0.00 0.00clap 0.10 0.81 0.08 0.00 0.01 0.00wave 0.00 0.04 0.96 0.00 0.00 0.00jog 0.00 0.00 0.00 0.84 0.11 0.05run 0.00 0.00 0.00 0.16 0.84 0.00

walk 0.00 0.00 0.00 0.01 0.00 0.99(a)

Method CuesRate for

LOORate for

23 −

13

Time perframe (s)

Ours motion 0.8712 0.8510 0.02Ke [16] motion – 0.6296 –

Fathi [10] motion & tracking – 0.9050 0.75Niebles [22] appearance & motion 0.8150 – –Jhuang [14] appearance & motion – 0.9170 0.6

Mikolajczyk [21] appearance & motion – 0.9317 0.5 – 10Laptev [18] appearance & motion – 0.9650 –

(b)

Figure 3. Results on the KTH database. (a) Confusion matrix whenvideos with camera zoom are excluded. The overall recognitionrate is 0.90. (b) Comparison with other results when videos withcamera zoom are included.

(a)

(b)

Figure 4. Clapping examples (top row a,b) with motion tracks sim-ilar to boxing and waving (bottom row a,b). (a) arms move diago-nally; in boxing, fists are thrown forward and elbows drawn backon a single diagonal direction and they occlude each other. (b)arms are dropped after hands come together and then raised forthe next cycle.

system on two action recognition tasks: classification anddetection. For the first task, we randomly chose a trainingset of 2

3 of the instances and a validation set with the re-maining 1

3 . Averaging over 20 trials, we obtain an overallrecognition rate of 0.95 and the confusion matrix in Figure6(a). For the second task, we randomly chose a training setof 2

3 of the videos and a validation set with the remaining13 . An action detection is considered correct if its endpoints

are within 0.5 s of the endpoints of an annotated instance ofthat action. Although this seems generous when comparedto the average duration, instances of the same action occurmore than one second away from each other (there is at leastone other instance in between or, in the case of “scan item”,a “reset” gesture). Averaging over all actions and over 20trials, we trace the Precision-Recall curve in Figure 6(b);the crossover point is 0.71.

(a) (b)

(c) (d)

(e)

Figure 5. The actions in the checkout counter database: (a) pickup item from cart, (b) put item down on belt, (c) pick up item frombelt, (d) scan item, (e) put item down on table.

We evaluated the displacement estimation procedure in asimilar way, but randomly translated the motion vector setsof validation instances. The translation can be expressedwith distance and angle and we computed overall recogni-tion rates for increasing distances. For each distance value,we ran 20 trials with a random 2

3 - 13 split and a random an-

gle. With displacement estimation, the overall recognitionrate is approximately 0.9 for displacement less than 150 pix-els, and decreases slowly as motion profiles are clipped bythe image window. Without estimation, the rate degradesfaster to random guessing (Figure 6(c)).

4. ConclusionWe presented a novel motion descriptor for action recog-

nition based on human movement characteristics. Our ap-proach assumes short and small correlated linear move-ments during an action and models motion vectors with a

probability distribution over the Hough space augmentedwith position information. We used a feature point trackerto obtain motion vectors, formed motion vector groups, ag-gregated group statistics into mixtures of Gaussians, andmeasured the similarity of two such pdf’s with the Bhat-tacharyya coefficient.

We demonstrated good recognition performance on thestandard KTH action database, as well as on a locally ac-quired database on which approaches that use appearanceand shape are bound to fail. Our descriptor has very lowcomputational requirements, can handle both short and longactions, and since it relies only on movement, it has po-tential for extension by incorporating appearance and shapeinformation.

References[1] http://www.ces.clemson.edu/˜stb/klt/.[2] S. Ali, A. Basharat, and M. Shah. Chaotic invariants for

human action recognition. In IEEE International Conferenceon Computer Vision, pages 1–8, 2007.

[3] A. F. Bobick and J. W. Davis. The recognition of humanmovement using temporal templates. IEEE Transactions onPattern Analysis and Machine Intelligence, 23(3):257–267,2001.

[4] C. Bregler. Learning and recognizing human dynamics invideo sequences. IEEE Conference on Computer Vision andPattern Recognition, pages 568–574, 1997.

[5] R. Cutler and L. Davis. Robust real-time periodic motiondetection, analysis, and applications. IEEE Transactionson Pattern Analysis and Machine Intelligence, 22:781–796,2000.

[6] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behaviorrecognition via sparse spatio-temporal features. In VS-PETS,October 2005.

[7] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizingaction at a distance. In IEEE International Conference onComputer Vision, 2003.

[8] R. Fablet and P. Bouthemy. Motion recognition using non-parametric image motion models estimated from temporaland multiscale co-occurrence statistics. IEEE Transactionson Pattern Analysis and Machine Intelligence, 25:1619–1624, 2003.

[9] Q. Fan, R. Bobbitt, Z. Yun, A. Yanagwa, S. Pankanti, andA. Hampapur. Recognition of repetitive sequential humanactivity. In IEEE Conference on Computer Vision and Pat-tern Recognition, 2009.

[10] A. Fathi and G. Mori. Action recognition by learning mid-level motion features. In IEEE Conference on Computer Vi-sion and Pattern Recognition, 2008.

[11] K. Grauman and T. Darrell. The pyramid match kernel: Dis-criminative classification with sets of image features. IEEEConference on Computer Vision and Pattern Recognition,2:1458–1465, 2005.

[12] P. Hough. Machine analysis of bubble chamber pictures.Proc. Int’l Conf. High Energy Accelerators and Instrumen-tation, 1959.

pick upfrom cart

put downon belt

pick upfrom belt scan put down

on tablepick up

from cart 0.97 0.03 0.00 0.00 0.00

put downon belt 0.00 1.00 0.00 0.00 0.00

pick upfrom belt 0.00 0.00 1.00 0.00 0.00

scan 0.00 0.00 0.00 0.96 0.04put downon table 0.00 0.00 0.03 0.16 0.81

(a)

(b)

(c)

Figure 6. Results on the checkout counter database. (a) Confu-sion matrix for the classification task. The overall recognition rateis 0.95. (b) PR curve for the detection task. (c) Overall recogni-tion rate as function of displacement (similarity measure with andwithout displacement estimation).

[13] M. Hu, S. Ali, and M. Shah. Detecting global motion pat-terns in complex videos. In International Conference on Pat-tern Recognition, pages 1–5, 2008.

[14] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. In IEEE InternationalConference on Computer Vision, pages 1–8, 2007.

[15] H. Jiang and D. R. Martin. Finding actions using shape flows.In European Conference on Computer Vision, pages 278–292, 2008.

[16] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual eventdetection using volumetric features. In IEEE InternationalConference on Computer Vision, volume 1, pages 166–173,2005.

[17] T.-K. Kim and R. Cipolla. Canonical correlation analysis ofvideo volume tensors for action categorization and detection.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 99(1), 2008.

[18] I. Laptev, B. Caputo, C. Schuldt, and T. Lindeberg.Local velocity-adapted motion events for spatio-temporalrecognition. Computer Vision and Image Understanding,108(3):207–229, 2007.

[19] I. Laptev and T. Lindeberg. Space-time interest points. InIEEE International Conference on Computer Vision, pages432–439, 2003.

[20] B. Laxton, J. Lim, and D. J. Kriegman. Leveraging temporal,contextual and ordering constraints for recognizing complexactivities in video. In IEEE Conference on Computer Visionand Pattern Recognition, 2007.

[21] K. Mikolajczyk and H. Uemura. Action recognition withmotion-appearance vocabulary forest. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1–8,2008.

[22] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learn-ing of human action categories using spatial-temporal words.International Journal of Computer Vision, 79(3):299–318,2008.

[23] V. Parameswaran and R. Chellappa. View invariants for hu-man action recognition. IEEE Conference on Computer Vi-sion and Pattern Recognition, 2:613, 2003.

[24] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: A local svm approach. In International Conferenceon Pattern Recognition, pages 32–36, 2004.

[25] J. Shi and C. Tomasi. Good features to track. In IEEE Con-ference on Computer Vision and Pattern Recognition, pages593–600, 1994.

[26] Y. Song, L. Goncalves, and P. Perona. Unsupervised learningof human motion. IEEE Transactions on Pattern Analysisand Machine Intelligence, 25, 2003.

[27] S. N. P. Vitaladevuni, V. Kellokumpu, and L. S. Davis. Ac-tion recognition using ballistic dynamics. In IEEE Confer-ence on Computer Vision and Pattern Recognition, 2008.

[28] J. Yan and M. Pollefeys. A general framework for motionsegmentation: Independent, articulated, rigid, non-rigid, de-generate and non-degenerate. In European Conference onComputer Vision, pages 94–106, 2006.

[29] Z. Zhang, K. Huang, and T. Tan. Multi-thread parsing forrecognizing complex events in videos. In European Con-ference on Computer Vision, pages 738–751, Berlin, Heidel-berg, 2008. Springer-Verlag.

Documents

Action Recognition based on Human Movement Characteristics · Action Recognition based on Human Movement Characteristics Radu Dondera, David Doermann and Larry Davis University of