6
Frame-by-frame crowd motion classification from affine motion models Antoine Basset Patrick Bouthemy Charles Kervrann Inria, Centre Rennes – Bretagne Atlantique Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France [email protected] Abstract Recognizing dynamic behaviors of dense crowds in videos is of great interest in many surveillance applica- tions. In contrast to most existing methods which are based on trajectories or tracklets, our approach for crowd motion analysis provides a crowd motion classification on a frame- by-frame and pixel-wise basis. Indeed, we only compute affine motion models from pairs of two consecutive video images. The classification itself relies on simple rules on the coefficients of the computed affine motion models, and therefore does not imply any prior learning stage. The over- all method proceeds in four steps: (i) detection of moving points, (ii) computation of a set of motion model candidates over a collection of windows, (iii) selection of the best mo- tion model at each point owing to a maximum likelihood criterion, (iv) determination of the crowd motion class at each pixel with a hierarchical classification tree regular- ized by majority votes. The algorithm is almost parameter- free, and is efficient in terms of memory and computation load. Experiments on computer-generated sequences and real video sequences demonstrate that our method is accu- rate, and can successfully handle complex situations. 1. Introduction and related work Important research efforts have been devoted to crowd analysis for several years (for a survey, see [15]), involv- ing pedestrian tracking [7, 10], anomaly detection [5, 8, 12] and path classification [14, 16]. As for crowd behavior classification, Zhou et al. [17] and Cheriyadat and Radke [2] studied coherent and dominant crowd motions. Zhou et al. proposed to group moving points according to the so- called coherent neighbor invariance, able to represent the spatial proximity and the correlation over time of velocity vectors of data points. In [2], the trajectories are organized into clusters according to a longest common subsequence (LCSS) criterion. In [14], Wang et al. developed the recog- nition of semantic regions (prominent paths in the scene) within the framework of hierarchical Dirichlet processes Table 1. Classes of Solmaz et al. [13] and ours Behaviors of Motion types Solmaz et al. Our classes Translation Lane North, West, South, East Rotation Ring/Arch Clockwise, Counterclockwise Scaling Bottleneck Convergence Fountainhead Divergence Skewing Blocking Not included and latent topics, while Zhou et al. [16] introduced the so- called random field topic model for semantic region analy- sis to account for spatial and temporal coherence between tracklets. To our knowledge, only Hu et al. [7] and Solmaz et al. [13] have focused on classifying structured group mo- tions. The former determined motion patterns by clustering 4D flow vectors (2D position and velocity of points) in each frame according to proximity and similarity rules. The lat- ter proposed to extract trajectories and accumulation points from the advection of flow fields over video sequences. In this paper, we address the same problem of classifying co- herent crowd motions in videos recorded by a fixed camera, but we propose a totally different approach. Existing crowd analysis methods usually exploit motion- based features computed on extended time intervals: spatio- temporal cuboids [5, 8, 11], tracklets [16] and mostly tra- jectories [2, 10, 13, 14, 17]. Our method only requires sim- ple affine motion models estimated from pairs of images. It does not involve any temporal integration or trajectory computation. Besides, neither learning stage nor parameter adjustment is needed. We assume that the apparent motion of a group of pedestrians can locally be represented by one of the three following motion models: translation, scaling or rotation. Scaling motions correspond to gathering (Con- vergence) or dispersing pedestrians (Divergence). Rotation motions are subdivided into Clockwise and Counterclock- wise classes. Since our classification scheme is view-based, we choose to distinguish four image-related translation di- rections: North, West, South, East. A finer subdivision could be handled as well if required. These eight crowd motion classes can be related to the behaviors introduced in [13], as summarized in Table 1. Let us mention that our 1

Frame-by-frame crowd motion classification from affine motion models

Embed Size (px)

Citation preview

Frame-by-frame crowd motion classification from affine motion models

Antoine Basset Patrick Bouthemy Charles KervrannInria, Centre Rennes – Bretagne Atlantique

Campus Universitaire de Beaulieu, 35042 Rennes Cedex, [email protected]

Abstract

Recognizing dynamic behaviors of dense crowds invideos is of great interest in many surveillance applica-tions. In contrast to most existing methods which are basedon trajectories or tracklets, our approach for crowd motionanalysis provides a crowd motion classification on a frame-by-frame and pixel-wise basis. Indeed, we only computeaffine motion models from pairs of two consecutive videoimages. The classification itself relies on simple rules onthe coefficients of the computed affine motion models, andtherefore does not imply any prior learning stage. The over-all method proceeds in four steps: (i) detection of movingpoints, (ii) computation of a set of motion model candidatesover a collection of windows, (iii) selection of the best mo-tion model at each point owing to a maximum likelihoodcriterion, (iv) determination of the crowd motion class ateach pixel with a hierarchical classification tree regular-ized by majority votes. The algorithm is almost parameter-free, and is efficient in terms of memory and computationload. Experiments on computer-generated sequences andreal video sequences demonstrate that our method is accu-rate, and can successfully handle complex situations.

1. Introduction and related workImportant research efforts have been devoted to crowd

analysis for several years (for a survey, see [15]), involv-ing pedestrian tracking [7, 10], anomaly detection [5, 8, 12]and path classification [14, 16]. As for crowd behaviorclassification, Zhou et al. [17] and Cheriyadat and Radke[2] studied coherent and dominant crowd motions. Zhouet al. proposed to group moving points according to the so-called coherent neighbor invariance, able to represent thespatial proximity and the correlation over time of velocityvectors of data points. In [2], the trajectories are organizedinto clusters according to a longest common subsequence(LCSS) criterion. In [14], Wang et al. developed the recog-nition of semantic regions (prominent paths in the scene)within the framework of hierarchical Dirichlet processes

Table 1. Classes of Solmaz et al. [13] and oursBehaviors of

Motion types Solmaz et al. Our classes

Translation Lane North, West, South, EastRotation Ring/Arch Clockwise, Counterclockwise

ScalingBottleneck Convergence

Fountainhead DivergenceSkewing Blocking Not included

and latent topics, while Zhou et al. [16] introduced the so-called random field topic model for semantic region analy-sis to account for spatial and temporal coherence betweentracklets. To our knowledge, only Hu et al. [7] and Solmazet al. [13] have focused on classifying structured group mo-tions. The former determined motion patterns by clustering4D flow vectors (2D position and velocity of points) in eachframe according to proximity and similarity rules. The lat-ter proposed to extract trajectories and accumulation pointsfrom the advection of flow fields over video sequences. Inthis paper, we address the same problem of classifying co-herent crowd motions in videos recorded by a fixed camera,but we propose a totally different approach.

Existing crowd analysis methods usually exploit motion-based features computed on extended time intervals: spatio-temporal cuboids [5, 8, 11], tracklets [16] and mostly tra-jectories [2, 10, 13, 14, 17]. Our method only requires sim-ple affine motion models estimated from pairs of images.It does not involve any temporal integration or trajectorycomputation. Besides, neither learning stage nor parameteradjustment is needed. We assume that the apparent motionof a group of pedestrians can locally be represented by oneof the three following motion models: translation, scalingor rotation. Scaling motions correspond to gathering (Con-vergence) or dispersing pedestrians (Divergence). Rotationmotions are subdivided into Clockwise and Counterclock-wise classes. Since our classification scheme is view-based,we choose to distinguish four image-related translation di-rections: North, West, South, East. A finer subdivisioncould be handled as well if required. These eight crowdmotion classes can be related to the behaviors introducedin [13], as summarized in Table 1. Let us mention that our

1

scheme is applicable to any point in the image, not onlyaround few critical points as in [13].

We have evaluated our method on both computer-generated and real image sequences. Furthermore, themethod is flexible enough to be extended to “crowd-alike”videos like animal flocks or traffic images. The rest of thepaper is organized as follows. In Section 2, we introducethe motion models and the set of computation windows. Wealso explain how we select the proper motion model at everypoint, and how we infer the crowd motion classification. InSection 3 implementation details are given and experimen-tal results are reported and discussed. Section 4 containsconcluding remarks.

2. Crowd motion analysisOur method is divided into four main steps: (i) detec-

tion of moving areas in the image, (ii) estimation of three“sub-affine” motion models over a collection of windowsof different sizes, (iii) point-wise selection of the optimalmotion models, and (iv) crowd motion classification by adecision tree involving majority votes. The first step isachieved with the motion detection algorithm [3], whichfollows a background subtraction approach and involves amixed-state conditional random field (see [3], for details).The set of detected moving areas is denoted S with S ⊂ Ω,where Ω is the image domain. The motion detection algo-rithm is tuned by two parameters that we have kept fixed forall processed sequences and their setting was not critical forour classification task. With step 2, we end up with a set ofmotion model candidates at every point p ∈ S . Step 3 al-lows us to find at every point p∈S the most relevant motionmodel among these candidates with a maximum likelihood(ML) criterion. The last step consists in determining thecrowd motion class at every point p∈S while regularizingthe solution using two majority votes.

2.1. Estimation of the motion model candidates

We only consider 2D parametric motion models. At anypoint p = (x, y)∈S , the optical flow vectorw(p) is approx-imated by an affine flow vector wθ(p) defined by:

wθ(p) =

(a1 a2

a3 a4

)︸ ︷︷ ︸

A

(xy

)+

(b1b2

)︸ ︷︷ ︸B

, (1)

with θ = (a1, a2, a3, a4, b1, b2)T the model parameter vec-tor. In order to characterize the eight previously introducedcrowd motion classes, only three “sub-affine” motion mod-els are necessary: translation, scaling, and rotation motions,respectively corresponding to the following 2×2 matricesA, as explained in [6]:

AT =

(0 00 0

), AS =

(a1 00 a1

), AR

(0 a2

−a2 0

). (2)

(a) Coarse size (100%)

Figure 1. The set of windowsW(p) the point p belongs to(Stairs sequence). (a), (b), (c)distinguish the three windowsizes. In (a), the blue win-dow includes the whole image,and the red and yellow ones aretruncated to lie inside it.

(b) Medium size (50%)

(c) Fine size (25%)

The vector B is considered in any case, since it corre-sponds to the displacement of the origin of the coordinatesystem. So, for each motion model, only two (translationcase) or three (scaling and rotation cases) coefficients haveto be estimated, respectively:

θT = (b1, b2)T , θS = (b1, b2, a1)T , θR = (b1, b2, a2)T .(3)

Since we do not know in advance the appropriate spatialsupport to estimate the motion models, we consider a col-lection W of overlapping windows of various sizes – typi-cally, 25%, 50%, and 100% of the image dimensions. Fora given size, the overlap rate is 50%, so that a given pointp belongs to four windows of that size (apart from bordereffects). An example is given in Figure 1.

We estimate the three motion models (3) in everywindow, using the robust method [9] based on a multi-resolution and incremental scheme. The robust estimationallows us to capture the dominant motion if several mo-tions are present inside the window, and to tolerate errorsin the motion detection stage. Since the minimization ofthe robust penalty function amounts to an IRLS (IterativelyReweighted Least Squares) procedure [9], each point p isassigned at the end a weight representing its influence inthe robust estimation. A point whose weight is close to 1is called an inlier. Let θk,i be the parameters of the motionmodel k ∈ T, S,R, estimated in the window Wi ∈ W .The set of inliers for the model of parameter vector θk,i isdenoted byXk,i (it is obtained by thresholding the weights).

The conformity evaluation of a point p to a given motionmodel of parameters θk,i is based on the displaced framedifference (DFD) and is defined by:

ε(p, θk,i) = It+1(p+ wθk,i(p))− It(p), (4)

where It represents the intensities of the t-th frame andwθk,i

(p) is the velocity of p deduced from θk,i accordingto (1). Conformity corresponds to ε close to 0.

For every motion model in every window Wi, we com-pute both the motion parameters θk,i and the empirical vari-

ance σ 2k,i, given by:

σ 2k,i =

1

|Xk,i|∑p∈Xk,i

ε2(p, θk,i), (5)

where |Xk,i| denotes the cardinal of Xk,i. Let W(p)⊂Wbe the subset of windows containing a given point p, andM(p) the set of motion model candidates for p. In ourexperiments, using the previously mentioned windows col-lection, 33 motion model candidates are available for eachpixel (only 30 for pixels lying on the image borders).

2.2. Motion model selection

From the set of motion models M(p) available at eachpoint p, we select the most relevant one based on an MLcriterion. In the same time, we get the motion type (transla-tion, scaling, rotation) of the point, as explained below.

Assuming the residuals ε(p, θk,i) are independent, cen-tered and normally distributed, the likelihood of the motionmodel of parameter θk,i over a patch Np centered in p canbe written as:

L(p, θk,i) =∏q∈Np

1√2πσ 2

k,i

exp−ε2(q, θk,i)

2σ 2k,i

(6)

with σ 2k,i the variance defined in (5) and Np a small neigh-

borhood of p, typically 3× 3.The preliminary motion model selection results from:

θ(p) = arg maxθk,i∈Θ(p)

L(p, θk,i), (7)

where Θ(p) is the set of parameters corresponding toM(p).In the case where θ(p) corresponds to a scaling (resp.

rotation) motion, the motion type is accepted provided thatthe magnitude of the coefficient a1 (resp. a2) is higher thana threshold τ (set to 10−3). Otherwise, it is considered as atranslation motion, i.e., a1 (resp. a2) is set to 0. This thresh-olding prevents us from accepting very local and small scal-ing or rotation motions, which actually correspond to irrele-vant information. This can be viewed as an empirical modeldimension penalization to countervail the tendency of theML to favor the more complex model. It also allows us tolimit perspective effects, inherent to our view-based classi-fication scheme.

2.3. From motion models to crowd motion classes

From the three translation, scaling and rotation motionmodels, respectively denoted as T , S, and R, we can de-fine eight crowd motion classes, according to the sign of thecoefficients b1, b2, a1, a2 and some combinations of them.We represent these classes by colors: CT = •, •, •, •,CS = •, , CR = •, •, as explained in Table 2. Theset of all eight classes is denoted as C = CT ∪CS ∪CR. Theclassification consists in first computing an initial classifi-cation map Cinit = cinit(p) ∈ C | p ∈ S ⊂Ω. Then, we

Table 2. Classes definition

Motion Crowd motion classestypes Directions Criteria

Translation

• North b1+b2>0, b1−b2<0• West b1+b2<0, b1−b2<0• South b1+b2<0, b1−b2>0• East b1+b2>0, b1−b2>0

Scaling• Convergence a1<0 Divergence a1>0

Rotation• Clockwise a2<0• Counterclockwise a2>0

(a) Flat

•••••••(b) Full hierarchy

R

••

S

T

••••(c) Hierarchy on translations only

•••T

••••Figure 2. Considered classification trees

regularize it by majority votes.In our approach, Cinit is obtained from the motion model

parameters θ(p) | p ∈ S defined in (7). As an illustra-tion, let us consider the point p plotted in Figure 1. Sinceit belongs to 11 windows, 33 motion models (3) and like-lihoods (6) are evaluated. The highest likelihood (7) is ob-tained with a scaling motion model, hence cinit(p) ∈ CS .Moreover, a1(p) = −0.0044 < −10−3, which yieldscinit(p) = •. An example of initial classification map isgiven in Figure 3a.

The regularization step is organized as a decision treewith up to two levels. The decisions are taken according tomajority votes.We have investigated three different hierar-chical trees depicted in Figure 2.

The flat “tree” consists in classifying the crowd motionsdirectly from Cinit. Here, the class c(p) of any point p isdeduced from a majority vote over a wide square windowcentered in p, denoted by Pp – to capture large groups mo-tions, we set its side length to 25% of the image width. Theclass selection is formally given by:

c(p) = arg maxλ∈C

∑q∈Pp

δ(cinit(q)=λ), (8)

where δ(cinit(q)=λ) = 1 if cinit(q) = λ and 0 otherwise.

With the fully hierarchical method, we apply a first ma-jority vote on the motion type (T , S, R) over Pp in orderto distinguish translation, rotation and scaling motions (see

(a) Initial map Cinit

Figure 3. The hierarchical clas-sification process. In the Shoarsequence, fishes twist counter-clockwise. For (a) and (c),color code is given in Table 2.In (b), blue, green, orange re-spectively represent translation,scaling, and rotation motions.

(b) Motion type vote

(c) Crowd motion class vote

Figure 3b). This way, we obtain the motion type m(p) of p:

m(p) = arg maxk∈T,S,R

∑q∈Pp

δ(cinit(q)∈Ck), (9)

where δ(cinit(q)∈Ck) = 1 if cinit(q)∈Ck and 0 otherwise.The eight crowd motion classes are then selected by a

second majority vote. Within the window Pp, we only con-sider the points belonging to the same motion type as p:

c(p) = arg maxλ∈Cm(p)

∑q∈Pp

m(q)=m(p)

δ(cinit(q)=λ). (10)

The third classification procedure we have tested is acompromise between the two first ones. We only gather thefour translations in a tree node, while clockwise and coun-terclockwise rotation motions, along with convergences anddivergences are separated right from the first level. Thestrategy is summarized in Figure 2c.

The experimental results demonstrate that this thirdmethod is the most efficient. It is easy to understand why theflat tree may poorly behave. Let us take a hypothetical ex-ample where people would walk together from North-Westto South-East. About half the translating pixels would beclassified as translating to the South, and about the sameamount to the East. Then, if a little more points than thesetwo amounts of points would be considered in rotation, themost represented class would be rotation motion, whereastranslation motion should predominate by far. This effect isperceptible in Figure 4a.

The fully hierarchical approach might be the most the-oretically satisfying one, since it is coherent with the mo-tion estimation paradigm. Arguments to understand the effi-ciency of the last approach reside in the continuity betweenclasses: continuously modifying θT ∈ R2 does not impactthe motion type (translation), while for example continu-ously modifying θR ∈R3 changes a rotation motion into atranslation one when a2 ∈ [−τ, τ ]. Hence, the four trans-lation motions are pairwise adjacent, while clockwise andcounterclockwise rotation motions are “separated” by trans-

(a) Flat (b) Full hierarchy (c) Hierarchy ontranslations only

Figure 4. Illustration of the effect of the classification tree on thesynthetic sequence Corridor. The threshold τ is set to 10−3 for alltrees. Some rotation motions (•) due to perspective effect are dis-carded with the hierarchical classification of translation motions.

lation ones, and so are convergent and divergent motions.Therefore, a South translation motion and an East one areclose from each other, but clockwise and counterclockwiserotation ones can be considered as opposed, and so aggre-gating their scores in the vote can lead to an erroneous de-cision as illustrated in Figure 4b. Finally, gathering trans-lation motions is coherent with the fact that we arbitrarilychose the number of detected translation directions.

3. Implementation and experimental results

Algorithm 1. Motion detection

Determine the moving regions S [3].. Motion estimation

for each Wi ∈ W doEstimate the motion parameters θk,i, k∈T, S,R [9].

. Model selectionfor each p ∈ S do

for each θk,i ∈ Θ(p) doEvaluate the variable ε(p, θk,i) (4).Evaluate the likelihood L(p, θk,i) (6).

Select the best model θ(p) (7).. Crowd motion classification

for each p ∈ S doDetermine cinit(p) (Table 2).

for each p ∈ S doSelect the crowd motion class c(p) (Figure 2c).

Algorithm 1 can be massively parallelized in the estima-tion loop thanks to the independency of the windows andmotion models involved. Moreover, the memory load isfar more reduced than the simple pseudo-code Algorithm1 could let suppose, since for each point only the motionmodel with the best likelihood is stored. Also interestingis the systematic use of integral images (introduced in [4])in equations (6), (8), (9), and (10), reducing the O(|Np|)and O(|Pp|) sums to additions of four terms, which donot depend on the windows dimensions. These optimiza-tions allow a fast computation: on a laptop with a 4-core2.3 GHz processor and an 8 GB 1.6 GHz memory, classi-fying a 720×576-pixel frame runs in 1 to 9 seconds, de-pending on the number of moving points inside the image.

Corridor Escape Mekkah# • 0 7 59# • 0 0 1,601# • 0 9 34# • 2,392,551 417 78# • 15,191 1,179,719 3,200# 129,025 4,939 150

(a) Corridor sequence: (b) Escape sequence: (c) Mekkah sequence: # • 7,768 4,138 2,447,843people walking rightward. people gathering at the exit. people circling clockwise. # • 35,782 1,885 0

Results on frame 1 Results on frame 1 Results on frame 1 TPR 92.7% 99.0% 99.7%Figure 5. Tests on 50-frame generated sequences exhibiting a single crowd motion class: (a) Corridor corresponds to Eastward translation(•), (b) Escape to convergence (•), and (c) Mekkah to clockwise rotation (•). Cardinals of each class and true positive rates (TPR) arereported in the table. The TPR is computed as the proportion of detected moving points which are correctly classified over the sequence.

(a) Results on frame 1 (b) Results on frame 21 (c) Results on frame 41Figure 6. The Marathon bend sequence. People run from upper left to upper right, describing a U. The movement is quite constant in thewhole sequence and so is the classification: in the left branch, people go South (•), then turn counterclockwise (•) until the end of thebend. Some Eastward translation (•) is sometimes found here because of the large radius of curvature. Finally the North translation isrecovered (•). The points in the upper right corner of the image are classified as translations to the West (•), but the translation directionis closer to North than to West (North-North-West): it is also due to the lateral presence of pedestrians walking to the left.

(a) Results on frame 1 (b) Results on frame 21 (c) Results on frame 41Figure 7. Demonstration, a video where the perspective effect is strong but marginally affects the method performance. The demonstratorspass in front of the camera from right to left, corresponding to a West translation (•). Visually, the perspective is not negligible, but theτ -thresholding on a1 nearly discards the scaling motion type. Only a few pixels are classified as divergent () in the entire sequence.

Moreover, a quarter of the computation time lies in the ini-tial motion detection stage, whose optimization could leadto a substantially faster computation.

We have evaluated our method on both computer-generated image sequences [1] and real datasets collectedby [11, 13]. All the reported results have been obtainedwith the same parameters and options: τ = 10−3, Np is a3×3 neighborhood of p, the side length of Pp is 25% of theimage width, and the classification is hierarchical only fortranslation motions. The processed sequences (Figures 5 to9) exhibit various situations in terms of angles of view, ve-locities, densities, and motion classes. Detailed commentsare given in the respective captions.

In most cases, results are accurate (Figure 5), even if thesegmentation is inaccurate (Figures 7 and 8) or perspectiveeffects not negligible (sequence Demonstration). In Fig-

ure 7, the convergence class () which could appear due tothe perspective, is seldom selected owing to the threshold-ing on a1 (subsection 2.2). However, if it is relevant fora given application to identify people coming closer to thecamera, this can be achieved by setting τ to a lower value.In Figures 6 and 8, translation motions are sometimes foundinstead of a rotation one due to the wide radius of curvature.In Figures 6 and 7, we display a frame over twenty, but theclassification is stable over the sequences, as shown in Fig-ure 8b depicting a so-called kymograph, representing theevolution in time of the class labels of a line of points.

To outline the difference of approach between [13] andour method, we have analyzed the sequence Roundabout,which has been presented by Solmaz et al.. Figure 9 com-pares the results of both methods, which do not deliver thesame kind of information: the method [13] classifies trajec-

(a) Frame 1

Line section(y = 200)

←−−−−−−−−−−

Time

Frame 1 (line 200)

Frame t (line 200)

(b) KymographFigure 8. Stability over time (100 frames) illustrated on Shoar.The temporal evolution of the “crowd” motion classes for the hor-izontal line pointed by the arrow in (a) is illustrated in (b). Thedark areas in the kymograph correspond to undetected motions.The stability of the classification is not perfect but, within the de-tected moving points, 86.9 % are correctly classified (•). Mostmisclassifications are due to locally low rotation motion curvaturesleading to translation selection (•, •, •, •).

(a) Results from [13] (b) Our classificationFigure 9. Results comparison between (a) the method of Solmazet al. and (b) ours. In the Roundabout sequence, cars arrive fromthe bottom left, turn left and quit the roundabout in the upper halfof the frame. In (a), resp. (b), a lane (+), resp. an East translation(•), is detected at the bottom, and an arch (), resp. a counter-clockwise rotation (•), above it. In (b), another translation to theNorth is found (•), and a car going leftward (•). Irrelevant clas-sification in the background trees (•) and over the fountain (•) isdue to motion detection errors.

tories over a time interval and around critical points, whileours delivers a pixel-wise frame-based motion classifica-tion. Thus, even if the method might be prone to instability,it can capture short and spatially localized events.

4. Conclusions

We have proposed an original method to classify crowdmotions in videos on a frame basis. It exploits three simpleimage motion models computed over a collection of win-dows in the image. A preliminary selection among motionmodel candidates is performed at every point using an MLcriterion. The final crowd motion classification is achievedwith a decision tree regularized with majority votes. Eightcrowd motion classes are considered. Moreover, since thewhole classification process only requires two consecutive

frames, even short events can be captured. The algorithmis fast and does not require any learning stage, no fine pa-rameter tuning, and no trajectories computation. The exper-iments we have carried out demonstrate the accuracy andefficiency of our approach in various situations.

As future work, we plan to exploit our instantaneouscrowd motion classification for anomaly detection and tem-poral scenarios recognition.

This project is partially supported by Region Bretagne (Brittany Coun-cil) through a contribution to A. Basset’s Ph.D. student grant.

References[1] P. Allain, N. Courty, and T. Corpetti. Agoraset: a dataset for crowd

video analysis. In 1st Int. Work. Pattern Recog. and Crowd Analysis,ICPR’12, Tsukuba, Nov. 2012.

[2] A. M. Cheriyadat and R. J. Radke. Detecting dominant motions indense crowds. J. Selected Topics in Sig. Processing, 2(4):568–581,Aug. 2008.

[3] T. Crivelli, P. Bouthemy, B. Cernuschi-Frıas, and J.-F. Yao. Simulta-neous motion detection and background reconstruction with a mixed-state conditional Markov random field. Int. J. Comp. Vis., 94(3):295–316, 2011.

[4] F. C. Crow. Summed-area tables for texture mapping. ACM SIG-GRAPH Comp. Graphics, 18(3):207–212, Jul. 1984.

[5] J. Feng, C. Zhang, and P. Hao. Online learning with self-organizingmaps for anomaly detection in crowd scenes. In 20th Int. Conf. Pat-tern Recog., ICPR’10, Istambul, Aug. 2010.

[6] E. Francois and P. Bouthemy. Derivation of qualitative informationin motion analysis. Image and Vis. Comp., 8(4):279–288, Nov. 1990.

[7] M. Hu, S. Ali, and M. Shah. Learning motion patterns in crowdedscenes using motion flow field. In 19th Int. Conf. Pattern Recog.,ICPR’08, Tampa, Dec. 2008.

[8] L. Kratz and K. Nishino. Anomaly detection in extremely crowdedscenes using spatio-temporal motion pattern models. In IEEE Conf.Comp. Vis. and Pattern Recog., CVPR’09, Miami Beach, Jun. 2009.

[9] J.-M. Odobez and P. Bouthemy. Robust multiresolution estimation ofparametric motion models. Int. J. Visual Communication and ImageRepresentation, 6(4):348–369, Dec. 1995.

[10] M. Rodriguez, S. Ali, and T. Kanade. Tracking in unstructuredcrowded scenes. In 12th IEEE Int. Conf. Comp. Vis., ICCV’09, Ky-oto, Sep. 2009.

[11] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert. Data-drivencrowd analysis in videos. In 13th IEEE Int. Conf. Comp. Vis.,ICCV’11, Barcelona, Nov. 2011.

[12] D. Ryan, S. Denman, C. Fookes, and S. Sridharan. Textures of op-tical flow for real-time anomaly detection in crowds. In 8th IEEEInt. Conf. Advanced Video and Sig. Based Surveillance, AVSS’11,Klagenfurt, Aug. 2011.

[13] B. Solmaz, B. E. Moore, and M. Shah. Identifying behaviors incrowded scenes using stability analysis for dynamical systems. IEEETrans. Pattern Analysis and Machine Intel., 34(10):1–8, 2012.

[14] X. Wang, K. T. Ma, G. Ng, and W. E. L. Grimson. Trajectory anal-ysis and semantic region modeling using nonparametric hierarchicalBayesian models. Int. J. Comp. Vis., 95(3):287–312, 2011.

[15] B. Zhan, D. N. Monekosso, P. Remagnino, S. A. Velastin, and L.-Q.Xu. Crowd analysis: a survey. Machine Vis. and Applic., 19(5–6):345–357, 2008.

[16] B. Zhou, X. Wang, and X. Tang. Random field topic model for se-mantic region analysis in crowded scenes from tracklets. In Comp.Vis. and Pattern Recog., CVPR’11, Colorado Springs, Jun. 2011.

[17] B. Zhou, X. Tang, and X. Wang. Coherent filtering: detecting co-herent motions from crowd clutters. In 12th Eur. Conf. Comp. Vis.,ECCV’12, Firenze, Oct. 2012.