Scene-Independent Group Profiling in Crowdopenaccess.thecvf.com/content_cvpr_2014/papers/... · Scene-Independent Group Proﬁling in Crowd Jing Shao 1Chen Change Loy2 Xiaogang Wang

Scene-Independent Group Profiling in Crowd

Jing Shao1 Chen Change Loy2 Xiaogang Wang1

1Department of Electronic Engineering, The Chinese University of Hong Kong2Department of Information Engineering, The Chinese University of Hong Kong

[email protected], [email protected], [email protected]

Abstract

Groups are the primary entities that make up a crowd.Understanding group-level dynamics and properties is thusscientifically important and practically useful in a widerange of applications, especially for crowd understanding.In this study we show that fundamental group-level proper-ties, such as intra-group stability and inter-group conflict,can be systematically quantified by visual descriptors. Thisis made possible through learning a novel Collective Tran-sition prior, which leads to a robust approach for group seg-regation in public spaces. From the prior, we further devisea rich set of group property visual descriptors. These de-scriptors are scene-independent, and can be effectively ap-plied to public-scene with variety of crowd densities anddistributions. Extensive experiments on hundreds of publicscene video clips demonstrate that such property descrip-tors are not only useful but also necessary for group stateanalysis and crowd scene understanding.

1. IntroductionGroup dynamics have been extensively studied in socio-

psychological [35] and biological [37] research as the pri-mary processes that influence crowd behaviors. In these s-tudies, group dynamics are characterized by both intra- andinter-group properties. Intra-group properties, e.g. collec-tiveness, stability, and uniformity, denote internal coordina-tion among members in the same group. Whilst inter-groupproperties, e.g. conflict, reflect the external interaction be-tween members in different groups. Such properties widelyexist in animal/insect crowd systems (e.g. bacterial coloniesand bird flocks), and are also frequently researched in socio-psychological studies [25]. For instance, bacterial colonieswere found to exhibit collective behavior to achieve a com-mon goal, i.e. spreading of diseases [37]. From sociologicalview-point, conflict occurs for competition of resources orgoal incompatibility [35].

In the context of visual surveillance, groups also primar-ily make up human crowds. Indeed, a rich body of litera-ture [5, 14, 19, 26] suggest that majority of the pedestrians

Collectiveness Stability Uniformity Conflict H

igh

L

ow

1 2

4

5

3

6

4

1 2 3 5

6

1

1

2

2 3

3

4

4

Figure 1. Crowd behavior can be better understood through inher-ent intra- and inter-group properties. In this study, we show thepossibility of quantifying such properties with scene-independentvisual descriptors. Best viewed in color.

tend to move in groups with their friends and family mem-bers. The tendency of forming a coherent group with oth-er pedestrians becomes more prominent in dense crowds,where pedestrians have to align with others to form collec-tive behaviors instead of moving freely [26].

When pedestrians form groups, they exhibit some in-teresting properties in their dynamics, which share com-monalities with socio-psychological and biological studies(Fig. 1). For instance, collective behavior is observed whenpedestrians in a group maneuver towards a common desti-nation. During a crowd disaster, turbulent dynamics in thecrowd can be characterized by the stability property. Crowdtends to have non-uniform distribution when its member-s have different social relationships and walk in less re-stricted area. Two pedestrian groups with different goals,e.g. when crossing roads from different directions, exhib-it conflict behavior. Clearly, understanding such propertiesprovides critical mid-representation to crowd motion anal-ysis [3, 24, 17, 15], and could facilitate other high-level se-mantic analysis such as crowd scene understanding, crowdvideo classification, and crowd event retrieval.

Our goal is to characterize and quantify these groupproperties from vision point of view, and study their po-tentials on crowd behavior analysis and crowd scene under-standing. We consider a group beyond just a collection ofspatially proximate individuals, but also a dynamic unit thatexhibits various fundamental intra- and inter-group proper-

1

ties, which can be used to compare group activities acrossdifferent crowd systems. To our knowledge, this study isthe first attempt in computer vision that investigates com-prehensively and systematically the universal properties ofgroups in crowds. We make the following contributions:

1) A robust group detector - We introduce a novel CollectiveTransition (CT) prior to capture the underlying dynamics ofa group. Based on the prior we formulate a robust groupdetector that outperforms state-of-the-art methods [11, 39].

2) Scene-independent group descriptors - Based on the C-T prior, we devise a set of visual descriptors to quantifyfour fundamental intra- and inter-group properties, namelycollectiveness, stability, uniformity, and conflict. These de-scriptors convey richer group-level information in compar-ison to the conventional group size and velocity informa-tion [9]. Importantly, these descriptors are scene invariantand robust to public scenes with variety of crowdedness.

3) Group-driven crowd scene understanding - We showthat the proposed descriptors are effective in identifyingthe intrinsic group states (gases, fluids, and solid) follow-ing the common analogy employed in crowd modeling lit-erature [31, 13, 12]. We also demonstrate their superiorityfor scene-independent group state analysis and crowd videoclassification over existing activity descriptors [16].

Experiments are conducted on hundreds of video clipscollected from over 200 crowded scenes. The dataset andthe ground truth are made publicly available to facilitate fu-ture research in group-level crowd analysis1.

2. Related Work

Most existing imagery-based crowd analysis methodstend to treat a crowd either as a collection of individual-s [10, 23, 27] or as an aggregated whole [3, 8, 24, 17]. Incontrast to these studies, we analyze crowd at the group-level. The object-centered approaches require explicit de-tection and segmentation of individuals from crowd. Thesetechniques are infeasible in crowded scenes where inter-object occlusion is severe. The activity representation em-ployed by holistic methods, e.g. optical flow codeword-s [17, 22], dynamic texture [8], and grid of particles [3, 24],are useful for learning scene-level spatio-temporal pattern,but not directly applicable for learning group-level proper-ties, which requires finer group segregation.

State-of-the-art methods [39, 11] achieve group detec-tion through tracklet clustering. Zhou et al. [39] present theCoherent Filtering (CF) approach for segmenting coherentmotion in crowd, whilst Ge et al. [11] discover small groupsby hierarchical clustering based on pairwise objects’ veloc-ity and distance. As shown in our experiments, the abovemethods are either too sensitive to tracking noise or unscal-

1http://www.ee.cuhk.edu.hk/˜xgwang/CUHKcrowd.html

able to extremely crowded scenes. Importantly, neither ofthem learn group properties further nor analyze crowd be-haviors at the group-level.

A number of approaches [2] have been proposed forrecognizing group activities such as meeting and fighting.These studies tend to analyze small social groups, with spe-cific focus on scenario-specific predicates learning [9], con-textual and interaction modeling [4, 10, 19, 21], and socialsignaling analysis [6]. Moreover, many crowd modeling ap-proaches are scene-specific [17, 34, 41], i.e. activity mod-els learned from a specific-scene cannot be applied to otherscenes. Our work differs significantly to the aforementionedstudies: (i) we have a different focus on understanding andquantifying the fundamental group properties, which can bewell generalized to different crowd systems, and (ii) we fo-cus on crowded scenes where a group may have an arbitrarylarge number of members (see Fig. 1).

3. Profiling Group PropertiesWe consider a group as a set of members with a common

goal and collective behaviors. Given a short video clip ofτ frames, a set of groups {G}mi=1 are detected. Each groupGi encompasses a set of tracklets {z} detected by the KLTfeature point tracker. From each detected group, we wish toextract a set of visual descriptors to represent its properties.

3.1. Collective Transition Prior

Precise group detection in crowd is challenging due tocomplex interaction among pedestrians. We assume thatpedestrian movements in a scene are intimately governedby a finite number of Collective Transition (CT) priors.These priors are discovered simultaneously with the groupdetection process. We show that group detection can bemade more robust by considering the temporal smoothnessand consistency enforced by the priors. Furthermore, wedemonstrate in Sec. 3.3 that certain group properties can bereadily derived from the discovered CT priors.

Each pedestrian group has a specific CT prior, whichcan be discovered from a video clip. More precisely, for ntracklets, {z}nk=1, we assume there exist m Markov chain-s, where m < n and m is inferred automatically. EachMarkov chain is a time-series model with the form of

ztk = Azt−1k + vt, (1)

where the continuous observation ztk evolves by a transitionmatrix A ∈ R3×3. Gaussian noise vt ∼ N (0,Q) is as-sumed between transition. Let ztk = [xt, yt, 1]

T representthe position of a pedestrian in homogeneous coordinates2

and the initial observation z1k follows a Gaussian distribu-tion N (µ,Σ). We denote Θ = {A,Q, µ,Σ} as the param-eters of the chain. A represents the CT prior, which reveals

2A represents affine transforms. Translation, contraction, expansion,dilation, rotation, shear, and their combinations are all affine transforms.

2

http://www.ee.cuhk.edu.hk/~xgwang/CUHKcrowd.html

(a) Coherent filtering clusters (b) Anchor tracklet

(c) Seeding tracklets (d) Final group after refinement

Figure 2. (a) Coherent filtering [39] fails to distinguish two subtlegroups. We address it through discovering (b) a representative an-chor tracklet (in red) and subsequently (c) a set of seeding trackletsto infer a group-specific CT prior. (d) With refinement based onthe CT prior, two groups are separated.

the collective motions of all the members in a group, while{µ,Σ} ensure that group members are spatially proximateat the initial frame. Next, we discuss how to learn this priorand perform robust group detection simultaneously.

3.2. Group Detection by Collective Transition

The key idea is to search for pedestrian groupings thatfit well to the discovered priors within the video clip. Themethod permits fragmented tracklets that fail to sustain overthe whole clip. The missing data of zk can be inferred withEM. It is thus suitable for group detection in dense crowds.In addition, it relies on local spatio-temporal relationshipsand velocity correlations without assumption on the globalshape of the pedestrian group. Therefore it can be appliedto scenes with different scales and perspectives.

The key steps of learning the CT priors for group discov-ery are summarized in Alg. 1.Step-1: Generate coherent filtering clusters: We first dis-cover a set of initial tracklet clusters {C}rj=1 using CF [39](Fig. 2(a)). These clusters do not align with our group per-ception perfectly but can serve as the basis for finding thefinal tracklet groups {G}mi=1.Step-2: Identify anchor tracklets: The iterative schemebegins by randomly picking a cluster Ci and finding it-s anchor tracklet z∗i with long duration and low variance(Fig. 2(b)).Step-3: Discover seeding tracklets for learning CT pri-ors: As shown in Fig. 2(c), a set of seeding tracklets, Si, areselected with the following criteria: (1) they are also fromCi; and (2) have high velocity correlation with z∗i ,

〈vz∈Ci , vz∗i 〉‖vz∈Ci‖ · ‖vz∗i ‖

> η, (2)

Algorithm 1: Group detection by collective transition.Input: Tracklets {z}nk=1 in a video clip.Output: m tracklet groups, {G}mi=1.Step-1: i = 1, generate coherent filtering clusters {C}rj=1;if {C} 6= ∅ then

Step-2: Identify an anchor tracklet, z∗i ;Step-3: Discover seeding tracklets set S from Ci;Learning the collective transition prior Ai with Si;Step-4: Perform group refinement to discover Gi;{G} = {G} ∪ Gi, {C} = {C} \Gi, i = i+ 1;

end

where η is a threshold. Si includes reliable tracklets and isused to learn a representative CT prior with EM, which willbe used to refine the group itself in Step-4.

Step-4: Group refinement: We fit each tracklet z in theinitial cluster Ci with Ai of the ith Markov chain. The fittingerror ε of a tracklet is defined as

ε =1

τ − 1

τ−1∑t=1

‖Azt − zt+1‖22. (3)

Any tracklet with ε < δ is retained to construct Gi. Unqual-ified tracklets will need to repeat the iterative process to beconsidered for a different group.

3.3. Group Descriptors for Crowd Scenes

We formulate a set of descriptors to quantify groupproperties (Table 1). The first three quantify the spatio-temporal evolvement of intra-group structure, whilst thefourth characterizes inter-group interaction. Sec. 4 showsthat they complement each other to perform well on scene-independent group state analysis and crowd video classifi-cation.

Table 1. List of group descriptors.

Property Descriptor EquationCollectiveness φcoll (G) 4

Stability Φstab(G) 10Uniformity Φunif(G) 13

Conflict Φconf(G) 14

To facilitate explanation, we make an analogy between apoint and a member. A detected group has n members in aframe, which form aK-NN graph,G(V,E), whose verticesV represent the members, and member pairs are connectedby edges, E. The edges are weighted by an affinity matrixW, with elements wij = exp(−d2ij/σ2), where dij is thespatial distance between two members. We denote the setof nearest neighbors of a member z asN 1

z , . . . ,N τz at every

frame of a given clip. Next we discuss the descriptors indetail.

Collectiveness: The collectiveness property indicates thedegree of individuals acting as a union in collective motion.It is a fundamental and universal measurement for various

3

crowd systems [32, 40]. A collectiveness measurement forthe whole video was proposed in [40] using manifold learn-ing. In contrast, we quantify collectiveness at group levelwith the proposed collective transition prior A, since it cap-tures the coherent motion of all group members. In particu-lar, we compute the collectiveness of group G as

φcoll (G) =1

|G|∑

z∈Gε (z,A) , (4)

where | · | denotes the cardinality of the input set, andε (z,A) is defined in Eqn. (3).

A high value in φcoll (G) suggests that the members of agroup move coherently towards a common destination. Thedescriptor is useful for distinguishing low-collectivenessgroups, e.g. in a train station or wet market, from high-collectiveness groups, e.g. observed during a marathon oron an escalator track.Stability: The stability property characterizes whether agroup can keep internal topological structure over time. Itis analogous to molecules stability in a chemical system. Inparticular, stable members tend to (1) maintain a similar setof nearest neighbors; (2) keep a consistent topological dis-tance with its neighbors throughout a clip; and (3) a memberis less likely to leave its current nearest neighbor set. Fol-lowing this idea, we formulate three stability descriptors.

We compute the first stability descriptor by counting andaveraging the number of the invariant neighbors of eachmember in the K-NN graph over time

φstaba (G) =

1

|G|∑

z∈G

(K − |N 1

z \N τz |), (5)

where |N 1z \N τ

z | = |{z : z ∈ N 1

z and z /∈ N τz

}|.

The second stability descriptor is formulated to examineif the members keep consistent topological distance withtheir nearest neighbors. This is achieved by first rankingthe nearest neighbors of a member (z) in accordance totheir pairwise affinity, and subsequently applying the Lev-enshtein string metric distance (dtz) [20] to compare therankings at every two consecutive frames. dtz = 0 if tworankings are the same, and dtz = K if the ranking indices ofall the members have changed. Through collecting dtz overτ frames, we construct its histogram with K bins, h(z), foreach member z. The second stability descriptor is then ob-tained as an averaged histogram

Φstabb (G) =

1

|G|∑

z∈Gh(z). (6)

It reveals information about the change of topological dis-tances between members in a group.

The third stability descriptor measures how likely amember would depart from its existing nearest neighbor set.We assume a random walk behavior on all the group mem-bers, i.e. we allow the members to transit freely within the

(a) 91 76 61 46 31 16 10

0.3

0.6

0.90.9

Cut number

Q e

volu

tion

(b) 54 45 36 27 18 9 10

0.3

0.6

0.90.9

Cut number

Q e

volu

tion

5

Figure 3. (a) and (b) show a uniform and a non-uniform group, re-spectively. From left to right, we show the original coherent groupdetection, the sub-groups obtained through further clustering, andthe optimal number of cuts inferred by modularity function.

group and join other members to form new neighborhood.We then measure the stability of a member as the differencebetween its initial and final transition probabilities. We ini-tialize the transition probability matrix P ∈ Rn×n as

P = D−1W, (7)

where D is a diagonal matrix whose elements are Dii =∑j wij . The probability distribution of the ith member

‘walks’ to and ‘joins’ other members is defined by

qi = eiT [(I− αP)−1 − I

], (8)

where q ∈ R1×n, I is the identity matrix, and ei =

(e1, . . . , en)T is an indicator vector with ei = 1 and e\i =

0. The parameter α has a range of 0 < α < 1/ρ(P), whereρ(P) denotes the spectral radius of P. We set α = 0.9/K.The stability of ith member is computed by measuring theKullback-Leibler (KL) divergence [18] of qi between thefirst and final frames. A lower KL-divergence score, skl

suggest higher stability. We compute the the third stabilitydescriptor by averaging the scores across all members

φstabc (G) =

1

|G|∑

z∈Gskl(z). (9)

The final stability descriptor

Φstab(G) = [φstaba (G) ,Φstab

b (G) , φstabc (G)]. (10)

Uniformity: Uniformity is an important property for char-acterizing homogeneity of a group in terms of spatial distri-bution. This is in contrast to the two previous properties thatmeasure temporal aspects. A group is uniform if their mem-bers stay close with each other and are evenly distributed inspace. A non-uniform group has a tendency to be furtherdivided into subgroups. A comparative example of uniformand non-uniform groups is shown in Fig. 3(a) and 3(b).

We quantify uniformity by inferring the optimal number(c∗) of graph cuts on theK-NN graph. A higher c∗ suggestsa higher degree of non-uniformity. A hierarchy of clusters(H) is generated with agglomerative clustering [38] and themodularity function Q [30] is used to find c∗. Specifically,

4

(a)

𝜖 conf

Group

contour

Conflict

points

Conflict

points

Group

contour

𝜖 conf

hig

h

low

h

igh

lo

w

(b)

𝜖 conf

Group

contour

Conflict

points

Conflict

points

Group

contour

𝜖 conf

hig

h

low

h

igh

lo

w

Figure 4. (a) Conflict location in groups. Hot color indicates a highdegree of conflict. (b) Conflict distribution of the group. The mapis normalized with the group’s center, moving direction, and thelargest distance from group’s contour to its center.

given a cluster number c, a graph partition {V1, . . . , Vc} isobtained from H. Computing Qc for c ∈ {1, . . . , C} andits maximum value suggests the optimal number of cuts:

c∗ = argmaxc∈{1,...,C}

Qc (11)

given Qc =

c∑i=1

[A(Vi, Vi)

A(V, V )−(A(Vi, V )

A(V, V )

)2], (12)

where A (V ′, V ′′) =∑i∈V ′,j∈V ′′ w(i, j). Examples are

shown in the last column of Fig. 3. They show that a non-uniform group has a relatively higher number of cuts.

Since the uniformity of a group may change as groupevolves, we measure the its uniformity by the mean µc∗ andvariance σc∗ of the optimal number of cuts over time:

Φunif(Gi) = {µc∗ , σc∗} . (13)

Conflict: The conflict property characterizes interac-tion/friction between groups when they approach each oth-er. The spatial distribution and level of conflict experiencedby a group can be visualized on a 2D normalized map asshown in Fig. 4. Such a map is informative for crowd un-derstanding as it contains rich information about differen-t natures of inter-group interactions observed in differentscenes. On this map, the group contour is obtained as theouter boundary of the internal members, whereas a conflictpoint is defined as a member with external group membersin its K-NN set, N . Note that the K-NN sets defined herediffer from those we employed earlier, as the current setsare allowed to include members from external groups.

To represent the conflict map compactly with invarianceto scales, we formulate a Conflict Shape Context (CSC) de-scriptor inspired by shape context [7]. The first step is tocapture the spatial distribution for each conflict point bycomputing a histogram of the relative coordinates of groupcontour points. This is achieved by introducing a polar co-ordinate system [7] centered on each conflict point, andcomputing the frequency of contour points in the bins. 8equally spaced angle bins and 5 equally spaced radius binsare used. The second step is to perform K-means clusteringover training clips to build a vocabulary on the histogram-s, and produce Bag of Words (BoW) representation. Usinglocally constrained linear coding [33], the ith conflict point

has a distribution ui over the vocabulary. We further com-pute the level of conflict of this conflict point based on theCT prior introduced in Sec. 3.1

εconfi =1

|Ni|∑

z∈Ni

ε (z,A) , (14)

The ε (z,A) is defined in Eqn. 3, and A is the CT prior ofthe group where the conflict point is residing. Intuitively, ifthe nearest neighbors of a conflict point are mostly externalmembers that do not fit well to A, a high value in εconf isobtained. The final conflict property of a group is computedby max pooling {ui} weighted by {εconfi } as in [33].

4. Applications and Experimental ResultsWe evaluate group detection, and demonstrate the effec-

tiveness of our descriptors on two applications: group stateanalysis and crowd video classification. Both are scene-independent.

4.1. Crowd Database

Evaluations are conducted on a new CUHK CrowdDataset. It includes crowd videos with various densities andperspective scales, collected from many different environ-ments, e.g. streets, shopping malls, airports, and parks. Itconsists of 474 video clips from 215 scenes, among which419 clips were collected from Pond53 and Getty Image4,and 55 clips were captured by us. It is larger than any ex-isting crowd datasets [3, 29, 40] (they are actually coveredby our dataset) in terms of scene diversity and clips num-ber. Although the video clips have various length, we onlytake the first 30 frames from each clip for implementing ourapproach5. The full video clips are available in the dataset.The ground truth of group detection, group state analysis,and crowd video classification are manually annotated andchecked by multiple annotators.

4.2. Group Detection

Tracklets from 300 video clips are manually annotatedinto groups for evaluation based on the criterion that mem-bers in the same group have a common goal and form col-lective movement. Tracklets not belonging to any groupare annotated as outliers. We compare our group detec-tion method of using Collective Transition priors (CT) withthree state-of-the-art approaches: mixture of dynamic tex-ture (DTM) [8], hierarchical clustering (HC) [11], and co-herent filtering (CF) [39]. Examples of the ground truth andthe detection results in comparison are shown in Fig. 5.

3http://www.pond5.com/4http://www.gettyimages.com/5We found that considering longer frames does not make significant

difference in our evaluation performance. This is because group motionsin each segmented clip remain similar across its whole length.

5

http://www.pond5.com/

http://www.gettyimages.com/

DTM [8]

HC [11]

CF [39]

CT

Ground truthFigure 5. Comparative results of group detection with four meth-ods. Groups are distinguished with colors. Red color indicatesoutliers. Arrows are moving directions. Best viewed in color.

DTM well separates background and simple group mo-tions, but it performs poorly on complex and mixed groupmotions. Besides, it requires manual specification of groupnumber (we provides ground truth as input) and for eachclip it takes hundred-fold longer time than our method. HChierarchically clusters tracklets with velocity and spatialconstraints and does not consider group dynamic prior. Itthus leads to more errors than ours. CF detects coherentmotions with a neighborhood measurement without model-ing dynamics shared by the whole group. It is thus sensitiveto tracking failures. This can be observed in the first col-umn of Fig. 5, where CF splits a group moving in the samedirection into subgroups. Moreover, CF first detects group-s with coherent motions between consecutive frames, andthen associates the groups through the whole clip. Its errorsare therefore accumulated. In the second and third column-s of Fig. 5, CF associates two groups moving in differentdirections into one due to errors made in single frames.

For quantitative evaluation, we consider group detectionas a clustering problem, and adopt three widely used mea-surements in clustering evaluation, i.e., Normalized Mu-tual Information (NMI) [36], Purity [1], and Rand Index(RI) [28]. The comparison is shown in Fig. 6. The bar charton the right shows the relative improvement of our methodcompared with DTM, HC, and CF.

Methods NMI Purity RIDTM [8] 0.30 0.68 0.71HC [11] 0.27 0.62 0.73CF [39] 0.42 0.73 0.78CT 0.48 0.78 0.83

0

0.3

0.6

0.9

NMI Purity RI

DTM

HC

CF

1

2

3

Figure 6. Left: quantitative comparison of group detection meth-ods. Right: relative improvement of our approach (CT) comparedwith DTM, HC, and CF.

Fluid Gas Solid

Figure 7. Distributions of different types of groups in a shoppingmall and a escalator scene. Colors indicate group states automati-cally recognized with our descriptors. Best viewed in colour.

4.3. Application I: Group State Analysis

Research on crowd modelling and analysis [31, 13, 12]generally classifies crowd particles into the following states(impure fluid is added by us) with an analogy of classify-ing different phrases of matter in equilibrium statistical me-chanics. It is assumed that the underlying physical modelsare different for different states.

• Gas: particles moving in different directions withoutforming collective behaviors with others.

• Solid: particles moving in the same direction collectively.Their relative positions remain unchanged, bounded byinternal forces.

• Pure fluid: particles moving towards the same direction;however, their relative positions change constantly due tothe lack of inter-particle forces.

• Impure fluid: it is similar to pure fluid, but with invasionof particles from other groups.

These states are decided by multiple socio-psychologicaland physical factors including crowd density, goals, interac-tions and relationships of group members, and scene struc-tures. As examples shown in Fig. 7, in a large open area,pedestrians behave more like gas and fluid, while move asflying solid on an escalator track or in a queue. In the figureon the left, fluid groups appear frequently on the paths con-necting entrances and exits regions, while gas groups locaterandomly and they are isolated costumers walking around.In the figure on the right, the states of groups transit be-tween solid and fluid at the exits of escalators. In a scenewhere crowds compete for sources, they behave like fluid.Group states well reflect these factors, which are of interestin various applications. We use the proposed group descrip-tors to classify crowd groups into states, which is useful incrowd scene understanding.

6

(a)

0.75

0.02

0.09

0.14

0.06

0.66

0.24

0.09

0.16

0.29

0.52

0.32

0.03

0.03

0.15

0.46

GasSoli

d

Pure f

luid

Impu

re flu

id

Gas

Solid

Pure fluid

Impure fluid0

0.5

1

(b) 0

0.2

0.4

0.6

0.8

Aver

age

acc

ura

cy

(c)0

0.1

0.2

0.3

0.4

Gas Solid Pure

fluid

Impure

fluid

Perf

orm

an

ce d

rop

(d)

00.20.40.60.8

11.2 Collectiveness

Stability

Uniformity

Conflict

GroupSize

Combination

1

2

3

4

5

6

Figure 8. (a) Confusion matrix of classifying group states by com-bining all the group descriptors. (b) Average accuracy of usingeach descriptor and combining them. (c) Performance drop by us-ing only one descriptor on classifying each of the group states. Alower bar indicates that the descriptor is more effective on classi-fying a particular group state. (d) Legend for (b) and (c). 1 ∼ 5are single descriptors, and 6 is combination of the five descriptors.

There are 927 groups manually labeled as ground truth:128 gas groups, 291 solid groups, 349 pure fluid groups,and 159 impure fluid groups. Half of the data is randomlyselected for training and the remaining for test (the trainingand test sets do not contain the same scenes). All of ourproposed descriptors together with “group size”6 are com-bined as features input to a SVM classifier. The confusionmatrix averaged over 10 trials is shown in Figure 8(a). Theaverage accuracy7 is 60% while the chance of random guessis 25%. The result shows the effectiveness of our group de-scriptors and their generalization power across scenes. Itis understandable that pure and impure fluid groups are themost confusing classes, since some pure fluid groups haveinteractions with other groups on their boundaries. The ma-jor but subtle difference between the two classes is the s-patial distributions of conflict points. Tracking errors al-so increase difficulty in separating these two classes. Fig-ures 8(b) and 8(c) show the effectiveness of each group de-scriptor on classifying different group states. It is observedthat stability and conflict are the most effective on classify-ing solid groups. Collectiveness and conflict are the mosteffective on classifying pure fluid groups. Group size is ef-fective for gas groups.

4.4. Application II: Crowd Video Classification

We also demonstrate the robustness and effectiveness ofthe proposed group descriptors in the application of classi-fying crowd videos instead of individual groups. There ex-

6Group size is the number of tracklets in a group normalized by thetotal number of tracklets in a scene. It is useful for classifying gas groups.Some groups have one pedestrian with multiple feature points.

7We first calculate the accuracy within each class and then averagethem. So the biggest class will not dominate the average accuracy.

ist research studies [16, 34] on using holistic descriptors toclassify crowd video clips. For instance, Kratz et al. [16] di-vided a video clip into spatio-temporal cubiod and extractedmotion features from each cube. We show that our descrip-tors specially designed for quantifying group properties aremuch more effective than generic features.

All the 474 video clips in our dataset are manually as-signed into 8 classes as shown in Table 2. The 8 classes arecommonly seen in crowd videos and some are of specialinterest in crowd management and traffic control. For ex-ample, crowd merge and crowd crossing may cause trafficcongestion and crowd disasters such as stampede. It is al-so important to keep escalator traffic smooth at the entranceand exit regions to avoid blocking, collisions, and potentialdangers. In class 1, pedestrians in a scene walk in multipledirections with highly mixed behaviors. In classes 2 and 3,most pedestrians follow the main stream. In class 2, the rel-ative positions of pedestrians are stable and there are rarelyovertake events, while pedestrians in class 3 are not wellorganized. Most crowd videos can be generally classifiedinto the above three categories. However, we identify a fewclasses (4 ∼ 8) which are of particular interest in crowdmanagement and wish to distinguish them from the remain-ing crowd videos. Therefore, classes 1 ∼ 3 have excludedvideos from classes 4 ∼ 8. All the 8 categories are classi-fied together. Leave-one-out evaluation is used. Each timeone scene (which may include multiple video clips) is se-lected for test, and the remaining scenes for training. Thusit tests the cross-scene generalization capability. If a videohas multiple groups, we take the average of a descriptor overgroups as the video descriptor. SVM is used for classifica-tion. The confusion matrices are shown in Figure 9. The av-erage accuracy of our approach is shown 70%, much higherthan that of random guess (12.5%) and the result of using

Figure 9. Confusion matrices of crowd video classification. Left:using holistic features in [16]. The average accuracy is 44%.Right: using our descriptors. The average accuracy is 70%.

Table 2. List of crowd video classes.Class name

1 Highly mixed pedestrian walking2 Crowd walking following a mainstream and well organized3 Crowd walking following a mainstream but poorly organized4 Crowd merge5 Crowd split6 Crowd crossing in opposite directions7 Intervened escalator traffic8 Smooth escalator traffic

7

the holistic crowd scene descriptor proposed in [16] (44%).

5. Conclusions

In this paper, we systematically study the fundamentaland universal group properties, which exist in various crowdsystems, from the vision point of view. They are motivatedby the socio-psychological studies and importance in crowdscene understanding. A robust group detection algorithmand a rich set of group-property visual descriptors are pro-posed through learning the collective transition prior. Theyare well applied to scene-independent group states analy-sis and crowd video classification. This research will alsoinspire new applications in the future work, such as cross-scene crowd event detection and modeling pedestrian dy-namics with group context.

Acknowledgement: This work is supported by theGeneral Research Fund sponsored by the Research GrantsCouncil of Hong Kong (Project No. CUHK 417110, CUHK417011, CUHK 429412).

References[1] C. C. Aggarwal. A human-computer interactive method for projected

clustering. TPAMI, 16(4):448–460, 2004. 6[2] J. Aggarwal and M. S. Ryoo. Human activity analysis: A review.

ACM Computing Surveys, 43(3):16, 2011. 2[3] S. Ali and M. Shah. A lagrangian particle dynamics approach for

crowd flow segmentation and stability analysis. In CVPR, 2007. 1,2, 5

[4] M. R. Amer, D. Xie, M. Zhao, S. Todorovic, and S.-C. Zhu.Cost-sensitive top-down/bottom-up inference for multiscale activityrecognition. In ECCV, 2012. 2

[5] A. F. Aveni. The not-so-lonely crowd: Friendship groups in collec-tive behavior. Sociometry, 40:96–99, 1977. 1

[6] L. Bazzani, M. Cristani, G. Paggetti, D. Tosato, G. Menegaz, andV. Murino. Analyzing groups: a social signaling perspective. InVideo Analytics for Business Intelligence, pages 271–305. 2012. 2

[7] S. Belongie, G. Mori, and J. Malik. Matching with shape contexts.In Statistics and Analysis of Shapes, pages 81–105. 2006. 5

[8] A. B. Chan and N. Vasconcelos. Modeling, clustering, and segment-ing video with mixtures of dynamic textures. TPAMI, 30(5):909–926,2008. 2, 5, 6

[9] M.-C. Chang, N. Krahnstoever, and W. Ge. Probabilistic group-levelmotion analysis and scenario recognition. In ICCV, 2011. 2

[10] W. Choi and S. Savarese. A unified framework for multi-target track-ing and collective activity recognition. In ECCV, 2012. 2

[11] W. Ge, R. T. Collins, and R. B. Ruback. Vision-based analysis ofsmall groups in pedestrian crowds. TPAMI, 34(5):1003–1016, 2012.2, 5, 6

[12] D. Helbing and A. Johansson. Pedestrian, crowd and evacuation dy-namics. In Extreme Environmental Events, pages 697–716. 2011. 2,6

[13] L. F. Henderson. The statistics of crowd fluids. Nature, 229:381–383,1971. 2, 6

[14] I. Karamouzas and M. Overmars. Simulating and evaluating the localbehavior of small pedestrian groups. TVCG, 18(3):394–406, 2012. 1

[15] S. Kim, S. J. Guy, and D. Manocha. Velocity-based modeling ofphysical interactions in multi-agent simulations. In SIGGRAPH,pages 125–133. ACM, 2013. 1

[16] L. Kratz and K. Nishino. Anomaly detection in extremely crowdedscenes using spatio-temporal motion pattern models. In CVPR, 2009.2, 7, 8

[17] D. Kuettel, M. D. Breitenstein, L. Van Gool, and V. Ferrari. What’sgoing on? discovering spatio-temporal dependencies in dynamicscenes. In CVPR, 2010. 1, 2

[18] S. Kullback and R. A. Leibler. On information and sufficiency. TheAnnals of Mathematical Statistics, 22(1):79–86, 1951. 4

[19] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori. Dis-criminative latent models for recognizing contextual group activities.TPAMI, 34(8):1549–1562, 2012. 1, 2

[20] V. I. Levenshtein. Binary codes capable of correcting deletions, in-sertions and reversals. Soviet physics doklady, 10:707–710, 1966.4

[21] R. Li, R. Chellappa, and S. K. Zhou. Recognizing interactive groupactivities using temporal interaction matrices and their riemannianstatistics. IJCV, 101(2):305–328, 2013. 2

[22] C. C. Loy, T. Xiang, and S. Gong. Modelling multi-object activityby gaussian processes. In BMVC, pages 1–11, 2009. 2

[23] C. C. Loy, T. Xiang, and S. Gong. Detecting and discriminatingbehavioural anomalies. Pattern Recognition, 44(1):117–132, 2011.2

[24] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior de-tection using social force model. In CVPR, 2009. 1, 2

[25] M. Moussaid, S. Garnier, G. Theraulaz, and D. Helbing. Collectiveinformation processing and pattern formation in swarms, flocks, andcrowds. Topics in Cognitive Science, 1:469–497, 2009. 1

[26] M. Moussaıd, N. Perozo, S. Garnier, D. Helbing, and G. Theraulaz.The walking behaviour of pedestrian social groups and its impact oncrowd dynamics. PloS, 5(4), 2010. 1

[27] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll neverwalk alone: Modeling social behavior for multi-target tracking. InICCV, 2009. 2

[28] W. M. Rand. Objective criteria for the evaluation of clustering meth-ods. JASSA, 66(336):846–850, 1971. 6

[29] M. Rodriguez, J. Sivic, I. Laptev, and J.-Y. Audibert. Data-drivencrowd analysis in videos. In ICCV, 2011. 5

[30] S. Smyth and S. White. A spectral clustering approach to findingcommunities in graphs. In SDM, 2005. 4

[31] J. Toner, Y. Tu, and S. Ramaswamy. Hydrodynamics abd phases offlocks. Annals of Physics, 318:170–244, 2005. 2, 6

[32] T. Vicsek and A. Zafeiris. Collective motion. arXiv, 1010.5017,2012. 4

[33] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, 2010.5

[34] X. Wang, X. Ma, and W. E. L. Grimson. Unsupervised activi-ty perception in crowded and complicated scenes using hierarchicalbayesian models. TPAMI, 31(3):539–555, 2009. 2, 7

[35] S. A. Wheelan. The handbook of group research and practice. Sage,2005. 1

[36] M. Wu and B. Scholkopf. A local learning approach for clustering.In NIPS, 2006. 6

[37] H.-P. Zhang, A. Beer, E.-L. Florin, and H. L. Swinney. Collec-tive motion and density fluctuations in bacterial colonies. PNAS,107(31):13626–13630, 2010. 1

[38] W. Zhang, X. Wang, D. Zhao, and X. Tang. Graph degree linkage:agglomerative clustering on a directed graph. In ECCV. 2012. 4

[39] B. Zhou, X. Tang, and X. Wang. Coherent filtering: detecting coher-ent motions from crowd clutters. In ECCV. 2012. 2, 3, 5, 6

[40] B. Zhou, X. Tang, and X. Wang. Measuring crowd collectiveness. InCVPR, 2013. 4, 5

[41] B. Zhou, X. Wang, and X. Tang. Understanding collective crowdbehaviors: Learning a mixture model of dynamic pedestrian-agents.In CVPR, 2012. 2

8

Documents

Scene-Independent Group Profiling in Crowdopenaccess.thecvf.com/content_cvpr_2014/papers/... · Scene-Independent Group Proﬁling in Crowd Jing Shao 1Chen Change Loy2 Xiaogang Wang