Graph-based approach for human action recognition using spatio-temporal features

Accepted Manuscript

Graph-based approach for human action recognition using spatio-temporal fea-

tures

Najib Ben Aoun, Mahmoud Mejdoub, Chokri Ben Amar

PII: S1047-3203(13)00191-0

DOI: http://dx.doi.org/10.1016/j.jvcir.2013.11.003

Reference: YJVCI 1284

To appear in: J. Vis. Commun. Image R.

Received Date: 29 January 2013

Accepted Date: 5 November 2013

Please cite this article as: N.B. Aoun, M. Mejdoub, C.B. Amar, Graph-based approach for human action recognition

using spatio-temporal features, J. Vis. Commun. Image R. (2013), doi: http://dx.doi.org/10.1016/j.jvcir.2013.11.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers

we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and

review of the resulting proof before it is published in its final form. Please note that during the production process

errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

http://dx.doi.org/10.1016/j.jvcir.2013.11.003

http://dx.doi.org/http://dx.doi.org/10.1016/j.jvcir.2013.11.003

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Graph-based approach for human action recognition

using spatio-temporal features

Najib Ben Aoun, Mahmoud Mejdoub, Chokri Ben Amar

REGIM-Lab: REsearch Groups on Intelligent Machines, University of Sfax, NationalEngineering School of Sfax (ENIS), BP 1173, Sfax, 3038, Tunisia

Abstract

Due to the exponential growth of the video data stored and uploaded in theInternet websites especially YouTube, an effective analysis of video actionshas become very necessary. In this paper, we tackle the challenging prob-lem of human action recognition in realistic video sequences. The proposedsystem combines the efficiency of the Bag-of-visual-Words strategy and thepower of graphs for structural representation of features. It is built upon thecommonly used Space-Time Interest Points (STIP) local features followed bya graph-based video representation which models the spatio-temporal rela-tions among these features. The experiments are realized on two challengingdatasets: Hollywood2 and UCF YouTube Action. The experimental resultsshow the effectiveness of the proposed method.

Keywords: Human action recognition, Spatio-temporal features,Graph-based video modeling, Bag-of-sub-Graphs, Frequent sub-graphs,Support Vector Machines

1. Introduction

In this paper we address the task of Human Action Recognition (HAR)which is the process of naming human actions based on the video content.This task has been an active research topic in computer vision over the lastdecade since it represents a fundamental part for many applications such as

Email addresses: [email protected] (Najib Ben Aoun),[email protected] (Mahmoud Mejdoub), [email protected] (ChokriBen Amar)

Preprint submitted to Journal of Visual Communication and Image Representation November 4, 2013

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

video categorization, human-computer interactions, video surveillance androbotics.

Among recently developed systems, Bag-of-visual-Words (BoW) basedmethods have gained a significant interest [5, 6, 7, 8, 9, 15, 23, 34, 35, 36].The BoW approach is performed to represent the video via a histogram ofvisual words. Indeed, local spatio-temporal features are computed from thevideo. Then, the k-means algorithm is usually used to quantize the spatio-temporal features into k clusters. The obtained clusters form the visualword vocabulary (codebook). Afterward, each spatio-temporal feature will bereplaced by the visual word that represents its nearest cluster. Subsequently,each video will be characterized with its appropriate histogram of visualwords. Based on the histograms of visual words, human actions will berecognized.

One of the notorious disadvantages of BoW is the lack of relationships be-tween the features. This flaw can be surmounted by interconnecting featuresby means of structured models such as graphs. Indeed, graphs are efficientfor providing natural description of human actions [16, 17, 18, 20, 21]. Oneother advantage of the graph-based representation is its robustness to thenoisy video inputs. The graph-based representation overcomes the problemof missing parts since, even with occluded parts, the majority of the infor-mation is maintained.

Motivated by the promising results given by the spatio-temporal features[3, 6, 7, 23, 24, 32, 33] and the success of the graph-based data representa-tion for representing the relationships between features [1, 2, 17, 20, 21, 25],we have developed a new HAR system. The proposed graph-based HARsystem (GHAR) is built upon our previous work reported in [2]. In [2], wehave modeled the video using spatial graph representation for frame regions.These frame regions are obtained by segmentation with the Hill Climbingalgorithm and are characterized by spatial features (color histogram, Gaborfilter and the gray level Cooccurrence matrix). Then, the video is indexedbased on the spatial graph features and block matching-based motion fea-tures. In this paper, we have represented each video with sets of spatial andtemporal graphs constructed from its STIP features. These graphs describerespectively the intra-video frame and the inter-video frame relationships be-tween STIP features. Spatial and temporal graphs collected over all trainingvideos will form two graph databases. After that, frequent sub-graphs are ex-tracted from these two graph databases using the graph-based substructurepattern mining algorithm (gSpan) [27]. Video indexing is then conducted by

2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

computing a histogram of frequent sub-graphs for each video. This is donein the same way as in the BoW approach with the consideration that thefrequent sub-graphs represent the visual words. From this fact, came theidea to name our graph-based video indexing as Bag-of-sub-Graphs (BoG)approach.

In comparison with our previous work [2], four main contributions canbe revealed in this work: (1) rather than using spatial and temporal featurescomputed separately from the video, spatio-temporal STIP features are usedto detect at the same time the video motion and appearance in a local volu-metric space-time neighborhood around 3D-interest points; (2) in [2], spatialgraphs are constructed from segmented frame regions while in this work,spatial graphs are constructed from detected spatio-temporal interest points.Consequently, this spatial graphs will not only surmount the problem of re-gion segmentation but also integrate the temporal information in the video.(3) In this work, the spatial graphs are coupled with temporal graphs in orderto extend the graph-based modeling from the frame level to the video levelso to give an overall description of the video; (4) For the graph-based videoindexing, we have improved the histogram encoding from a binary histogramthat indicates the information of the presence or the absence of the frequentsub-graphs to a numerical histogram that gives the number of occurrences offrequent sub-graphs.

The remaining sections of this paper are organized as follows. We begin,in Section 2, by presenting the related works on human action recognition.In Section 3, we explain in detail our proposed GHAR system which profitsfrom the benefits of the graph representation to model and index the video.In Section 4, we evaluate our system and give the experimental results on twochallenging benchmark datasets for human action recognition. The experi-mental results prove the efficiency of our GHAR system to recognize humanactions. We conclude our paper by giving a summary of the presented workas well as future extensions.

2. Related works

During the last decade, several human action recognition methods havebeen introduced. In [3], Gilbert et al. propose a hierarchical approach forconstructing and selecting discriminative compound features of 2D Harriscorners [30]. Later, in [4], a HAR method that uses features learned fromspatio-temporal data using independent subspace analysis was presented.

3

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Due to the success of the spatio-temporal features to describe video actions,most methods use them following the BoW approach [5, 6, 7, 8, 9, 10, 15].Laptev and Lindeberg [5] first introduced the Space-Time Interest Points(STIP) for action recognition by extending the famous Harris detector [30]to video. Their 3D Harris takes into consideration the pixel variations onboth space and time. The histogram of Oriented Gradients (HOG) and theHistogram of Optical Flow (HOF) features are then computed in the localneighborhood of the interest points. The combination of the HOG as a spatialfeature representing the local appearances and the HOF as a temporal featuredescribing the video motions has given promising results. STIP detectionin complex action datasets [6, 7] results in a large number of backgroundSTIPs which usually give an erroneous action recognition. To surmount thisproblem, Liu et al. [7] have used a static and motion features pruned to getthe most informative features and a divisive information-theoretic algorithmhas been employed to group semantically related features. An extension ofthe standard BoW approach is presented in [8] by locally applying BoWto regions that are spatially and temporally segmented to get rid of theunnecessary STIPs.

Chakraborty et al. [9] have introduced a new Spatial Interest Points(SIP) feature. Interest points are detected using the basic Harris corner de-tector [30]. Apart from the detected SIPs on the human actors, a significantamount of unwanted background SIPs are also detected. So, only distinc-tive SIP features are kept by suppressing unwanted background SIPs usinga surround suppression mask (SSM). The SSM is centered on each SIP, theinfluence of all surrounding points of the mask is estimated, and accordingly,a suppression decision is taken. The final set of interest points is selectedby applying non-maxima suppression similar to [31] in order to retain onlySIPs which represent local maximums in comparison with their neighbor-hood. Despite this local constraint, some background SIPs might remain.So, a temporal constraint is imposed to remove static SIPs using an interestpoint matching algorithm between each two consecutive frames along witha temporal Gabor filter response. Consequently, selective spatio-temporalinterest points (SSTIP) are used for local descriptor-based action recogni-tion. This approach follows, then, a spatial pyramid based BoW improvedby a vocabulary compression technique. Ikizler-Cinbis et al. have presentedin [10] an approach for combining features of the people, objects and scenesand have used a multiple instance learning (MIL) based framework for betterrecognition of actions.

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

The biggest drawback of the BoW approach is the lack of structural or-ganization of features. Indeed, higher-level descriptors can be extracted bycorrelating local features. Temporal feature correlation may be characterizedbased on motion trajectories. Some recent methods [11, 12, 13, 15] show im-pressive results for action recognition by exploiting the motion information oftrajectories by temporally correlating features. Brendel and Todorovic [11]have built a HAR system where activities are compactly represented as timeseries of a few snapshots of human-body parts. Indeed, these parts are char-acterized by (2D+t) HOG descriptors and tracked with the Viterbi algorithmto get their trajectories. Then, the action is recognized by retrieving the ex-emplar trajectories that best match the query video trajectories by aligningtheir short time series representations. Sun et al. [12] extract motion tra-jectories by matching Scale-Invariant Feature Transform (SIFT) descriptorsbetween two consecutive frames using the Markov chaining. Each descrip-tor has a unique match and the matches which are too far are discarded.Similar work introduced by Bregonzio et al. [13] extracts the trajectories bySIFT matching and tracking detected STIPs using the Kanade-Lucas-Tomasi(KLT) method [14]. The trajectories are characterized by a concatenationof three descriptors: orientation-magnitude descriptor, shape descriptor andSIFT based appearance descriptor. Then, the BoW approach is used andfeature selection is applied to eliminate the unwanted STIPs. Lately, Wangand al. [15] have proposed a method for trajectory extraction by trackingdensely sampled points using the optical flow fields. The dense trajectoriesare then described with the Motion Boundary Histograms (MBH), HOG andHOF features to index the video.

Graph is a natural way to represent the correlation between features.Rather than representing the trajectories by histograms like [11, 13, 15],graph can better represent the temporal ordering in the trajectories. In ad-dition to our previous work [2], some recent researches [16, 17, 18, 20, 21]have exploited the graph-based representation for video modeling. In [16],Ta el al. have proposed the PairWise Features (PWF) which represent theappearance and the spatio-temporal relations between STIP features. PWFsare constructed by connecting pairs of STIPs, which are spatio-temporallyclose. The BoW approach is followed to generate two codebooks accordingto the appearance and the geometric similarity of the PWFs. A video will bethen characterized by a feature vector which combines the two histogramsof visual words and classified using the Support Vector Machines (SVM)[15] method. Song et al.[17] use point features obtained in each frame and

5

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

subsequently tracked to construct a decomposable triangulated graph. Thisgraph, which is a collection of size1-three cliques2, models the whole bodyrelationships in order to detect and recognize human actions. The locationand the velocity of each point together with other cues such as local appear-ance are extracted and used as image cues. In [18], a spatio-temporal featurebased on the Speeded-Up Robust Feature (SURF) [19] is proposed. Indeed,moving SURF interest points are detected with the Lucas-Kanade method[14] and three-interest point cliques are grouped by Delaunay triangulation.These small triangular graph features are combined with Gabor texture fea-tures as appearance features and global optical-flow histograms as motionfeatures using Multiple Kernel Learning (MKL). Celiktutan et al. [20] pro-pose a graph-based approach for human action recognition. Only the mostsalient point, which has the highest confidence of the interest point detector,is kept in each frame and the video is represented by a graph linking allpoints. Then, video actions are classified and recognized by graph match-ing. Graph matching is also used in [21] to determine the classes of theactions present inside the video. Video was represented with a set (“string”)of feature graphs that respect the spatio-temporal ordering.

Despite these recent attempts, exploiting the graph-based representationfor video modeling remains limited. Therefore, we have exploited, in ourwork, the graph representation to model the videos. The video is representedby a set of spatial graphs interconnecting all STIP features and a set oftemporal graphs which are a collection of the trajectories of all STIPs. Usingthe gSpan algorithm, frequent spatial and temporal sub-graphs are discoveredfrom the two graph sets. Then, the video is modeled with a histogram offrequent spatial and temporal sub-graphs. Finally, the SVM algorithm isapplied to classify video actions.

The contributions of this work in comparison with the state of the artmethods can be summarized as follows: (1) Unlike [16, 17, 18, 20, 21], weuse a combination of spatial and temporal graphs. (2) To overcome theaforementioned drawbacks of the BoW based methods [5, 7, 8, 9], we haveused spatio-temporal features and a model on top of them which providesthe spatio-temporal relationships. (3) Our method can be considered as a

1The size of a graph is the number of its edges.2The maximum clique is the largest subset of vertices where each two of which are

connected.

6

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

generalization of the method used in [16, 17, 18, 20, 21] where relativelysimple graph structures, with a fixed order and fixed topography (shape),are used to model the video. In contrast, our method allows a more generalstructure with a higher and variable order as well as flexible topography. (4)The trajectories are described by temporal graphs that preserve the temporalordering unlike the histogram based trajectory methods used in [13, 15]. (5)Unlike most methods, only significant and relevant sub-graphs are retainedto model the video. These sub-graphs are selected by the gSpan algorithm.(6) Unlike [20, 21] that employ sequential graph matching to retrieve themost similar videos, we convert the graph matching problem to a vectorspace one by representing a graph with a histogram of frequent sub-graphs.This enables us to apply learning algorithms on these histograms for actionclassification. (7) Since we construct complete spatial graphs that connectall detected spatio-temporal interest points, we do not need to segment theimages such as [1, 2] or to model body parts such as [11, 17].

3. Our proposed human action recognition method

The general architecture of our human action recognition system (GHAR),is illustrated in Figure 1. Our GHAR system is composed of two phases: thetraining and the testing phases. In the training phase, for each trainingvideo, spatio-temporal features are extracted and used to model the videowith spatial and temporal graphs which lead to two graph databases. ThegSpan algorithm is applied to discover the frequent sub-graphs in each graphdatabase. Then, each video will be indexed by two histograms (histogramof the frequent spatial sub-graphs and histogram of frequent temporal sub-graphs). The two histograms are horizontally concatenated together to formthe final video descriptor. Finally, the video action models will be built fromall training video descriptors by classifying human actions using SVM [15].The same indexing procedure is followed in the testing phase to compute thedescriptors of the testing videos. The actions will be then recognized usingthe models built by SVM in the training phase.

In the next sections, the GHAR processing steps will be explained indetail.

3.1. Spatio-temporal feature extraction

In our work, we have utilized the space-time interest point detectionmethod proposed by Laptev in [22] followed by the HOG/HOF presented

7

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 1: Overview of our GHAR system

in [23] as spatio-temporal features since they have given good results espe-cially for the complex videos [7, 24].

We have used the Laptev implementation available online3 where interestpoints are extracted at different space-time scales with σ2 = {4, 8, 16, 32, 64, 128}and the temporal scale τ 2 = {2, 4}. Then, the HOG/HOF features are com-puted on a volumetric video patch in the neighborhood of each detectedSTIP. The patch is divided into a grid of 3× 3× 2 spatio-temporal blocks assuggested by the authors. Afterward, 4-bin HOG and 5-bin HOF features arecalculated for all blocks and concatenated to form a vector of 162 elements(72 elements for HOG and 90 elements for HOF) for each STIP.

3.2. Graph-based video modeling

Once the STIP features are extracted from each video, STIPs will beconsidered as vertices for the spatial and temporal graphs of the video. STIPfeatures of all training videos are clustered by k-means algorithm into kclusters in order to attribute each STIP to a cluster. Then each graph vertexwill be labeled by the cluster of its corresponding STIP.

Each video will be modeled with two sets of graphs:

• The Spatial Video Graph Set (SVGS): we compute for each video framea Spatial Frame Graph (SFG). SVGS is then the set of SFGs (see Fig-ure 2). The SFG is a complete graph constructed by connecting allthe STIPs inside the video frame. To label the SFG edge, we compute

3http://www.di.ens.fr/~laptev/download.html#stip

8

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 2: Illustration of a SVGS (formed by the SFGs) and a TVGS (formed by the STIPtrajectories)

the edge vector. It is composed of the displacements dx and dy, re-spectively, between the x-coordinates and the y-coordinates of the twoparticipating vertices. Then, the k-means algorithm is applied to theedge vectors of all training videos. each edge is labeled by the closestcluster to its vector.

• The Temporal Video Graph Set (TVGS): we connect each STIP to theone which has the same label in the next frame (see Figure 2). Thesearch of the STIP in the next frame is performed inside a windowof n × n pixels. To label the graph edge, the temporal displacementbetween two participating vertices is quantized into 4 directions: top-left, top-right, bottom-left and bottom-right. TVGS will be formed bythe connected graphs. These graphs reflect the STIP trajectories.

3.3. Video indexing by means of graph representation

Until this phase, each video is represented by SVGS and TVGS. TheSVGSs and the TVGSs associated to the training videos are respectively

9

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

collected to form the database of the spatial graphs (DSG) and the database ofthe temporal graphs (DTG). The gSpan algorithm is applied to each databaseto retrieve the frequent spatial and temporal sub-graphs (see Figure 3). Then,each video will be indexed by two histograms (histogram of the frequentspatial sub-graphs and histogram of frequent temporal sub-graphs) which arethen horizontally concatenated together to form the final video descriptor.

A great importance [1, 2, 25, 29] is given to the frequent sub-graph dis-covery approach. This is because it helps to reduce the high complexity ofthe graph isomorphism (graph similarity) since graphs can be more easilymatched based on the frequent sub-graphs that they contain. The discoveryof frequent sub-graphs in a database of graphs is based on the minimumsupport minSup which is the number of appearances that a sub-graph mustexceed to be frequent in a database. This parameter must be carefully cho-sen to give frequent sub-graphs which are significantly representatives of thegraph database. If the minSup is fixed to a high value, a small number offrequent sub-graphs will be obtained which will not accurately represent thegraph database. In contrast, setting the threshold minSup to a small valuewill lead to a big number of frequent sub-graphs which over-represent thegraph database, cause an overfitting in the classification phase and necessi-tate much mining cost in resources and in time.

There are several algorithms in the literature [27, 26, 28] which tried tofind frequent sub-graphs (patterns) in a graph database. Most of the methodstry to iteratively generate sub-graph candidates, compute their supports anddetermine whether they are frequent sub-graphs or not. The Frequent Sub-Graph mining algorithm (FSG) [26] has used a breadth-first search mannerwhen generating the candidate sub-graphs. The graph-based Substructurepattern mining algorithm (gSpan) [27] and the Fast Frequent Sub-graph Min-ing algorithm (FFSM) [28] have used the depth-first search traversal of thesearch space which significantly reduced the calculation cost and providedefficient memory consumption. The gSpan algorithm outperforms other al-gorithms in the computational time and is capable of mining large frequentsub-graphs in a big graph set [27]. This makes it an appropriate algorithmfor our GHAR system.

To index the video, we compute two video histograms (video histogram ofthe frequent spatial sub-graphs and video histogram of the frequent temporalsub-graphs). To compute the video histogram of the frequent spatial sub-graphs, we proceed as follows:

10

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 3: Video indexing using frequent spatial and temporal sub-graphs

• For each graph in the SVGS, we form a frequent spatial sub-graphshistogram. It consists in counting the number of occurrences, in thegraph, of each frequent spatial sub-graphs discovered by gSpan.

• The histograms are added up over all the graphs in the SVGS to formthe video histogram of the frequent spatial sub-graphs.

We give, in Figure 4, an example of a video histogram of 4 frequentspatial sub-graphs. In the same way, we compute the video histogram ofthe frequent temporal sub-graphs. The two histograms are concatenated toform the video descriptor which is then normalized. The normalization isconducted to standardize the descriptor element values in the same scale(between 0 and 1). The normalization is defined by:

yi =xi − bl

bu − bl(1)

where yi is the normalized value, bl is the lower bound and bu is the upperbound of the ith descriptor element xi.

Finding the frequencies of a sub-graph in a graph is a challenging tasksince it involves sub-graph isomorphism 4 which is an NP-complete problem.To surmount the sub-graph isomorphism problem, the Maximum CommonSub-graph (MCS) method is used in [29]. It (1) finds the Common Sub-graph(CS), which is the largest common substructure between the graph and thesub-graph and (2) computes the maximum clique [29] in CS. If the maximum

4The sub-graph isomorphism consists on finding if a graph is a part of another graph.

11

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 4: The graph G is indexed with 4 frequent spatial sub-graphs and has a spatialhistogram=[2,1,1,1]

clique has at least the size of the sub-graph, the sub-graph is present in thegraph. In our work, to compute the number of occurrences No of a frequentsub-graph SG in a graph G, we begin by computing the CS between them.CS is composed of:

• Edges which exist, with the same labels and the same composing verticelabels, in SG and in G (commonly present).

• Edges, between the vertices already in the CS, which do not exist inSG and exist in G (added to the CS for calculation reasons in order tobe able to use the maximum clique detection method).

It should be noticed that the CS edges corresponding to the second situ-ation is added only when we deal with a non-complete frequent sub-graph(see Fig. 4.d) which is usually the case. This situation of edge addition arenecessary for maximum clique identification (since the maximum clique mustbe completely connected). The Figure 5 shows an example of CS between agraph G and a frequent sub-graph SG. In CS, the edges in bold are those

12

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 5: An example of a Common Sub-graph between a graph and a non-completefrequent sub-graph

which are commonly present, the dashed edge is the one which is added forcomputation reasons.

In the next step, we detect in CS the maximum cliques that have a sizeequal to or greater than the size of SG. No corresponds then to the numberof the detected maximum cliques which has at least the size of the sub-graph.In CS of the Figure 5, we have one maximum clique having a size (3) whichis superior to the size of SG (2). So, the frequency of SG in G is equal to 1.

4. Experimental study

In this section, we evaluate our GHAR method on different benchmarkdatasets for human action recognition. According to each dataset, the sys-tem settings must be carefully defined since each benchmark dataset has itsparticularities.

In our experiments, to label the graphs, the STIP features and the graphedges are quantized using the k-means algorithm. The number of frequentsub-graphs is obtained by fixing theminSup in gSpan. For DSG, we denote bykvs, kes and Nfs respectively the number of vertice labels, the number of edgelabels and the number of frequent spatial sub-graphs. For DTG, we denoteby kvt and Nft respectively the number of vertice labels and the numberof frequent temporal sub-graphs. Like [24], a subset of 100,000 randomlyselected features is used for the k-means algorithm which is initialized 8 timesto have a more precise clustering. For the construction of the TVGSs, wehave taken, as search window size, n=13 for Hollywood2 dataset and n=15

13

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

for UCF YouTube Action dataset since they have given experimentally thebest result by applying cross validation in the training video dataset.

The video descriptors are used to learn the human action models by meansof an SVM classifier [15] with a one-versus-all multi-class approach.

4.1. Datasets

Several challenging video datasets have been provided to evaluate theperformances of HAR methods.

The Hollywood25 dataset was proposed by Marszalek et al. [6] as anextension of Hollywood dataset [23] with a greater number of actions andmore scene training data. It was collected from a set of 69 movies (33 moviesfor training and 36 movies for testing). 1707 manually labeled videos areproposed: 823 videos are taken for training and 884 videos for testing. Ad-ditional 810 automatically labeled videos are also given for training but formost evaluations only the 1707 manually labeled videos are used. The datasetis labeled with 12 human action classes (one example for each action is shownin Figure 6). The Average Precision (AP) is used as an evaluation criterion.The AP is computed of each human action, which will allow comparing oursystem according to each action, and then the mean AP (mAP) is calculatedacross all actions as the final evaluation metric.

The UCF YouTube Action dataset6 (UCF11) introduced in [7] is oneof the most extensive realistic action dataset available for use. It consists of1168 YouTube videos containing 11 action categories (see Figure 7) where thevideos of each category are divided into 25 relatively independent subsets.This dataset is challenging due to the large variations in camera motion,viewpoint, object appearance and pose, object scale, cluttered background,illumination conditions and low resolution. Performance is measured as in[7] using Leave-One-Out Cross-Validation (LOOCV) average accuracy.

4.2. Experiments on the Hollywood2 dataset

We have evaluated our GHAR system on the Hollywood2 dataset to rec-ognize the 12 required video actions. The experiments have been done usingthe manually labeled videos. Videos are downscaled into a half frame resolu-tion (320× 240) to accelerate the computation of the features and compareour method with other methods working on the same video resolution.

5http://www.di.ens.fr/~laptev/actions/hollywood2/6http://crcv.ucf.edu/data/UCF_YouTube_Action.php

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 6: Examples from Hollywood2 dataset

Figure 7: Examples from UCF11 dataset

15

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Table 1: mAP given by testing different values of DSG parameters (kvs is the number ofspatial graph vertice labels, kes is the number of spatial graph edge labels and Nfs is thenumber of frequent spatial sub-graphs )

kvs 1000 1000 1000 2000 2000 2000 3000 3000 3000kes 1000 1500 2000 1000 1500 2000 1000 1500 2000Nfs 8100 7300 6700 6200 5800 5300 4900 4300 3700mAP 49.4 50.6 49.6 50.5 51.78 50.7 49.8 50.2 49.2

In the first experiment, our proposed GHAR system is evaluated on theHollywood2 dataset using the STIP detection method. We tune using crossvalidation in the training video dataset, the kvs, kes and minSup parametersof gSpan for DSG as well as kvt and minSup of gSpan for DTG. For eachcouple (kvs, kes), we change the minSup value. For each minSup, we obtain avalue of Nfs. For DSG, the optimal values of kvs, kes and Nfs are respectively2000, 1500 and 5800. This setting is validated on the test dataset since it givesthe best mAP (see Table 1). The value of Nfs used in Table 1 corresponds tothe optimal value which we have retained in the validation step with a fixedcouple (kvs, kes). For DTG, the optimal values of kvt and Nft are respectivelyequal to 1000 and 3100. With these values, we obtain, using the videos ofthe test dataset, a mAP equal to 53.1%.

To test the impact of Nfs on the mAP of the test video dataset, we havefixed (Figure 8) kvs and kes to their optimal values and varied the value ofNfs. The best mAP is obtained for Nfs equal to 5800. The decrease of themAP after Nfs=5800 can be explained by the increase of the redundancy inthe selected frequent sub-graphs.

It must be noticed that, by using only 3000 frequent spatial sub-graphs(without using the frequent temporal sub-graphs), our BoG method gives amAP equal to 48.4% which is superior by 3.2% to the STIP based BoWmethod (with 4000 visual words) [24]. Using 4000 frequent spatial sub-graphs, we obtain a mAP superior by 5.35% to the BoW method.

Looking at Table 2, some conclusions could be made about the contribu-tion of the combination of the frequent spatial sub-graphs and the frequenttemporal sub-graphs. For most actions, the frequent temporal sub-graphsgive better result than the frequent spatial sub-graphs since most actions areaccompanied by important motions. On the other hand, the frequent spatialsub-graphs give better results for some actions since the scene (car shapefor GetOutCar, bedroom scene for SitUp) plays an important role for their

16

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Figure 8: Impact of the number of frequent spatial sub-graphsNfs on mAP for Hollywood2dataset

recognition. Besides, frequent spatial sub-graphs work better for actions withsmall motions such as AnswerPhone and HandShake. The combination of thefrequent spatial and the frequent temporal sub-graphs improves the recogni-tion accuracy for most actions, which proves the complementarity betweenthem.

As shown in Table 3, we have tested our previous work [2], in the Hol-lywood2 dataset and we have obtained a mAP equal to 37.18%. This lowrecognition rate is due to the frame segmentation into regions, which is notalways efficiently done. Also, as we have previously explained in Section 1,Ben Aoun et al. [2] use relatively simple spatial features interconnected onlyby spatial graphs. Besides, a binary histogram is used to model the videoaction, which reduces the recognition accuracy.

Table 4 presents a comparison between our GHAR system and the state-of-the-art methods for the Hollywood2 dataset in terms of action AP andmAP. It can be observed that most actions show low recognition rates. Thiscan be explained by the fact that the Hollywood2 dataset contains particu-larly challenging video conditions. Indeed, videos present a large amount ofcamera motion, clutter, lighting changes, variations in action execution andthey are taken from different view angles. Compared to other datasets suchas UCF11, many actions (FightPerson, HandShake, HugPerson and Kiss)have been carried out by more than one person with multiple moving people

17

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Table 2: Results on Hollywood2 dataset in terms of AP and mAP given by our systemusing frequent spatial sub-graphs, frequent temporal sub-graphs and their combination

Action Frequent spatialsub-graphs

Frequent temporalsub-graphs

Combination

AnswerPhone 35.02 32.83 34.83DriveCar 82.98 85.14 89.31Eat 52.02 56.27 61.06FightPerson 74.34 78.12 80.24GetOutCar 48.12 46.36 53.92HandShake 35.27 30.10 40.63HugPerson 47.36 50.27 55.92Kiss 54.38 57.80 62.27Run 73.28 76.71 79.85SitDown 55.37 58.04 62.47SitUp 24.11 20.62 26.83StandUp 53.08 60.23 63.92

mAP 51.78 53.96 59.27

Table 3: AP and mAP given by our BoG method and our previous method [2]

Action BEN AOUN [2] BoGAnswerPhone 14.1 34.83DriveCar 74.6 89.31Eat 32.6 61.06FightPerson 63.8 80.24GetOutCar 21.4 53.92HandShake 19.8 40.63HugPerson 26.9 55.92Kiss 52.3 62.27Run 56.1 79.85SitDown 31.2 62.47SitUp 18.8 26.83StandUp 34.5 63.92

mAP 37.18 59.27

18

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Table 4: Comparing our method with the state-of-the-art methods in terms of: AP (%) andmAP (%) over all actions of Hollywood2 dataset. STIP: Spatio-Temporal Interest Pointsproposed by Laptev in [22], SSTIP: Selective Spatio-Temporal Interest Points proposedby Chark. in [9]

Action Gilbert[3]

Le [4] Ullah[8]

Wang[15]

Chakr.[9]

BoG+STIP

BOG+SSTIP

AnswerPhone 40.2 29.9 26.3 32.6 41.6 34.83 43.24DriveCar 75 85.2 86.5 88 88.49 89.31 89.52Eat 51.5 59.7 59.2 65.2 56.5 61.06 59.65FightPerson 77.1 77.2 76.2 81.4 78.2 80.24 80.72GetOutCar 45.6 45.4 45.7 52.7 47.37 53.92 52.93HandShake 28.9 20.3 49.7 29.6 52.5 40.63 53.81HugPerson 49.4 38.2 45.4 54.2 50.3 55.92 55.84Kiss 56.6 57.9 59 65.8 57.35 62.27 60.16Run 47.5 75.7 72 82.1 76.73 79.85 78.76SitDown 62 59.4 62.4 62.5 62.5 62.47 62.87SitUp 26.8 25.7 27.5 20 30 26.83 31.41StandUp 50.7 64.7 58.8 65.2 60 63.92 62.32

mAP 50.9 53.3 55.7 58.3 58.46 59.27 60.94

19

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

in the background. Also actions like AnswerPhone, HandShake and SitUpare harder to interpret without context information [6].

Gilbert et al. [3] achieved a mAP equal to 50.9% although they use higherlevel knowledge with a hierarchical approach. Le et al. [4] obtained 53.3%by learning features from spatio-temporal data using independent subspaceanalysis. The BoW-based methods [8, 9] suffer from the lack of the featurerelationships modeling. Ullah et al. [8] obtained 55.7% by locally apply-ing BoW on semantically meaningful regions segmented (both spatially andtemporally) to get rid of the unnecessary STIPs. Chakraborty et al. [9]attain a mAP equal to 58.46% using selective STIPs which are detected bysuppressing background SIPs and imposing local and temporal constraints.The dense trajectory-based method proposed in [15] has reached 58.3% bycombining trajectory, HOG, HOF and MBH descriptors. Our BOG approachbased on STIP interest points (BOG+STIP) gives a mAP equal to 59.27%which is better than the presented state-of-the-art methods [3, 4, 8, 9, 15].

To further prove the efficiency of the temporal graphs used by our BoGmethod, we have conducted it with the same setups on the Hollywood2dataset using separately the HOG and the HOF features with the STIPdetection method. The dense trajectory based method [15] has given 41.5%using the HOG features and 50.8% using the HOF features. Despite densetrajectories were used in [15], we have obtained better results (42.4% usingonly the HOG features and 51.5% using only the HOF features). This canbe explained by the fact that trajectories in [15] were encoded by histogramswhich can result in a loss of the temporal ordering information. In addi-tion, a fixed trajectory size (15 frames) is imposed which may perturb therecognition of actions. In our work, we have used only the most pertinenttemporal sub-graphs mined by gSpan algorithm. These sub-graphs (1) rep-resent variable-size trajectories, (2) describe the temporal ordering of STIPfeatures in the trajectory and (3) integrate the displacement information inthe trajectory.

As it can be noticed from Table 4, our BoG approach based on STIPfeatures (BoG+STIP) has shown a minor improvement in comparison withWang et al. [15] and Chakraborty et al. [9]. The slight improvement obtainedin comparison with Wang et al. [15] is due to their dense trajectory basedmethod that performs well for the actions presenting an important back-ground information (Eat, FightPerson, Kiss, Run, SitDown and StandUp).In comparison with Chakraborty et al. [9], the minor improvement can beexplained by the detection of noisy interest points from the background which

20

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

is caused by the STIP detection method [5] that we have used. In contrast,Chakraborty et al.[9] use a SSTIP based method in which only relevant pointshave been selected to characterize the actions. Their method removes theSTIPs detected in the noisy background. Consequently, a good recognition isobtained especially for the actions which present complex background (An-swerPhone, HandShake and SitUp).

In the second experiment, to test the impact of SSTIP detection method,our proposed GHAR system is evaluated on the Hollywood2 dataset inte-grating SSTIP detection method in our BoG approach in place of the STIPdetection one. We note in Table 4 that, in comparison with the BoG+STIPapproach, the proposed BoG+SSTIP method gives a better mAP (60.94%).The improvements can be noticed essentially for the actions: AnswerPhone,HandShake and SitUp. This confirms the efficiency of the SSTIP detectionmethod for actions which present a large amount of background noise. Onthe contrary, for the actions Eat, GetOutCar, HugPerson, Kiss, Run andStandUp, the recognition rate is decreased since the background plays animportant role for their characterization.

In comparison with [15] and [9], our BoG+SSTIP method shows respec-tively an increase of 2.64% and 2.48% in terms of mAP. We achieve betterresults than [9] for all actions and than [15] for the majority of actions (An-swerPhone, DriveCar, GetOutCar, HandShake, HugPerson, SitDown andSitUp). In comparison with other state-of-the-art methods [3, 8], we notethat the BoG+SSTIP method gives better results for all actions

4.3. Experiments on the UCF11 dataset

In the first experiment, our proposed GHAR system is evaluated on theUCF11 dataset using the STIP detection method. Similarly to the Holly-wood2 dataset, cross validation-based tests have been conducted to obtainthe optimal values of kvs, kes and Nfs for DSG as well as kvt and Nft forDTG. The optimal values are: kvs=2000, kes=1000, Nfs=5300, kvt=500 andNft=3400.

Table 5 presents the results obtained using the frequent spatial sub-graphs, the frequent temporal sub-graphs and their combination. For thisdataset, the frequent spatial sub-graph approach outperforms the frequenttemporal sub-graph approach. This can be explained by the fact that thisdataset is collected from videos ”in the wild” with a big amount of back-ground motions which disturb the characterization of action related motions.

21

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Table 5: Results on the UCF11 dataset given by our system using frequent spatial sub-graphs, frequent temporal sub-graphs and their combination

Action Frequent spatialsub-graphs

Frequent temporalsub-graphs

Combination

basketball shooting 48 51 55biking/cycling 92.2 85.2 91.4diving 92 95 98golf swinging 90 93 96horseback riding 84 80 87soccer juggling 71 76 75swinging 87 83 90tennis swinging 84 79 88trampoline jumping 90 92 96volleyball spiking 94 87 96walking with a dog 84.6 79.3 87.3

Average accuracy 83.34 81.86 87.29

Furthermore, the frequent spatial sub-graphs help to make a distinction be-tween similar actions such as biking/cycling and the horseback riding actionwhich present similar object motions and then can be more easily differenti-ated by the object shape (bicycle and horse). However, the frequent temporalsub-graphs give better results (81.86%) than some state-of-the-art methods(see Table 6). They are efficient especially for the actions that present highmotion such as soccer juggling and diving. The combination of the frequentspatial and temporal sub-graphs is beneficial for recognition since it gives amAP equal to 87.29%.

As shown in Table 6, our BoG+STIP method gives better results (mAPequal to 87.29%) compared to different state-of-the-art methods [4, 7, 9, 10,11, 15]. Liu et al. [7] have achieved an average accuracy of 71.2% using acombination of static and motion features and following the BoW approachto describe the actions. Also, Ikizler-Cinbis [10] have followed the BoW ap-proach and reached 75.21% by combining features of people, objects andscenes. Le et al. [4] attain an average accuracy of 76.5% by learning multi-ple features using independent subspace analysis. Trajectory-based methodshave given better results. Brendel et al. [11] have obtained an average ac-curacy of 77.8% using time series of few snapshots of human body partsand [15] have reached 84.2% using the dense trajectories. The method pro-

22

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

Table 6: Accuracy per action class on the UCF11 dataset of our method compared thestate-of-the-art methods. In Chakr. [9], the authors did not provide the results per ac-tion. STIP: Spatio-Temporal Interest Points proposed by Laptev in [22], SSTIP: SelectiveSpatio-Temporal Interest Points proposed by Chark. in [9]

Action Liu[7]

Ikizler.[10]

Le [4] Brendel[11]

Wang[15]

Chakr.[9]

BoG+STIP

BOG+SSTIP

basketball shooting 53 48.48 86.9 60.1 43 - 55 61biking/cycling 73 75.17 93 79.3 91.7 - 91.9 92.4diving 81 95 85 85.8 99 - 98 99.3golf swinging 86 95 64 89.8 97 - 96 97.4horseback riding 72 73 87 80.6 85 - 87 87.5soccer juggling 54 53 76 59.3 76 - 75 78.7swinging 57 66 46.5 61.7 88 - 90 90.6tennis swinging 80 77 81 87.8 71 - 88 90.1trampoline jumping 79 93 88 88.3 94 - 96 97.3volleyball spiking 73.3 85 56 80.5 95 - 96 96.5walking with a dog 75 66.67 78.1 82.7 87 - 87.3 89.9

Average accuracy 71.2 75.21 76.5 77.8 84.2 86.98 87.29 89.15

posed by Chakraborty et al. [9] has reached 86.98% by using a SSTIP basedBoW method. In the second experiment, our proposed GHAR system isevaluated on the UCF11 dataset using the SSTIP detection method. TheBoG+SSTIP method outperforms [9] by 2.17% in term of mAP. Besides, wecan note in Table 6 that the BoG+SSTIP method gives better results for allactions in comparison with the BoG+STIP method [7], [10], [11] and [15].This demonstrates that our BoG+SSTIP method can cope well with thenoisy background problem which characterizes the UCF11 dataset actions.It should be noticed that the “Basketball shooting” action has not been wellclassified by all the methods since this action contains lots of backgroundnoise.

5. Conclusion

In this paper, we have presented a combination procedure merging spatio-temporal local features and graph-based video modeling for human actionrecognition. Graph-based video features are extracted to index the videoswith a histogram of frequent spatial and temporal sub-graphs. Experimental

23

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

results on HAR benchmark datasets have shown the efficiency of our BoGmethod. An interesting option for our future works would be the combinationof features to index the videos using our BoG approach. This combinationcan be conducted with the MKL method since it has given promising resultson feature fusion [12, 18]. Moreover, we intend to apply the BoG approachfor other applications such as video event recognition and detection.

References

References

[1] N. Ben Aoun, H. Elghazel, M.S. Hacid, C. Ben Amar, Graph aggregationbased image modeling and indexing for video annotation, in: Proceed-ings of the 14th International Conference on Computer Analysis of Imagesand Patterns (CAIP’11), Part II, LNCS, vol. 6855, Springer-Verlag BerlinHeidelberg, 2011, pp. 324-331.

[2] N. Ben Aoun, H. Elghazel, C. Ben Amar, Graph modeling based videoevent detection, in: Proceedings of the 7th International Conference onInnovations in Information Technology (IIT’11), 2011, pp. 114-117.

[3] A. Gilbert, J. Illingworth, R. Bowden, Action Recognition using MinedHierarchical Compound Features, IEEE Transactions on Pattern Analysisand Machine Intelligence 33 (5) (2011) 883-897.

[4] Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng, Learning hierarchical invariantspatio-temporal features for action recognition with independent subspaceanalysis, in: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR’11), 2011, pp. 3361-3368.

[5] I. Laptev, T. Lindeberg, Space-time interest points, in: Proceedings ofthe 9th International Conference on Computer Vision (ICCV’03), vol.1,2003, pp. 432-439.

[6] M. Marszalek, I. Laptev, C. Schmid, Actions in Context, in: Proceedingsof the 2009 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’09), 2009, pp. 2929-2936.

[7] J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos ”in thewild”, in: Proceedings of the 2009 IEEE Computer Society Conference

24

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

on Computer Vision and Pattern Recognition (CVPR’09), 2009, pp. 1996-2003.

[8] M. Ullah, S. Parizi, I. Laptev, Improving bag-of-features action recogni-tion with non-local cues, in: Proceedings of the British Machine VisionConference (BMVC’10), 2010, pp. 95.1-95.11.

[9] B. Chakraborty, M.B. Holte, T.B. Moeslund, J. Gonzalez, Selectivespatio-temporal interest points, Computer Vision and Image Understand-ing 116 (3) (2012) 396-410.

[10] N. Ikizler-Cinbis, S. Sclaroff, Object, scene and actions: Combining mul-tiple features for human action recognition, in: Proceedings of the 11thEuropean Conference on Computer Vision (ECCV’10), 2010, pp. 494-507.

[11] W. Brendel, S. Todorovic, Activities as time series of human postures,in: Proceedings of the 11th European Conference on Computer Vision(ECCV’10), 2010, pp. 721-734.

[12] J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li, Hierarchicalspatio-temporal context modeling for action recognition, in: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR’09), 2009, pp. 2004-2011.

[13] M. Bregonzio, J. Li, S. Gong, and T. Xiang, Discriminative topics mod-elling for action feature selection and recognition, in: Proceedings of theBritish Machine Vision Conference (BMVC’10), 2010, pp. 8.1-8.11.

[14] P. Matikainen, M. Hebert, and R. Sukthankar, Trajectons: Actionrecognition through the motion analysis of tracked features, in: Pro-ceedings of the 12th International Conference on Computer Vision Work-shops(ICCV’09), 2009, pp. 514-521.

[15] H. Wang, A. Klaser, C. Schmid, L. Cheng-Lin, Action Recognition byDense Trajectories, in: Proceedings of the 2011 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR’11), 2011, pp. 3169-3176.

[16] A.P. Ta, C. Wolf, G. Lavou, A. Baskurt, and J.M. Jolion, Pairwise fea-tures for human action recognition, in: Proceedings of the 20th Interna-tional Conference on Pattern Recognition (ICPR’10), 2010, pp. 3224-3227.

25

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

[17] Y. Song, L. Goncalves, and P. Perona, Unupervised learning of humanmotion. IIEEE Transactions on Pattern Analysis and Machine Intelligence25(7) (2003) 814827.

[18] Noguchi, A., and Yanai, K., A SURF-based Spatio-Temporal Featurefor Feature-fusion-based Action Recognition, in: Proceedings of the 11thEuropean Conference on Computer Vision (ECCV’10), 2010, pp. 153-167.

[19] H. Bay, A. Ess, T. Tuytelaars, L.V. Gool, SURF: Speeded up robustfeatures, Computer Vision and Image Understanding 110(3) (2008) 346-359.

[20] O., Celiktutan, C., Wolf, B., Sankur, E., Lombardi, Real-Time ExactGraph Matching with Application in Human Action Recognition, in: Pro-ceedings of the 3rd international conference on Human Behavior Under-standing (HBU’12), LNCS, vol. 7559, Springer-Verlag Berlin Heidelberg,2012, pp. 1728.Springer, Heidelberg (2012)

[21] U. Gaur, Y. Zhu, B. Song, and A. Roy-Chowdhury, A “string of fea-ture graphs” model for recognition of complex activities in natural videos,in: Proceedings of the 13th International Conference on Computer Vi-sion(ICCV’11), 2011, pp. 2595-2602.

[22] I. Laptev, On space-time interest points, International Journal of Com-puter Vision 64 (2-3) (2005) 107-123.

[23] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning RealisticHuman Actions from Movies, in: Proceedings of the 2008 IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition(CVPR’08), 2008, pp. 1-8.

[24] H. Wang, M.M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation oflocal spatio-temporal features for action recognition, in: Proceedings of theBritish Machine Vision Conference (BMVC’09), 2009, pp. 124.1-124.11.

[25] N. Acosta-Mendoza, A. Gago-Alonso, J.E. Medina-Pagola, Frequentapproximate subgraphs as features for graph-based image classification,Knowledge-Based Systems 27 (2012) 381-392.

26

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

[26] M. Kuramochi, G. Karypis, Frequent subgraph discovery, in: Proceed-ings of the IEEE International Conference on Data Mining (ICDM’01),2001, pp. 313-320.

[27] X. Yan, J. Han, gSpan: graph-based substructure pattern mining, in:Proceeding of the 2nd IEEE International Conference on Data Mining(ICDM’02), 2002, pp. 721-724.

[28] J. Huan, W. Wang, J. Prins, Efficient mining of frequent subgraphs inthe presence of isomorphism, in: Proceedings of the 3rd IEEE InternationalConference on Data Mining (ICDM’03), 2003, pp. 549-552.

[29] H. Elghazel, M. Hacid, Aggregated Search in Graph Databases: Pre-liminary Results, in: 8th IAPR-TC-15 Workshop on Graph-based Rep-resentations in Pattern Recognition (GbRPR’11), LNCS, Springer-VerlagBerlin Heidelberg, 2011, pp. 92-101.

[30] C. Harris, M. Stephens, A combined corner and edge detector, in: Pro-ceedings of the 4th Alvey Vision Conference, 1988, pp. 147-151.

[31] C. Grigorescu, N. Petkov, M.A. Westenberg, Contour and boundarydetection improved by surround suppression of texture edges, Image andVision Computing 22 (8) (2004) 609-622.

[32] A. Wali, N. BEN Aoun, H. Karray, C. BEN Amar, A.M. Alimi, A newsystem for event detection from video surveillance sequences, In Proceed-ings of Internationale Conference on the Advanced Concepts for IntelligentVision Systems (ACIVS’10), Part II, LNCS, vol. 6475, Springer-VerlagBerlin Heidelberg, 2010, pp. 110-120.

[33] M. Sekma, M. Mejdoub, C. Ben Amar, Human Action Recognition Us-ing Temporal Segmentation and Accordion Representation, in: Proceed-ings of the 15th International Conference on Computer Analysis of Imagesand Patterns (CAIP’13), Part II, LNCS, vol. 8048, Springer-Verlag BerlinHeidelberg, 2013, pp. 563-570.

[34] M. Dammak, M. Mejdoub, M. Zaied, C. Ben Amar, Feature vectorapproximation based on wavelet networks, in: Proceedings of the Interna-tional Conference on Agents and Artificial Intelligence (ICAART), 2012,pp.394-399.

27

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

[35] M. Mejdoub, L. Fonteles, C. BenAmar, Marc Antonini, Fast indexingmethod for image retrieval using tree-structured lattices, in: InternationalWorkshop on Content-based multimedia indexing (CBMI’08), 2008, pp.365-372.

[36] M. Mejdoub, C. Ben Amar, Classification improvement of local featurevectors over the KNN algorithm, Multimedia Tools and Applications 64(1) (2013), 197-218.

28

- We represent the video with spatial and temporal sets of graphs.

- We extract frequent spatial and temporal sub-graphs from the spatial and the temporal

graph databases.

- The video is indexed with a combination of a histogram of the frequent spatial sub-

graphs and a histogram of the temporal sub-graphs.

- Our graph-based approach has shown it efficiency for human action recognition.

Documents

Graph-based approach for human action recognition using spatio-temporal features