Bags of Spacetime Energies for Dynamic Scene Recognitionsunw.csail.mit.edu/2014/papers2/22_Feichtenhofer_SUNw.pdf · Use of temporal image sequences (e.g., video) to sup-port scene

Bags of Spacetime Energies for Dynamic Scene Recognition

Christoph Feichtenhofer1,2 Axel Pinz1 Richard P. Wildes2

1Institute of Electrical Measurement and Measurement Signal Processing, TU Graz, Austria2Department of Electrical Engineering and Computer Science, York University, Toronto, Canada

{feichtenhofer, axel.pinz}@tugraz.at [email protected]

Abstract

This work presents a unified bag of visual word (BoW)framework for dynamic scene recognition. The approachbuilds on primitive features that uniformly capture spa-tial and temporal orientation structure of the imagery (e.g.,video), as extracted via application of a bank of spatiotem-porally oriented filters. A novel approach to adaptive pool-ing of the encoded features is presented that captures spa-tial layout of the scene even while being robust to situationswhere camera motion and scene dynamics are confounded.The resulting overall approach has been evaluated on twostandard, publically available dynamic scene datasets. Theresults show that the proposed approach outperforms theprevious state-of-the-art in classification accuracy by 10%.

1. IntroductionUse of temporal image sequences (e.g., video) to sup-

port scene recognition has potential to improve perfor-mance over single images owing to the additional infor-mation available, e.g., regarding scene dynamics. A keychallenge in exploiting such information is disentangling in-trinsic scene dynamics from camera motion incurred duringcapture. This paper proposes a novel approach to dynamicscene recognition, Bags of Spacetime Energies (BoSE),within the framework indicated in Fig. 1. The approachcombines primitive features based on local measurementsof spatiotemporal orientation, careful selection of encodingtechnique and a dynamic pooling strategy that in empiricalevaluation outperforms the previous state-of-the-art in dy-namic scene recognition by a significant margin.

2. Bags of Spacetime EnergiesPrimitive feature extraction. The underlying descrip-

tor is based on spatiotemporal orientation measurementsthat jointly capture spacetime image appearance, dynamicsand colour information at multiple scales [1] . To extractthe representation, input imagery is filtered with oriented3D Gaussian third-derivatives G(3)

3D(θi, σj). The responsesare pointwise squared and blurred to yield oriented space-

f

Dense spacetime feature extraction

v(1,1,1)

v(8,1,1) . . .

v(x,y,t)

Dynamic pooling

Figure 1. Proposed BoSE representation. First, spatiotemporal ori-ented primitive features are extracted from a temporal subset of thevideo. Second, features are encoded into a mid-level representa-tion and also steered to extract dynamic pooling energies. Third,the encoded features are pooled via a novel dynamic spacetimepyramid that adapts to the temporal image structure, as guided bythe pooling energies. The pooled encodings are concatenated intovectors that serve as representation for online recognition.

time energy measurements

E(x; θi, σj) = G3D(σj) ∗ |G(3)3D(θi, σj) ∗ V (x)|2, (1)

where G3D is a 3D Gaussian, x = (x, y, t)> are spacetimecoordinates, V is the grayscale spacetime volume formedby stacking all frames in a sequence along the temporal axisand ∗ denotes convolution. To achieve invariance to multi-plicative contrast variation, the responses, (1), are normal-ized with respect to the sum of all filter responses at a point.Finally, colour information is incorporated in the presentspacetime primitives via the addition of three smoothedLUV colour measurements.

Coding. A variety of different coding procedures exist toconvert primitive local features, v(x) ∈ RD, into more ef-fective intermediate-level representations, c(x) ∈ RK , forclassification purposes and the choice can significantly im-pact performance. In the present case, the primitive featuresare given in terms of the feature vector constructed in theprevious subsection. To best mediate between these primi-

mailto:[email protected]



tives and dynamic scene classification, a systematic empir-ical evaluation of a representative set of four contemporarycoding techniques has been performed: Vector quantiza-tion (VQ), locality-constrained linear coding (LLC), Fishervectors (FV) and its improved version with power- and `2-normalization (IFV).

Dynamic pooling. When pooling the encoded features,c(x), from dynamic scenes, those that significantly changetheir spatial location across time should be pooled adap-tively in a correspondingly dynamic fashion. In contrast,features that retain their image positions over time (i.e.,static patterns) can be pooled within finer, predefined grids,e.g., as with standard spatial pyramid matching (SPM). In-deed, even highly dynamic features that retain their overallspatial position across time (i.e., temporally stochastic pat-terns, such as fluttering leaves on a bush and other dynamictextures) can be pooled with fine grids. Thus, it is not sim-ply the presence of image dynamics that should relax finelygridded pooling, but rather the presence of larger scale co-herent motion (e.g., as encountered with global camera mo-tion).

In response to the above observations, a set of dynamicpooling energies, ED, have been derived that favour or-derless pooling (global aggregation) when coarse scale im-age motion dominates and spatial pooling (as in an SPMscheme) when a visual word is static or its motion isstochastic but otherwise not changing in overall spatial po-sition. These energies are used as pooling weights appliedto the locally encoded features so that they can be pooled inan appropriate fashion.

In the present context, a set of dynamic energies repre-senting coherent image motion in 4 equally spaced direc-tions (horizontal (r − l), vertical (u− d) and two diagonals(ru− ld and lu− rd), as well as a static channel (s+ ε) thatindicates lack of coarse motion are employed. The dynamicpooling energies for a temporal subset of a street sequenceare shown in Figure 2.

3. Experimental evaluation

The proposed Bags of Spacetime Energies (BoSE) sys-tem is evaluated on the Maryland [4] and YUPENN [1] dy-namic scene recognition datasets. In Table 1 the full pro-posed BoSE system is compared to several others that pre-viously have shown best performance: GIST + histogramsof flow (HOF), GIST + chaotic dynamic features (Chaos)[4], spatiotemporal oriented energies (SOE) [1], slow fea-ture analysis (SFA) [5]1 and complementary spacetime ori-entation (CSO) features [3]. For both datasets, BoSE per-forms considerably better than the previous state-of-the-art,CSO [3], with an improvement of 10% or better on bothdatasets. BoSE is able to best represent the videos, by mod-

1See corrected results at http://webia.lip6.fr/ theriaultc/sfa.html

(a) (b) (c)

(d) centre frame

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(e) ED|s+ε|(x)

0.05

0.1

0.15

0.2

0.25

0.3

0.35

(f) ED|r−l|(x)

0.05

0.1

0.15

0.2

0.25

(g) ED|u−d|(x)

0.05

0.1

0.15

0.2

0.25

0.3

0.35

(h) ED|ru−ld|(x)

0.05

0.1

0.15

0.2

0.25

0.3

(i) ED|lu−rd|(x)

Figure 2. Distribution of Spatiotemporal Oriented Pooling Ener-gies of a Street Sequence from the YUPENN Dataset. (a), (b),(c), and (d) show the first 8, middle 8, last 9, and centre frame ofthe filter support region. (e)-(i) show the decomposition of the se-quence into a distribution of spacetime energies indicating station-arity/homogeneity in (e), and coarse coherent motion for severaldirections in (f)-(i). Hotter colours (e.g., red) correspond to largerfilter responses.

Approach HOF+ Chaos+ SOE SFA CSO BoSEGIST GIST

Maryland 33.08 58.46 43.08 60.00 67.69 77.69YUPENN 68.33 22.86 80.71 85.48 85.95 96.19

Table 1. Classification accuracy for different representations.

elling the visual words with locally encoded spacetime en-ergies that are pooled based on their dynamics.

4. ConclusionThe insights of this work have an impact on the design

of dynamic scene classification approaches, as they sig-nificantly extend the state-of-the-art. More generally, theoutstanding performance of the presented spacetime recog-nition framework suggests application to a variety of otherareas such as video retrieval or indexing. Details of the ap-proach are documented in the full version of the paper [2] athttp://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf.

References[1] K. Derpanis, M. Lecce, K. Daniilidis, and R. P. Wildes. Dy-

namic scene understanding: The role of orientation features inspace and time in scene classification. In CVPR, 2012. 1, 2

[2] C. Feichtenhofer, A. Pinz, and R. Wildes. Bags of spacetimeenergies for dynamic scene recognition. In CVPR, 2014. 2

[3] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spacetime forestswith complementary features for dynamic scene recognition.In BMVC, 2013. 2

[4] N. Shroff, P. Turaga, and R. Chellappa. Moving vistas: Ex-ploiting motion for describing scenes. In CVPR, 2010. 2

[5] C. Theriault, N. Thome, and M. Cord. Dynamic scene classi-fication: Learning motion descriptors with slow features anal-ysis. In CVPR, 2013. 2

http://webia.lip6.fr/~theriaultc/sfa.html

http://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf

Documents

Bags of Spacetime Energies for Dynamic Scene Recognitionsunw.csail.mit.edu/2014/papers2/22_Feichtenhofer_SUNw.pdf · Use of temporal image sequences (e.g., video) to sup-port scene