A51 Spatio Temporal Features1

Spatio-temporal features

• Actions are short task oriented body movements such as “waving a hand”, or “drinking from a bottle”. Some actions are atomic but often actions of interest have a cyclic nature such as “walking” or “running”…….

• Activities involve multiple people or happen in longer timeframes. Activities are often the result of a combination of actions like “taking money out from ATM” or “waiting for a bus”…

• We often refer to an Event as a combination of activities, usually involving more people and happening in a given context such as “a soccer match”, a “car accident” or a “fire in a wood”……..

All of these are not rigorous definitions

Actions, Activities, Events

glas

s

candle

person

drinking

indoors

car

car

car

person

kidnapping

house

street

outdoors

person

car

street

outdoors car

enter

person

car

road

field

countryside

car

crash

Computer vision grand challenge: video understanding

exit

through

a doorbuilding

car

people

outdoors

Objects:

cars, glasses,

people, etc…

Scene categories:

indoors, outdoors,

street scene, etc…

Actions:

drinking, running,

door exit, car enter,

etc…

Geometry:

Street, wall, field,

stair, etc…

constraints

Credit

• A generic action recognition framework needs a robust enough representation in order tohave classifiers concentrate on the real discriminant spatio-temporal features and not to getdistracted by clutter or other irrelevant intra-class variations. Intra-class variation is due to many factors:

− Person appearance variation because of gender, clothing and body posture and size. − Camera parameters, scene clutter and illumination.

• Camera motion need to be either removed via motion compensation or with robust representations.

• A robust representation is able to remove all the noisy features (clothing, gender, illumination, scale etc.) and preserve variability with respect to the body motion involved in different actions.

Requirements for action recognition

Action representation

• Actions can be described following different approaches:

− Holistic representations: each action is represented by a vector of features.− Local representations: each action is represented with a set of feature vectors.− Feature fusion/context modelling: each action is represented with a fusion of multiple

diverse features also representing the context of the action.

Holistic representation: Motion History Images

• Perform image differencing to detect motion, eventually with background subtraction :

− Motion Energy image is a binary image defined as follows:

it describes WHERE the motion happens.

− Motion History Image is a real valued image defined as follows:

it describes HOW the motion happens.

A.F. Bobick and J.W. Davis, IEEE TPAMI

2001

MHI

• Motion History Image can be described synthetically though image moments of different order:

Motion History Image descriptors

Hu moments: recall

• Given a distribution ρ (image intensity) moments of order p,q are defined as:

• Central moments are translation invariant and defined in term of moments:

• In order to obtain rotational invariance we define:

• The first four Hu moments are defined as:

Aerobic dataset: 18 movesNN classifier using Mahalanobis distance achieves 66% accuracy.

Example

Holistic approaches summary

• Simple and fast solution: works very well in controlled settings.

• Prone to errors of background subtraction.

Variations in light, shadows, clothing… What is the background here?

• Does not capture interior motion and shape. Silhouette tells little about actions.

Space-time local features

• A more useful and effective approach is to extract local features at space-time interest points and encode the temporal information directly into the local feature. This results into the definition of spatio-temporal local features that embed space and time jointly. In this case:

– Videos are considered as volumes of pixels.– Spatio-temporal features are located at spatio-temporal salient points that are extracted

with interest point operators.– Similarly as for the 2D case, interest point structures are searched for that are stable under

rotation, viewpoint, scale and illumination changes.

• Space time interest point detectors are extensions of 2D interest point detectors that incorporate temporal information.

• Detectors:− STIP Spatio Temporal Interest Points (Harris3D) [I. Laptev, IJCV 2005]− Dollar’s detector [P. Dollar et al., VS-PETS 2005]− Hessian3D [G. Willems et al., ECCV 2008]− Regular sampling [H. Wang et al. BMVC 2009]

• Descriptors:− HOG/HOF [I. Laptev, et al. CVPR 2008]− Dollar [P. Dollar et al., VS-PETS 2005]− HoG3D [A. Klaeser et al., BMVC 2008]− Extended SURF [G. Willems et al., ECCV 2008]

Most popular solutions

STIP: Spatio Temporal Interest Points

• Spatio-temporal Interest points (STIP) were proposed by I. Laptev in 2005. They are based on thedetection of spatio-temporal corners.

• Spatio-temporal corners are located in region that exhibits a high variation of image intensity in allthree directions (x, y , t). This requires that spatio-temporal corners are located at spatial cornerssuch that they invert motion in two consecutive frames (high temporal gradient variation)

• They are identified from local maxima of a cornerness function computed for all pixels acrossspatial and temporal scales.

.

Covariance

The space-time gradient is obtained as:

Spatial scale , temporal scale

STIP Detector

• The Harris-corner operator is extended to time:− Represent video as a function f (x,y,t)

− Compute Gaussian derivatives with kernel g using covariance . For each single scale pair (σ,τ) Gaussian derivatives are computed for each pixel p.

L

L

− Extract interest point by evaluating the distribution of within a local neighborhood.The matrix mof second moments measures the variation of gradients:

Second-moment matrix

− Spatio-temporal corners are obtained from the local maxima of H over (x,y,t)

Similar to Harris operator where lare the eigenvalues of H and k a constant with value close to 0.15

High variation of implies large eigenvalues of m

-

L

L

• Scale invariance is obtained by selecting space-time locations at their characteristic scale. The normalized Laplacian is able to select this scale for Harris corners.

• Scale selection algorithm : − Detect space-time corners for a sparse combination of spatial and temporal scales (σi,τj)− For each point detected at location (x,y,t, σi,τj ) compute normalized Laplacian for given

location and at neighboring scales: location (x,y,t, 2δ σi, 2δ τj ) and δ = -0.25, 0, + 0.25.− Select location (x,y,t, σi,τj ) that maximizes the normalized Laplacian.

Examples

Scale selection in space and time

boxing

hand waving

STIP summary

• Derived from 2D Harris corner detector• Maxima of H correspond to:

− spatial corner inverting motion− joining/splitting structures

• It is very robust but sparse• Scale selection is computationally expensive

Dollar’s periodic motion detector

...8,4,2 , ...8,4,2

2

1 )(

/4 ,)2sin(

)2cos(

))(())((

222

22

22

2/)(

2

/

/

22

yx

t

od

t

ev

odev

eg

eth

eth

hgIhgIR

• The spatio-temporal detector proposed by Dollar treats differently time and space. Attempts tosolve Laptev detector’s excessive sparseness of the interest points due to the rarity of truespace-time corners and to the scale-selection process.

• The Dollar’s detector obtains a denser sampling by avoiding scale selection and uses a Gaussianfilter in space and a Gabor filter in time.

− The Gaussian filter performs spatial scale (σ) selection by smoothing each frame− The Gabor bandpass filter gives high responses to periodic variation of the signal

• The interest point detector R is computed as follows:

Multiple scales in space and time can be used in orderto increase the amount of interest point selected and to represent space-time structures at different scales

• The spatial scale refers to the size of the moving object:− we can detect the same event observing it at different distances− we are able to select events of different spatial sizes (e.g. head, legs)

The temporal scale refers to the speed at which the object moves:− we are able to detect the same event performed at a different speed.− we are able to detect the proper scale for different events.

Importance of multiple scales

...consider a walking person ...at a certain scale onlythe torso motion is detected ...although legs and arms

movement are undoubtfullymore informative.

Detector response (large scale) Detector response (small scale)

Red denotes a high detector response at a given space and time.

Dollar’s detector summary

• Separates time and space filters• It is a band-pass filter in time• It is denser then Harris 3D• There is no scale selection: dense scale sampling can be used instead

Hessian 3D detector

• It is conceptually derived from SURF extended to time: uses box filters and integral videos to speed up.

• It is faster and denser than Harris 3D but less dense than Dollar’s detector• Performs scale selection but it is performed by scaling the filter not the image.

Harris3D Hessian3D

Dollar’s Dense

Space-time feature detectors

• At each spatio-temporal interest point, descriptors are defined taking into account the volume ofthe

cuboid neighbourhood. The size of the cuboid is obtained from the scale as (k ) (k ) (k’ )with k

a suitable constant typically equal to 6.

• Descriptors of the volume are computed with a common framework:− Preprocessing: volumes are smoothed with a gaussian 3D kernel− Spatio-temporal pooling: the volume is sub-divided into a number of smaller cuboid volumes

(e.g. 3x3x2 cuboids)− Feature computation (for each pixel a function or a transformation is computed in order to

obtain invariance to illumination and rotation) followed by feature quantization (histogramsof the computed features are accumulated):

Descriptors for spatio-temporal patches

HOG, HOF descriptors

• Typical representations widely used are:• Histogram of 3D gradient orientations (HOG) based on space-time pixel values derivatives.

Models the apperance• Histogram of Optical Flow magnitude and orientation (HOF). Models the motion.

• They obtain the better performance since they represent the dynamic content of the cuboidvolume.

3D Gradient (HoG)

• 3D gradient is computed at each pixel by differentiating the image function I(x,y,t) → R (three channels are obtained):

− Gx (x,y,t) = I(x+1, y, t) - I(x-1, y, t)

− Gy (x,y,t) = I(x, y+1, t) - I(x, y+1, t)

− Gt (x,y,t) = I(x, y, t+1) - I(x, y, t-1)

• Gradient is represented using Magnitude M and Orientation of θ and φ.

Orientations are quantized similarly to SIFT but in 3D there is a normalization issue: solid angles nearthe “equator” weight more with respect to solid angles near the “poles”

M

)/(tan

)/(tan

1

221

222

xy

yxt

tyx

GG

GGG

GGGM

3D Gradient

Solution 1): Weight the orientation bins with the inverse of the solid angle

Solution 2): Use “platonic solids” located at the centers of each cuboid subvolume to quantizegradient orientation (platonic solids have congruent faces i.e. angles corresponding to faces are allequal) and perform quantization by projecting gradient vectors on normals to solid faces

Solution 3) Quantize orientation separately: avoids rescaling of bins and keep histograms dense (the simplest)

φ

θ

Projects gradient vector

jointly characterized by θ and φ

Computes hisograms of

θ and φ separately θ φ

• Optical flow measures the apparent motion of a pixel between two frames. If the camera is still itcorresponds to movement of objects in the world projected onto the image plane. In case of ego-motion the information carried by the optic flow may be misleading.

• Several methods have been proposed (Horn and Schunck 81, Lucas and Kanade 81). They assume that image intensity does not change significantly from one frame to another due toillumination. Variations of intensity are therefore exploited to compute pixel velocities.Aperture problem: a vertical edge moving vertically produces null optical flow.

• Optic flow is represented by quantizing the orientation of velocity vector with components Vx ,Vy .

A bin of no-motion is usually computed.

M Vx

2V y

2

tan1(V y /Vx )

Optical flow

Optical flow (HoF)

E-SURF descriptor

• 3D cuboid is divided into cells.• Bins are filled with weighted sums of responses of the axis-aligned Haar-wavelets dx, dy, dt.

• Sums of absolute values are not included (as in SURF 2D) since they don’t improve performance.

PCAsd

basis: first 100 eigen

vectors

• Learn a PCA basis from gradient of cuboids• Project gradient of pixels onto first 100 principal components to get descriptor

descriptor

Documents

A51 Spatio Temporal Features1