Space-time interest pointsSpace-time interest points
Computational Vision and Active Perception Laboratory (CVAP)Dept of Numerical Analysis and Computer Science
KTH (Royal Institute of Technology)SE-100 44 Stockholm, Sweden
Ivan Laptev and Tony Lindeberg
General motivation
Spatio-temporal image data contains rich information about the external world.
Traditional methods for video analysis include
• optical flow estimation;
• tracking of features/models over time.
Observation:
Events in video are often characterised by non-constant motion and non-constant appearance
Spatio-temporal data
Idea: detect points with high spatio-temporal variation of image values
Direct method for event detection
Why local features in time?
Non-constant motion in images may be an indication of physical interaction between objects in the world (ball
bouncing the ground, car crash, etc.) non-rigid motion, e.g. relative motion of body parts,
gestures, etc. occlusions/disocclusions in the field of view
Goal: make a sparse and informative representation of
complex motion patterns; obtain robustness w.r.t. missing data (occlusions) and
outliers (dynamic, complex background)
Interest points in space
(Harris and Stephens 1988): image points with high variation of values in both image direction
High eigenvectors of the second-moment matrix integrated at the local neighbourhood
where Lx, Ly are Gaussian derivatives
Select points with positive maxima of the corner function
Interest points in space-time
High variation of image values in both space and time
extend Harris corner function into 3D spatio-temporal domain; compute the second moment matrix
where Lx, Ly , Lt are Gaussian derivatives in space-time obtained by spatio-temporal convolution:
and
Interest points in space-time
Points with high space-time variations of image values correspond to the maxima of
distinct scale parameters for the spatial scale and the temporal scale : spatial and temporal extents of events are independent in general.
Convolution with the Gaussian kernels violates causality constraint of temporal domain. Alternative (recursive) kernels can be used to address this problem (Koenderink 1988, Lindeberg & Fagerström 1996, Florack 1997)
where are eigenvalues of .
Experiments with synthetic sequences
Spatio-temporal ”corner” Collision I
Experiments with synthetic sequences
Collision II
2=16
2=16
2=8
2=8
Motivation for scale selection
2=22=8
2=82=8
2=22=2
2=82=2
Motivation for velocity adaptation
vx=-0.8 vx=1.4vx=0.0
Spatio-temporal scale selection
Estimate the spatio-temporal extent of image structures
Local scale estimation has been investigated and applied previously in the spatial domain (Lindeberg IJCV’98; Chomat et.al. ECCV’00; Mikolajczyk and Schmid ICCV’01):
Here: Extend scale selection into the spatio-temporal domain; estimate spatial and temporal scale parameters
Task: find normalisation parameters (a,b,c,d) of
such that normalised derivatives obtain extrema at scales corresponding to the extents of image structures in space-time
Spatio-temporal scale selection
Analyse spatio-temporal blob
Extrema constraints
Give parameter values a=1, b=1/4, c=1/2, d=3/4
The normalised spatio-temporal Laplacian operator
assumes extrema values at positions and scales corresponding to the centres and the spatio-temporal extent of a Gaussian blob
Spatio-temporal scale selection
Want to adapt point neighbourhoods to the direction of motion and obtain invariance w.r.t. the first-order motion
Velocity adaptation
Stationary pattern:
First-order motion is described by the Galilean transformation
where
and it follows
Velocity adaptation
expansion gives
However, this scheme needs the estimate of in advance in order to adapt the smoothing filter kernel .
Iteratively estimate and adapt the filter kernel until the fixed-point condition is reached:
with
(Similar approach for affine shape adaptation in space, Lindeberg)
Find interest points p=(x,y,t,2,2,vx ,vy) that are
maxima of the corner function H over (x,y,t); maxima of the normalised Laplacian over (2,2); satisfy fixed-point condition
Scale and velocity adaptation
Approach:
1. Find interest points P for a set of sampled (2,2,vx ,vy)2. For each pi in P
3. select new scale (2,2) at (x,y,t) that maximises Laplacian in the local scale-neighbourhood
4. estimate velocity (vx ,vy) 5. re-detect interest point for new scales and velocities6. If changes in (2,2 ,vx ,vy) => repeat from 3.7. else i=i+1
(Similar in spatial domain: Mikolajczyk and Schmid ICCV01, ECCV02)
Scale- and velocity-adaptedinterest points
ExperimentsStationary cameraStabilised camera
ExperimentsSta
tion
ary
cam
era
Sta
bili
sed c
am
era
No adaptation Scale adaptationScale and velocity adaptation
Experiments
Invariance with respect to size changes
Experiments
Selection of temporal scales captures the temporal extents of events
Applications of interest points
(preliminary results)
Classify detected interest points using their spatio-temporal neighbourhoods
Represent video data by a set of classified interest points (features)
Align video sequences by matching spatio-temporal features
Recognise motion patterns using probability distribution of features derived from training sequences
Classification of events
When analysing periodic motion such as the gait pattern, the interest points with similar spatio-temporal structure are likely to correspond to the interesting events, while the others are more likely to be caused by noise.
Describe each interest point pi, i=1,...,n by the local responses of spatio-temporal Gaussian derivatives:
and normalise descriptors w.r.t. the covariance
Group similar points in the space of normalised descriptors using k-means clustering
Select significant clusters and represent each of them by the mean and the covariance matrix
K-means clusteringFor the gait pattern, four significant clusters (clusters with most points) correspond to distinct spatio-temporal events
c1
c2
c3
c4
Clustering
Classification
Application I: Sequence matching
Represent the model sequence and the test sequence by a set of classified spatio-temporal points.
Find a valid transformation of a model that brings model features in correspondence with data features.
Problem: Find walking people and estimate their poses from image sequences
Match a model sequence with data sequences using spatio-temporal interest points
Note: the feature matching is defined in a 3D spatio-temporal window
Walking model
Represent the gait pattern using classified spatio-temporal points corresponding the one gait cycle
Define the state of the model X for the moment t0 by the position, the size, the phase and the velocity of a person:
Associate each phase with a silhouette of a person extracted from the original sequence
Sequence alignment Given a data sequence with the current moment t0,
detect and classify interest points in the time window of length tw: (t0, t0-tw)
Transform model features according to X and for each model feature fm,i=(xm,i, ym,i, tm,i, m,i, m,i, cm,i) compute its distance di to the most close data feature fd,j, cd,j=cm,i:
Define the ”fit function” D of model configuration X as a sum of distances of all features weighted w.r.t. their ”age” (t0-tm) such that recent features get more influence on the matching
Sequence alignment
data featuresmodel features
At each moment t0 minimize D with respect to X using standard Gauss-Newton minimization method
Experiments
Experiments
Walking Exercise Running Cycling
1. Detect spatio-temporal velocity- and scale-adapted interest points and compute their jet descriptors
2. Cluster all the descriptors using k-means
3. Compute distributions of points over detected clusters for each sequence separately
Application II: Action recognition
Cluster id
Cluster id
Cluster id
Cluster id
Walking
Exercise
Running
Cycling
Model histograms
(related to Leung & Malik, IJCV01)
Walking
Exercise
Running
Cycling
Background
Test sequences
Classification
1. Detect interest points and classify their jet responses w.r.t. the cluster means :
2. Compute distribution of cluster labels and classify the sequence as an action if
WalkingExerciseRunningCycling
Confusion matrix:
test walking test exercise test running test cycling test background
Classification
ROC curve corresponding to changes of the decision threshold when classifying 37 sequences using different histogram-distance measures
% correct
% false
Performance comparison
Velocity- and scale-adapted space-time interest points
Non-adapted space-time interest points
Spatial interest points
Back-projection of points
Test running
Test cycling
Test walking
Test exercise
Summary
Points with high variation of image values in space-time are detected
Direct approach for event detection (no tracking needed) invariant treatment of events at different spatial and
temporal scales; invariance w.r.t. camera motion
Interest point detection
Applications Classified space-time features provide a compact
representation of video information Interpretation of scenes with complex, non-stationary
backgrounds
Future work: contrast and orientation invariant descriptors, large-scale action recognition experiments, integration of multi-local constraints, on-line implementation.