Upload
sudhamsh-maddala
View
224
Download
2
Embed Size (px)
Citation preview
7/24/2019 Lecture on Action Recognition
1/72
Action recognition
ECS734 Techniques in Computer Vision
Ioannis Patras
Slides thanks to Hays, Hoiem, Grauman, Oikonomopoulos
7/24/2019 Lecture on Action Recognition
2/72
Past lectures
Recognition in static images
Object recognition
Image categorisation
7/24/2019 Lecture on Action Recognition
3/72
Todays lectures
Recognition in image sequences
Action recognition(body gestures / facial expressions)
7/24/2019 Lecture on Action Recognition
4/72
To come
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
Tracking
Structure from motion
Surveillance
7/24/2019 Lecture on Action Recognition
5/72
Todays lecture
Introductionapplication
Features Template classification methods
Recognition using pose estimation and
objects (in brief) Part-based action localisation
7/24/2019 Lecture on Action Recognition
6/72
Todays lecture
Introductionapplication
Features Template classification methods
Recognition using pose estimation and
objects Part-based action localisation
7/24/2019 Lecture on Action Recognition
7/72
What is an action?
Action: a transition from one state to another
Who is the actor? How is the state of the actor changing?
What (if anything) is being acted on?
How is that thing changing?
What is the purpose of the action (if any)?
7/24/2019 Lecture on Action Recognition
8/72
Human activity in video
No universal terminology, but approximately:
Actions: atomic motion patterns -- often gesture-
like, single clear-cut trajectory, single nameable
behavior (e.g., break eggs, lift spoon, kick, wavearms)
Activity: series or composition of actions (e.g.,
people having a conversation, interacting) Event: combination of activities or actions (e.g., a
football game, a traffic accident, cooking a meal)
Adapted from Venu Govindaraju
7/24/2019 Lecture on Action Recognition
9/72
How do we represent actions?
CategoriesWalking, hammering, dancing, skiing, sitting
down, standing up, jumping
Poses
Nouns and Predicates
7/24/2019 Lecture on Action Recognition
10/72
Applications
Human-Computer
interfaces
Augmented Reality
Sports Analysis
[ C. Sminchisescu, 2007 ]
7/24/2019 Lecture on Action Recognition
11/72
Surveillance
http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf
7/24/2019 Lecture on Action Recognition
12/72
2011
Interfaces
7/24/2019 Lecture on Action Recognition
13/72
7/24/2019 Lecture on Action Recognition
14/72
How can we identify actions?
Motion Pose
Held
ObjectsNearby
Objects
7/24/2019 Lecture on Action Recognition
15/72
Todays lecture
Introductionapplication
Features Template classification methods
Recognition using pose estimation and
objects Part-based action localisation
7/24/2019 Lecture on Action Recognition
16/72
Representing Motion
Bobick Davis 2001
Optical Flow with Motion History
http://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdfhttp://www.cse.ohio-state.edu/~jwdavis/Publications/pami01.pdf7/24/2019 Lecture on Action Recognition
17/72
Representing Motion
Efros et al. 2003
Optical Flow with Split Channels
http://graphics.cs.cmu.edu/people/efros/research/action/efros-iccv03.pdfhttp://graphics.cs.cmu.edu/people/efros/research/action/efros-iccv03.pdf7/24/2019 Lecture on Action Recognition
18/72
Representing MotionTracked Points
Matikainen et al. 2009
http://www.cs.cmu.edu/~rahuls/pub/voec2009-rahuls.pdfhttp://www.cs.cmu.edu/~rahuls/pub/voec2009-rahuls.pdf7/24/2019 Lecture on Action Recognition
19/72
Representing MotionSpace-Time Interest Points
Corner detectors in
space-time
Laptev 2005
http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf7/24/2019 Lecture on Action Recognition
20/72
Representing MotionSpace-Time Interest Points
Laptev 2005
http://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2005_ijcv_laptev.pdf7/24/2019 Lecture on Action Recognition
21/72
Representing MotionSpace-Time Volumes
Blank et al. 2005
http://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/SpaceTimeActions/SpaceTimeActions_iccv05.pdfhttp://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/SpaceTimeActions/SpaceTimeActions_iccv05.pdf7/24/2019 Lecture on Action Recognition
22/72
Examples of Action Recognition
Systems
Feature-based classification
Recognition using pose and objects
Part-based recognition and localisation
7/24/2019 Lecture on Action Recognition
23/72
Todays lecture
Introductionapplication
Features Feature classification methods
Recognition using pose estimation and
objects Part-based action localisation
7/24/2019 Lecture on Action Recognition
24/72
Action recognition as classification
Retrieving actions in movies, Laptev and Perez, 2007
http://www.irisa.fr/vista/Papers/2007_iccv_laptev.pdfhttp://www.irisa.fr/vista/Papers/2007_iccv_laptev.pdf7/24/2019 Lecture on Action Recognition
25/72
Remember image categorization
TrainingLabelsTraining
Images
Classifier
Training
Training
Image
Features
Trained
Classifier
7/24/2019 Lecture on Action Recognition
26/72
Remember image categorization
TrainingLabelsTraining
Images
Classifier
Training
Training
Image
Features
Image
Features
Testing
Test Image
Trained
Classifier
Trained
Classifier Outdoor
Prediction
7/24/2019 Lecture on Action Recognition
27/72
Spatial pyramids.
Compute histogram in each spatial bin
7/24/2019 Lecture on Action Recognition
28/72
Features for Classifying Actions
1. Spatio-temporal pyramids (14x14x8 bins)
Image Gradients
Optical Flow
7/24/2019 Lecture on Action Recognition
29/72
Features for Classifying Actions
2. Spatio-temporal interest points
Corner detectors in
space-time
Descriptors based on Gaussian derivative filters over x, y, time
7/24/2019 Lecture on Action Recognition
30/72
Classification
Boosted stubs for pyramids of optical flow,
gradient
Nearest neighbor for STIP
7/24/2019 Lecture on Action Recognition
31/72
Searching the video for an action1. Detect keyframes using a trained HOG
detector in each frame
2. Classify detected keyframes as positive (e.g.,
drinking) or negative (other)
7/24/2019 Lecture on Action Recognition
32/72
Accuracy in searching video
Withoutkeyframe
detection
Withkeyframe
detection
7/24/2019 Lecture on Action Recognition
33/72
Learning realistic human actions from movies, Laptev et al. 2008
Talk on phone
Get out of car
http://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdfhttp://www.irisa.fr/vista/Papers/2008_cvpr_laptev.pdf7/24/2019 Lecture on Action Recognition
34/72
Approach
Space-time interest point detectors
Descriptors
HOG, HOF
Pyramid histograms (3x3x2)
SVMs with Chi-Squared Kernel
Interest Points
Spatio-Temporal Binning
7/24/2019 Lecture on Action Recognition
35/72
Results
7/24/2019 Lecture on Action Recognition
36/72
Todays lecture
Introductionapplication
Features Template classification methods
Recognition using pose estimation and
objects Part-based action localisation
7/24/2019 Lecture on Action Recognition
37/72
7/24/2019 Lecture on Action Recognition
38/72
Human-Object Interaction
Torso
Head
Human pose estimation
Holistic image based classification
Integrated reasoning
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
39/72
Human-Object Interaction
Tennis
racket
Human pose estimation
Holistic image based classification
Integrated reasoning
Object detection
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
40/72
Human-Object Interaction
Human pose estimation
Holistic image based classification
Integrated reasoning
Object detection
Torso
Head
Tennis
racket
HOI activity: Tennis Forehand
Slide Credit: Yao/Fei-Fei
Action categorization
7/24/2019 Lecture on Action Recognition
41/72
Felzenszwalb & Huttenlocher, 2005
Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009 Eichner & Ferrari, 2009
Difficult part
appearance
Self-occlusion
Image region looks
like a body part
Human pose estimation & Object detection
Human pose
estimation is
challenging.
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
42/72
Human pose estimation & Object detection
Human pose
estimation is
challenging.
Felzenszwalb & Huttenlocher, 2005
Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009 Eichner & Ferrari, 2009 Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
43/72
Human pose estimation & Object detection
Facilitate
Given the
object is
detected.
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
44/72
Viola & Jones, 2001
Lampert et al, 2008
Divvala et al, 2009
Vedaldi et al, 2009
Small, low-resolution,
partially occluded
Image region similar
to detection target
Human pose estimation & Object detection
Object
detection is
challenging
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
45/72
Human pose estimation & Object detection
Object
detection is
challenging
Viola & Jones, 2001
Lampert et al, 2008
Divvala et al, 2009
Vedaldi et al, 2009
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
46/72
Human pose estimation & Object detection
Facilitate
Given the
pose is
estimated.
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
47/72
Human pose estimation & Object detection
Mutual Context
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
48/72
H
A
Mutual Context Model Representation
More than oneHfor eachA;
Unobservedduring training.
A:
Croquet
shot
Volleyball
smash
Tennis
forehand
Intra-class variations
Activity
Object
Human pose
Body parts
lP: location; P: orientation;sP: scale.
Croquet
malletVolleyballTennis
racket
O:
H:
P:
f: Shape context. [Belongie et al, 2002]
P1
Image evidence
fO
f1 f2 fN
O
P2
PN
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
49/72
Activity Classification Results
Gupta et
al, 2009
Our
model
Bag-of-
Words
83.3%
C
lassificationaccuracy
78.9%
52.5%
0.9
0.8
0.7
0.6
0.5
Cricket
shot
Tennis
forehand
Bag-of-words
SIFT+SVM
Gupta et
al, 2009
Our
model
Slide Credit: Yao/Fei-Fei
7/24/2019 Lecture on Action Recognition
50/72
Todays lecture
Introductionapplication
Features
Template classification methods
Recognition using pose estimation and
objects Part-based action localisation
Part based Recognition&Localisation
7/24/2019 Lecture on Action Recognition
51/72
52
Part-based Recognition&Localisation
Implicit shape model
Goal:
Recognize categories of
actionsLocalize them in terms of their
bounding box (space +
time)
Challenges:Occlusions, clutter, variations,
Hypothesis: Analysis can be restricted on a set of
spatiotemporally interesting/salient events
I f ti th ti l ti l
7/24/2019 Lecture on Action Recognition
52/72
53
Information theoretical spatial
saliency
Proposal: Use signal unpredictability as anindicator of saliency
HD=3.866
HD=7.201
Spatial Saliency: Unpredictability in a single frame
7/24/2019 Lecture on Action Recognition
53/72
54
Scale (circle radius)
E
ntropy
0 20 40 60 80-0.2
0
0.2
0.4
0.6
0.8
1
29 59
Towards scale invariance
The entropy maxima reveal the spatial scale(s) of a salient
region
Detected salient pointsin a single frame
Spatial and spatiotemporal
7/24/2019 Lecture on Action Recognition
54/72
Entropy(HD)
55
Spatial and spatiotemporal
saliency
Spatiotemporal Saliency:
Driven by signal unpredictability in a spatiotemporal
volume (cylinder / sphere)
Examine
entropy:
kkk vHvwvY
Entropys heightEntropys
peaknessdqudsp
dddqudsp
ssudsw
q
D
q
D
,,,,,,
Descriptor extraction codebook
7/24/2019 Lecture on Action Recognition
55/72
56
Descriptor extraction codebookcreation
Optical Flow
after median subtraction
Spatiotemporal
Salient Point Detectionc1
c2
cN
Codebook
(class-specific)
Optical FlowInput sequence
t
Feature ensembles
O.Boiman & M.Irani [ICCV05]
Feature selection
Ensemble codewords
Optical Flow + Spatial Gradient
Descriptors.
Bin in histograms and concatenat
Class-dependent
7/24/2019 Lecture on Action Recognition
56/72
57
Class-dependent
Spatio-temporal probabilistic
voting
Current frame
T
t
-t T-t
Parameters stored for each ensemble in the
training set
average spatial position of ensemble with
respect to subject center and lower bound.
distance in frames of the activated ensemblefrom the start/end of the action
average spatiotemporal scale of ensemble.
Localisation model learned for codeword/cluster :
d
e
idii epcepwcp
d
|||
X
T
S
de
ic
ic
de
iX cp
x
|
Di i i ti l i
7/24/2019 Lecture on Action Recognition
57/72
Discriminative learning
Higher weights for pdfs with higher
localisation accuracy
Class dictionary comprise of
discriminative codewordsAdaboost on the codeword similarities
iii cpcpdw |log|exp( icp |
Spatio-temporal probabilistic
7/24/2019 Lecture on Action Recognition
58/72
59
Spatio temporal probabilisticvoting
7/24/2019 Lecture on Action Recognition
59/72
60
Hypothesis verification with
RVM-based classification
Mean-shift responses
used as features in RVM-based classification
Two class classification problem for class l
Select class l that maximizes the posterior probability
2
2
( , ')
2( , ')
CD F F
K F F e
N
ji
l
jl
l
jl FFKwwwFc ,);( 0
,......,,1 i
ffF
1;1|
wFc leFlp
Relevance Vector Machine (RMV) is variant of Support Vector Machine
Localisation of single actions
7/24/2019 Lecture on Action Recognition
60/72
61
Localisation of single actions
7/24/2019 Lecture on Action Recognition
61/72
Localisation accuracy (KTH)
7/24/2019 Lecture on Action Recognition
62/72
Localisation accuracy (KTH)
Action recognition
7/24/2019 Lecture on Action Recognition
63/72
64
Action recognition
KTH datasetaverage : 88% HoHA datasetaverage : 37%
7/24/2019 Lecture on Action Recognition
64/72
Localisation under artificial occlusions (KTH)
7/24/2019 Lecture on Action Recognition
65/72
Localisation under clutter (KTH)
Summary
7/24/2019 Lecture on Action Recognition
66/72
Summary
Advantages
Highly flexible structure model
Each part casts votes independently
Only few training examples are needed
Fast recognition
Robustness to occlusions
Disadvantages
Loose spatial model that does not model co-occurence of parts.
False positives in background (clutter)
Take-home messages
7/24/2019 Lecture on Action Recognition
67/72
Take-home messages
Action recognition is an open problem.
How to define actions? How to infer them?
What are good visual cues?
How do we incorporate higher level reasoning?
Take-home messages
7/24/2019 Lecture on Action Recognition
68/72
Take home messages
Some work done, but it is just the beginning of
exploring the problem. So far Actions are mainly categorical
Most approaches are classification using simple features
(spatial-temporal histograms of gradients or flow, s-t interest
points, SIFT in images)
Just a couple works on how to incorporate pose and objects Not much idea of how to reason about long-term activities or
to describe video sequences
To come
7/24/2019 Lecture on Action Recognition
69/72
To come
Recognition in image sequences
Action recognition(body gestures / facial expressions)
Tracking
Structure from motion
Surveillance
References
7/24/2019 Lecture on Action Recognition
70/72
References
C. Sminchisescu. Learning and Inference Algorithms for Monocular
Perception - Applications to Visual Object Detection, Localization and
Time Series Models for 3D Human Motion Understanding, 2007.University of Bonn, Faculty of Mathematics and Natural Sciences.
Habilitation Thesis.
A. Bobick and J. Davis, The recognition of human movement using
temporal templates, IEEE Trans. PAMI., vol. 23, pp. 257267,
Mar 2001.
Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik.
Recognizing action at a distance. In Proceedings of IEEE ICCV '03 -
Volume 2, 2003.
P Matikainen, M Hebert, and R Sukthankar.Trajectons: Action
recognition through the motion analysis of tracked features. In Workshop
on Video-Oriented Object and Event Classication, ICCV 2009 Ivan Laptev. On space-time interest points. International Journal of
Computer Vision, 64(2-3): 2005
B. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-
time shapes, in: ICCV, Beijing, China, Oct 1521, 2005.
References
7/24/2019 Lecture on Action Recognition
71/72
References
Ivan Laptev, Patrick Prez, Retrieving actions in movies, in:
Proceedings of the ICCV07, Rio de Janeiro, Brazil, October 2007,
pp. 1
8. Ivan Laptev, Marcin Marszaek, Cordelia Schmid, Benjamin
Rozenfeld, Learning realistic human actions from movies, in:
Proceedings of the CVPR 08, Anchorage, AK, June 2008, pp. 18.
Modeling Mutual Context of Object and Human Pose in Human-Object
Interaction Activities Bangpeng Yao and Li Fei-Fei IEEE CVPR 10. San
Francisco, CA, USA. June 13-18, 2010.
O. Boiman, M. Irani, Detecting irregularities in images and in video, in:
ICCV, Beijing, China, Oct 1521, 2005.
P. Felzenszwalb, D. Huttenlocher Pictorial Structures for Object
Recognition,IJCV Vol. 61, No. 1, January 2005
Deva Ramanan, Learning to parse images of articulated bodies, in:NIPS, 19, Vancouver, Canada, December 2006
V. Ferrari, M. Marin, and A. Zisserman"Progressive Search Space
Reduction for Human Pose Estimation IEEE CVPR, Alaska, June2008.
References
7/24/2019 Lecture on Action Recognition
72/72
References
Yang Wang and Greg Mori, Multiple Tree Models for Occlusion and
Spatial Constraints in Human Pose Estimation, ECCV, 2008
M Andriluka, S Roth, B Schiele, Pictorial Structures Revisited: PeopleDetection and Articulated Pose Estimation In: IEEE CVPR, 2009
M. Eichner and V. Ferrari "Better Appearance Models for Pictorial
Structures", BMVC, London, September 2009.
Paul A. Viola and Michael J. Jones. Rapid object detection using a
boosted cascade of simple features. In CVPR (1), 2001.
Matthew B. Blaschko, Christoph H. Lampert, "Learning to Localize
Objects with Structured Output Regression", ECCV,
Marseilles,France,2008
A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal
Localization and Categorization of Human Actions in
Unsegmented Image Sequences" . IEEE Trans. Image Processing,vol. 20, no. 4, pp. 1126-1140, Mar. 2011
Multiple Kernels for Object Detection, A. Vedaldi, V. Gulshan, M.
Varma, and A. Zisserman, in Proceedings of the ICCV, 2009