View
214
Download
0
Category
Preview:
Citation preview
EVENT DETECTION AND HUMAN
BEHAVIOR RECOGNITION
Ing. Lorenzo Seidenari
e-mail: seidenari@dsi.unifi.it
What is an Event?
Dictionary.com definition:
“something that occurs in a certain place during
a particular interval of time.”
Sports: shot on goal Surveillance: enter in car Movies: drink
Examples from various domains:
Importance of Human Actions• Most videos recorded and downloadable from the web
contain people; the semantic is therefore defined by people
behavior.
• Third generation video-surveillance systems benefit from
automatic interpretation of human actions and behaviors.
Definition 1: physical body motion.
Definition 2: interaction with environment (objects or people) on a specific purpose.
Human action recognition challenges
• Actor appearance variation. Gender,
clothing body posture and size.
• Scale, illumination and background change as in object categorization.
• Semantically different but perceptually
similar actions (e.g. running and jogging).
• Different ways of executing the same action. This results in limbs trajectory and speed change.
time time time time
Are actions space-time objects?
We already know how to detect instances of object categories in static images.
How do we take advantage of time to describe dynamic concepts (i.e. human actions)?
Bag-of-wordsSVM classifierrunning
walking
jogging
handwaving
handclapping
boxing
Visual Dictionary
…
Bag-of-featuresInterest points extraction
Framework Overview:• Same three steps of object categorization (feature extraction, dictionary formation, classification)
• Features detector and descriptor here differ!
Visual Dictionaries
…
… …
Visual Dictionary
3DGrad + HoF BoW
3DGrad_HoF
3DGrad
HoF
BoW
ST Patch
ST Patch
Descriptor
Descriptors
Action Representation
Action Representation
Descriptor combination strategy
Effective codebooks:• Spatio-temporal descriptors span an extremely high-dimensional feature space
• Our dense multi-scale sampling produce a non-uniform feature space.
K-means clusters are attracted
towards densely populated regions.
• Less dense zone are not represented
correctly.
Radius-based clustering [Jurie ICCV05]
exploits mode finding to place cluster
centers.
• More accurate coding of the feature
space.
Note: to reduce the uncertainty we perform soft assignment.
Results: codebook performance
KTH
Informative
mid-frequency terms.Non-informative
high-frequency terms.
codebook size
Words are sorted by frequency and added incrementally to dictionary.
Weizmann
codebook size
Non-informative
high-frequency terms.
Informative
mid-frequency terms.
Results: codebook performance
Words are sorted by frequency and added incrementally to dictionary.
Results: dataset
• KTH
• 25 actors
• 6 actions
• 4 viewing conditions
• 2931 clips
• Weizmann
• 9 actors
• 10 actions
• 1 viewing conditions
• 93 clips
The approach is tested on two standard datasets
Weizmann dataset is considered less challenging for the reduce variability of
shooting conditions and amount of actors.
Results: comparison with the state of the art
Method KTH Weizmann
Our method 92.57 95.41
Laptev et al. - HoG ['08] 81.6 -
Laptev et al. - HoF [‘08] 89.7 -
Dollár et al. [‘05] 81.2 -
Wong e Cipolla [‘07] 86.6 -
Scovanner et al. [‘07] - 82.6
Niebles et al. [‘08] 83.3 90
Liu et al. [‘08] - 90.4
Kläser et al. [’08] 91.4 84.3
Willems et al. [‘08] 84.2 -
We compare our results by using the same methodology to measure the
Improvement w.r.t. to the current state-of-the-art
walking
running
Real video footage
We test our detector on a sequence taken in a garage.
A sliding temporal window is used to perform the segmentation.
• Online video search and video indexing
• Events characterized by an evolution of scenes, objects
and actions over time.
• 56 events are defined in LSCOM.
• Event examples in the news domain:
Airplane Flying Car Exiting
Recognizing generic video events
• A possible approach, which exploit object recognition is to detect interest object,
track over time, and model spatio-temporal dynamics.
• Some events are well defined by the presence and motion of an object.
Object Detection & Localization
Tracking Inference
“Airplane
Landing”
?
Event Recognition: Object Tracking
• Hard to detect events without explicit object motion, such as Riot
feature feature feature feature
extractionextraction
concept concept concept concept
detectorsdetectors
EMDEMDEMDEMD
distancedistance
......
Plug the EMD into
a rbf kernel and use
it in a SVM to predict
category.
Event recognition: exploit dynamic concept evolution
• Global low level feature are extracted such as edge histograms, Gabor texture descriptors and
grid color moments.
• 108 concent detectors are trained on this features.
• Each frame is represented by 108 concept scores.
• Shots similarity is evaluated by computing Earth Mover’s Distance.
• Train detectors on
low-level features
• Mid-level semantic concept
feature is more robust
• Columbia developed and
released 374 semantic concept
detectors. Detectors are
available online.
Concept Detectors
Content Representation: Mid-level Semantic Concept Scores
Image Database
+-
http://www.ee.columbia.edu/ln/dvmm/columbia374/
Earth Mover’s Distance (EMD): Approach
dij
Supplier P is with a
given amount of goods
Receiver Q is with a
given limited capacity
Weights:
Solved by linear programming
• Temporal shift:
a frame at the beginning of P can be mapped to a frame at the end of Q
• Scale variations:
a frame from P can be mapped to multiple frames in Q
111/21/2
1/21/2
Experiments:
Keyframe based feature performance
0,0
0,2
0,4
0,6
0,8
1,0
Car
Cra
sh
Pro
test
Gre
etin
g
Car
Exitin
g
Com
bat
Mar
ching
Rio
t
Run
ning
Sho
otin
g
Walking
(ave
rage
)
concept scores Gabor texture
edge direction histogram color moment
Dataset: TRECVID2005Evaluation Metric: Average
Precision
References
On space-time interest points, Laptev, I. IJCV 2005
Behavior recognition via sparse spatio-temporal features, Dollar, P., Rabaud, V.,
Cottrel, G. and Belongie, S. ICCV VS-PETS 2005
Effective Codebooks for Human Action Recognition, Ballan, L., Bertini, M., Del
Bimbo , A.,Seidenari, L. and Serra, G. ICCV VOEC 2009
Video Event Recognition using kernel methods with multilevel temporal
alignement, Dong Xu, Shih-Fu Chang, TPAMI 2008
Recommended