View
525
Download
2
Category
Preview:
DESCRIPTION
Citation preview
The Shanghai-Hongkong Team at MediaEval2012: Violent
Scene Detection Using Trajectory-based Features
Yu-Gang Jiang*, Qi Dai*, Chun Chet Tan**, Xiangyang Xue*, Chong-Wah Ngo**
*School of Computer Science, Fudan University, Shanghai
**Department of Computer Science, City University of Hong Kong, HK
MediaEval 2012 Workshop, Oct 4-5, Pisa, Italy
Outlines• Introduction
• Framework
• Feature Extraction
• Classifiers
• Temporal Smoothing
• Results
• Discussions
• First 20 clips retrieved
Introduction• Violent Scene Detection task [1] -
practical challenge, great potential in applications.
• Focus on novel features.
• Top performance in mAP@20, runner-up in mAP@100
[1] C.-H. Demarty, C. Penet, G. Gravier, and M. Soleymani. The MediaEval 2012 Affect Task: Violent Scenes Detection. In MediaEval 2012 Workshop, Pisa, Italy, 2012.
Framework
The circled numbers indicate the 5 submitted runs
Feature extraction
Trajectory-based (7 features)
Spatial-temporal interest point
MFCC audio feature
χ2 kernel SVM
Classifiers
SIFT
Concept-based
5
4
Video shots
3
Detection score-level temporal
smoothing
1
All features except
concept-based
χ2 kernel SVM
Temporal feature
smoothing2
Feature Extraction• Trajectory-based features [2]:
- dense trajectory, HOG, HOF, MBH [5]
- TrajMF (relative locations and motions between trajectory pairs)
- Trajectory shape feature
• Advantages: robust to camera movement, rich information, implicitly capture object-object and object-background relationships.
[2] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, 2012.
[5] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
Feature Extraction• SIFT [4]
• STIP [3]
• MFCC
• Concept-based Features (10 concepts: blood, carchase, coldarms, fights, fire, firearms, gore, explosions, gunshots, screams)
[3] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64:107-123, 2005.
[4] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision, 60:91-110, 2004.
Classifiers• BoW representation
• Chi-squared kernel SVMs
• Kernel level early fusion is used to combine multiple features
Temporal Smoothing• Feature Smoothing – averaged
features over a three-shot window.
• Score Smoothing – averaged prediction scores over a three-shot window.
r3 r2 r5 r4 r10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mea
n Av
erag
e Pr
ecis
ion
at 2
0
Results (mAP@20)
• Run 5: 7 dense trajectory features
• Run 4: Run 5 + SIFT + STIP + MFCC
• Run 3: Run 4 + concept scores
• Run 2: Run 4 + feature smoothing
• Run 1: Run 4 + score smoothing
Results (mAP@100)
• Run 5: 7 dense trajectory features
• Run 4: Run 5 + SIFT + STIP + MFCC
• Run 3: Run 4 + concept scores
• Run 2: Run 4 + feature smoothing
• Run 1: Run 4 + score smoothing
r3 r4 r5 r2 r10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Mea
n Av
erag
e Pr
ecis
ion
at 1
00
Discussions• SIFT + STIP + MFCC show insignificant
improvement. TrajMF has encoded the rich information of SIFT and STIP.
• Concept-based scores do not improve the performances - overfitting SVMs due to insufficient training data. In fact, using mid-level concept detectors is a promising direction.
• Score smoothing boosts the performances. Feature smoothing that “blurs” the features across shots might not be a good option.
First 20 clips retrieved
Thank You
Recommended