Click here to load reader
Upload
multimediaeval
View
103
Download
0
Embed Size (px)
DESCRIPTION
The Violent Scenes Detection task aims at evaluating algorithms that automatically localize violent segments in both Hollywood movies and short web videos. The definition of violence is subjective: "the segments that one would not let an 8 years old child see in a movie because they contain physical violence". This is a highly challenging problem because of the strong content variations among the positive instances. In this year's evaluation, we adopted our recently proposed classification method to fuse multiple features using Deep Neural Networks (DNN). The method was named regularized DNN. We extracted a set of visual and audio features, which have been observed useful. We then applied the regularized DNN for feature fusion and classification. Results indicate that using multiple features is still very helpful, and more importantly, our proposed regularized DNN offers significantly better results than the popular SVM. We achieved a mean average precision of 0.63 for the main task and 0.60 for the generalization task. http://ceur-ws.org/Vol-1263/mediaeval2014_submission_65.pdf
Citation preview
FUDAN-NJUST AT MEDIAEVAL 2014: VIOLENT SCENES DETECTION USING
DEEP NEURAL NETWORKS
Qi Dai*, Zuxuan Wu*, Yu-Gang Jiang*, Xiangyang Xue*, Jinhui Tang#
*School of Computer Science, Fudan University, Shanghai, China#School of Computer Science and Engineering, Nanjing University of Science and Technology,
Nanjing, China
MediaEval 2014 Workshop, Oct 16-17, Barcelona, Spain
Problem• Detecting violent scenes in both movies and short web videos
Violent scene in Hollywood Movies Violent scene in short web videos
System Overview• Several features were used,
including trajectory-based features, and two other visual/audio features
• In addition to SVM, we adopted deep neural networks (DNN) as a feature fusion and classification method
FV
-HO
G
Video Clips
11
SVM DNN
MergingMerging
22
55 33 44
Feature Extraction
FV
-HO
F
FV
-MB
H
FV
-Tra
jSha
pe
Tra
jMF
-HO
G
Tra
jMF
-HO
F
Tra
jMF
-MB
H
ST
IP
MF
CC
Fusion
MergingMerging
MergingMerging Smoothing&Merging
Smoothing&Merging
Features• Trajectory-based features:
Improved Dense Trajectories (HOG, HOF, MBH, Trajectory Shape)
Features are encoded by the Fisher Vectors (FV)
Dimension-reduced TrajMF (relative locations and motions between trajectory pairs), implemented based on:
• Two additional visual/audio features (a complement to the trajectory-based features):
Spatial-temporal interest points
Audio MFCC
Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling
of human actions with motion reference points. In ECCV, 2012.
Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling
of human actions with motion reference points. In ECCV, 2012.
Classifiers• SVM
Chi-square kernel for STIP and MFCC; Linear kernel for the other features
Kernel fusion within trajectory-based features; Score-level late fusion to combine trajectory-based features with STIP and MFCC
• Regularized DNN (ACM Multimedia 2014 full paper) Multiple hand-crafted features are used as inputs
Fusing features in a more rigorous fashion by considering both feature correlation and feature diversity
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
Classification
Classification
Feature Fusion
Feature Abstraction
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
Results (MAP2014)
• Run 1: SVM + Score Merging• Run 2: DNN + Merging• Run 3: SVM + DNN + Merging• Run 4: SVM + DNN + Smoothing + Merging• Run 5: SVM + DNN • Note: for DNN, we used less features (excluding the FV encoding of HOG, HOF, and MBH)
Run 1 Run 2 Run 3 Run 4 Run 5
0.410.45
0.4
0.63
0.510.49
0.6
0.540.5
0.55
Main Task Generalization Task
Observations• DNN is significantly better than SVM, even when some features were not used in DNN.
• Directly fusing SVM and DNN incurs a small perfor-mance drop. Better result may be obtained after parameter optimization (needs more investigation).
• Smoothing and merging are useful, but the correct order of using the two (i.e. which is used first) and their contributions need more experiments to be fully understood.
• Some conclusions drawn from the main task do not generalize to the generalization task. This also requires more investigations to come to a more concrete understanding of the problem.
Thank You!
Questions: [email protected]