9

Click here to load reader

Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Embed Size (px)

DESCRIPTION

The Violent Scenes Detection task aims at evaluating algorithms that automatically localize violent segments in both Hollywood movies and short web videos. The definition of violence is subjective: "the segments that one would not let an 8 years old child see in a movie because they contain physical violence". This is a highly challenging problem because of the strong content variations among the positive instances. In this year's evaluation, we adopted our recently proposed classification method to fuse multiple features using Deep Neural Networks (DNN). The method was named regularized DNN. We extracted a set of visual and audio features, which have been observed useful. We then applied the regularized DNN for feature fusion and classification. Results indicate that using multiple features is still very helpful, and more importantly, our proposed regularized DNN offers significantly better results than the popular SVM. We achieved a mean average precision of 0.63 for the main task and 0.60 for the generalization task. http://ceur-ws.org/Vol-1263/mediaeval2014_submission_65.pdf

Citation preview

Page 1: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

FUDAN-NJUST AT MEDIAEVAL 2014: VIOLENT SCENES DETECTION USING

DEEP NEURAL NETWORKS

Qi Dai*, Zuxuan Wu*, Yu-Gang Jiang*, Xiangyang Xue*, Jinhui Tang#

*School of Computer Science, Fudan University, Shanghai, China#School of Computer Science and Engineering, Nanjing University of Science and Technology,

Nanjing, China

[email protected]

MediaEval 2014 Workshop, Oct 16-17, Barcelona, Spain

Page 2: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Problem• Detecting violent scenes in both movies and short web videos

Violent scene in Hollywood Movies Violent scene in short web videos

Page 3: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

System Overview• Several features were used,

including trajectory-based features, and two other visual/audio features

• In addition to SVM, we adopted deep neural networks (DNN) as a feature fusion and classification method

FV

-HO

G

Video Clips

11

SVM DNN

MergingMerging

22

55 33 44

Feature Extraction

FV

-HO

F

FV

-MB

H

FV

-Tra

jSha

pe

Tra

jMF

-HO

G

Tra

jMF

-HO

F

Tra

jMF

-MB

H

ST

IP

MF

CC

Fusion

MergingMerging

MergingMerging Smoothing&Merging

Smoothing&Merging

Page 4: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Features• Trajectory-based features:

Improved Dense Trajectories (HOG, HOF, MBH, Trajectory Shape)

Features are encoded by the Fisher Vectors (FV)

Dimension-reduced TrajMF (relative locations and motions between trajectory pairs), implemented based on:

• Two additional visual/audio features (a complement to the trajectory-based features):

Spatial-temporal interest points

Audio MFCC

Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling

of human actions with motion reference points. In ECCV, 2012.

Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling

of human actions with motion reference points. In ECCV, 2012.

Page 5: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Classifiers• SVM

Chi-square kernel for STIP and MFCC; Linear kernel for the other features

Kernel fusion within trajectory-based features; Score-level late fusion to combine trajectory-based features with STIP and MFCC

• Regularized DNN (ACM Multimedia 2014 full paper) Multiple hand-crafted features are used as inputs

Fusing features in a more rigorous fashion by considering both feature correlation and feature diversity

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)

Page 6: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Classification

Classification

Feature Fusion

Feature Abstraction

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)

Page 7: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Results (MAP2014)

• Run 1: SVM + Score Merging• Run 2: DNN + Merging• Run 3: SVM + DNN + Merging• Run 4: SVM + DNN + Smoothing + Merging• Run 5: SVM + DNN • Note: for DNN, we used less features (excluding the FV encoding of HOG, HOF, and MBH)

Run 1 Run 2 Run 3 Run 4 Run 5

0.410.45

0.4

0.63

0.510.49

0.6

0.540.5

0.55

Main Task Generalization Task

Page 8: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Observations• DNN is significantly better than SVM, even when some features were not used in DNN.

• Directly fusing SVM and DNN incurs a small perfor-mance drop. Better result may be obtained after parameter optimization (needs more investigation).

• Smoothing and merging are useful, but the correct order of using the two (i.e. which is used first) and their contributions need more experiments to be fully understood.

• Some conclusions drawn from the main task do not generalize to the generalization task. This also requires more investigations to come to a more concrete understanding of the problem.

Page 9: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

Thank You!

Questions: [email protected]