Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks

FUDAN-NJUST AT MEDIAEVAL 2014: VIOLENT SCENES DETECTION USING

DEEP NEURAL NETWORKS

Qi Dai*, Zuxuan Wu*, Yu-Gang Jiang*, Xiangyang Xue*, Jinhui Tang#

*School of Computer Science, Fudan University, Shanghai, China#School of Computer Science and Engineering, Nanjing University of Science and Technology,

Nanjing, China

[email protected]

MediaEval 2014 Workshop, Oct 16-17, Barcelona, Spain

Problem• Detecting violent scenes in both movies and short web videos

Violent scene in Hollywood Movies Violent scene in short web videos

System Overview• Several features were used,

including trajectory-based features, and two other visual/audio features

• In addition to SVM, we adopted deep neural networks (DNN) as a feature fusion and classification method

FV

-HO

G

Video Clips

11

SVM DNN

MergingMerging

22

55 33 44

Feature Extraction

FV

-HO

F

FV

-MB

H

FV

-Tra

jSha

pe

Tra

jMF

-HO

G

Tra

jMF

-HO

F

Tra

jMF

-MB

H

ST

IP

MF

CC

Fusion

MergingMerging

MergingMerging Smoothing&Merging

Smoothing&Merging

Features• Trajectory-based features:

Improved Dense Trajectories (HOG, HOF, MBH, Trajectory Shape)

Features are encoded by the Fisher Vectors (FV)

Dimension-reduced TrajMF (relative locations and motions between trajectory pairs), implemented based on:

• Two additional visual/audio features (a complement to the trajectory-based features):

Spatial-temporal interest points

Audio MFCC

Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling

of human actions with motion reference points. In ECCV, 2012.

Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling

of human actions with motion reference points. In ECCV, 2012.

Classifiers• SVM

Chi-square kernel for STIP and MFCC; Linear kernel for the other features

Kernel fusion within trajectory-based features; Score-level late fusion to combine trajectory-based features with STIP and MFCC

• Regularized DNN (ACM Multimedia 2014 full paper) Multiple hand-crafted features are used as inputs

Fusing features in a more rigorous fashion by considering both feature correlation and feature diversity

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)


Classification

Classification

Feature Fusion

Feature Abstraction



Results (MAP2014)

• Run 1: SVM + Score Merging• Run 2: DNN + Merging• Run 3: SVM + DNN + Merging• Run 4: SVM + DNN + Smoothing + Merging• Run 5: SVM + DNN • Note: for DNN, we used less features (excluding the FV encoding of HOG, HOF, and MBH)

Run 1 Run 2 Run 3 Run 4 Run 5

0.410.45

0.4

0.63

0.510.49

0.6

0.540.5

0.55

Main Task Generalization Task

Observations• DNN is significantly better than SVM, even when some features were not used in DNN.

• Directly fusing SVM and DNN incurs a small perfor-mance drop. Better result may be obtained after parameter optimization (needs more investigation).

• Smoothing and merging are useful, but the correct order of using the two (i.e. which is used first) and their contributions need more experiments to be fully understood.

• Some conclusions drawn from the main task do not generalize to the generalization task. This also requires more investigations to come to a more concrete understanding of the problem.

Thank You!

Questions: [email protected]

Software

Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks