Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Learnable pooling with Context Gating for video classification
Antoine Miech1,2 Josef Sivic1,2,3Ivan Laptev1,2
Ranked 1 at kaggleModel Overview
1DI ENS, École Normale Supérieure, PSL Research University 3CIIRC, CTU in Prague2Inria
Goal
Challenges
Contributions
LOUPE Tensorflow toolbox:github.com/antoine77340/LOUPEarXiv: 1706.06905
Focus on aggregation
Results on the Youtube-8M dataset
Training speed and Generalization
Learnable pooling via Clustering
Context Gating
References
• Multi-label video taggingGroundtruth: Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Renault - 4-Wheel Drive
Groundtruth:Tree- Christmas Tree - Christmas Decoration - Christmas
Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Fiat AutomobilesVolkswagen Beetles
Top 6 scores:
Top 6 scores: Christmas - Christmas decoration - Origami - Paper - Tree Christmas Tree
• Large diversity of labels• Incomplete and noisy annotation• What is a good representation for video ?
• Learnable pooling for video representations• Context Gating: A non-linear learnable module for modeling interdependencies between activations
Input
GatedOutput
Fully-connected
Sigmoid
Multiplicationwith Input
Ski
Tree
Snow
SkiTree
Snow
Featu
re a
ctiv
ation
Featu
re a
ctiv
ation
• Models interdependencies between activations with a self-gating mechanism
• Adds quadratic interactions
• Inspired by Gated Linear Unit [3][1] R. Arandjelović, et al. NetVLAD: CNN architecture for weakly supervised place recognition. CVPR 2016[2] J. Sivic et al. Video Google: a text retrieval approach to object matching in videos. ICCV 2003 [3] Y. N. Dauphin, et al. Language modeling with gated convolutional networks. arXiv:1612.08083[4] H. Jégou, et al. Aggregating local descriptors into a compact image representation. CVPR 2010[5] F. Perronnin, et al. Fisher kernels on visual vocabularies for image categorization. CVPR 2007
Release
• How to aggregate multi-modal sequences of features into a good compact representation ?
Visual features
Audio features
Supervised model
Single andcompact pooledrepresentation
• Training time vs. performance for LSTM, GRU, NetVLAD and Gated NetVLAD.
• Performance as a function of the training set size.
• Soft-DBoWSoft Bag-of-Words [2] formulated in terms of differentiable operations.
• NetVLAD [1] / NetRVLADNetVLAD: Approximates a VLAD [4] encoding with differentiable operations.NetRVLAD: Residual-less NetVLAD.
• NetFVA variant of the Fisher Vector representation [5] with an approximation of Gaussian Mixture model by differentiable operations. This is done through a simple extension of NetVLAD [1].
0 5 10 15 20 25Number of models
83.0
83.5
84.0
84.5
85.0
85.5
GAP
• Comparison of different pooling based methods (with or without Context Gating) and the RNN pooling approaches.
• The effect of ensembling different methods.A prediction score average of 7 different models can achieve the first place of the kaggle Youtube-8M challenge out of 650 teams.Models are: Gated NetVLAD, Gated
NetRVLAD, Gated NetFV, Gated Soft DBoW,
DBoW, GRU and LSTM.
105 106 107
Number of samples
65
70
75
80
85
GAP
Gated NetVLADNetVLADLSTMAverage Pooling
X: input layerY: output layerW,b: parameters to learn
2 4 6 8 10 12 14Time (hour)
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84
GAP
Gated NetVLADNetVLADGRULSTM
+7 Millions videos
4716labels
3.4 avg labels/video
450000video hoursVIDEO
FEATURES
Learnable pooling
(e.g NetVLAD,NetFV)
FClayer
MoE
AUDIO FEATURES
Context Gating
Context Gating
Learnable pooling
(e.g NetVLAD,NetFV)
FEATURE POOLING
INPUTFEATURES
CLASSIFICATION
Time (hours)
NetVLAD NetRVLAD NetFV Soft-DBoW GRU LSTMModel
81.5
82.0
82.5
83.0
83.5
GAP
With Context GatingBased on ClusteringRNN