Convolutional Neural Networks (Part II)lvelho.impa.br/ip17/proj/slides/1110GoingDeeperConv.pdf ·...

Preview:

Citation preview

Convolutional Neural Networks 1

Convolutional Neural Networks (Part II)

08, 10 & 17 Nov, 2016

J. Ezequiel Soto S.Image Processing 2016

Prof. Luiz Velho

Convolutional Neural Networks 2

Summary & References08/11 ImageNet Classification with Deep Convolutional Neural Networks

2012, Krizhevsky et. al. [source]10/11 Going Deeper with Convolutions

2015, Szegedy et. al. [source]17/11 Painting Style Transfer for Head Portraits using Convolutional Neural Networks

2016, Selim & Elgharib [source]

+ An Analysis of Deep Neural Network Models for Practical Applications

2016, Canziani & Culurciello [source]+ Provable bounds for learning some deep representations

2013, Arora et.al. [source]

Convolutional Neural Networks 3

Going Deeper with Convolutions

Szegedy et.al. 2015

Convolutional Neural Networks 4

Outline● Introduction● Related Work● Motivation● Architecture Detail● GoogLeNet● Training● ILSVRC 2014● Conclusions

Convolutional Neural Networks 5

Introduction● GoogLeNet submission to ILSVRC →2014

● Accuracy + low cost in ops (1.5B @inference) → real world applicability

● Efficient CNN architecture: Inception

● Depth: network layers + Inception module

● Results!!! New State of the Art→

Convolutional Neural Networks 6

Related Work● Standard CNN layer:

convolution + normalization + max pooling ● Good results in MNIST, CIFAR and ImageNet (with dropout vs. overfitting)

● Concerns that max-pooling loses spatial information● Neuro-science model of primate vision: stack filters

→ inspiration of the inception module● Network in Network model (NiN)● 1 x 1 convolutions:

– Increase depth– Dimension reduction (reduce computational cost)

● Regions with Convolutional Neural Networks: R-CNN

Convolutional Neural Networks 7

Convolutional Neural Networks 8

Convolutional Neural Networks 9

Motivation● Improve CNNs by growing them deeper and wider…

– Too much parameters Overfitting →– Computational cost: two layers chained

2x filters o→ 2 computation– Zero entries? Sparsity control*→– The lack of structure, large number of filters and great

batches Efficient use of dense computation→

* Theoretical results: 2013, Arora et.al. “Provable Bounds for Learning Some Deep Representations”, 54 p.

Convolutional Neural Networks 10

Motivation“This raises the question whether there is any hope for a next, intermediate step: an architecture that makes use of the extra sparsity, even at filter level, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices.”

● Inception idea…– case study trying to approximate Arora’s sparse structure with

dense, available components (convolutions)– highly speculative / immediate good results

CAUTION: “although the proposed architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction”

Convolutional Neural Networks 11

“Given samples from a sparsely connected neural network whose each layer is a denoising autoencoder, can the net (and hence its reverse) be learnt in polynomial time with low sample complexity?”

Video 1Video 2

Convolutional Neural Networks 12

Architecture Detail“finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components”

● Translation invariance convolutional→● Local construction that repeats● Theory points at analyzing correlations of last layer and cluster by it.

● Lower layers: correlation spatial localization→● Avoid “aligned” correlations… using different sized filers

Convolutional Neural Networks 13

Convolutional Neural Networks 14

Architecture Detail● Higher levels higher abstraction:→● Spatial correlation decreases

→ increase use of bigger filters (3×3, 5×5)● Stacking large filters blows up the number of outputs! reduce dimension→

● Avoid too much compression of the information and maintain sparsity → 1×1 convolutions before the larger ones!

Convolutional Neural Networks 15

Convolutional Neural Networks 16

Inception module

Video

Convolutional Neural Networks 17

Architecture Detail● Lower levels: classic convolutions● Higher levels: inceptions modules

* Author thinks this isn’t necessary, but compensates some inefficiency of structure design...

● Intuition scale invariance of visual information before abstraction→

● Increased computation efficiency achieved by the reductions, allowing to grow depth and breath

● Efficiency: 3 – 10x faster than similar networks without inception modules, but the design has to be careful.

Convolutional Neural Networks 18

GoogLeNet● Specific design with Inception models used in the ILSVRC 2014 competition

● Same design for 6/7 of the ensemble models

● 22 layers deep

● Detail:– All convolutions include ReLU– Input: 224×224 in RGB with zero mean– #3×3 reduce = 1×1 filters before 3×3 convolutions– #5×5 reduce = 1×1 filters before 5×5 convolutions– pool proj = 1×1 filters after max-pooling

Convolutional Neural Networks 19

Convolutional Neural Networks 20

GoogLeNet● 22 layers (27 with max-pooling)● 100 independent building blocks● Pooling before classifying: NiN

+ Linear layer: convenience / change labels● Avg-pooling over FC gives +0.6% top-1 acc● Dropout remained essential

● Propagate gradient in effective manner discriminate →correctly in middle layers

● Inclusion of intermediate classifiers: convolutional networks on top of the inception modules (4a) and (4d) 0.3*Loss→

● Auxiliary classifiers are ignored at inference / marginal effect

Convolutional Neural Networks 21

GoogLeNet● Auxiliary network:

– Avg-pooling: 5×5 filter, stride 3(4a) 4×4×512→(4d) 4×4×528→

– 1×1 convolution with 128 filters + ReLU– FC layer with 1024 units + ReLU– Dropout layer (70%)– Linear + softmax for 1000 classes

(removed @inference)

Convolutional Neural Networks 23

Training Methodology● DistBelief: modest model & data parallelism (…Google)

CPU only → one week in a few GPUs (memory!)

● Stochastic Gradient Descent:– 0.9 momentum– Fixed learning rate:

-4% every 8 epochs– Polyak-Ruppert average of the iterations of

SGD● Many different methods for sampling and training over the images…– Different size crops– Patches 8% - 100% of the image– Aspect ratio [¾, 4/3]– Photometric distortions

Convolutional Neural Networks 24

ILSVRC 2014: Classification● No external data for training● 7 versions of GoogLeNet model (1 wider)

– Same initialization (same weights: oversight)– Same learning policies– Different sampling

→ Ensemble prediction● Testing (more aggressive than AlexNet)

– 4 scales (256, 288, 320, 352)– Left, center and right (top, center, bottom) squares– Each square: full + 4 corners + center (224×224)– Mirrored image

→ 4×3×6×2 = 144 crops per imageNot necessary / decreasing marginal benefit

● Softmax: avg over all crops & all models (1008 tests) avgn=1008

144x

Convolutional Neural Networks 25

Convolutional Neural Networks 26

Source: 2016, Canziani & Culurciello

Convolutional Neural Networks 27

ILSVRC 2014: Detection● Produce bbox around objects in 200 classes

– Correct if the bbox overlaps 50% w/ groundtruth– Extraneous detection (false +) are penalized

● Submission:– R-CNN + Inception model as region classifier– Selective search (2x pixel) + Multibox– Classify region: ensemble of 6 GoogLeNet models– No bounding box regression (R-CNN)– Report mean avg precision (mAP)

Convolutional Neural Networks 28

Convolutional Neural Networks 29

Source: 2016, Canziani & Culurciello

Convolutional Neural Networks 30

Conclusions“...approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision.”

● Large gain / Small increase in computation● Detection is very competitive not using context nor bbox regression

● Moving to sparser architectures: feasible & useful● Importance of the analysis!!! (2013, Arora et.al.)

DeepDream (side result) examples are creepy… but show the reverse function of the network!

Input image force it to get close to animal categories→

Convolutional Neural Networks 31

Convolutional Neural Networks 32

Convolutional Neural Networks 33

Will continue, again...

Recommended