Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Lecture 8: Deep Learning
Tuo Zhao
Schools of ISyE and CSE, Georgia Tech
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep Learning = Artificial Intelligence?
Tuo Zhao — Lecture 8: Deep Learning 2/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep Learning = Artificial Intelligence?
Tuo Zhao — Lecture 8: Deep Learning 3/73
Neural Network
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Single Neuron
Basic Building Block
Input: x1, x2, x3,+1
Output: hw,b(x) = σ(w>x) = σ(∑3
j=1wjxj + b)
Activation Function σ : R→ R
Tuo Zhao — Lecture 8: Deep Learning 5/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Activation Function
Sigmoid function:
σ(z) =1
1 + exp(−z)
Tanh function:
σ(z) =exp(z)− exp(−z)exp(z) + exp(−z)
ReLU function:
σ(z) = max{0, z}
Tuo Zhao — Lecture 8: Deep Learning 6/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Activation Function
Tuo Zhao — Lecture 8: Deep Learning 7/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Neural Network
Tuo Zhao — Lecture 8: Deep Learning 8/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiple Neuron
Supervised Learning: X → Y
Input: x1, x2, x3,+1
Hidden Units:
a(2)1 = σ(W
(1)11 x1 +W
(1)12 x2 +W
(1)13 x3 + b
(1)1 )
a(2)2 = σ(W
(1)21 x1 +W
(1)22 x2 +W
(1)23 x3 + b
(1)2 )
a(2)3 = σ(W
(1)31 x1 +W
(1)32 x2 +W
(1)33 x3 + b
(1)3 )
Output:
hW,b(x) = a(3)1 = σ(W
(2)11 a
(2)1 +W
(2)12 a
(2)2 +W
(2)13 a
(2)3 + b
(2)1 )
Tuo Zhao — Lecture 8: Deep Learning 9/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Feedforward Network
hW,b(x) = W(3)σ(W(2)σ(W(1)x+ b(1)) + b(2)) + b(3)
Tuo Zhao — Lecture 8: Deep Learning 10/73
Backpropagation Algorithm
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Empirical Risk Minimization
Supervised Learning: (x(1),y(1)), ..., (x(n),y(n))
Loss function:
L(W, b) =1
n
n∑i=1
`(hW,b(x(i)),y(i))
Empirical Risk Minimization:
L(W, b) =1
n
n∑i=1
`(hW,b(x(i)),y(i)) + λR(W, b)
Tuo Zhao — Lecture 8: Deep Learning 12/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Backpropagation Algorithm
Nonconvex Optimization: Convergence to stationary solutions
Gradient Descent: Not scalable
Stochastic Gradient Descent: Most popular
W(p)jk ←W
(p)jk − α
∂`(hW,b(x(t)),y(t))
∂W(p)jk
b(p)j ← b
(p)j − α
∂`(hW,b(x(t)),y(t))
∂b(p)j
Step size: α — also known as learning rate
Tuo Zhao — Lecture 8: Deep Learning 13/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Backpropagation Algorithm
Composite function: h(x) = f(g(x))
Chain Rule: h′(x) = f ′(g(x))g′(x)
Error Backpropagation ⇔ Stochastic Gradient Descent
Momentum:
δW
(p)jk
← γδ(p)Wjk
+ α∂`(hW,b(x
(t)),y(t))
∂W(p)jk
δb(p)j
← γδb(p)i
− α∂`(hW,b(x(t)),y(t))
∂b(p)j
W(p)jk ←W
(p)jk − δW(p)
jk
, b(p)j ← b
(p)j − δb(p)j
Tuo Zhao — Lecture 8: Deep Learning 14/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Momentum
Tuo Zhao — Lecture 8: Deep Learning 15/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GPU & Asynchronous SGD
Tuo Zhao — Lecture 8: Deep Learning 16/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GPU & Asynchronous SGD
Tuo Zhao — Lecture 8: Deep Learning 17/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GPU & Asynchronous SGD
Tuo Zhao — Lecture 8: Deep Learning 18/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
A Function Approximation Perspective
Supervised Learning: (x(1), y(1)), ..., (x(n), y(n))
Decision function f(x) : Rd → R
Empirical Risk Minimization:
f̂ = argminf∈F
m∑i=1
`(f(x(i)), y(i)) +R(f),
Linear Model: f(x(i)) = θ>x(i)
Nonparametric Model: Polynomial Regressions
Neural Network: f(x(i)) = hW,b(x(i))
Tuo Zhao — Lecture 8: Deep Learning 19/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Universal Approximation
Any function f can be approximated by a neural net with onehidden layer.
A wide and shallow network is sufficient for representation.
The hidden layer may contain a large number of neurons,which is generally computationally intractble.
How can we get such a good neural net?
Mission Impossible
Tuo Zhao — Lecture 8: Deep Learning 20/73
How to “Hack” a Better Neural Network
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Vanishing Gradient
Tuo Zhao — Lecture 8: Deep Learning 22/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Vanishing Gradient
Overfitting: No Errors to Propagate
Avoid Zero Derivate
Tuo Zhao — Lecture 8: Deep Learning 23/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dropout Training
Randomly drop neurons:
High dropout probability: e.g. 0.5
Implicit regularization
Tuo Zhao — Lecture 8: Deep Learning 24/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Batch Normalization
Normalize Each Layer: Standardization
Avoid covariate shift
Implicit regularization
Tuo Zhao — Lecture 8: Deep Learning 25/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Step Size Annealing
Tuo Zhao — Lecture 8: Deep Learning 26/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Noise Annealing
W(p)jk ←W
(p)jk − α
∂`(hW,b(x(t)),y(t))
∂W(p)jk
+ ε(p)jk
b(p)j ← b
(p)j − α
∂`(hW,b(x(t)),y(t))
∂b(p)j
+ ε(p)j
Tuo Zhao — Lecture 8: Deep Learning 27/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Adaptive Optimization
We solve the following optimization problem,
minθf(θ) where g(θ) = ∇f(θ).
We can make the step sizes and momentums adaptive tocoordinates (Animation 1, Animation 2)
AdaGrad: θ(t+1)j = θ
(t)j − η
(t)j gj(θ
(t)).
AdaM : θ(t+1)j = θ
(t)j − η
(t)j gj(θ
(t)) + α(t)j (θ
(t)j − θ
(t−1)j ).
The AdaGrad algorithm takes
θ(t+1)j = θ
(t)j −
ηgj(θ(t))√
1 +∑t
i=1 gj(θ(i))
.
Tuo Zhao — Lecture 8: Deep Learning 28/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Early Stopping
Tuo Zhao — Lecture 8: Deep Learning 29/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Residual Network
Skip-Layer Connection
FW,V(x) = σ(Vσ(Wx) + x)
Ensemble Multiple Neural Networks
Tuo Zhao — Lecture 8: Deep Learning 30/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Xavier Initialization
254
Understanding the difficulty of training deep feedforward neural networks
4.2.2 Gradient Propagation Study
To empirically validate the above theoretical ideas, we haveplotted some normalized histograms of activation values,weight gradients and of the back-propagated gradients atinitialization with the two different initialization methods.The results displayed (Figures 6, 7 and 8) are from exper-iments on Shapeset-3 ⇥ 2, but qualitatively similar resultswere obtained with the other datasets.
We monitor the singular values of the Jacobian matrix as-sociated with layer i:
J i =@zi+1
@zi(17)
When consecutive layers have the same dimension, the av-erage singular value corresponds to the average ratio of in-finitesimal volumes mapped from zi to zi+1, as well asto the ratio of average activation variance going from zi
to zi+1. With our normalized initialization, this ratio isaround 0.8 whereas with the standard initialization, it dropsdown to 0.5.
Figure 6: Activation values normalized histograms withhyperbolic tangent activation, with standard (top) vs nor-malized initialization (bottom). Top: 0-peak increases forhigher layers.
4.3 Back-propagated Gradients During Learning
The dynamic of learning in such networks is complex andwe would like to develop better tools to analyze and trackit. In particular, we cannot use simple variance calculationsin our theoretical analysis because the weights values arenot anymore independent of the activation values and thelinearity hypothesis is also violated.
As first noted by Bradley (2009), we observe (Figure 7) thatat the beginning of training, after the standard initializa-tion (eq. 1), the variance of the back-propagated gradientsgets smaller as it is propagated downwards. However wefind that this trend is reversed very quickly during learning.Using our normalized initialization we do not see such de-creasing back-propagated gradients (bottom of Figure 7).
Figure 7: Back-propagated gradients normalized his-tograms with hyperbolic tangent activation, with standard(top) vs normalized (bottom) initialization. Top: 0-peakdecreases for higher layers.
What was initially really surprising is that even when theback-propagated gradients become smaller (standard ini-tialization), the variance of the weights gradients is roughlyconstant across layers, as shown on Figure 8. However, thisis explained by our theoretical analysis above (eq. 14). In-terestingly, as shown in Figure 9, these observations on theweight gradient of standard and normalized initializationchange during training (here for a tanh network). Indeed,whereas the gradients have initially roughly the same mag-nitude, they diverge from each other (with larger gradientsin the lower layers) as training progresses, especially withthe standard initialization. Note that this might be one ofthe advantages of the normalized initialization, since hav-ing gradients of very different magnitudes at different lay-ers may yield to ill-conditioning and slower training.
Finally, we observe that the softsign networks share simi-larities with the tanh networks with normalized initializa-tion, as can be seen by comparing the evolution of activa-tions in both cases (resp. Figure 3-bottom and Figure 10).
5 Error Curves and Conclusions
The final consideration that we care for is the successof training with different strategies, and this is best il-lustrated with error curves showing the evolution of testerror as training progresses and asymptotes. Figure 11shows such curves with online training on Shapeset-3⇥ 2,while Table 1 gives final test error for all the datasetsstudied (Shapeset-3 ⇥ 2, MNIST, CIFAR-10, and Small-ImageNet). As a baseline, we optimized RBF SVM mod-els on one hundred thousand Shapeset examples and ob-tained 59.47% test error, while on the same set we obtained50.47% with a depth five hyperbolic tangent network withnormalized initialization.
These results illustrate the effect of the choice of activa-tion and initialization. As a reference we include in Fig-
Standard Initialization: W (l) ∼ U(−√3√nl,√3√nl
)Xavier Initialization: W (l) ∼ U
(−
√6√
nl+nl+1,
√6√
nl+nl+1
)
Tuo Zhao — Lecture 8: Deep Learning 31/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep v.s. Shallow Networks
AlexNet and VGGNet
GoogleNet
Tuo Zhao — Lecture 8: Deep Learning 32/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep v.s. Shallow Networks
Deep network is very powerful in representation
Deep network turns out to be easier to optimize
AlexNet: 8 ⇒ LeNet: 23 ⇒ ResNet: 152
Why?
Tuo Zhao — Lecture 8: Deep Learning 33/73
Convolutional Neural Networks
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
The Architecture of CNNs
5 Convolution Layers
3 Max Pooling
3 Dense Layers
Tuo Zhao — Lecture 8: Deep Learning 35/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolutional Neural Networks
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32
32
3
28
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
28
6
CONV, ReLU e.g. 6 5x5x3 filters
Tuo Zhao — Lecture 8: Deep Learning 36/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolutional Neural Networks
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions
32
32
3
CONV, ReLU e.g. 6 5x5x3 filters 28
28
6
CONV, ReLU e.g. 10 5x5x6 filters
CONV, ReLU
….
10
24
24
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 37/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Operation
The convolution operation
Tuo Zhao — Lecture 8: Deep Learning 38/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Operation
TheconvoluFonoperaFon
Tuo Zhao — Lecture 8: Deep Learning 39/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Benefits of Convolution
Sparse ConnectivityReason 1 : Sparse Connectivity
Tuo Zhao — Lecture 8: Deep Learning 40/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Benefits of Convolution
Parameter SharingReason 2 : Parameter sharing
Tuo Zhao — Lecture 8: Deep Learning 41/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Benefits of Convolution
Translational Invariance
Tuo Zhao — Lecture 8: Deep Learning 42/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
3
Convolution Layer 32x32x3 image
width
height
32 depth
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 43/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
5x5x3 filter
32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 44/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
32x32x3 image 5x5x3 filter
1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 45/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
activation map 32x32x3 image 5x5x3 filter
1
28
28
convolve (slide) over all spatial locations
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 46/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
32x32x3 image 5x5x3 filter
activation maps
1
28
28
convolve (slide) over all spatial locations
consider a second, green filter
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 47/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
32
3 6
28
activation maps 32
28
Convolution Layer
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 48/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
Stride
Tuo Zhao — Lecture 8: Deep Learning 49/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
A closer look at spatial dimensions:
32
32
3
activation map 32x32x3 image 5x5x3 filter
1
28
28
convolve (slide) over all spatial locations
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 50/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 51/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 52/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 53/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 54/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
=> 5x5 output
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 55/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 2
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 56/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 2
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 57/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output!
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 58/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 3?
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 59/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 3?
7
7
A closer look at spatial dimensions:
doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 60/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
N
F
F
N Output size: (N - F) / stride + 1
e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 61/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Zero-Padding
Zero-Padding
Tuo Zhao — Lecture 8: Deep Learning 62/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Zero-Padding: common to the border
0 0 0 0 0 0
0
0
0
0
e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?
(recall:) (N - F) / stride + 1
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Zero-Padding: common to the border
Tuo Zhao — Lecture 8: Deep Learning 63/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Zero-Padding: common to the border
e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output!
0 0 0 0 0 0
0
0
0
0
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Zero-Padding: common to the border
Tuo Zhao — Lecture 8: Deep Learning 64/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Tiled Convolution
Local connectivity
Locallyconnectedlayer
ConvoluFonallayer
Fullyconnectedlayer
Tuo Zhao — Lecture 8: Deep Learning 65/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Tiled Convolution
Tiled convolution
Locallyconnectedlayer
TiledconvoluFon
ConvoluFonallayer
Tuo Zhao — Lecture 8: Deep Learning 66/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling
Effect=invariancetosmalltranslaFonsoftheinput
Pooling
Tuo Zhao — Lecture 8: Deep Learning 67/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling
Pooling
Tuo Zhao — Lecture 8: Deep Learning 68/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling- makes the representations smaller and more manageable - operates over each activation map independently
Pooling
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 69/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
Single depth slice
x
y
max pool with 2x2 filters and stride 2 6 8
3 4
Max Pooling
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 70/73
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Case Study: AlexNet
Case Study: AlexNet [Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)
Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 71/73
The End
Congratulations!