Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Lecture 8: Deep Learning

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Deep Learning = Artificial Intelligence?

Tuo Zhao — Lecture 8: Deep Learning 2/73


Deep Learning = Artificial Intelligence?


Neural Network


Single Neuron

Basic Building Block

Input: x1, x2, x3,+1

Output: hw,b(x) = σ(w>x) = σ(∑3

j=1wjxj + b)

Activation Function σ : R→ R



Activation Function

Sigmoid function:

σ(z) =1

1 + exp(−z)

Tanh function:

σ(z) =exp(z)− exp(−z)exp(z) + exp(−z)

ReLU function:

σ(z) = max{0, z}



Activation Function



Neural Network



Multiple Neuron

Supervised Learning: X → Y

Input: x1, x2, x3,+1

Hidden Units:

a(2)1 = σ(W

(1)11 x1 +W

(1)12 x2 +W

(1)13 x3 + b

(1)1 )

a(2)2 = σ(W

(1)21 x1 +W

(1)22 x2 +W

(1)23 x3 + b

(1)2 )

a(2)3 = σ(W

(1)31 x1 +W

(1)32 x2 +W

(1)33 x3 + b

(1)3 )

Output:

hW,b(x) = a(3)1 = σ(W

(2)11 a

(2)1 +W

(2)12 a

(2)2 +W

(2)13 a

(2)3 + b

(2)1 )



Feedforward Network

hW,b(x) = W(3)σ(W(2)σ(W(1)x+ b(1)) + b(2)) + b(3)


Backpropagation Algorithm


Empirical Risk Minimization

Supervised Learning: (x(1),y(1)), ..., (x(n),y(n))

Loss function:

L(W, b) =1

n

n∑i=1

`(hW,b(x(i)),y(i))

Empirical Risk Minimization:

L(W, b) =1

n

n∑i=1

`(hW,b(x(i)),y(i)) + λR(W, b)




Nonconvex Optimization: Convergence to stationary solutions

Gradient Descent: Not scalable

Stochastic Gradient Descent: Most popular

W(p)jk ←W

(p)jk − α

∂`(hW,b(x(t)),y(t))

∂W(p)jk

b(p)j ← b

(p)j − α

∂`(hW,b(x(t)),y(t))

∂b(p)j

Step size: α — also known as learning rate




Composite function: h(x) = f(g(x))

Chain Rule: h′(x) = f ′(g(x))g′(x)

Error Backpropagation ⇔ Stochastic Gradient Descent

Momentum:

δW

(p)jk

← γδ(p)Wjk

+ α∂`(hW,b(x

(t)),y(t))

∂W(p)jk

δb(p)j

← γδb(p)i

− α∂`(hW,b(x(t)),y(t))

∂b(p)j

W(p)jk ←W

(p)jk − δW(p)

jk

, b(p)j ← b

(p)j − δb(p)j



Momentum



GPU & Asynchronous SGD









A Function Approximation Perspective

Supervised Learning: (x(1), y(1)), ..., (x(n), y(n))

Decision function f(x) : Rd → R

Empirical Risk Minimization:

f̂ = argminf∈F

m∑i=1

`(f(x(i)), y(i)) +R(f),

Linear Model: f(x(i)) = θ>x(i)

Nonparametric Model: Polynomial Regressions

Neural Network: f(x(i)) = hW,b(x(i))



Universal Approximation

Any function f can be approximated by a neural net with onehidden layer.

A wide and shallow network is sufficient for representation.

The hidden layer may contain a large number of neurons,which is generally computationally intractble.

How can we get such a good neural net?

Mission Impossible


How to “Hack” a Better Neural Network


Vanishing Gradient



Vanishing Gradient

Overfitting: No Errors to Propagate

Avoid Zero Derivate



Dropout Training

Randomly drop neurons:

High dropout probability: e.g. 0.5

Implicit regularization



Batch Normalization

Normalize Each Layer: Standardization

Avoid covariate shift

Implicit regularization



Step Size Annealing



Noise Annealing

W(p)jk ←W

(p)jk − α

∂`(hW,b(x(t)),y(t))

∂W(p)jk

+ ε(p)jk

b(p)j ← b

(p)j − α

∂`(hW,b(x(t)),y(t))

∂b(p)j

+ ε(p)j



Adaptive Optimization

We solve the following optimization problem,

minθf(θ) where g(θ) = ∇f(θ).

We can make the step sizes and momentums adaptive tocoordinates (Animation 1, Animation 2)

AdaGrad: θ(t+1)j = θ

(t)j − η

(t)j gj(θ

(t)).

AdaM : θ(t+1)j = θ

(t)j − η

(t)j gj(θ

(t)) + α(t)j (θ

(t)j − θ

(t−1)j ).

The AdaGrad algorithm takes

θ(t+1)j = θ

(t)j −

ηgj(θ(t))√

1 +∑t

i=1 gj(θ(i))

.



Early Stopping



Residual Network

Skip-Layer Connection

FW,V(x) = σ(Vσ(Wx) + x)

Ensemble Multiple Neural Networks



Xavier Initialization

254

Understanding the difficulty of training deep feedforward neural networks

4.2.2 Gradient Propagation Study

To empirically validate the above theoretical ideas, we haveplotted some normalized histograms of activation values,weight gradients and of the back-propagated gradients atinitialization with the two different initialization methods.The results displayed (Figures 6, 7 and 8) are from exper-iments on Shapeset-3 ⇥ 2, but qualitatively similar resultswere obtained with the other datasets.

We monitor the singular values of the Jacobian matrix as-sociated with layer i:

J i =@zi+1

@zi(17)

When consecutive layers have the same dimension, the av-erage singular value corresponds to the average ratio of in-finitesimal volumes mapped from zi to zi+1, as well asto the ratio of average activation variance going from zi

to zi+1. With our normalized initialization, this ratio isaround 0.8 whereas with the standard initialization, it dropsdown to 0.5.

Figure 6: Activation values normalized histograms withhyperbolic tangent activation, with standard (top) vs nor-malized initialization (bottom). Top: 0-peak increases forhigher layers.

4.3 Back-propagated Gradients During Learning

The dynamic of learning in such networks is complex andwe would like to develop better tools to analyze and trackit. In particular, we cannot use simple variance calculationsin our theoretical analysis because the weights values arenot anymore independent of the activation values and thelinearity hypothesis is also violated.

As first noted by Bradley (2009), we observe (Figure 7) thatat the beginning of training, after the standard initializa-tion (eq. 1), the variance of the back-propagated gradientsgets smaller as it is propagated downwards. However wefind that this trend is reversed very quickly during learning.Using our normalized initialization we do not see such de-creasing back-propagated gradients (bottom of Figure 7).

Figure 7: Back-propagated gradients normalized his-tograms with hyperbolic tangent activation, with standard(top) vs normalized (bottom) initialization. Top: 0-peakdecreases for higher layers.

What was initially really surprising is that even when theback-propagated gradients become smaller (standard ini-tialization), the variance of the weights gradients is roughlyconstant across layers, as shown on Figure 8. However, thisis explained by our theoretical analysis above (eq. 14). In-terestingly, as shown in Figure 9, these observations on theweight gradient of standard and normalized initializationchange during training (here for a tanh network). Indeed,whereas the gradients have initially roughly the same mag-nitude, they diverge from each other (with larger gradientsin the lower layers) as training progresses, especially withthe standard initialization. Note that this might be one ofthe advantages of the normalized initialization, since hav-ing gradients of very different magnitudes at different lay-ers may yield to ill-conditioning and slower training.

Finally, we observe that the softsign networks share simi-larities with the tanh networks with normalized initializa-tion, as can be seen by comparing the evolution of activa-tions in both cases (resp. Figure 3-bottom and Figure 10).

5 Error Curves and Conclusions

The final consideration that we care for is the successof training with different strategies, and this is best il-lustrated with error curves showing the evolution of testerror as training progresses and asymptotes. Figure 11shows such curves with online training on Shapeset-3⇥ 2,while Table 1 gives final test error for all the datasetsstudied (Shapeset-3 ⇥ 2, MNIST, CIFAR-10, and Small-ImageNet). As a baseline, we optimized RBF SVM mod-els on one hundred thousand Shapeset examples and ob-tained 59.47% test error, while on the same set we obtained50.47% with a depth five hyperbolic tangent network withnormalized initialization.

These results illustrate the effect of the choice of activa-tion and initialization. As a reference we include in Fig-

Standard Initialization: W (l) ∼ U(−√3√nl,√3√nl

)Xavier Initialization: W (l) ∼ U

(−

√6√

nl+nl+1,

√6√

nl+nl+1

)



Deep v.s. Shallow Networks

AlexNet and VGGNet

GoogleNet



Deep v.s. Shallow Networks

Deep network is very powerful in representation

Deep network turns out to be easier to optimize

AlexNet: 8 ⇒ LeNet: 23 ⇒ ResNet: 152

Why?


Convolutional Neural Networks


The Architecture of CNNs

5 Convolution Layers

3 Max Pooling

3 Dense Layers




Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

32

3

28

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

28

6

CONV, ReLU e.g. 6 5x5x3 filters




Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32

32

3

CONV, ReLU e.g. 6 5x5x3 filters 28

28

6

CONV, ReLU e.g. 10 5x5x6 filters

CONV, ReLU

….

10

24

24




Convolution Operation

The convolution operation



Convolution Operation

TheconvoluFonoperaFon



Benefits of Convolution

Sparse ConnectivityReason 1 : Sparse Connectivity




Parameter SharingReason 2 : Parameter sharing




Translational Invariance



Convolution Layer

32

3

Convolution Layer 32x32x3 image

width

height

32 depth




Convolution Layer

32

32

3

5x5x3 filter

32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

Convolution Layer




Convolution Layer

32

32

3

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

Convolution Layer




Convolution Layer

32

32

3

activation map 32x32x3 image 5x5x3 filter

1

28

28

convolve (slide) over all spatial locations

Convolution Layer




Convolution Layer

32

32

3

32x32x3 image 5x5x3 filter

activation maps

1

28

28


consider a second, green filter

Convolution Layer




Convolution Layer

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016

32

3 6

28

activation maps 32

28

Convolution Layer

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!




Stride Convolution

Stride



Stride Convolution

A closer look at spatial dimensions:

32

32

3

activation map 32x32x3 image 5x5x3 filter

1

28

28





Stride Convolution

7

7x7 input (spatially) assume 3x3 filter

7





Stride Convolution

7


7





Stride Convolution

7


7





Stride Convolution

7


7





Stride Convolution

=> 5x5 output

7


7





Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 2

7

7





Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 2

7

7





Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output!

7

7





Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 3?

7

7





Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 3?

7

7


doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.




Stride Convolution

N

F

F

N Output size: (N - F) / stride + 1

e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\




Zero-Padding

Zero-Padding



Zero-Padding: common to the border

0 0 0 0 0 0

0

0

0

0

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

(recall:) (N - F) / stride + 1






e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output!

0 0 0 0 0 0

0

0

0

0





Tiled Convolution

Local connectivity

Locallyconnectedlayer

ConvoluFonallayer

Fullyconnectedlayer



Tiled Convolution

Tiled convolution

Locallyconnectedlayer

TiledconvoluFon

ConvoluFonallayer



Pooling

Effect=invariancetosmalltranslaFonsoftheinput

Pooling



Pooling

Pooling



Pooling-  makes the representations smaller and more manageable -  operates over each activation map independently

Pooling

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 69/73


Pooling

1 1 2 4

5 6 7 8

3 2 1 0

1 2 3 4

Single depth slice

x

y

max pool with 2x2 filters and stride 2 6 8

3 4

Max Pooling




Case Study: AlexNet

Case Study: AlexNet [Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)

Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 71/73

The End

Congratulations!

Documents

Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech