147
Neural Networks: Design Shan-Hung Wu [email protected] Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49

Neural Networks: Design - GitHub Pages

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Neural Networks: Design - GitHub Pages

Neural Networks:Design

Shan-Hung [email protected]

Department of Computer Science,National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 1 / 49

Page 2: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 2 / 49

Page 3: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 3 / 49

Page 4: Neural Networks: Design - GitHub Pages

Model: a Composite Function IA feedforward neural networks, or multilayer perceptron, definesa function composition

ˆy = f

(L)( · · · f (2)(f (1)(x;q (1));q (2));q (L))

that approximates the target function f

Parameters q (1), · · · ,q (L) learned from training set X“Feedforward” because information flows from input to output

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 4 / 49

Page 5: Neural Networks: Design - GitHub Pages

Model: a Composite Function IA feedforward neural networks, or multilayer perceptron, definesa function composition

ˆy = f

(L)( · · · f (2)(f (1)(x;q (1));q (2));q (L))

that approximates the target function f

Parameters q (1), · · · ,q (L) learned from training set X

“Feedforward” because information flows from input to output

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 4 / 49

Page 6: Neural Networks: Design - GitHub Pages

Model: a Composite Function IA feedforward neural networks, or multilayer perceptron, definesa function composition

ˆy = f

(L)( · · · f (2)(f (1)(x;q (1));q (2));q (L))

that approximates the target function f

Parameters q (1), · · · ,q (L) learned from training set X“Feedforward” because information flows from input to output

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 4 / 49

Page 7: Neural Networks: Design - GitHub Pages

Model: a Composite Function IIAt each layer k, the function f

(k)( · ;W

(k),b(k)) is nonlinear andoutputs value a

(k) 2 RD

(k) , where

a

(k) = act

(k)(W(k)>a

(k�1) +b

(k))

act

(i)(·) : R! R is an activation function applied elementwisely

Shorthand: a

(k) = act

(k)(W(k)>a

(k�1))

a

(k�1) 2 RD

(k�1)+1, a

(k�1)0

= 1, and W

(k) 2 R(D(k�1)+1)⇥D

(k)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 5 / 49

Page 8: Neural Networks: Design - GitHub Pages

Model: a Composite Function IIAt each layer k, the function f

(k)( · ;W

(k),b(k)) is nonlinear andoutputs value a

(k) 2 RD

(k) , where

a

(k) = act

(k)(W(k)>a

(k�1) +b

(k))

act

(i)(·) : R! R is an activation function applied elementwisely

Shorthand: a

(k) = act

(k)(W(k)>a

(k�1))

a

(k�1) 2 RD

(k�1)+1, a

(k�1)0

= 1, and W

(k) 2 R(D(k�1)+1)⇥D

(k)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 5 / 49

Page 9: Neural Networks: Design - GitHub Pages

Neurons I

Each f

(k)j

= act

(k)(W(k)>:,j a

(k�1)) =

act

(k)(z(k)j

) is a unit (or neuron)

E.g., the perceptronLoosely guided by neuroscience

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 6 / 49

Page 10: Neural Networks: Design - GitHub Pages

Neurons I

Each f

(k)j

= act

(k)(W(k)>:,j a

(k�1)) =

act

(k)(z(k)j

) is a unit (or neuron)E.g., the perceptron

Loosely guided by neuroscience

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 6 / 49

Page 11: Neural Networks: Design - GitHub Pages

Neurons I

Each f

(k)j

= act

(k)(W(k)>:,j a

(k�1)) =

act

(k)(z(k)j

) is a unit (or neuron)E.g., the perceptronLoosely guided by neuroscience

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 6 / 49

Page 12: Neural Networks: Design - GitHub Pages

Neurons IIModern NN design is mainly guided by mathematical and engineeringdisciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a

(k) = max(0,z(k))

Output unit: a

(L) = ˆr = s(z(L)),assuming Pr(y = 1 |x)⇠ Bernoulli(r)

Prediction:y = 1( ˆr;

ˆr > 0.5) = 1(z(L);z

(L) > 0)A logistic regressor with input a

(L�1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

Page 13: Neural Networks: Design - GitHub Pages

Neurons IIModern NN design is mainly guided by mathematical and engineeringdisciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a

(k) = max(0,z(k))

Output unit: a

(L) = ˆr = s(z(L)),assuming Pr(y = 1 |x)⇠ Bernoulli(r)

Prediction:y = 1( ˆr;

ˆr > 0.5) = 1(z(L);z

(L) > 0)A logistic regressor with input a

(L�1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

Page 14: Neural Networks: Design - GitHub Pages

Neurons IIModern NN design is mainly guided by mathematical and engineeringdisciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a

(k) = max(0,z(k))

Output unit: a

(L) = ˆr = s(z(L)),assuming Pr(y = 1 |x)⇠ Bernoulli(r)

Prediction:y = 1( ˆr;

ˆr > 0.5) = 1(z(L);z

(L) > 0)

A logistic regressor with input a

(L�1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

Page 15: Neural Networks: Design - GitHub Pages

Neurons IIModern NN design is mainly guided by mathematical and engineeringdisciplines. Consider a binary classifier where y 2 {0,1}:

Hidden units: a

(k) = max(0,z(k))

Output unit: a

(L) = ˆr = s(z(L)),assuming Pr(y = 1 |x)⇠ Bernoulli(r)

Prediction:y = 1( ˆr;

ˆr > 0.5) = 1(z(L);z

(L) > 0)A logistic regressor with input a

(L�1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 7 / 49

Page 16: Neural Networks: Design - GitHub Pages

Representation Learning

The outputs a

(1), a

(2), · · · , a

(L�1) ofhidden layers f

(1), f

(2), · · · , f

(L�1) aredistributed representation of x

Nonlinear to input space since f

(k)’sare nonlinear

Usually more abstract at a deeperlayer

f

(L) is the actual prediction functionLike in non-linear SVM/polynomialregression, a simple linear functionsuffices:

z

(L) = W

(L)a

(L�1)

act

(L)(·) just “normalizes” z

(L) to giveˆr 2 (0,1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

Page 17: Neural Networks: Design - GitHub Pages

Representation Learning

The outputs a

(1), a

(2), · · · , a

(L�1) ofhidden layers f

(1), f

(2), · · · , f

(L�1) aredistributed representation of x

Nonlinear to input space since f

(k)’sare nonlinearUsually more abstract at a deeperlayer

f

(L) is the actual prediction functionLike in non-linear SVM/polynomialregression, a simple linear functionsuffices:

z

(L) = W

(L)a

(L�1)

act

(L)(·) just “normalizes” z

(L) to giveˆr 2 (0,1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

Page 18: Neural Networks: Design - GitHub Pages

Representation Learning

The outputs a

(1), a

(2), · · · , a

(L�1) ofhidden layers f

(1), f

(2), · · · , f

(L�1) aredistributed representation of x

Nonlinear to input space since f

(k)’sare nonlinearUsually more abstract at a deeperlayer

f

(L) is the actual prediction functionLike in non-linear SVM/polynomialregression, a simple linear functionsuffices:

z

(L) = W

(L)a

(L�1)

act

(L)(·) just “normalizes” z

(L) to giveˆr 2 (0,1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

Page 19: Neural Networks: Design - GitHub Pages

Representation Learning

The outputs a

(1), a

(2), · · · , a

(L�1) ofhidden layers f

(1), f

(2), · · · , f

(L�1) aredistributed representation of x

Nonlinear to input space since f

(k)’sare nonlinearUsually more abstract at a deeperlayer

f

(L) is the actual prediction functionLike in non-linear SVM/polynomialregression, a simple linear functionsuffices:

z

(L) = W

(L)a

(L�1)

act

(L)(·) just “normalizes” z

(L) to giveˆr 2 (0,1)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 8 / 49

Page 20: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 9 / 49

Page 21: Neural Networks: Design - GitHub Pages

Learning the XOR I

Why ReLUs learn nonlinear (andbetter) representation?

Let’s learn XOR (f ⇤) in a binaryclassification task

x 2 R2 and y 2 {0,1}Nonlinear, so cannot be learned bylinear models

Consider an NN with 1 hidden layer:a

(1) = max(0,W(1)>x)

a

(2) = ˆr = s(w(2)>a

(1))Prediction: 1( ˆr;

ˆr > 0.5)

Learns XOR by “merging” data pointsfirst

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

Page 22: Neural Networks: Design - GitHub Pages

Learning the XOR I

Why ReLUs learn nonlinear (andbetter) representation?Let’s learn XOR (f ⇤) in a binaryclassification task

x 2 R2 and y 2 {0,1}

Nonlinear, so cannot be learned bylinear models

Consider an NN with 1 hidden layer:a

(1) = max(0,W(1)>x)

a

(2) = ˆr = s(w(2)>a

(1))Prediction: 1( ˆr;

ˆr > 0.5)

Learns XOR by “merging” data pointsfirst

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

Page 23: Neural Networks: Design - GitHub Pages

Learning the XOR I

Why ReLUs learn nonlinear (andbetter) representation?Let’s learn XOR (f ⇤) in a binaryclassification task

x 2 R2 and y 2 {0,1}Nonlinear, so cannot be learned bylinear models

Consider an NN with 1 hidden layer:a

(1) = max(0,W(1)>x)

a

(2) = ˆr = s(w(2)>a

(1))Prediction: 1( ˆr;

ˆr > 0.5)

Learns XOR by “merging” data pointsfirst

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

Page 24: Neural Networks: Design - GitHub Pages

Learning the XOR I

Why ReLUs learn nonlinear (andbetter) representation?Let’s learn XOR (f ⇤) in a binaryclassification task

x 2 R2 and y 2 {0,1}Nonlinear, so cannot be learned bylinear models

Consider an NN with 1 hidden layer:a

(1) = max(0,W(1)>x)

a

(2) = ˆr = s(w(2)>a

(1))Prediction: 1( ˆr;

ˆr > 0.5)

Learns XOR by “merging” data pointsfirst

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

Page 25: Neural Networks: Design - GitHub Pages

Learning the XOR I

Why ReLUs learn nonlinear (andbetter) representation?Let’s learn XOR (f ⇤) in a binaryclassification task

x 2 R2 and y 2 {0,1}Nonlinear, so cannot be learned bylinear models

Consider an NN with 1 hidden layer:a

(1) = max(0,W(1)>x)

a

(2) = ˆr = s(w(2)>a

(1))Prediction: 1( ˆr;

ˆr > 0.5)

Learns XOR by “merging” data pointsfirst

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 10 / 49

Page 26: Neural Networks: Design - GitHub Pages

Learning the XOR II

X =

2

664

1 0 0

1 0 1

1 1 0

1 1 1

3

775 2 RN⇥(1+D), W

(1) =

2

41 1

1 1

0 �1

3

5, w

(2) =

2

4�1

2

�4

3

5

ˆy =

2

664

0

1

1

0

3

775= 1(s([1 max(0,XW

(1)) ]w(2))> 0.5)

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 11 / 49

Page 27: Neural Networks: Design - GitHub Pages

Latent Representation A

(1)

XW

(1) =

2

664

1 0 0

1 0 1

1 1 0

1 1 1

3

775

2

40 �1

1 1

1 1

3

5=

2

664

0 �1

1 0

1 0

2 1

3

775

A

(1) = [1 max(0,XW

(1)) ] =

2

664

1 0 0

1 1 0

1 1 0

1 2 1

3

775

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 12 / 49

Page 28: Neural Networks: Design - GitHub Pages

Output Distribution a

(2)

a

(2) = s(A(1)w

(2)) = s

0

BB@

2

664

1 0 0

1 1 0

1 1 0

1 2 1

3

775

2

4�1

2

�4

3

5

1

CCA= s

0

BB@

2

664

�1

1

1

�1

3

775

1

CCA

ˆy = 1(a(2) > 0.5) =

2

664

0

1

1

0

3

775

But how to train W

(1) and w

(2) from examples?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

Page 29: Neural Networks: Design - GitHub Pages

Output Distribution a

(2)

a

(2) = s(A(1)w

(2)) = s

0

BB@

2

664

1 0 0

1 1 0

1 1 0

1 2 1

3

775

2

4�1

2

�4

3

5

1

CCA= s

0

BB@

2

664

�1

1

1

�1

3

775

1

CCA

ˆy = 1(a(2) > 0.5) =

2

664

0

1

1

0

3

775

But how to train W

(1) and w

(2) from examples?Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 13 / 49

Page 30: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 14 / 49

Page 31: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?

Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)= argminQ� logP(X |Q)= argminQ Â

i

� logP(x(i),y(i) |Q)= argminQ Â

i

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]= argminQ Â

i

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 32: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)

= argminQ� logP(X |Q)= argminQ Â

i

� logP(x(i),y(i) |Q)= argminQ Â

i

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]= argminQ Â

i

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 33: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)= argminQ� logP(X |Q)

= argminQ Âi

� logP(x(i),y(i) |Q)= argminQ Â

i

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]= argminQ Â

i

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 34: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)= argminQ� logP(X |Q)= argminQ Â

i

� logP(x(i),y(i) |Q)

= argminQ Âi

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]= argminQ Â

i

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 35: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)= argminQ� logP(X |Q)= argminQ Â

i

� logP(x(i),y(i) |Q)= argminQ Â

i

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]

= argminQ Âi

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 36: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)= argminQ� logP(X |Q)= argminQ Â

i

� logP(x(i),y(i) |Q)= argminQ Â

i

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]= argminQ Â

i

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 37: Neural Networks: Design - GitHub Pages

Training an NN

Given examples: X= {(x(i),y(i))}N

i=1

How to learn parameters Q = {W

(1), · · · , W

(L)}?Most NNs are trained using the maximum likelihood by default(assuming i.i.d examples):

argmaxQ logP(X |Q)= argminQ� logP(X |Q)= argminQ Â

i

� logP(x(i),y(i) |Q)= argminQ Â

i

[� logP(y(i) |x(i),Q)� logP(x(i) |Q)]= argminQ Â

i

� logP(y(i) |x(i),Q)= argminQ Â

i

C

(i)(Q)

The minimizer ˆQ is an unbiased estimator of “true” Q⇤Good for large N

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 15 / 49

Page 38: Neural Networks: Design - GitHub Pages

Example: Binary Classification

Pr(y = 1 |x)⇠ Bernoulli(r), where x 2 RD and y 2 {0,1}a

(L) = ˆr = s(z(L)) the predicted distribution

The cost function C

(i)(Q) can bewritten as:

C

(i)(Q)=� logP(y(i) |x(i);Q)

=� log[(a(L))y

(i)(1�a

(L))1�y

(i)]

=� log[s(z(L))y

(i)(1�s(z(L)))1�y

(i)]

=� log[s((2y

(i)�1)z(L))]= z ((1�2y

(i))z(L))

z (·) is the softplus function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Page 39: Neural Networks: Design - GitHub Pages

Example: Binary Classification

Pr(y = 1 |x)⇠ Bernoulli(r), where x 2 RD and y 2 {0,1}a

(L) = ˆr = s(z(L)) the predicted distribution

The cost function C

(i)(Q) can bewritten as:

C

(i)(Q)=� logP(y(i) |x(i);Q)

=� log[(a(L))y

(i)(1�a

(L))1�y

(i)]

=� log[s(z(L))y

(i)(1�s(z(L)))1�y

(i)]

=� log[s((2y

(i)�1)z(L))]= z ((1�2y

(i))z(L))

z (·) is the softplus function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Page 40: Neural Networks: Design - GitHub Pages

Example: Binary Classification

Pr(y = 1 |x)⇠ Bernoulli(r), where x 2 RD and y 2 {0,1}a

(L) = ˆr = s(z(L)) the predicted distribution

The cost function C

(i)(Q) can bewritten as:

C

(i)(Q)=� logP(y(i) |x(i);Q)

=� log[(a(L))y

(i)(1�a

(L))1�y

(i)]

=� log[s(z(L))y

(i)(1�s(z(L)))1�y

(i)]

=� log[s((2y

(i)�1)z(L))]

= z ((1�2y

(i))z(L))

z (·) is the softplus function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Page 41: Neural Networks: Design - GitHub Pages

Example: Binary Classification

Pr(y = 1 |x)⇠ Bernoulli(r), where x 2 RD and y 2 {0,1}a

(L) = ˆr = s(z(L)) the predicted distribution

The cost function C

(i)(Q) can bewritten as:

C

(i)(Q)=� logP(y(i) |x(i);Q)

=� log[(a(L))y

(i)(1�a

(L))1�y

(i)]

=� log[s(z(L))y

(i)(1�s(z(L)))1�y

(i)]

=� log[s((2y

(i)�1)z(L))]= z ((1�2y

(i))z(L))

z (·) is the softplus function

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 16 / 49

Page 42: Neural Networks: Design - GitHub Pages

Optimization Algorithm

Most NNs use SGD to solve the problem argminQ Âi

C

(i)(Q)

Fast convergence in time [1]Supports (GPU-based) parallelismSupports online learningEasy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD)

Initialize Q(0) randomly;Repeat until convergence {

Randomly partition the training set X into minibatches of size M;Q(t+1) Q(t)�h—Q ÂM

i=1

C

(i)(Q(t));}

How to compute —Q Âi

C

(i)(Q(t)) efficiently?There could be a huge number of W

(k)i,j ’s in Q

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Page 43: Neural Networks: Design - GitHub Pages

Optimization Algorithm

Most NNs use SGD to solve the problem argminQ Âi

C

(i)(Q)Fast convergence in time [1]

Supports (GPU-based) parallelismSupports online learningEasy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD)

Initialize Q(0) randomly;Repeat until convergence {

Randomly partition the training set X into minibatches of size M;Q(t+1) Q(t)�h—Q ÂM

i=1

C

(i)(Q(t));}

How to compute —Q Âi

C

(i)(Q(t)) efficiently?There could be a huge number of W

(k)i,j ’s in Q

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Page 44: Neural Networks: Design - GitHub Pages

Optimization Algorithm

Most NNs use SGD to solve the problem argminQ Âi

C

(i)(Q)Fast convergence in time [1]Supports (GPU-based) parallelism

Supports online learningEasy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD)

Initialize Q(0) randomly;Repeat until convergence {

Randomly partition the training set X into minibatches of size M;Q(t+1) Q(t)�h—Q ÂM

i=1

C

(i)(Q(t));}

How to compute —Q Âi

C

(i)(Q(t)) efficiently?There could be a huge number of W

(k)i,j ’s in Q

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Page 45: Neural Networks: Design - GitHub Pages

Optimization Algorithm

Most NNs use SGD to solve the problem argminQ Âi

C

(i)(Q)Fast convergence in time [1]Supports (GPU-based) parallelismSupports online learning

Easy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD)

Initialize Q(0) randomly;Repeat until convergence {

Randomly partition the training set X into minibatches of size M;Q(t+1) Q(t)�h—Q ÂM

i=1

C

(i)(Q(t));}

How to compute —Q Âi

C

(i)(Q(t)) efficiently?There could be a huge number of W

(k)i,j ’s in Q

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Page 46: Neural Networks: Design - GitHub Pages

Optimization Algorithm

Most NNs use SGD to solve the problem argminQ Âi

C

(i)(Q)Fast convergence in time [1]Supports (GPU-based) parallelismSupports online learningEasy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD)

Initialize Q(0) randomly;Repeat until convergence {

Randomly partition the training set X into minibatches of size M;Q(t+1) Q(t)�h—Q ÂM

i=1

C

(i)(Q(t));}

How to compute —Q Âi

C

(i)(Q(t)) efficiently?There could be a huge number of W

(k)i,j ’s in Q

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Page 47: Neural Networks: Design - GitHub Pages

Optimization Algorithm

Most NNs use SGD to solve the problem argminQ Âi

C

(i)(Q)Fast convergence in time [1]Supports (GPU-based) parallelismSupports online learningEasy to implement

(Mini-Batched) Stochastic Gradient Descent (SGD)

Initialize Q(0) randomly;Repeat until convergence {

Randomly partition the training set X into minibatches of size M;Q(t+1) Q(t)�h—Q ÂM

i=1

C

(i)(Q(t));}

How to compute —Q Âi

C

(i)(Q(t)) efficiently?There could be a huge number of W

(k)i,j ’s in Q

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 17 / 49

Page 48: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 18 / 49

Page 49: Neural Networks: Design - GitHub Pages

Back Propagation

Q(t+1) Q(t)�h—QM

Ân=1

C

(n)(Q(t))

We have —Q Ân

C

(n)(Q(t)) = Ân

—QC

(n)(Q(t))

Let c

(n) = C

(n)(Q(t)), our goal is to evaluate

∂c

(n)

∂W

(k)i,j

for all i, j, k, and n

Back propagation (or simply backprop) is an efficient way toevaluate multiple partial derivatives at once

Assuming the partial derivatives share some common evaluation stepsBy the chain rule, we have

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Page 50: Neural Networks: Design - GitHub Pages

Back Propagation

Q(t+1) Q(t)�h—QM

Ân=1

C

(n)(Q(t))

We have —Q Ân

C

(n)(Q(t)) = Ân

—QC

(n)(Q(t))Let c

(n) = C

(n)(Q(t)), our goal is to evaluate

∂c

(n)

∂W

(k)i,j

for all i, j, k, and n

Back propagation (or simply backprop) is an efficient way toevaluate multiple partial derivatives at once

Assuming the partial derivatives share some common evaluation stepsBy the chain rule, we have

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Page 51: Neural Networks: Design - GitHub Pages

Back Propagation

Q(t+1) Q(t)�h—QM

Ân=1

C

(n)(Q(t))

We have —Q Ân

C

(n)(Q(t)) = Ân

—QC

(n)(Q(t))Let c

(n) = C

(n)(Q(t)), our goal is to evaluate

∂c

(n)

∂W

(k)i,j

for all i, j, k, and n

Back propagation (or simply backprop) is an efficient way toevaluate multiple partial derivatives at once

Assuming the partial derivatives share some common evaluation steps

By the chain rule, we have

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Page 52: Neural Networks: Design - GitHub Pages

Back Propagation

Q(t+1) Q(t)�h—QM

Ân=1

C

(n)(Q(t))

We have —Q Ân

C

(n)(Q(t)) = Ân

—QC

(n)(Q(t))Let c

(n) = C

(n)(Q(t)), our goal is to evaluate

∂c

(n)

∂W

(k)i,j

for all i, j, k, and n

Back propagation (or simply backprop) is an efficient way toevaluate multiple partial derivatives at once

Assuming the partial derivatives share some common evaluation stepsBy the chain rule, we have

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 19 / 49

Page 53: Neural Networks: Design - GitHub Pages

Forward Pass

The second term: ∂ z

(k)j

∂W

(k)i,j

When k = 1, we have z

(1)j

= Âi

W

(1)i,j x

(n)i

and

∂ z

(1)j

∂W

(1)i,j

= x

(n)i

Otherwise (k > 1), we have z

(k)j

= Âi

W

(k)i,j a

(k�1)i

and

∂ z

(k)j

∂W

(1)i,j

= a

(k�1)i

We can get the second terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the most

shallow layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Page 54: Neural Networks: Design - GitHub Pages

Forward Pass

The second term: ∂ z

(k)j

∂W

(k)i,j

When k = 1, we have z

(1)j

= Âi

W

(1)i,j x

(n)i

and

∂ z

(1)j

∂W

(1)i,j

= x

(n)i

Otherwise (k > 1), we have z

(k)j

= Âi

W

(k)i,j a

(k�1)i

and

∂ z

(k)j

∂W

(1)i,j

= a

(k�1)i

We can get the second terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the most

shallow layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Page 55: Neural Networks: Design - GitHub Pages

Forward Pass

The second term: ∂ z

(k)j

∂W

(k)i,j

When k = 1, we have z

(1)j

= Âi

W

(1)i,j x

(n)i

and

∂ z

(1)j

∂W

(1)i,j

= x

(n)i

Otherwise (k > 1), we have z

(k)j

= Âi

W

(k)i,j a

(k�1)i

and

∂ z

(k)j

∂W

(1)i,j

= a

(k�1)i

We can get the second terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the most

shallow layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Page 56: Neural Networks: Design - GitHub Pages

Forward Pass

The second term: ∂ z

(k)j

∂W

(k)i,j

When k = 1, we have z

(1)j

= Âi

W

(1)i,j x

(n)i

and

∂ z

(1)j

∂W

(1)i,j

= x

(n)i

Otherwise (k > 1), we have z

(k)j

= Âi

W

(k)i,j a

(k�1)i

and

∂ z

(k)j

∂W

(1)i,j

= a

(k�1)i

We can get the second terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the most

shallow layer

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 20 / 49

Page 57: Neural Networks: Design - GitHub Pages

Backward Pass I

Conversely, we can get the first terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the

deepest layer

Define error signal d (k)j

as the first term ∂c

(n)

∂ z

(k)j

When k = L, the evaluation varies from task to taskDepending on the definition of functions act

(L) and C

(n)

E.g., in binary classification, we have:

d (L) =∂c

(n)

∂ z

(L)=

∂z ((1�2y

(n))z(L))

∂ z

(L)= s((1�2y

(n))z(L)) · (1�2y

(n))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Page 58: Neural Networks: Design - GitHub Pages

Backward Pass I

Conversely, we can get the first terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the

deepest layer

Define error signal d (k)j

as the first term ∂c

(n)

∂ z

(k)j

When k = L, the evaluation varies from task to taskDepending on the definition of functions act

(L) and C

(n)

E.g., in binary classification, we have:

d (L) =∂c

(n)

∂ z

(L)=

∂z ((1�2y

(n))z(L))

∂ z

(L)= s((1�2y

(n))z(L)) · (1�2y

(n))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Page 59: Neural Networks: Design - GitHub Pages

Backward Pass I

Conversely, we can get the first terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the

deepest layer

Define error signal d (k)j

as the first term ∂c

(n)

∂ z

(k)j

When k = L, the evaluation varies from task to taskDepending on the definition of functions act

(L) and C

(n)

E.g., in binary classification, we have:

d (L) =∂c

(n)

∂ z

(L)=

∂z ((1�2y

(n))z(L))

∂ z

(L)= s((1�2y

(n))z(L)) · (1�2y

(n))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Page 60: Neural Networks: Design - GitHub Pages

Backward Pass I

Conversely, we can get the first terms of all ∂c

(n)

∂W

(k)i,j

’s starting from the

deepest layer

Define error signal d (k)j

as the first term ∂c

(n)

∂ z

(k)j

When k = L, the evaluation varies from task to taskDepending on the definition of functions act

(L) and C

(n)

E.g., in binary classification, we have:

d (L) =∂c

(n)

∂ z

(L)=

∂z ((1�2y

(n))z(L))

∂ z

(L)= s((1�2y

(n))z(L)) · (1�2y

(n))

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 21 / 49

Page 61: Neural Networks: Design - GitHub Pages

Backward Pass IIWhen k < L, we have

d (k)j

= ∂c

(n)

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· ∂a

(k)j

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· act

0(z(k)j

)

=

✓Â

s

∂c

(n)

∂ z

(k+1)s

· ∂ z

(k+1)s

∂a

(k)j

◆act

0(z(k)j

)

=

✓Â

s

d (k+1)s

· ∂ Âi

W

(k+1)i,s a

(k)i

∂a

(k)j

◆act

0(z(k)j

)

=⇣

Âs

d (k+1)s

·W(k+1)j,s

⌘act

0(z(k)j

)

Theorem (Chain Rule)

Let g : R! Rd and f : Rd! R, then f

(f �g)0(x) = f

0(g(x))g0(x) = —f (g(x))>

2

64g

01

(x)...

g

0n

(x)

3

75.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Page 62: Neural Networks: Design - GitHub Pages

Backward Pass IIWhen k < L, we have

d (k)j

= ∂c

(n)

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· ∂a

(k)j

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· act

0(z(k)j

)

=

✓Â

s

∂c

(n)

∂ z

(k+1)s

· ∂ z

(k+1)s

∂a

(k)j

◆act

0(z(k)j

)

=

✓Â

s

d (k+1)s

· ∂ Âi

W

(k+1)i,s a

(k)i

∂a

(k)j

◆act

0(z(k)j

)

=⇣

Âs

d (k+1)s

·W(k+1)j,s

⌘act

0(z(k)j

)

Theorem (Chain Rule)

Let g : R! Rd and f : Rd! R, then f

(f �g)0(x) = f

0(g(x))g0(x) = —f (g(x))>

2

64g

01

(x)...

g

0n

(x)

3

75.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Page 63: Neural Networks: Design - GitHub Pages

Backward Pass IIWhen k < L, we have

d (k)j

= ∂c

(n)

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· ∂a

(k)j

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· act

0(z(k)j

)

=

✓Â

s

∂c

(n)

∂ z

(k+1)s

· ∂ z

(k+1)s

∂a

(k)j

◆act

0(z(k)j

)

=

✓Â

s

d (k+1)s

· ∂ Âi

W

(k+1)i,s a

(k)i

∂a

(k)j

◆act

0(z(k)j

)

=⇣

Âs

d (k+1)s

·W(k+1)j,s

⌘act

0(z(k)j

)

Theorem (Chain Rule)

Let g : R! Rd and f : Rd! R, then f

(f �g)0(x) = f

0(g(x))g0(x) = —f (g(x))>

2

64g

01

(x)...

g

0n

(x)

3

75.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Page 64: Neural Networks: Design - GitHub Pages

Backward Pass IIWhen k < L, we have

d (k)j

= ∂c

(n)

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· ∂a

(k)j

∂ z

(k)j

= ∂c

(n)

∂a

(k)j

· act

0(z(k)j

)

=

✓Â

s

∂c

(n)

∂ z

(k+1)s

· ∂ z

(k+1)s

∂a

(k)j

◆act

0(z(k)j

)

=

✓Â

s

d (k+1)s

· ∂ Âi

W

(k+1)i,s a

(k)i

∂a

(k)j

◆act

0(z(k)j

)

=⇣

Âs

d (k+1)s

·W(k+1)j,s

⌘act

0(z(k)j

)

Theorem (Chain Rule)

Let g : R! Rd and f : Rd! R, then f

(f �g)0(x) = f

0(g(x))g0(x) = —f (g(x))>

2

64g

01

(x)...

g

0n

(x)

3

75.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 22 / 49

Page 65: Neural Networks: Design - GitHub Pages

Backward Pass III

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

We can evaluate all d (k)j

’s starting from the deepest layer

The information propagate along a new kind of feedforward network:

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

Page 66: Neural Networks: Design - GitHub Pages

Backward Pass III

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

We can evaluate all d (k)j

’s starting from the deepest layerThe information propagate along a new kind of feedforward network:

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

Page 67: Neural Networks: Design - GitHub Pages

Backward Pass III

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

We can evaluate all d (k)j

’s starting from the deepest layerThe information propagate along a new kind of feedforward network:

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 23 / 49

Page 68: Neural Networks: Design - GitHub Pages

Backprop Algorithm (Minibatch Size M = 1)

Input: (x(n),y(n)) and Q(t)

Forward pass:

a

(0) ⇥

1 x

(n)⇤>;

for k 1 to L do

z

(k) W

(k)>a

(k�1) ;a

(k) act(z(k)) ;end

Backward pass:

Compute error signal d (L) (e.g., (1�2y

(n))s((1�2y

(n))z(L)) in binaryclassification)for k L�1 to 1 do

d (k) act

0(z(k))� (W(k+1)d (k+1)) ;end

Return

∂c

(n)

∂W

(k) = a

(k�1)⌦d (k) for all k

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 24 / 49

Page 69: Neural Networks: Design - GitHub Pages

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1

and Q(t)

Forward pass:

A

(0) ⇥

a

(0,1) · · · a

(0,M)⇤>;

for k 1 to L do

Z

(k) A

(k�1)W

(k) ;A

(k) act(Z(k)) ;end

Backward pass:

Compute error signals

D(L) =h

d (L,0) · · · d (L,M)i>

for k L�1 to 1 do

D(k) act

0(Z(k))� (D(k+1)W

(k+1)>) ;end

Return

∂c

(n)

∂W

(k) = ÂM

n=1

a

(k�1,n)⌦d (k,n) for all k

Speed up withGPUs?Large width (D(k))at each layerLarge batch size

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Page 70: Neural Networks: Design - GitHub Pages

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1

and Q(t)

Forward pass:

A

(0) ⇥

a

(0,1) · · · a

(0,M)⇤>;

for k 1 to L do

Z

(k) A

(k�1)W

(k) ;A

(k) act(Z(k)) ;end

Backward pass:

Compute error signals

D(L) =h

d (L,0) · · · d (L,M)i>

for k L�1 to 1 do

D(k) act

0(Z(k))� (D(k+1)W

(k+1)>) ;end

Return

∂c

(n)

∂W

(k) = ÂM

n=1

a

(k�1,n)⌦d (k,n) for all k

Speed up withGPUs?

Large width (D(k))at each layerLarge batch size

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Page 71: Neural Networks: Design - GitHub Pages

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1

and Q(t)

Forward pass:

A

(0) ⇥

a

(0,1) · · · a

(0,M)⇤>;

for k 1 to L do

Z

(k) A

(k�1)W

(k) ;A

(k) act(Z(k)) ;end

Backward pass:

Compute error signals

D(L) =h

d (L,0) · · · d (L,M)i>

for k L�1 to 1 do

D(k) act

0(Z(k))� (D(k+1)W

(k+1)>) ;end

Return

∂c

(n)

∂W

(k) = ÂM

n=1

a

(k�1,n)⌦d (k,n) for all k

Speed up withGPUs?Large width (D(k))at each layer

Large batch size

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Page 72: Neural Networks: Design - GitHub Pages

Backprop Algorithm (Minibatch Size M > 1)

Input: {(x(n),y(n))}M

n=1

and Q(t)

Forward pass:

A

(0) ⇥

a

(0,1) · · · a

(0,M)⇤>;

for k 1 to L do

Z

(k) A

(k�1)W

(k) ;A

(k) act(Z(k)) ;end

Backward pass:

Compute error signals

D(L) =h

d (L,0) · · · d (L,M)i>

for k L�1 to 1 do

D(k) act

0(Z(k))� (D(k+1)W

(k+1)>) ;end

Return

∂c

(n)

∂W

(k) = ÂM

n=1

a

(k�1,n)⌦d (k,n) for all k

Speed up withGPUs?Large width (D(k))at each layerLarge batch size

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 25 / 49

Page 73: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 26 / 49

Page 74: Neural Networks: Design - GitHub Pages

Neuron Design

The design of modern neurons is largely influenced by how an NN istrained

Maximum likelihood principle:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

Universal cost functionDifferent output units for different P(y |x)

Gradient-based optimization:During SGD, the gradient

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

= d (k)j

∂ z

(k)j

∂W

(k)i,j

should be sufficiently large before we get a satisfactory NN

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Page 75: Neural Networks: Design - GitHub Pages

Neuron Design

The design of modern neurons is largely influenced by how an NN istrainedMaximum likelihood principle:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

Universal cost function

Different output units for different P(y |x)Gradient-based optimization:

During SGD, the gradient

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

= d (k)j

∂ z

(k)j

∂W

(k)i,j

should be sufficiently large before we get a satisfactory NN

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Page 76: Neural Networks: Design - GitHub Pages

Neuron Design

The design of modern neurons is largely influenced by how an NN istrainedMaximum likelihood principle:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

Universal cost functionDifferent output units for different P(y |x)

Gradient-based optimization:During SGD, the gradient

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

= d (k)j

∂ z

(k)j

∂W

(k)i,j

should be sufficiently large before we get a satisfactory NN

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Page 77: Neural Networks: Design - GitHub Pages

Neuron Design

The design of modern neurons is largely influenced by how an NN istrainedMaximum likelihood principle:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

Universal cost functionDifferent output units for different P(y |x)

Gradient-based optimization:During SGD, the gradient

∂c

(n)

∂W

(k)i,j

=∂c

(n)

∂ z

(k)j

·∂ z

(k)j

∂W

(k)i,j

= d (k)j

∂ z

(k)j

∂W

(k)i,j

should be sufficiently large before we get a satisfactory NN

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 27 / 49

Page 78: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 28 / 49

Page 79: Neural Networks: Design - GitHub Pages

Negative Log Likelihood and Cross Entropy

The cost function of most NNs:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

For NNs that output an entire distribution ˆ

P(y |x), the problem can beequivalently described as minimizing the cross entropy (or KLdivergence) from ˆ

P to the empirical distribution of data:

argmin

ˆ

P

�E(x,y)⇠Empirical(X)⇥log

ˆ

P(y |x)⇤

Provides a consistent way to define output units

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

Page 80: Neural Networks: Design - GitHub Pages

Negative Log Likelihood and Cross Entropy

The cost function of most NNs:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

For NNs that output an entire distribution ˆ

P(y |x), the problem can beequivalently described as minimizing the cross entropy (or KLdivergence) from ˆ

P to the empirical distribution of data:

argmin

ˆ

P

�E(x,y)⇠Empirical(X)⇥log

ˆ

P(y |x)⇤

Provides a consistent way to define output units

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

Page 81: Neural Networks: Design - GitHub Pages

Negative Log Likelihood and Cross Entropy

The cost function of most NNs:

argmax

QlogP(X |Q) = argmin

Q Âi

� logP(y(i) |x(i),Q)

For NNs that output an entire distribution ˆ

P(y |x), the problem can beequivalently described as minimizing the cross entropy (or KLdivergence) from ˆ

P to the empirical distribution of data:

argmin

ˆ

P

�E(x,y)⇠Empirical(X)⇥log

ˆ

P(y |x)⇤

Provides a consistent way to define output units

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 29 / 49

Page 82: Neural Networks: Design - GitHub Pages

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1 |x)⇠ Bernoulli(r)y 2 {0,1} and r 2 (0,1)

Sigmoid output unit:

a

(L) = ˆr = s(z(L)) =exp(z(L))

exp(z(L))+1

d (L) = ∂c

(n)

∂ z

(L) =∂�log

ˆ

P(y(n) |x(n);Q)∂ z

(L) = (1�2y

(n))s((1�2y

(n))z(L))

Close to 0 only when y

(n) = 1 and z

(L) is large positive; or y

(n) = 0 andz

(L) is small negative

The loss c

(n) saturates (becomes flat) only when ˆr is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Page 83: Neural Networks: Design - GitHub Pages

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1 |x)⇠ Bernoulli(r)y 2 {0,1} and r 2 (0,1)

Sigmoid output unit:

a

(L) = ˆr = s(z(L)) =exp(z(L))

exp(z(L))+1

d (L) = ∂c

(n)

∂ z

(L) =∂�log

ˆ

P(y(n) |x(n);Q)∂ z

(L) = (1�2y

(n))s((1�2y

(n))z(L))

Close to 0 only when y

(n) = 1 and z

(L) is large positive; or y

(n) = 0 andz

(L) is small negative

The loss c

(n) saturates (becomes flat) only when ˆr is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Page 84: Neural Networks: Design - GitHub Pages

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1 |x)⇠ Bernoulli(r)y 2 {0,1} and r 2 (0,1)

Sigmoid output unit:

a

(L) = ˆr = s(z(L)) =exp(z(L))

exp(z(L))+1

d (L) = ∂c

(n)

∂ z

(L) =∂�log

ˆ

P(y(n) |x(n);Q)∂ z

(L) = (1�2y

(n))s((1�2y

(n))z(L))

Close to 0 only when y

(n) = 1 and z

(L) is large positive;

or y

(n) = 0 andz

(L) is small negative

The loss c

(n) saturates (becomes flat) only when ˆr is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Page 85: Neural Networks: Design - GitHub Pages

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1 |x)⇠ Bernoulli(r)y 2 {0,1} and r 2 (0,1)

Sigmoid output unit:

a

(L) = ˆr = s(z(L)) =exp(z(L))

exp(z(L))+1

d (L) = ∂c

(n)

∂ z

(L) =∂�log

ˆ

P(y(n) |x(n);Q)∂ z

(L) = (1�2y

(n))s((1�2y

(n))z(L))

Close to 0 only when y

(n) = 1 and z

(L) is large positive; or y

(n) = 0 andz

(L) is small negative

The loss c

(n) saturates (becomes flat) only when ˆr is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Page 86: Neural Networks: Design - GitHub Pages

Sigmoid Units for Bernoulli Output Distributions

In binary classification, we assuming P(y = 1 |x)⇠ Bernoulli(r)y 2 {0,1} and r 2 (0,1)

Sigmoid output unit:

a

(L) = ˆr = s(z(L)) =exp(z(L))

exp(z(L))+1

d (L) = ∂c

(n)

∂ z

(L) =∂�log

ˆ

P(y(n) |x(n);Q)∂ z

(L) = (1�2y

(n))s((1�2y

(n))z(L))

Close to 0 only when y

(n) = 1 and z

(L) is large positive; or y

(n) = 0 andz

(L) is small negative

The loss c

(n) saturates (becomes flat) only when ˆr is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 30 / 49

Page 87: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions I

In multiclass classification, we can assume thatP(y |x)⇠ Categorical(r), where y,r 2 RK and 1

>r = 1

Softmax units:

a

(L)j

= ˆrj

= sofmax(z(L))j

=exp(z(L)

j

)

ÂK

i=1

exp(z(L)i

)

Actually, to define a Categorical distribution, we only needr

1

, · · · ,rK�1

(rK

= 1�ÂK�1

i=1

ri

can be discarded)We can alternatively define K�1 output units (discardinga

(L)K

= ˆrK

= 1):

a

(L)j

= ˆrj

=exp(z(L)

j

)

ÂK�1

i=1

exp(z(L)i

)+1

that is a direct generalization of s in binary classificationIn practice, the two versions make little difference

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Page 88: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions I

In multiclass classification, we can assume thatP(y |x)⇠ Categorical(r), where y,r 2 RK and 1

>r = 1

Softmax units:

a

(L)j

= ˆrj

= sofmax(z(L))j

=exp(z(L)

j

)

ÂK

i=1

exp(z(L)i

)

Actually, to define a Categorical distribution, we only needr

1

, · · · ,rK�1

(rK

= 1�ÂK�1

i=1

ri

can be discarded)We can alternatively define K�1 output units (discardinga

(L)K

= ˆrK

= 1):

a

(L)j

= ˆrj

=exp(z(L)

j

)

ÂK�1

i=1

exp(z(L)i

)+1

that is a direct generalization of s in binary classificationIn practice, the two versions make little difference

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Page 89: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions I

In multiclass classification, we can assume thatP(y |x)⇠ Categorical(r), where y,r 2 RK and 1

>r = 1

Softmax units:

a

(L)j

= ˆrj

= sofmax(z(L))j

=exp(z(L)

j

)

ÂK

i=1

exp(z(L)i

)

Actually, to define a Categorical distribution, we only needr

1

, · · · ,rK�1

(rK

= 1�ÂK�1

i=1

ri

can be discarded)

We can alternatively define K�1 output units (discardinga

(L)K

= ˆrK

= 1):

a

(L)j

= ˆrj

=exp(z(L)

j

)

ÂK�1

i=1

exp(z(L)i

)+1

that is a direct generalization of s in binary classificationIn practice, the two versions make little difference

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Page 90: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions I

In multiclass classification, we can assume thatP(y |x)⇠ Categorical(r), where y,r 2 RK and 1

>r = 1

Softmax units:

a

(L)j

= ˆrj

= sofmax(z(L))j

=exp(z(L)

j

)

ÂK

i=1

exp(z(L)i

)

Actually, to define a Categorical distribution, we only needr

1

, · · · ,rK�1

(rK

= 1�ÂK�1

i=1

ri

can be discarded)We can alternatively define K�1 output units (discardinga

(L)K

= ˆrK

= 1):

a

(L)j

= ˆrj

=exp(z(L)

j

)

ÂK�1

i=1

exp(z(L)i

)+1

that is a direct generalization of s in binary classification

In practice, the two versions make little difference

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Page 91: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions I

In multiclass classification, we can assume thatP(y |x)⇠ Categorical(r), where y,r 2 RK and 1

>r = 1

Softmax units:

a

(L)j

= ˆrj

= sofmax(z(L))j

=exp(z(L)

j

)

ÂK

i=1

exp(z(L)i

)

Actually, to define a Categorical distribution, we only needr

1

, · · · ,rK�1

(rK

= 1�ÂK�1

i=1

ri

can be discarded)We can alternatively define K�1 output units (discardinga

(L)K

= ˆrK

= 1):

a

(L)j

= ˆrj

=exp(z(L)

j

)

ÂK�1

i=1

exp(z(L)i

)+1

that is a direct generalization of s in binary classificationIn practice, the two versions make little difference

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 31 / 49

Page 92: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions II

Now we have

d (L)j

=∂c

(n)

∂ z

(L)j

=∂ � log

ˆ

P(y(n) |x(n);Q)

∂ z

(L)j

=∂ � log

⇣’

i

ˆr1(y(n);y(n)=i)i

∂ z

(L)j

If y

(n) = j, then d (L)j

=� ∂ log

ˆrj

∂ z

(L)j

=� 1

ˆrj

⇣ˆr

j

� ˆr2

j

⌘= ˆr

j

�1

d (L)j

is close to 0 only when ˆrj

is “correct”

In this case, z

(L)j

dominates among all z

(L)i

’s

If y

(n) = i 6= j, then d (L)j

=� ∂ log

ˆri

∂ z

(L)j

=� 1

ˆri

(� ˆri

ˆrj

) = ˆrj

Again, close to 0 only when ˆrj

is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Page 93: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions II

Now we have

d (L)j

=∂c

(n)

∂ z

(L)j

=∂ � log

ˆ

P(y(n) |x(n);Q)

∂ z

(L)j

=∂ � log

⇣’

i

ˆr1(y(n);y(n)=i)i

∂ z

(L)j

If y

(n) = j, then d (L)j

=� ∂ log

ˆrj

∂ z

(L)j

=� 1

ˆrj

⇣ˆr

j

� ˆr2

j

⌘= ˆr

j

�1

d (L)j

is close to 0 only when ˆrj

is “correct”

In this case, z

(L)j

dominates among all z

(L)i

’s

If y

(n) = i 6= j, then d (L)j

=� ∂ log

ˆri

∂ z

(L)j

=� 1

ˆri

(� ˆri

ˆrj

) = ˆrj

Again, close to 0 only when ˆrj

is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Page 94: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions II

Now we have

d (L)j

=∂c

(n)

∂ z

(L)j

=∂ � log

ˆ

P(y(n) |x(n);Q)

∂ z

(L)j

=∂ � log

⇣’

i

ˆr1(y(n);y(n)=i)i

∂ z

(L)j

If y

(n) = j, then d (L)j

=� ∂ log

ˆrj

∂ z

(L)j

=� 1

ˆrj

⇣ˆr

j

� ˆr2

j

⌘= ˆr

j

�1

d (L)j

is close to 0 only when ˆrj

is “correct”

In this case, z

(L)j

dominates among all z

(L)i

’s

If y

(n) = i 6= j, then d (L)j

=� ∂ log

ˆri

∂ z

(L)j

=� 1

ˆri

(� ˆri

ˆrj

) = ˆrj

Again, close to 0 only when ˆrj

is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Page 95: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions II

Now we have

d (L)j

=∂c

(n)

∂ z

(L)j

=∂ � log

ˆ

P(y(n) |x(n);Q)

∂ z

(L)j

=∂ � log

⇣’

i

ˆr1(y(n);y(n)=i)i

∂ z

(L)j

If y

(n) = j, then d (L)j

=� ∂ log

ˆrj

∂ z

(L)j

=� 1

ˆrj

⇣ˆr

j

� ˆr2

j

⌘= ˆr

j

�1

d (L)j

is close to 0 only when ˆrj

is “correct”

In this case, z

(L)j

dominates among all z

(L)i

’s

If y

(n) = i 6= j, then d (L)j

=� ∂ log

ˆri

∂ z

(L)j

=� 1

ˆri

(� ˆri

ˆrj

) = ˆrj

Again, close to 0 only when ˆrj

is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Page 96: Neural Networks: Design - GitHub Pages

Softmax Units for Categorical OutputDistributions II

Now we have

d (L)j

=∂c

(n)

∂ z

(L)j

=∂ � log

ˆ

P(y(n) |x(n);Q)

∂ z

(L)j

=∂ � log

⇣’

i

ˆr1(y(n);y(n)=i)i

∂ z

(L)j

If y

(n) = j, then d (L)j

=� ∂ log

ˆrj

∂ z

(L)j

=� 1

ˆrj

⇣ˆr

j

� ˆr2

j

⌘= ˆr

j

�1

d (L)j

is close to 0 only when ˆrj

is “correct”

In this case, z

(L)j

dominates among all z

(L)i

’s

If y

(n) = i 6= j, then d (L)j

=� ∂ log

ˆri

∂ z

(L)j

=� 1

ˆri

(� ˆri

ˆrj

) = ˆrj

Again, close to 0 only when ˆrj

is “correct”

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 32 / 49

Page 97: Neural Networks: Design - GitHub Pages

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

For example, we can assume P(y |x)⇠N (µ,S) for regressionHow to design output neurons if we want to predict the mean ˆµ?Linear units:

a

(L) = ˆµ = z

(L)

We have

d (L) =∂c

(n)

∂ z

(L)=

∂ � logN (y(n); ˆµ,S)∂ z

(L)

Let S = I, maximizing the log-likelihood is equivalent to minimizingthe SSE/MSE

d (L) = ∂ky(n)� z

(L)k2/∂ z

(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradientbased optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Page 98: Neural Networks: Design - GitHub Pages

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

For example, we can assume P(y |x)⇠N (µ,S) for regression

How to design output neurons if we want to predict the mean ˆµ?Linear units:

a

(L) = ˆµ = z

(L)

We have

d (L) =∂c

(n)

∂ z

(L)=

∂ � logN (y(n); ˆµ,S)∂ z

(L)

Let S = I, maximizing the log-likelihood is equivalent to minimizingthe SSE/MSE

d (L) = ∂ky(n)� z

(L)k2/∂ z

(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradientbased optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Page 99: Neural Networks: Design - GitHub Pages

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

For example, we can assume P(y |x)⇠N (µ,S) for regressionHow to design output neurons if we want to predict the mean ˆµ?

Linear units:a

(L) = ˆµ = z

(L)

We have

d (L) =∂c

(n)

∂ z

(L)=

∂ � logN (y(n); ˆµ,S)∂ z

(L)

Let S = I, maximizing the log-likelihood is equivalent to minimizingthe SSE/MSE

d (L) = ∂ky(n)� z

(L)k2/∂ z

(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradientbased optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Page 100: Neural Networks: Design - GitHub Pages

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

For example, we can assume P(y |x)⇠N (µ,S) for regressionHow to design output neurons if we want to predict the mean ˆµ?Linear units:

a

(L) = ˆµ = z

(L)

We have

d (L) =∂c

(n)

∂ z

(L)=

∂ � logN (y(n); ˆµ,S)∂ z

(L)

Let S = I, maximizing the log-likelihood is equivalent to minimizingthe SSE/MSE

d (L) = ∂ky(n)� z

(L)k2/∂ z

(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradientbased optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Page 101: Neural Networks: Design - GitHub Pages

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

For example, we can assume P(y |x)⇠N (µ,S) for regressionHow to design output neurons if we want to predict the mean ˆµ?Linear units:

a

(L) = ˆµ = z

(L)

We have

d (L) =∂c

(n)

∂ z

(L)=

∂ � logN (y(n); ˆµ,S)∂ z

(L)

Let S = I, maximizing the log-likelihood is equivalent to minimizingthe SSE/MSE

d (L) = ∂ky(n)� z

(L)k2/∂ z

(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradientbased optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Page 102: Neural Networks: Design - GitHub Pages

Linear Units for Gaussian Means

An NN can also output just one conditional statistic of y given x

For example, we can assume P(y |x)⇠N (µ,S) for regressionHow to design output neurons if we want to predict the mean ˆµ?Linear units:

a

(L) = ˆµ = z

(L)

We have

d (L) =∂c

(n)

∂ z

(L)=

∂ � logN (y(n); ˆµ,S)∂ z

(L)

Let S = I, maximizing the log-likelihood is equivalent to minimizingthe SSE/MSE

d (L) = ∂ky(n)� z

(L)k2/∂ z

(L) (see linear regression)

Linear units do not saturate, so they pose little difficulty for gradientbased optimization

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 33 / 49

Page 103: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 34 / 49

Page 104: Neural Networks: Design - GitHub Pages

Design Considerations

Most units differ from each other only in activation functions:

a

(k) = act(z(k)) = act(W(k)>a

(k�1))

Why use ReLU as default hidden units?act(z(k)) = max(0,z(k))

Why not, for example, use Sigmoid as hidden units?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 35 / 49

Page 105: Neural Networks: Design - GitHub Pages

Design Considerations

Most units differ from each other only in activation functions:

a

(k) = act(z(k)) = act(W(k)>a

(k�1))

Why use ReLU as default hidden units?act(z(k)) = max(0,z(k))

Why not, for example, use Sigmoid as hidden units?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 35 / 49

Page 106: Neural Networks: Design - GitHub Pages

Design Considerations

Most units differ from each other only in activation functions:

a

(k) = act(z(k)) = act(W(k)>a

(k�1))

Why use ReLU as default hidden units?act(z(k)) = max(0,z(k))

Why not, for example, use Sigmoid as hidden units?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 35 / 49

Page 107: Neural Networks: Design - GitHub Pages

Vanishing Gradient Problem

In backward pass of Backprop:

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

If act

0(·) = s 0(·)< 1, then d (k)j

becomessmaller and smaller during backward passThe surface of cost function becomes veryflat at shallow layersSlows down the learning speed of entirenetwork

Weights at deeper layers depend on thosein shallow ones

Numeric problems, e.g., underflow

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

Page 108: Neural Networks: Design - GitHub Pages

Vanishing Gradient Problem

In backward pass of Backprop:

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

If act

0(·) = s 0(·)< 1, then d (k)j

becomessmaller and smaller during backward pass

The surface of cost function becomes veryflat at shallow layersSlows down the learning speed of entirenetwork

Weights at deeper layers depend on thosein shallow ones

Numeric problems, e.g., underflow

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

Page 109: Neural Networks: Design - GitHub Pages

Vanishing Gradient Problem

In backward pass of Backprop:

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

If act

0(·) = s 0(·)< 1, then d (k)j

becomessmaller and smaller during backward passThe surface of cost function becomes veryflat at shallow layers

Slows down the learning speed of entirenetwork

Weights at deeper layers depend on thosein shallow ones

Numeric problems, e.g., underflow

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

Page 110: Neural Networks: Design - GitHub Pages

Vanishing Gradient Problem

In backward pass of Backprop:

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

If act

0(·) = s 0(·)< 1, then d (k)j

becomessmaller and smaller during backward passThe surface of cost function becomes veryflat at shallow layersSlows down the learning speed of entirenetwork

Weights at deeper layers depend on thosein shallow ones

Numeric problems, e.g., underflow

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

Page 111: Neural Networks: Design - GitHub Pages

Vanishing Gradient Problem

In backward pass of Backprop:

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

If act

0(·) = s 0(·)< 1, then d (k)j

becomessmaller and smaller during backward passThe surface of cost function becomes veryflat at shallow layersSlows down the learning speed of entirenetwork

Weights at deeper layers depend on thosein shallow ones

Numeric problems, e.g., underflow

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 36 / 49

Page 112: Neural Networks: Design - GitHub Pages

ReLU I

act

0(z(k)) =

⇢1, if z

(k) > 0

0, otherwise

No vanishing gradients

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

What if z

(k) = 0?In practice, we usually assign 1 or 0 randomly

Floating points are not precise anyway

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

Page 113: Neural Networks: Design - GitHub Pages

ReLU I

act

0(z(k)) =

⇢1, if z

(k) > 0

0, otherwise

No vanishing gradients

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

What if z

(k) = 0?In practice, we usually assign 1 or 0 randomly

Floating points are not precise anyway

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

Page 114: Neural Networks: Design - GitHub Pages

ReLU I

act

0(z(k)) =

⇢1, if z

(k) > 0

0, otherwise

No vanishing gradients

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

What if z

(k) = 0?

In practice, we usually assign 1 or 0 randomlyFloating points are not precise anyway

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

Page 115: Neural Networks: Design - GitHub Pages

ReLU I

act

0(z(k)) =

⇢1, if z

(k) > 0

0, otherwise

No vanishing gradients

d (k)j

=

✓Â

s

d (k+1)s

·W(k+1)j,s

◆act

0(z(k)j

)

What if z

(k) = 0?In practice, we usually assign 1 or 0 randomly

Floating points are not precise anyway

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 37 / 49

Page 116: Neural Networks: Design - GitHub Pages

ReLU II

Why piecewise linear?To avoid vanishing gradient, we can modify s(·) to make it steeper atmiddle and s 0(·)> 1

The second derivative ReLU

00(·) is 0 everywhereEliminates the second-order effects and makes the gradient-basedoptimization more useful (than, e.g., Newton methods)

Problem: for neurons with d (k)j

= 0, theirs weights W

(k):,j will not be

updated∂c

(n)

∂W

(k)i,j

= d (k)∂ z

(k)j

∂W

(k)i,j

Improvement?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 38 / 49

Page 117: Neural Networks: Design - GitHub Pages

ReLU II

Why piecewise linear?To avoid vanishing gradient, we can modify s(·) to make it steeper atmiddle and s 0(·)> 1

The second derivative ReLU

00(·) is 0 everywhereEliminates the second-order effects and makes the gradient-basedoptimization more useful (than, e.g., Newton methods)

Problem: for neurons with d (k)j

= 0, theirs weights W

(k):,j will not be

updated∂c

(n)

∂W

(k)i,j

= d (k)∂ z

(k)j

∂W

(k)i,j

Improvement?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 38 / 49

Page 118: Neural Networks: Design - GitHub Pages

ReLU II

Why piecewise linear?To avoid vanishing gradient, we can modify s(·) to make it steeper atmiddle and s 0(·)> 1

The second derivative ReLU

00(·) is 0 everywhereEliminates the second-order effects and makes the gradient-basedoptimization more useful (than, e.g., Newton methods)

Problem: for neurons with d (k)j

= 0, theirs weights W

(k):,j will not be

updated∂c

(n)

∂W

(k)i,j

= d (k)∂ z

(k)j

∂W

(k)i,j

Improvement?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 38 / 49

Page 119: Neural Networks: Design - GitHub Pages

Leaky/Parametric ReLU

act(z(k)) = max(a · z(k),z(k)),

for some a 2 R

Leaky ReLU: a is set in advance (fixed during training)Usually a small valueOr domain-specific

Example: absolute value rectification a =�1

Used for object recognition from imagesSeek features that are invariant under a polarity reversal of the inputillumination

Parametric ReLU (PReLU): a learned automatically by gradientdescent

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 39 / 49

Page 120: Neural Networks: Design - GitHub Pages

Leaky/Parametric ReLU

act(z(k)) = max(a · z(k),z(k)),

for some a 2 R

Leaky ReLU: a is set in advance (fixed during training)Usually a small valueOr domain-specific

Example: absolute value rectification a =�1

Used for object recognition from imagesSeek features that are invariant under a polarity reversal of the inputillumination

Parametric ReLU (PReLU): a learned automatically by gradientdescent

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 39 / 49

Page 121: Neural Networks: Design - GitHub Pages

Leaky/Parametric ReLU

act(z(k)) = max(a · z(k),z(k)),

for some a 2 R

Leaky ReLU: a is set in advance (fixed during training)Usually a small valueOr domain-specific

Example: absolute value rectification a =�1

Used for object recognition from imagesSeek features that are invariant under a polarity reversal of the inputillumination

Parametric ReLU (PReLU): a learned automatically by gradientdescent

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 39 / 49

Page 122: Neural Networks: Design - GitHub Pages

Maxout Units IMaxout units generalize ReLU variants further:

act(z(k))j

= max

s

z

j,s

a

(k�1) is linearly mapped to multiple groups of z

(k)j,: ’s

Learns a piecewise linear, convex activation function automaticallyCovers both leaky ReLU and PReLU

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 40 / 49

Page 123: Neural Networks: Design - GitHub Pages

Maxout Units IMaxout units generalize ReLU variants further:

act(z(k))j

= max

s

z

j,s

a

(k�1) is linearly mapped to multiple groups of z

(k)j,: ’s

Learns a piecewise linear, convex activation function automaticallyCovers both leaky ReLU and PReLU

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 40 / 49

Page 124: Neural Networks: Design - GitHub Pages

Maxout Units II

How to train an NN with maxout units?

Given a training example(x(n),y(n)), update the weightsthat corresponds to the winning

z

(k)j,s ’s for this example

Different examples may updatedifferent parts of the network

Offers some “redundancy” that helps to resist the catastrophic

forgetting phenomenon [2]An NN may forget how to perform tasks that they were trained on inthe past

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

Page 125: Neural Networks: Design - GitHub Pages

Maxout Units II

How to train an NN with maxout units?

Given a training example(x(n),y(n)), update the weightsthat corresponds to the winning

z

(k)j,s ’s for this example

Different examples may updatedifferent parts of the network

Offers some “redundancy” that helps to resist the catastrophic

forgetting phenomenon [2]An NN may forget how to perform tasks that they were trained on inthe past

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

Page 126: Neural Networks: Design - GitHub Pages

Maxout Units II

How to train an NN with maxout units?

Given a training example(x(n),y(n)), update the weightsthat corresponds to the winning

z

(k)j,s ’s for this example

Different examples may updatedifferent parts of the network

Offers some “redundancy” that helps to resist the catastrophic

forgetting phenomenon [2]An NN may forget how to perform tasks that they were trained on inthe past

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

Page 127: Neural Networks: Design - GitHub Pages

Maxout Units II

How to train an NN with maxout units?

Given a training example(x(n),y(n)), update the weightsthat corresponds to the winning

z

(k)j,s ’s for this example

Different examples may updatedifferent parts of the network

Offers some “redundancy” that helps to resist the catastrophic

forgetting phenomenon [2]An NN may forget how to perform tasks that they were trained on inthe past

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 41 / 49

Page 128: Neural Networks: Design - GitHub Pages

Maxout Units III

Cons?

Each maxout unit is now parametrized by multiple weight vectorsinstead of just oneTypically requires more training dataOtherwise, regularization is needed

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 42 / 49

Page 129: Neural Networks: Design - GitHub Pages

Maxout Units III

Cons?Each maxout unit is now parametrized by multiple weight vectorsinstead of just one

Typically requires more training dataOtherwise, regularization is needed

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 42 / 49

Page 130: Neural Networks: Design - GitHub Pages

Maxout Units III

Cons?Each maxout unit is now parametrized by multiple weight vectorsinstead of just oneTypically requires more training dataOtherwise, regularization is needed

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 42 / 49

Page 131: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 43 / 49

Page 132: Neural Networks: Design - GitHub Pages

Architecture Design

Thin-and-deep or fat-and-shallow?

Theorem (Universal Approximation Theorem [3, 4])

A feedforward network with at least one hidden layer can approximate anycontinuous function (on a closed and bounded subset of RD) or anyfunction mapping from a finite dimensional discrete space to another.

In short, a feedforward network with a single layer is sufficient torepresent any functionWhy going deep?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 44 / 49

Page 133: Neural Networks: Design - GitHub Pages

Architecture Design

Thin-and-deep or fat-and-shallow?

Theorem (Universal Approximation Theorem [3, 4])

A feedforward network with at least one hidden layer can approximate anycontinuous function (on a closed and bounded subset of RD) or anyfunction mapping from a finite dimensional discrete space to another.

In short, a feedforward network with a single layer is sufficient torepresent any function

Why going deep?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 44 / 49

Page 134: Neural Networks: Design - GitHub Pages

Architecture Design

Thin-and-deep or fat-and-shallow?

Theorem (Universal Approximation Theorem [3, 4])

A feedforward network with at least one hidden layer can approximate anycontinuous function (on a closed and bounded subset of RD) or anyfunction mapping from a finite dimensional discrete space to another.

In short, a feedforward network with a single layer is sufficient torepresent any functionWhy going deep?

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 44 / 49

Page 135: Neural Networks: Design - GitHub Pages

Exponential Gain in Number of Hidden UnitsFunctions representable with a deep rectifier NN require anexponential number of hidden units in a shallow NN [5]

Deep NNs are easier to learn given a fixed amount of data

Example: an NN with absolute value rectification units

Each hidden unit specifies where to fold the input space in order tocreate mirror responses (on both sides of the absolute value)By composing these folding operations, we obtain an exponentiallylarge number of piecewise linear regions which can capture all kinds ofregular (e.g., repeating) patterns

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 45 / 49

Page 136: Neural Networks: Design - GitHub Pages

Exponential Gain in Number of Hidden UnitsFunctions representable with a deep rectifier NN require anexponential number of hidden units in a shallow NN [5]

Deep NNs are easier to learn given a fixed amount of data

Example: an NN with absolute value rectification units

Each hidden unit specifies where to fold the input space in order tocreate mirror responses (on both sides of the absolute value)By composing these folding operations, we obtain an exponentiallylarge number of piecewise linear regions which can capture all kinds ofregular (e.g., repeating) patterns

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 45 / 49

Page 137: Neural Networks: Design - GitHub Pages

Exponential Gain in Number of Hidden UnitsFunctions representable with a deep rectifier NN require anexponential number of hidden units in a shallow NN [5]

Deep NNs are easier to learn given a fixed amount of data

Example: an NN with absolute value rectification units

Each hidden unit specifies where to fold the input space in order tocreate mirror responses (on both sides of the absolute value)By composing these folding operations, we obtain an exponentiallylarge number of piecewise linear regions which can capture all kinds ofregular (e.g., repeating) patterns

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 45 / 49

Page 138: Neural Networks: Design - GitHub Pages

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that thefunction we want to learn should involve composition of severalsimpler functions

If valid, deep NNs give better generalizability

When valid?Representation learning point of view:

Learning problem consists of discovering a set of underlying factorsFactors can in turn be described using other, simpler underlying factors

Computer program point of view:Function to learn is a computer program consisting of multiple stepsEach step makes use of the previous step’s outputIntermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

Page 139: Neural Networks: Design - GitHub Pages

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that thefunction we want to learn should involve composition of severalsimpler functions

If valid, deep NNs give better generalizability

When valid?Representation learning point of view:

Learning problem consists of discovering a set of underlying factorsFactors can in turn be described using other, simpler underlying factors

Computer program point of view:Function to learn is a computer program consisting of multiple stepsEach step makes use of the previous step’s outputIntermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

Page 140: Neural Networks: Design - GitHub Pages

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that thefunction we want to learn should involve composition of severalsimpler functions

If valid, deep NNs give better generalizability

When valid?

Representation learning point of view:Learning problem consists of discovering a set of underlying factorsFactors can in turn be described using other, simpler underlying factors

Computer program point of view:Function to learn is a computer program consisting of multiple stepsEach step makes use of the previous step’s outputIntermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

Page 141: Neural Networks: Design - GitHub Pages

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that thefunction we want to learn should involve composition of severalsimpler functions

If valid, deep NNs give better generalizability

When valid?Representation learning point of view:

Learning problem consists of discovering a set of underlying factorsFactors can in turn be described using other, simpler underlying factors

Computer program point of view:Function to learn is a computer program consisting of multiple stepsEach step makes use of the previous step’s outputIntermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

Page 142: Neural Networks: Design - GitHub Pages

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that thefunction we want to learn should involve composition of severalsimpler functions

If valid, deep NNs give better generalizability

When valid?Representation learning point of view:

Learning problem consists of discovering a set of underlying factorsFactors can in turn be described using other, simpler underlying factors

Computer program point of view:Function to learn is a computer program consisting of multiple stepsEach step makes use of the previous step’s output

Intermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

Page 143: Neural Networks: Design - GitHub Pages

Encoding Prior Knowledge

Choosing a deep model also encodes a very general belief that thefunction we want to learn should involve composition of severalsimpler functions

If valid, deep NNs give better generalizability

When valid?Representation learning point of view:

Learning problem consists of discovering a set of underlying factorsFactors can in turn be described using other, simpler underlying factors

Computer program point of view:Function to learn is a computer program consisting of multiple stepsEach step makes use of the previous step’s outputIntermediate outputs can be counters or pointers for internal processing

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 46 / 49

Page 144: Neural Networks: Design - GitHub Pages

Outline

1The Basics

Example: Learning the XOR

2Training

Back Propagation

3Neuron Design

Cost Function & Output NeuronsHidden Neurons

4Architecture Design

Architecture Tuning

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 47 / 49

Page 145: Neural Networks: Design - GitHub Pages

width & depth

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 48 / 49

Page 146: Neural Networks: Design - GitHub Pages

Reference I

[1] Léon Bottou.Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.

[2] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and YoshuaBengio.An empirical investigation of catastrophic forgetting in gradient-basedneural networks.arXiv preprint arXiv:1312.6211, 2013.

[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White.Multilayer feedforward networks are universal approximators.Neural networks, 2(5):359–366, 1989.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 48 / 49

Page 147: Neural Networks: Design - GitHub Pages

Reference II

[4] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken.Multilayer feedforward networks with a nonpolynomial activationfunction can approximate any function.Neural networks, 6(6):861–867, 1993.

[5] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and YoshuaBengio.On the number of linear regions of deep neural networks.In Advances in neural information processing systems, pages2924–2932, 2014.

Shan-Hung Wu (CS, NTHU) NN Design Machine Learning 49 / 49