Training Neural Networks

SS 2020 – Machine PerceptionOtmar Hilliges

12 March 2020

Errata

• Projects are in teams of two – Mea Culpa• See Piazza announcement for more details and post questions

3/12/20 2

More announcements – COVID-19

Situation changes daily; we keep on monitoring this

If minimum distance of 1 meter cannot be guaranteed, events at ETH are forbidden

This effects lectures -> please sit apart from each otherIf possible, prioritize video-recordings over lecture hall

3/12/20 3

Last lecture

• Perceptron learning algorithm• MLP as engineering model of a neural network

3/12/20 4

This lecture

• What types of functions can be approximated by neural networks?• Universal approximation theorem

• How do we train neural networks?• Backprop algorithm

3/12/20 5

Perceptron - Block diagram

3/12/20 6

𝒘 𝒃

⋅ + 𝝈 𝒚𝑥 =[𝑥&, 𝑥(, 𝑥), … , 𝑥*]

𝑤 =[𝑤&, 𝑤(, 𝑤), … , 𝑤*]

𝑤+𝑥 + 𝑏𝑦 = 𝜎(𝑤+𝑥 + 𝑏)

Multi-layered Perceptron - Block diagram

We can combine several layers:With 𝒙 ! = 𝒙,

∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝜎 𝒘 " #𝒙 "$% + 𝑏 "

And 𝑓 𝒙;𝒘, 𝑏 = 𝒙 &

3/12/20 7

𝒘 𝒃

⋅ + 𝝈

𝒘 𝒃

⋅ + 𝝈… 𝒚

Linear activation functions?

Assuming that 𝜎 is a linear transform,∀𝑥 ∈ ℝ!, 𝜎 𝒙 = 𝛼𝒙 + 𝛽𝐼

with 𝛼, 𝛽 ∈ ℝ, we get:∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝛼𝒘 " #𝒙 "$% + 𝛼𝑏 " + 𝛽𝐼,

which results in an affine mapping:𝑓 𝒙;𝒘, 𝑏 = 𝐴 & 𝒙 + 𝐵 & ,

where 𝐴 ' = 𝐼, 𝐵 ' = 0 and

∀𝑙 < 𝐿 9 𝐴 " = 𝛼𝒘 " 𝐴 "$%

𝐵 " = 𝛼𝒘 " 𝐵 "$% + 𝛼𝑏 " + 𝛽𝐼

3/12/20 8

Important message: The activation function should be non-linear, or the resulting MLP is an affine mapping with a peculiar parametrization!

So what happens with a non-linearity?

3/12/20 9

Example: Solving ’XOR’

Task learn function 𝑦 = 𝑓∗ 𝒙 that maps binary variables 𝑥%, 𝑥( to “true” or “false” if one and only one 𝑥) == 1We will use function 𝑓 𝒙; Θ and find parameters Θ to make 𝑓∗ ≈ 𝑓

𝑋 =0 00 111

, 𝑌 =

3/12/20 10

Attempt I: Use a linear model?

Assuming a linear function:

Solving via normal equation:

3/12/20 11

𝐽 Θ =1

(𝑁 = 4)1-∈/

𝑓∗ 𝑥 1 − 𝑓(𝑥 1 ; Θ ))

𝒘 = 𝟎 and 𝑏 =12

with 𝑓 𝒙;𝒘, 𝑏 = 𝒙+𝒘+ 𝑏

Visualization

3/12/20 12

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Original x space

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.

Figure 6.3: The rectified linear activation function. This activation function is the defaultactivation function recommended for use with most feedforward neural networks. Applyingthis function to the output of a linear transformation yields a nonlinear transformation.However, the function remains very close to linear, in the sense that is a piecewise linearfunction with two linear pieces. Because rectified linear units are nearly linear, theypreserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear modelsgeneralize well. A common principle throughout computer science is that we can buildcomplicated systems from minimal components. Much as a Turing machine’s memoryneeds only to be able to store 0 or 1 states, we can build a universal function approximatorfrom rectified linear functions.

3/12/20 14

𝑔 𝑧 = max{0, 𝑧}

Feedforward Neural Network

3/12/20 15

𝑊 𝑐

𝜎+!

𝑤 𝑏

+! 𝑦

𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 16

𝑋𝑊

𝑋 =0 00 111

, 𝑌 =

1 11 1 =

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 17

𝑋𝑊

𝑋 =0 00 111

, 𝑌 =

+𝑐 + 0−1 =

0 −1112

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 18

𝑋𝑊

𝑋 =0 00 111

, 𝑌 =

=+𝑐0 −1112

max 0, max{𝑂, } =0 0112

3/12/20 19

𝑋𝑊 =+𝑐0 −1112

max 0, max{𝑂, } =0 0112

ℎ( ℎ)

Original x space

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.

𝑋 =00

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 20

𝑊𝑋

𝑋 =0 00 111

, 𝑌 =

=+𝑐max 0,0 0112

w7 +𝑏 1 −2 =

= 𝑌+0

So far: The non-linearity is critically important

Next: What type of functions can we approximate?

3/12/20 21

Universal approximation theorem

Given that 𝜎 ∈ 𝐶( (ℝ) is non-linear (e.g., sigmoid), what type of functions can we learn?

3/12/20 22

∃ 𝑔 𝑥 𝑎𝑠 𝑁𝑁,

It is an approximation

Given enough hidden units.One layer is enough in theory. In practice deeper is better.

continuous function𝑓(𝑥)

Original proof: Multilayer feedforward networks are universal approximators

𝑔 𝑥 ≈ 𝑓 𝑥 , 𝑎𝑛𝑑 𝑔 𝑥 − 𝑓 𝑥 < 𝜖

3/12/20 23[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function

𝑤 = 16, 𝑏 = −9

𝑥 𝑔(𝑥)

exp(𝑤+𝑥 + 𝑏) + 1

𝑤 = 16, 𝑏 = −9

𝑤 = 40

𝑤 = 8

𝑤 = 16

exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function

𝑤 = 16, 𝑏 = −9

𝑏 = −12

𝑏 = −7𝑏 = −9

𝑤 = 100, 𝑏 = −40

Step function

𝑠 = −𝑏𝑤= 0.4Leading edge:

𝑥 𝑔(𝑥)

ℎ = 0.5

𝑠 = 0.4

Step function

Output layer is always assumed linear

𝑥 𝑔(𝑥)

𝑠( = 0.2

ℎ) = 0.5

ℎ( = 0.5

𝑠) = 0.4

“Bump function”

𝑥 𝑔(𝑥)

𝑠( = 0.2

ℎ) = −0.5

ℎ( = 0.5

𝑠) = 0.4

𝑔(𝑥)

−ℎ!

−ℎ"

3/12/20 31

𝑥 𝑦

• Tells us that feed forward networks can approximate 𝑓 𝑥• It does not tell us anything about the chances of learning the correct

parameters

More precisely…

Let 𝜎:ℝ → ℝ be a non-constant, bounded and continous(activation) function𝐼* denotes the 𝑚-dimensional unit hypercube 0,1 * and the spaceof real-valued functions on 𝐼* is denoted with 𝐶 𝐼* .Then any function 𝑓 ∈ 𝐶 𝐼* can be approximated given any 𝜖 > 0, integer 𝑁, real constants 𝑣) , 𝑏) ∈ ℝ and real vectors 𝑤) ∈ ℝ* for 𝑖 =1,… , 𝑁:

𝑓 𝑥 ≈ 𝑔 𝑥 =Y)+%

𝑣)𝜎(𝑤)#𝑥 + 𝑏))

3/12/20 32And: 𝑔 𝑥 − 𝑓 𝑥 < 𝜖 , for all 𝑥 ∈ 𝐼!

[Cybenko, G. "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 1989.]

However…

Networks with a single-hidden layer, need to have exponential width

In practice, deeper networks work better.

Lot’s of ongoing work to provide theory on what network properties lead to which approximation capabilities.

E.g., [Lu et al. “The Expressive Power of Neural Networks: A View from the Width”. NeurIPS 2017]

3/12/20 33

So far: NNs are universal approximators

Next: How do we find the network parameters?

3/12/20 34

General procedure

Iterative gradient descent to find parameters Θ:

Initialize weights with small random valuesInitialize biases with 0 or small positive valuesCompute gradientsUpdate parameters with SGD

3/12/20 35

Compute the negative gradient at 𝜃&

−𝛻𝐶 𝜃:

Times the learning rate 𝜂

−𝜂𝛻𝐶 𝜃:

SGD in a nutshell:

Chain rule

𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑔 𝑥 = 𝑓(𝑦)

𝑑𝑧𝑑𝑥

=𝑑𝑧𝑑𝑦

𝑑𝑦𝑑𝑥

For vector types𝜕𝑧𝜕𝑥

=𝜕𝑧𝜕𝑦𝜕𝑦𝜕x

3/12/20 36

Two ways to compute gradients

Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (

Symbolic differentiation gives us:

𝑑𝑓𝑑𝑥

exp exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (

+ cos exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (

3/12/20 37

If we were to write a program…

Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (

We would define and compute intermediate variables: 𝑎 = exp 𝑥𝑏 = 𝑎(

𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.

3/12/20 38

We can draw this as graph

3/12/20 39

𝑓 = exp exp 𝑥 + exp 𝑥 " + sin exp 𝑥 + exp 𝑥 "

𝑎 = exp 𝑥𝑏 = 𝑎"

𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.

𝑥 𝑎exp(L)

(L)" 𝑏

+ 𝑐

exp(L)

sin(L)

𝑒 + 𝑓

Mechanically writing down derivatives

#$##= 1

#$#%= 1

#$#&= #$

#####&+ #$

#%#%#&

#$#'= #$

#&#&#'

#$#(= #$

#&#&#(+ #$#'

#$#)= #$

#(#(#)

3/12/20

#$##= 1

#$#%= 1

#$#&= #$

##exp(𝑐) + #$

#%cos(𝑐)

#$#'= #$

#$#(= #$

#&+ #$#'2𝑎

#$#)= #$

#(exp(𝑥)

𝑥𝑎

exp(')

𝑏 +𝑐

exp(')

sin(')

𝑑 𝑒+

Backpropagation in Neural Networks

Example & Derivation on the blackboard

3/12/20 41

3/12/20 44

Next week

Convolutional Neural Networks

3/12/20 49

Training Neural Networks

Documents

Technical Challenges for Training Fair Neural Networks

Direct Training for Spiking Neural Networks: Faster

Introduction to Training and Learning in Neural Networks

Neural Networks and applications to natural language ...ccd.cua.uam.mx/~evillatoro/TallerPLN2018/JohnArevalo/rnn_lm.pdf · Neural Networks Neural Network Training Learning as optimization

Quantized Neural Networks: Training Neural Networks with ...jmlr.org/papers/volume18/16-456/16-456.pdfWe introduce a method to train Quantized-Neural-Networks (QNNs), neural networks

Artificial Neural Networks. 2 Outline What are Neural Networks? Biological Neural Networks ANN The basics Feed forward net Training Example Voice

ADAPTIVE TRAINING OF FEEDFORWARD NEURAL NETWORKS BY KALMAN

Hybrid training approach for artiﬁcial neural networks

EMPIRICAL ANALYSIS OF NEURAL NETWORKS TRAINING …

Lecture 7: Training Neural Networks, Part I

Training Neural Networks - Databricks€¦ · Convolutional Neural Networks • Similar to Artificial Neural Networks but CNNs (or ConvNets) make explicit assumptions that the input

Deep Learning & Artificial Intelligence · Schedule 17.10.2018 Introduction 24.10.2018 Basic Neural Networks 31.10.2018 Training Neural Networks 07.11.2018 Convolutional Neural Networks

Convolutional Neural Networks · 2015-12-07 · Training Convolutional Neural Networks • Backpropagation + stochastic gradient descent with momentum –Neural Networks: Tricks of

Training artificial neural networks directly on the

A tutorial on training recurrent neural networks, covering

Domain-Adversarial Training of Neural Networks

Artificial Neural Networksepxing/Class/10701-10s/Lecture/lecture7.pdf · Artificial Neural Networks ... LE Training for Neural Networks Learned neural network • Let’s maximize

Supervised training of recurrent neural networks

Scalable Consistency Training for Graph Neural Networks

Training Deep Neural Networks for Wireless Sensor Networks