Training Neural Networks

Preview:

Citation preview

Training Neural Networks

SS 2020 – Machine PerceptionOtmar Hilliges

12 March 2020

Errata

• Projects are in teams of two – Mea Culpa• See Piazza announcement for more details and post questions

there

3/12/20 2

More announcements – COVID-19

Situation changes daily; we keep on monitoring this

If minimum distance of 1 meter cannot be guaranteed, events at ETH are forbidden

This effects lectures -> please sit apart from each otherIf possible, prioritize video-recordings over lecture hall

3/12/20 3

Last lecture

• Perceptron learning algorithm• MLP as engineering model of a neural network

3/12/20 4

This lecture

• What types of functions can be approximated by neural networks?• Universal approximation theorem

• How do we train neural networks?• Backprop algorithm

3/12/20 5

Perceptron - Block diagram

3/12/20 6

𝒙

𝒘 𝒃

⋅ + 𝝈 𝒚𝑥 =[𝑥&, 𝑥(, 𝑥), … , 𝑥*]

𝑤 =[𝑤&, 𝑤(, 𝑤), … , 𝑤*]

𝑤+𝑥 + 𝑏𝑦 = 𝜎(𝑤+𝑥 + 𝑏)

Multi-layered Perceptron - Block diagram

We can combine several layers:With 𝒙 ! = 𝒙,

∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝜎 𝒘 " #𝒙 "$% + 𝑏 "

And 𝑓 𝒙;𝒘, 𝑏 = 𝒙 &

3/12/20 7

𝒙

𝒘 𝒃

⋅ + 𝝈

𝒘 𝒃

⋅ + 𝝈… 𝒚

Linear activation functions?

Assuming that 𝜎 is a linear transform,∀𝑥 ∈ ℝ!, 𝜎 𝒙 = 𝛼𝒙 + 𝛽𝐼

with 𝛼, 𝛽 ∈ ℝ, we get:∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝛼𝒘 " #𝒙 "$% + 𝛼𝑏 " + 𝛽𝐼,

which results in an affine mapping:𝑓 𝒙;𝒘, 𝑏 = 𝐴 & 𝒙 + 𝐵 & ,

where 𝐴 ' = 𝐼, 𝐵 ' = 0 and

∀𝑙 < 𝐿 9 𝐴 " = 𝛼𝒘 " 𝐴 "$%

𝐵 " = 𝛼𝒘 " 𝐵 "$% + 𝛼𝑏 " + 𝛽𝐼

3/12/20 8

Important message: The activation function should be non-linear, or the resulting MLP is an affine mapping with a peculiar parametrization!

So what happens with a non-linearity?

3/12/20 9

Example: Solving ’XOR’

Task learn function 𝑦 = 𝑓∗ 𝒙 that maps binary variables 𝑥%, 𝑥( to “true” or “false” if one and only one 𝑥) == 1We will use function 𝑓 𝒙; Θ and find parameters Θ to make 𝑓∗ ≈ 𝑓

𝑋 =0 00 111

01

, 𝑌 =

0110

3/12/20 10

Attempt I: Use a linear model?

Assuming a linear function:

Solving via normal equation:

3/12/20 11

𝐽 Θ =1

(𝑁 = 4)1-∈/

𝑓∗ 𝑥 1 − 𝑓(𝑥 1 ; Θ ))

𝒘 = 𝟎 and 𝑏 =12

with 𝑓 𝒙;𝒘, 𝑏 = 𝒙+𝒘+ 𝑏

Visualization

3/12/20 12

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

x1

0

1

x2

Original x space

0 1 2

h1

0

1

h2

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.

173

ReLu

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0

z

0

g(z

)=

max

{0,z}

Figure 6.3: The rectified linear activation function. This activation function is the defaultactivation function recommended for use with most feedforward neural networks. Applyingthis function to the output of a linear transformation yields a nonlinear transformation.However, the function remains very close to linear, in the sense that is a piecewise linearfunction with two linear pieces. Because rectified linear units are nearly linear, theypreserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear modelsgeneralize well. A common principle throughout computer science is that we can buildcomplicated systems from minimal components. Much as a Turing machine’s memoryneeds only to be able to store 0 or 1 states, we can build a universal function approximatorfrom rectified linear functions.

175

3/12/20 14

𝑔 𝑧 = max{0, 𝑧}

Feedforward Neural Network

3/12/20 15

𝑋

𝑊 𝑐

𝜎+!

𝑤 𝑏

+! 𝑦

𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 16

𝑋𝑊

𝑋 =0 00 111

01

, 𝑌 =

0110

=00

01

11

01

1 11 1 =

01

01

12

12

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 17

𝑋𝑊

𝑋 =0 00 111

01

, 𝑌 =

0110

=01

01

12

12

+𝑐 + 0−1 =

0 −1112

001

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 18

𝑋𝑊

𝑋 =0 00 111

01

, 𝑌 =

0110

=+𝑐0 −1112

001

max 0, max{𝑂, } =0 0112

001

3/12/20 19

𝑋𝑊 =+𝑐0 −1112

001

max 0, max{𝑂, } =0 0112

001

ℎ( ℎ)

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

x1

0

1

x2

Original x space

0 1 2

h1

0

1

h2

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.

173

𝑋 =00

01

11

01

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 20

𝑊𝑋

𝑋 =0 00 111

01

, 𝑌 =

0110

=+𝑐max 0,0 0112

001

w7 +𝑏 1 −2 =

0110

= 𝑌+0

So far: The non-linearity is critically important

Next: What type of functions can we approximate?

3/12/20 21

Universal approximation theorem

Given that 𝜎 ∈ 𝐶( (ℝ) is non-linear (e.g., sigmoid), what type of functions can we learn?

3/12/20 22

∃ 𝑔 𝑥 𝑎𝑠 𝑁𝑁,

It is an approximation

Given enough hidden units.One layer is enough in theory. In practice deeper is better.

continuous function𝑓(𝑥)

Original proof: Multilayer feedforward networks are universal approximators

𝑔 𝑥 ≈ 𝑓 𝑥 , 𝑎𝑛𝑑 𝑔 𝑥 − 𝑓 𝑥 < 𝜖

Universal approximation theorem

3/12/20 23[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function

𝑤 = 16, 𝑏 = −9

𝑥 𝑔(𝑥)

Universal approximation theorem

3/12/20 24[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1

𝑤 = 16, 𝑏 = −9

𝑤 = 40

𝑤 = 8

𝑤 = 16

Universal approximation theorem

3/12/20 25[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function

𝑤 = 16, 𝑏 = −9

𝑏 = −12

𝑏 = −7𝑏 = −9

Universal approximation theorem

3/12/20 26[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑤 = 100, 𝑏 = −40

Step function

𝑠 = −𝑏𝑤= 0.4Leading edge:

𝑥 𝑔(𝑥)

Universal approximation theorem

3/12/20 27[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑥 𝑔(𝑥)

ℎ = 0.5

ℎ = 0.5

𝑠 = 0.4

𝑠 = 0.4

Step function

Output layer is always assumed linear

Universal approximation theorem

3/12/20 28[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑥 𝑔(𝑥)

𝑠( = 0.2

ℎ) = 0.5

ℎ( = 0.5

𝑠) = 0.4

Universal approximation theorem

3/12/20 29[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

“Bump function”

𝑥 𝑔(𝑥)

𝑠( = 0.2

ℎ) = −0.5

ℎ( = 0.5

𝑠) = 0.4

Universal approximation theorem

3/12/20 30[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑥

𝑠(

𝑠)

𝑔(𝑥)

𝑠8

𝑠9

ℎ!

−ℎ!

ℎ"

−ℎ"

Universal approximation theorem

3/12/20 31

𝑥 𝑦

• Tells us that feed forward networks can approximate 𝑓 𝑥• It does not tell us anything about the chances of learning the correct

parameters

More precisely…

Let 𝜎:ℝ → ℝ be a non-constant, bounded and continous(activation) function𝐼* denotes the 𝑚-dimensional unit hypercube 0,1 * and the spaceof real-valued functions on 𝐼* is denoted with 𝐶 𝐼* .Then any function 𝑓 ∈ 𝐶 𝐼* can be approximated given any 𝜖 > 0, integer 𝑁, real constants 𝑣) , 𝑏) ∈ ℝ and real vectors 𝑤) ∈ ℝ* for 𝑖 =1,… , 𝑁:

𝑓 𝑥 ≈ 𝑔 𝑥 =Y)+%

,

𝑣)𝜎(𝑤)#𝑥 + 𝑏))

3/12/20 32And: 𝑔 𝑥 − 𝑓 𝑥 < 𝜖 , for all 𝑥 ∈ 𝐼!

[Cybenko, G. "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 1989.]

However…

Networks with a single-hidden layer, need to have exponential width

In practice, deeper networks work better.

Lot’s of ongoing work to provide theory on what network properties lead to which approximation capabilities.

E.g., [Lu et al. “The Expressive Power of Neural Networks: A View from the Width”. NeurIPS 2017]

3/12/20 33

So far: NNs are universal approximators

Next: How do we find the network parameters?

3/12/20 34

General procedure

Iterative gradient descent to find parameters Θ:

Initialize weights with small random valuesInitialize biases with 0 or small positive valuesCompute gradientsUpdate parameters with SGD

3/12/20 35

Compute the negative gradient at 𝜃&

−𝛻𝐶 𝜃:

Times the learning rate 𝜂

−𝜂𝛻𝐶 𝜃:

SGD in a nutshell:

Chain rule

𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑔 𝑥 = 𝑓(𝑦)

𝑑𝑧𝑑𝑥

=𝑑𝑧𝑑𝑦

𝑑𝑦𝑑𝑥

For vector types𝜕𝑧𝜕𝑥

=𝜕𝑧𝜕𝑦𝜕𝑦𝜕x

3/12/20 36

Two ways to compute gradients

Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (

Symbolic differentiation gives us:

𝑑𝑓𝑑𝑥

=

exp exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (

+ cos exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (

3/12/20 37

If we were to write a program…

Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (

We would define and compute intermediate variables: 𝑎 = exp 𝑥𝑏 = 𝑎(

𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.

3/12/20 38

We can draw this as graph

3/12/20 39

𝑓 = exp exp 𝑥 + exp 𝑥 " + sin exp 𝑥 + exp 𝑥 "

𝑎 = exp 𝑥𝑏 = 𝑎"

𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.

𝑥 𝑎exp(L)

(L)" 𝑏

+ 𝑐

exp(L)

sin(L)

𝑑

𝑒 + 𝑓

Mechanically writing down derivatives

#$##= 1

#$#%= 1

#$#&= #$

#####&+ #$

#%#%#&

#$#'= #$

#&#&#'

#$#(= #$

#&#&#(+ #$#'

#'#(

#$#)= #$

#(#(#)

3/12/20

#$##= 1

#$#%= 1

#$#&= #$

##exp(𝑐) + #$

#%cos(𝑐)

#$#'= #$

#&

#$#(= #$

#&+ #$#'2𝑎

#$#)= #$

#(exp(𝑥)

𝑥𝑎

exp(')

(')!

𝑏 +𝑐

exp(')

sin(')

𝑑 𝑒+

𝑓

Backpropagation in Neural Networks

Example & Derivation on the blackboard

3/12/20 41

42

3/12/20 44

Next week

Convolutional Neural Networks

3/12/20 49

Recommended