44
Training Neural Networks SS 2020 – Machine Perception Otmar Hilliges 12 March 2020

Training Neural Networks

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Training Neural Networks

Training Neural Networks

SS 2020 – Machine PerceptionOtmar Hilliges

12 March 2020

Page 2: Training Neural Networks

Errata

• Projects are in teams of two – Mea Culpa• See Piazza announcement for more details and post questions

there

3/12/20 2

Page 3: Training Neural Networks

More announcements – COVID-19

Situation changes daily; we keep on monitoring this

If minimum distance of 1 meter cannot be guaranteed, events at ETH are forbidden

This effects lectures -> please sit apart from each otherIf possible, prioritize video-recordings over lecture hall

3/12/20 3

Page 4: Training Neural Networks

Last lecture

• Perceptron learning algorithm• MLP as engineering model of a neural network

3/12/20 4

Page 5: Training Neural Networks

This lecture

• What types of functions can be approximated by neural networks?• Universal approximation theorem

• How do we train neural networks?• Backprop algorithm

3/12/20 5

Page 6: Training Neural Networks

Perceptron - Block diagram

3/12/20 6

𝒙

𝒘 𝒃

⋅ + 𝝈 𝒚𝑥 =[𝑥&, 𝑥(, 𝑥), … , 𝑥*]

𝑤 =[𝑤&, 𝑤(, 𝑤), … , 𝑤*]

𝑤+𝑥 + 𝑏𝑦 = 𝜎(𝑤+𝑥 + 𝑏)

Page 7: Training Neural Networks

Multi-layered Perceptron - Block diagram

We can combine several layers:With 𝒙 ! = 𝒙,

∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝜎 𝒘 " #𝒙 "$% + 𝑏 "

And 𝑓 𝒙;𝒘, 𝑏 = 𝒙 &

3/12/20 7

𝒙

𝒘 𝒃

⋅ + 𝝈

𝒘 𝒃

⋅ + 𝝈… 𝒚

Page 8: Training Neural Networks

Linear activation functions?

Assuming that 𝜎 is a linear transform,∀𝑥 ∈ ℝ!, 𝜎 𝒙 = 𝛼𝒙 + 𝛽𝐼

with 𝛼, 𝛽 ∈ ℝ, we get:∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝛼𝒘 " #𝒙 "$% + 𝛼𝑏 " + 𝛽𝐼,

which results in an affine mapping:𝑓 𝒙;𝒘, 𝑏 = 𝐴 & 𝒙 + 𝐵 & ,

where 𝐴 ' = 𝐼, 𝐵 ' = 0 and

∀𝑙 < 𝐿 9 𝐴 " = 𝛼𝒘 " 𝐴 "$%

𝐵 " = 𝛼𝒘 " 𝐵 "$% + 𝛼𝑏 " + 𝛽𝐼

3/12/20 8

Important message: The activation function should be non-linear, or the resulting MLP is an affine mapping with a peculiar parametrization!

Page 9: Training Neural Networks

So what happens with a non-linearity?

3/12/20 9

Page 10: Training Neural Networks

Example: Solving ’XOR’

Task learn function 𝑦 = 𝑓∗ 𝒙 that maps binary variables 𝑥%, 𝑥( to “true” or “false” if one and only one 𝑥) == 1We will use function 𝑓 𝒙; Θ and find parameters Θ to make 𝑓∗ ≈ 𝑓

𝑋 =0 00 111

01

, 𝑌 =

0110

3/12/20 10

Page 11: Training Neural Networks

Attempt I: Use a linear model?

Assuming a linear function:

Solving via normal equation:

3/12/20 11

𝐽 Θ =1

(𝑁 = 4)1-∈/

𝑓∗ 𝑥 1 − 𝑓(𝑥 1 ; Θ ))

𝒘 = 𝟎 and 𝑏 =12

with 𝑓 𝒙;𝒘, 𝑏 = 𝒙+𝒘+ 𝑏

Page 12: Training Neural Networks

Visualization

3/12/20 12

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

x1

0

1

x2

Original x space

0 1 2

h1

0

1

h2

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.

173

Page 13: Training Neural Networks

ReLu

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0

z

0

g(z

)=

max

{0,z}

Figure 6.3: The rectified linear activation function. This activation function is the defaultactivation function recommended for use with most feedforward neural networks. Applyingthis function to the output of a linear transformation yields a nonlinear transformation.However, the function remains very close to linear, in the sense that is a piecewise linearfunction with two linear pieces. Because rectified linear units are nearly linear, theypreserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear modelsgeneralize well. A common principle throughout computer science is that we can buildcomplicated systems from minimal components. Much as a Turing machine’s memoryneeds only to be able to store 0 or 1 states, we can build a universal function approximatorfrom rectified linear functions.

175

3/12/20 14

𝑔 𝑧 = max{0, 𝑧}

Page 14: Training Neural Networks

Feedforward Neural Network

3/12/20 15

𝑋

𝑊 𝑐

𝜎+!

𝑤 𝑏

+! 𝑦

𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Page 15: Training Neural Networks

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 16

𝑋𝑊

𝑋 =0 00 111

01

, 𝑌 =

0110

=00

01

11

01

1 11 1 =

01

01

12

12

Page 16: Training Neural Networks

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 17

𝑋𝑊

𝑋 =0 00 111

01

, 𝑌 =

0110

=01

01

12

12

+𝑐 + 0−1 =

0 −1112

001

Page 17: Training Neural Networks

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 18

𝑋𝑊

𝑋 =0 00 111

01

, 𝑌 =

0110

=+𝑐0 −1112

001

max 0, max{𝑂, } =0 0112

001

Page 18: Training Neural Networks

3/12/20 19

𝑋𝑊 =+𝑐0 −1112

001

max 0, max{𝑂, } =0 0112

001

ℎ( ℎ)

CHAPTER 6. DEEP FEEDFORWARD NETWORKS

0 1

x1

0

1

x2

Original x space

0 1 2

h1

0

1

h2

Learned h space

Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.

173

𝑋 =00

01

11

01

Page 19: Training Neural Networks

The XOR Multi-Layered Perceptron

Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏

Oracle solution:𝑊 = 1 1

1 1 , 𝐜 = 0−1 , 𝒘 = 1

−2 , 𝑏 = 0

3/12/20 20

𝑊𝑋

𝑋 =0 00 111

01

, 𝑌 =

0110

=+𝑐max 0,0 0112

001

w7 +𝑏 1 −2 =

0110

= 𝑌+0

Page 20: Training Neural Networks

So far: The non-linearity is critically important

Next: What type of functions can we approximate?

3/12/20 21

Page 21: Training Neural Networks

Universal approximation theorem

Given that 𝜎 ∈ 𝐶( (ℝ) is non-linear (e.g., sigmoid), what type of functions can we learn?

3/12/20 22

∃ 𝑔 𝑥 𝑎𝑠 𝑁𝑁,

It is an approximation

Given enough hidden units.One layer is enough in theory. In practice deeper is better.

continuous function𝑓(𝑥)

Original proof: Multilayer feedforward networks are universal approximators

𝑔 𝑥 ≈ 𝑓 𝑥 , 𝑎𝑛𝑑 𝑔 𝑥 − 𝑓 𝑥 < 𝜖

Page 22: Training Neural Networks

Universal approximation theorem

3/12/20 23[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function

𝑤 = 16, 𝑏 = −9

𝑥 𝑔(𝑥)

Page 23: Training Neural Networks

Universal approximation theorem

3/12/20 24[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1

𝑤 = 16, 𝑏 = −9

𝑤 = 40

𝑤 = 8

𝑤 = 16

Page 24: Training Neural Networks

Universal approximation theorem

3/12/20 25[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)

exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function

𝑤 = 16, 𝑏 = −9

𝑏 = −12

𝑏 = −7𝑏 = −9

Page 25: Training Neural Networks

Universal approximation theorem

3/12/20 26[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑤 = 100, 𝑏 = −40

Step function

𝑠 = −𝑏𝑤= 0.4Leading edge:

𝑥 𝑔(𝑥)

Page 26: Training Neural Networks

Universal approximation theorem

3/12/20 27[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑥 𝑔(𝑥)

ℎ = 0.5

ℎ = 0.5

𝑠 = 0.4

𝑠 = 0.4

Step function

Output layer is always assumed linear

Page 27: Training Neural Networks

Universal approximation theorem

3/12/20 28[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑥 𝑔(𝑥)

𝑠( = 0.2

ℎ) = 0.5

ℎ( = 0.5

𝑠) = 0.4

Page 28: Training Neural Networks

Universal approximation theorem

3/12/20 29[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

“Bump function”

𝑥 𝑔(𝑥)

𝑠( = 0.2

ℎ) = −0.5

ℎ( = 0.5

𝑠) = 0.4

Page 29: Training Neural Networks

Universal approximation theorem

3/12/20 30[see also: http://neuralnetworksanddeeplearning.com/chap4.html]

𝑥

𝑠(

𝑠)

𝑔(𝑥)

𝑠8

𝑠9

ℎ!

−ℎ!

ℎ"

−ℎ"

Page 30: Training Neural Networks

Universal approximation theorem

3/12/20 31

𝑥 𝑦

• Tells us that feed forward networks can approximate 𝑓 𝑥• It does not tell us anything about the chances of learning the correct

parameters

Page 31: Training Neural Networks

More precisely…

Let 𝜎:ℝ → ℝ be a non-constant, bounded and continous(activation) function𝐼* denotes the 𝑚-dimensional unit hypercube 0,1 * and the spaceof real-valued functions on 𝐼* is denoted with 𝐶 𝐼* .Then any function 𝑓 ∈ 𝐶 𝐼* can be approximated given any 𝜖 > 0, integer 𝑁, real constants 𝑣) , 𝑏) ∈ ℝ and real vectors 𝑤) ∈ ℝ* for 𝑖 =1,… , 𝑁:

𝑓 𝑥 ≈ 𝑔 𝑥 =Y)+%

,

𝑣)𝜎(𝑤)#𝑥 + 𝑏))

3/12/20 32And: 𝑔 𝑥 − 𝑓 𝑥 < 𝜖 , for all 𝑥 ∈ 𝐼!

[Cybenko, G. "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 1989.]

Page 32: Training Neural Networks

However…

Networks with a single-hidden layer, need to have exponential width

In practice, deeper networks work better.

Lot’s of ongoing work to provide theory on what network properties lead to which approximation capabilities.

E.g., [Lu et al. “The Expressive Power of Neural Networks: A View from the Width”. NeurIPS 2017]

3/12/20 33

Page 33: Training Neural Networks

So far: NNs are universal approximators

Next: How do we find the network parameters?

3/12/20 34

Page 34: Training Neural Networks

General procedure

Iterative gradient descent to find parameters Θ:

Initialize weights with small random valuesInitialize biases with 0 or small positive valuesCompute gradientsUpdate parameters with SGD

3/12/20 35

Compute the negative gradient at 𝜃&

−𝛻𝐶 𝜃:

Times the learning rate 𝜂

−𝜂𝛻𝐶 𝜃:

SGD in a nutshell:

Page 35: Training Neural Networks

Chain rule

𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑔 𝑥 = 𝑓(𝑦)

𝑑𝑧𝑑𝑥

=𝑑𝑧𝑑𝑦

𝑑𝑦𝑑𝑥

For vector types𝜕𝑧𝜕𝑥

=𝜕𝑧𝜕𝑦𝜕𝑦𝜕x

3/12/20 36

Page 36: Training Neural Networks

Two ways to compute gradients

Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (

Symbolic differentiation gives us:

𝑑𝑓𝑑𝑥

=

exp exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (

+ cos exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (

3/12/20 37

Page 37: Training Neural Networks

If we were to write a program…

Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (

We would define and compute intermediate variables: 𝑎 = exp 𝑥𝑏 = 𝑎(

𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.

3/12/20 38

Page 38: Training Neural Networks

We can draw this as graph

3/12/20 39

𝑓 = exp exp 𝑥 + exp 𝑥 " + sin exp 𝑥 + exp 𝑥 "

𝑎 = exp 𝑥𝑏 = 𝑎"

𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.

𝑥 𝑎exp(L)

(L)" 𝑏

+ 𝑐

exp(L)

sin(L)

𝑑

𝑒 + 𝑓

Page 39: Training Neural Networks

Mechanically writing down derivatives

#$##= 1

#$#%= 1

#$#&= #$

#####&+ #$

#%#%#&

#$#'= #$

#&#&#'

#$#(= #$

#&#&#(+ #$#'

#'#(

#$#)= #$

#(#(#)

3/12/20

#$##= 1

#$#%= 1

#$#&= #$

##exp(𝑐) + #$

#%cos(𝑐)

#$#'= #$

#&

#$#(= #$

#&+ #$#'2𝑎

#$#)= #$

#(exp(𝑥)

𝑥𝑎

exp(')

(')!

𝑏 +𝑐

exp(')

sin(')

𝑑 𝑒+

𝑓

Page 40: Training Neural Networks

Backpropagation in Neural Networks

Example & Derivation on the blackboard

3/12/20 41

Page 41: Training Neural Networks

42

Page 42: Training Neural Networks

3/12/20 44

Page 43: Training Neural Networks
Page 44: Training Neural Networks

Next week

Convolutional Neural Networks

3/12/20 49