Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Training Neural Networks
SS 2020 – Machine PerceptionOtmar Hilliges
12 March 2020
Errata
• Projects are in teams of two – Mea Culpa• See Piazza announcement for more details and post questions
there
3/12/20 2
More announcements – COVID-19
Situation changes daily; we keep on monitoring this
If minimum distance of 1 meter cannot be guaranteed, events at ETH are forbidden
This effects lectures -> please sit apart from each otherIf possible, prioritize video-recordings over lecture hall
3/12/20 3
Last lecture
• Perceptron learning algorithm• MLP as engineering model of a neural network
3/12/20 4
This lecture
• What types of functions can be approximated by neural networks?• Universal approximation theorem
• How do we train neural networks?• Backprop algorithm
3/12/20 5
Perceptron - Block diagram
3/12/20 6
𝒙
𝒘 𝒃
⋅ + 𝝈 𝒚𝑥 =[𝑥&, 𝑥(, 𝑥), … , 𝑥*]
𝑤 =[𝑤&, 𝑤(, 𝑤), … , 𝑤*]
𝑤+𝑥 + 𝑏𝑦 = 𝜎(𝑤+𝑥 + 𝑏)
Multi-layered Perceptron - Block diagram
We can combine several layers:With 𝒙 ! = 𝒙,
∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝜎 𝒘 " #𝒙 "$% + 𝑏 "
And 𝑓 𝒙;𝒘, 𝑏 = 𝒙 &
3/12/20 7
𝒙
𝒘 𝒃
⋅ + 𝝈
𝒘 𝒃
⋅ + 𝝈… 𝒚
Linear activation functions?
Assuming that 𝜎 is a linear transform,∀𝑥 ∈ ℝ!, 𝜎 𝒙 = 𝛼𝒙 + 𝛽𝐼
with 𝛼, 𝛽 ∈ ℝ, we get:∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝛼𝒘 " #𝒙 "$% + 𝛼𝑏 " + 𝛽𝐼,
which results in an affine mapping:𝑓 𝒙;𝒘, 𝑏 = 𝐴 & 𝒙 + 𝐵 & ,
where 𝐴 ' = 𝐼, 𝐵 ' = 0 and
∀𝑙 < 𝐿 9 𝐴 " = 𝛼𝒘 " 𝐴 "$%
𝐵 " = 𝛼𝒘 " 𝐵 "$% + 𝛼𝑏 " + 𝛽𝐼
3/12/20 8
Important message: The activation function should be non-linear, or the resulting MLP is an affine mapping with a peculiar parametrization!
So what happens with a non-linearity?
3/12/20 9
Example: Solving ’XOR’
Task learn function 𝑦 = 𝑓∗ 𝒙 that maps binary variables 𝑥%, 𝑥( to “true” or “false” if one and only one 𝑥) == 1We will use function 𝑓 𝒙; Θ and find parameters Θ to make 𝑓∗ ≈ 𝑓
𝑋 =0 00 111
01
, 𝑌 =
0110
3/12/20 10
Attempt I: Use a linear model?
Assuming a linear function:
Solving via normal equation:
3/12/20 11
𝐽 Θ =1
(𝑁 = 4)1-∈/
𝑓∗ 𝑥 1 − 𝑓(𝑥 1 ; Θ ))
𝒘 = 𝟎 and 𝑏 =12
with 𝑓 𝒙;𝒘, 𝑏 = 𝒙+𝒘+ 𝑏
Visualization
3/12/20 12
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0 1
x1
0
1
x2
Original x space
0 1 2
h1
0
1
h2
Learned h space
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.
173
ReLu
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0
z
0
g(z
)=
max
{0,z}
Figure 6.3: The rectified linear activation function. This activation function is the defaultactivation function recommended for use with most feedforward neural networks. Applyingthis function to the output of a linear transformation yields a nonlinear transformation.However, the function remains very close to linear, in the sense that is a piecewise linearfunction with two linear pieces. Because rectified linear units are nearly linear, theypreserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear modelsgeneralize well. A common principle throughout computer science is that we can buildcomplicated systems from minimal components. Much as a Turing machine’s memoryneeds only to be able to store 0 or 1 states, we can build a universal function approximatorfrom rectified linear functions.
175
3/12/20 14
𝑔 𝑧 = max{0, 𝑧}
Feedforward Neural Network
3/12/20 15
𝑋
𝑊 𝑐
𝜎+!
𝑤 𝑏
+! 𝑦
𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 16
𝑋𝑊
𝑋 =0 00 111
01
, 𝑌 =
0110
=00
01
11
01
1 11 1 =
01
01
12
12
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 17
𝑋𝑊
𝑋 =0 00 111
01
, 𝑌 =
0110
=01
01
12
12
+𝑐 + 0−1 =
0 −1112
001
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 18
𝑋𝑊
𝑋 =0 00 111
01
, 𝑌 =
0110
=+𝑐0 −1112
001
max 0, max{𝑂, } =0 0112
001
3/12/20 19
𝑋𝑊 =+𝑐0 −1112
001
max 0, max{𝑂, } =0 0112
001
ℎ( ℎ)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0 1
x1
0
1
x2
Original x space
0 1 2
h1
0
1
h2
Learned h space
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.
173
𝑋 =00
01
11
01
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 20
𝑊𝑋
𝑋 =0 00 111
01
, 𝑌 =
0110
=+𝑐max 0,0 0112
001
w7 +𝑏 1 −2 =
0110
= 𝑌+0
So far: The non-linearity is critically important
Next: What type of functions can we approximate?
3/12/20 21
Universal approximation theorem
Given that 𝜎 ∈ 𝐶( (ℝ) is non-linear (e.g., sigmoid), what type of functions can we learn?
3/12/20 22
∃ 𝑔 𝑥 𝑎𝑠 𝑁𝑁,
It is an approximation
Given enough hidden units.One layer is enough in theory. In practice deeper is better.
continuous function𝑓(𝑥)
Original proof: Multilayer feedforward networks are universal approximators
𝑔 𝑥 ≈ 𝑓 𝑥 , 𝑎𝑛𝑑 𝑔 𝑥 − 𝑓 𝑥 < 𝜖
Universal approximation theorem
3/12/20 23[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)
exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function
𝑤 = 16, 𝑏 = −9
𝑥 𝑔(𝑥)
Universal approximation theorem
3/12/20 24[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)
exp(𝑤+𝑥 + 𝑏) + 1
𝑤 = 16, 𝑏 = −9
𝑤 = 40
𝑤 = 8
𝑤 = 16
Universal approximation theorem
3/12/20 25[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)
exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function
𝑤 = 16, 𝑏 = −9
𝑏 = −12
𝑏 = −7𝑏 = −9
Universal approximation theorem
3/12/20 26[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑤 = 100, 𝑏 = −40
Step function
𝑠 = −𝑏𝑤= 0.4Leading edge:
𝑥 𝑔(𝑥)
Universal approximation theorem
3/12/20 27[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑥 𝑔(𝑥)
ℎ = 0.5
ℎ = 0.5
𝑠 = 0.4
𝑠 = 0.4
Step function
Output layer is always assumed linear
Universal approximation theorem
3/12/20 28[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑥 𝑔(𝑥)
𝑠( = 0.2
ℎ) = 0.5
ℎ( = 0.5
𝑠) = 0.4
Universal approximation theorem
3/12/20 29[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
“Bump function”
𝑥 𝑔(𝑥)
𝑠( = 0.2
ℎ) = −0.5
ℎ( = 0.5
𝑠) = 0.4
Universal approximation theorem
3/12/20 30[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑥
𝑠(
𝑠)
𝑔(𝑥)
𝑠8
𝑠9
ℎ!
−ℎ!
ℎ"
−ℎ"
Universal approximation theorem
3/12/20 31
𝑥 𝑦
• Tells us that feed forward networks can approximate 𝑓 𝑥• It does not tell us anything about the chances of learning the correct
parameters
More precisely…
Let 𝜎:ℝ → ℝ be a non-constant, bounded and continous(activation) function𝐼* denotes the 𝑚-dimensional unit hypercube 0,1 * and the spaceof real-valued functions on 𝐼* is denoted with 𝐶 𝐼* .Then any function 𝑓 ∈ 𝐶 𝐼* can be approximated given any 𝜖 > 0, integer 𝑁, real constants 𝑣) , 𝑏) ∈ ℝ and real vectors 𝑤) ∈ ℝ* for 𝑖 =1,… , 𝑁:
𝑓 𝑥 ≈ 𝑔 𝑥 =Y)+%
,
𝑣)𝜎(𝑤)#𝑥 + 𝑏))
3/12/20 32And: 𝑔 𝑥 − 𝑓 𝑥 < 𝜖 , for all 𝑥 ∈ 𝐼!
[Cybenko, G. "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 1989.]
However…
Networks with a single-hidden layer, need to have exponential width
In practice, deeper networks work better.
Lot’s of ongoing work to provide theory on what network properties lead to which approximation capabilities.
E.g., [Lu et al. “The Expressive Power of Neural Networks: A View from the Width”. NeurIPS 2017]
3/12/20 33
So far: NNs are universal approximators
Next: How do we find the network parameters?
3/12/20 34
General procedure
Iterative gradient descent to find parameters Θ:
Initialize weights with small random valuesInitialize biases with 0 or small positive valuesCompute gradientsUpdate parameters with SGD
3/12/20 35
Compute the negative gradient at 𝜃&
−𝛻𝐶 𝜃:
Times the learning rate 𝜂
−𝜂𝛻𝐶 𝜃:
SGD in a nutshell:
Chain rule
𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑔 𝑥 = 𝑓(𝑦)
𝑑𝑧𝑑𝑥
=𝑑𝑧𝑑𝑦
𝑑𝑦𝑑𝑥
For vector types𝜕𝑧𝜕𝑥
=𝜕𝑧𝜕𝑦𝜕𝑦𝜕x
3/12/20 36
Two ways to compute gradients
Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (
Symbolic differentiation gives us:
𝑑𝑓𝑑𝑥
=
exp exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (
+ cos exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (
3/12/20 37
If we were to write a program…
Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (
We would define and compute intermediate variables: 𝑎 = exp 𝑥𝑏 = 𝑎(
𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.
3/12/20 38
We can draw this as graph
3/12/20 39
𝑓 = exp exp 𝑥 + exp 𝑥 " + sin exp 𝑥 + exp 𝑥 "
𝑎 = exp 𝑥𝑏 = 𝑎"
𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.
𝑥 𝑎exp(L)
(L)" 𝑏
+ 𝑐
exp(L)
sin(L)
𝑑
𝑒 + 𝑓
Mechanically writing down derivatives
#$##= 1
#$#%= 1
#$#&= #$
#####&+ #$
#%#%#&
#$#'= #$
#&#&#'
#$#(= #$
#&#&#(+ #$#'
#'#(
#$#)= #$
#(#(#)
3/12/20
#$##= 1
#$#%= 1
#$#&= #$
##exp(𝑐) + #$
#%cos(𝑐)
#$#'= #$
#&
#$#(= #$
#&+ #$#'2𝑎
#$#)= #$
#(exp(𝑥)
𝑥𝑎
exp(')
(')!
𝑏 +𝑐
exp(')
sin(')
𝑑 𝑒+
𝑓
Backpropagation in Neural Networks
Example & Derivation on the blackboard
3/12/20 41
42
3/12/20 44
Next week
Convolutional Neural Networks
3/12/20 49