Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture 15: From (Sigmoidal) Perceptrons toNeural Networks
Reference: We will be referring to sections etc of ‘Deep Learning’by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville
https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=1187
Recap: Non-linearity via Kernelization (Eg., LR)2 The Regularized (Logistic) Cross-Entropy Loss function
(minimized wrt w ∈ �p):
E (w) = −
1
m
m�
i=1
�y (i) log fw
�x(i)
�+
�1 − y (i)
�log
�1 − fw
�x(i)
��� +
λ
2m�w�22 (1)
3 Equivalent dual kernelized objective1
(minimized wrt α ∈ �m):
ED (α) =
m�
i=1
m�
j=1
−y (i)K�x(i), x(j)
�αj +
λ
2αi K
�x(i), x(j)
�αj
+ log
1 +
m�
j=1
αj K�x(i), x(j)
�
(2)
Decision function fw(x) =1
1+ exp
m�
j=1
αjK�x, x(j)
�
1Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/
Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf
Story so Far
Perceptron
Kernel Perceptron
Logistic Regression
Kernelized Logistic RegressionNeural Networks:
Story so Far
Perceptron
Kernel Perceptron
Logistic Regression
Kernelized Logistic RegressionNeural Networks: Universal Approximation Properties andDepth (Section 6.4)
With a single hidden layer of a sufficient size and a reasonable choiceof nonlinearity (including the sigmoid, hyperbolic tangent, and RBFunit), one can represent any smooth function to any desired accuracyThe greater the required accuracy, the more hidden units are requiredNo free lunch theorems
Problem in Perspective
Given data points xi , i = 1, 2, . . . ,m
Possible class choices: c1, c2, . . . , ckWish to generate a mapping/classifier
f : x → {c1, c2, . . . , ck}
To get class labels y1, y2, . . . , ym
Problem in Perspective
In general, series of mappings
xf (·)−−→ y
g(·)−−→ zh(·)−−→ {c1, c2, . . . , ck}
where y , z are in some latent space.
https://playground.tensorflow.org
Other non-linear activation functions?
Consider classification: f (x) = g�wTφ(x)
�
https://playground.tensorflow.org
Other non-linear activation functions?
Consider classification: f (x) = g�wTφ(x)
�
sign�wTφ(x)
�replaced by g
�wTφ(x)
�where g(s) is a
1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1
1+e−s with possible thresholding using some
θ (such as 12).
3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function
4 Softplus: g(s) = ln (1 + es)
Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.
Demostration at https://www.desmos.com/calculator
Other non-linear activation functions?
Consider classification: f (x) = g�wTφ(x)
�
sign�wTφ(x)
�replaced by g
�wTφ(x)
�where g(s) is a
1 step function: g(s) = 1 if s ∈ [θ,∞) and g(s) = 0 otherwise OR2 sigmoid function: g(s) = 1
1+e−s with possible thresholding using some
θ (such as 12).
3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function
4 Softplus: g(s) = ln (1 + es)
Options 2 and 4 have the thresholding step deferred.Threshold changes as bias is changed.
Neural Networks: Cascade of layers of perceptrons giving younon-linearity. Check out https://playground.tensorflow.org/
Recall: Logistic Extension to multi-class
1 Each class c = 1, 2, . . . ,K − 1 can have a different weightvector [wc,1,wc,2, . . . ,wc,k , . . . ,wc,K−1] and
p(Y = c|φ(x)) = e−(wc)Tφ(x)
1 +K−1�
k=1
e−(wk)Tφ(x)
for c = 1, . . . ,K − 1 so that
p(Y = K |φ(x)) = 1
1 +K−1�
k=1
e−(wk)Tφ(x)
Softmax: (Equivalent) LR extension to multi-class
1 Each class c = 1, 2, . . . ,K can have a different weight vector[wc,1,wc,2 . . .wc,K ] and
p(Y = c|φ(x)) = e−(wc)Tφ(x)
K�
k=1
e−(wk)Tφ(x)
for c = 1, . . . ,K .2 This has one set of additional (redundant) weight vector
parameters3 Tutorial 7: Show the (simple) equivalence between the two
formulations
Multi-layer Perceptron/LR (Neural Network) andVC Dimension
Measure for (non)separability using a classifier?
Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22
n
booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882
Measure for (non)separability using a classifier?
Aspect 1: Number of functions that can be representedTutorial 3: Given n boolean variables how many of 22
n
booleanfunctions can be represented by the classifier? Eg, withperceptron we saw that for n = 2 it is 14, for n = 3 it is 104,for n = 4 it is 1882
Measure for (non)separability using a classifier?
Aspect 2: Cardinality of largest set of points that canbe shatteredVC (Vapnik - Chervonenkis) dimension ⇒ Richness of space offunctions learnable by a statistical classification algorithm.
A classification function fw is said to shatter a set of data points(x1, x2, . . . , xn) if, for all assignments of labels to those points, thereexists a w such that fw makes no errors when evaluating that set ofdata points.Cardinality of the largest set of points that f (w) can shatter is itsVC-dimension (see extra & optional material on VC-dimension).
VC dimension: Examples
Three points can be shattered using linear classifiers
✗� ✗�
✗�
✗� ✗� ✗�
✗�
✗�
✗�
✗� ✗�
✗�
VC dimension: Examples
Three points can be shattered using linear classifiers
✗� ✗�
✗�
✗� ✗� ✗�
✗�
✗�
✗�
✗� ✗�
✗�
Four points can be shattered using axis-parallel rectangles
✗�
✗� ✗� ✗�✗�✗�✗�
✗�
✗�
✗�
✗�
✗�
✗� ✗� ✗� ✗�
✗� ✗�✗�
✗�
✗�✗�✗�✗�
✗� ✗� ✗� ✗�
✗� ✗�✗�
✗�
Measure for (non)separability using a classifier?
A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �),
Measure for (non)separability using a classifier?
A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �),
Measure for (non)separability using a classifier?
A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2),
Measure for (non)separability using a classifier?
A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n),
Measure for (non)separability using a classifier?
A classification function fw is said to shatter a set of datapoints (x1, x2, . . . , xn) if, for all assignments of labels to thosepoints, there exists a w such that fw makes no errors whenevaluating that set of data points.Cardinality of the largest set of points that f (w) can shatteris its VC-dimension (see extra & optional material onVC-dimension).Eg: For f as a threshold interval (in �), VC dimension = 1Eg: For f as an interval classifier (in �), VC dimension = 2Eg: For f as perceptron (in 2 dimensional �2), VC dimension= 3Eg: For f as perceptron (in �n), VC dimension = n + 1 (seeextra slides for proof)
Neural Networks
Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions
Varied activation functions
Neural Networks
Great Expressive power (Recall VC dimension): Keys to non-linearity1 non-linearity in activation functions (Tutorial 7)2 cascade of non-linear activation functions
Varied activation functions f (x) = g�wTφ(x)
�where g(s) can be any of
following:1 sign/step function: g(s) = sign(s) or g(s) = 1 if s ∈ [θ,∞) and
g(s) = 0 otherwise2 sigmoid function: g(s) = 1
1+e−s with possible thresholding using some
θ (such as 12). tanh(s) = 2sigmoid(2s)− 1
3 Rectified Linear Unit (ReLU): g(s) = max(0, s): A most popularactivation function
4 Softplus: g(s) = ln (1 + es): A smooth version of ReLU5 Others: leaky ReLU, RBF, Maxout, Hard tanh, absolute value
rectification (Section 6.2.1)Neural Networks: Cascade of layers of perceptrons giving you non-linearity
The 4 Design Choices in a Neural Network (Section6.1)
Some activation functions
Derivatives of some activation functions
Some interesting visualizations
https://distill.pub/2018/building-blocks/
https://distill.pub/2017/feature-visualization/
http://colah.github.io/posts/2015-01-Visualizing-Representations/
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Simple OR using (step) perceptron
x
y
b
θ = 12
1
1
−0.25
x ∨ y
AND using (step) perceptron
x
y
b
θ = 12
1
1
−1.25
x ∧ y
NOT using perceptron
x
b
θ = 12
−1
0.75
¬x
Feed-forward Neural Nets
xn
x2
x1
1
z1 = g (�
)
z2 = g (�
)
wn1
w21
w11
w01
wn2
w22
w12
w02
f1 = g(.)
f2 = g(.)
inputs
Eg: Feed-forward Neural Net for XOR (θ = 0)
Eg: Feed-forward Neural Net for XOR (θ = 0)
x2
x1
1
z1 = g (�
)
z2 = g (�
) 1
1
1
−0.25
−1
−1
1.25
fw = g(.)
1
1−1.25
inputs