Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Deep learning and neural networks
1
B. Mehlig, Department of Physics, University of Gothenburg, SwedenFFR135/FIM720 Artificial Neural Networks Chalmers/Gothenburg University, 7.5 credits
Thanks to E. Werner, H. Linander & O. Balabanov.
Neurons in the cerebral cortex
Neurons in the cerebral cortex (outer layer of the cerebrum, the largest and best developed part of the Human brain)
brainmaps.org
2
Neurons in the cerebral cortex
Neurons in the cerebral cortex (outer layer of the cerebrum, the largest and best developed part of the Human brain)
brainmaps.org
3
dendrites
cell body
axon
input
process
output
Neuron anatomy and activity
Neurons in the cerebral cortex
Spike train in electrosensory pyramidal neuron in fish (Eigenmannia)Gabbiani & Metzner, J. Exp. Biol. 202 (1999) 1267
active inactive
10 µm
Total length of dendrites up to . � cm
time
4
Schematic drawing of a neuron Output of a neuron: spike train
input process output
McCulloch-Pitts neuron
Simple model for a neuron: Signal processing: weighted sum of inputs Activation function
�
Weights(synapticcouplings)
wi1
wi2
wiN
Incoming signalsj=1,...,N
wij ThresholdNeuron number
�i
�i
Output
Oi
Oi
g(bi)
= bi
i
(local field)
5
McCulloch & Pitts, Bull. Math. Biophys. 5 (1943) 115
x1
x2
xN
Oi = g� N�
j=1
wijxj � �i
�
= g�wi1x1 + . . . + wiNxN � �i)
xj
Activation function
Signal processing of McCulloch-Pitts neuron: weighted sum of inputs with activation function : g(bi)
= bi
InputsWeights wij
ThresholdOutput Oi
�i
g(b)
b
1
Step function
g(b)
b
1
�1
0
Signum function
g(b)
b
1
Sigmoid functionsgn(b)
0
6
g(b)
ReLU function
0
max(0, b)
b
Oi = g� N�
j=1
wijxj � �i
�= g
�wi1x1 + . . . + wiNxN � �i) xj
g(x) =1
e�b + 1
(local field)
�H(b)
Highly simplified idealisation of a neuron
Take activation function to be signum function .Output of neuron can assume only two values, ,representing the level of activity of the neuron.
This is an idealisation.
-real neurons fire spike trains
-switching active inactive can be delayed (time delay)
-real neurons may perform non-linear summation of inputs
-real neurons subject to noise (stochastic dynamics)
��
�
1
�1
active
inactiveisgn(bi)±1
7
Neural networks
8
Simple perceptrons
Recurrent network
Boltzmann machine
inputs inputs
Two-layer perceptron
Connect neurons into networks that can perform computing tasks: for example objectidentification, video analysis, speech recognition, classification, data clustering,...
The term perceptron was coined in 1958.
Rosenblatt, Psychological Review 65 (1958) 386
A classification task
Input patterns . Index labels different patterns.
Each pattern has two components, and .
Arrange components into vector, . is shown in the Figure.
Patterns fall into two classes: onthe left, and on the right.
µ = 1, . . . , p
9
x
(µ)
x
(µ) =
"x
(µ)1
x
(µ)2
#x
(µ)1 x
(µ)2
x
(8)
x1
x2
x
(8)
A classification task
Input patterns . Index labels different patterns.
Each pattern has two components, and .
Arrange components into vector, . is shown in the Figure.
Patterns fall into two classes: onthe left, and on the right.
Draw a red line (decision boundary) to distinguish the two types of patterns ( and ).
10
µ = 1, . . . , p
x
(8)
x1
x2
x
(µ) =
"x
(µ)1
x
(µ)2
#
x
(µ)
x
(µ)1 x
(µ)2
x
(8)
A classification task
Input patterns . Index labels different patterns.
Each pattern has two components, and .
Arrange components into vector, , is shown in the Figure.
Patterns fall into two classes: onthe left, and on the right.
Draw a red line (decision boundary) to distinguish the two types of patterns ( and ).
Aim: train a neural network to compute the decision boundary. To do this, define targetvalues: for , and for
Training set .
11
t(µ) = 1 t(µ) = �1
µ = 1, . . . , p
x1
x2
x
(µ)
x
(µ)1 x
(µ)2
x
(µ) =
"x
(µ)1
x
(µ)2
#
(x(µ), t(µ)), µ = 1, . . . , p
x
(8)
x
(8)
Simple perceptron
Simple perceptron: one neuron. Two input terminals and . Activation function . No threshold, . Output
Vectors
Scalar product between vectors:
So
Aim: adjust the weights so that network outputs correct target value for all patterns:
w1
w2
sgn(b)
w =
w1
w2
�
� = 0
angle between and w
' w
= |w|
cos �sin�
�O
↵�
12
inputs
output denotes length of vector w
w
µ = 1, . . . , pfor
cos �
�2���2
x =
x1
x2
�= |x|
cos ↵sin↵
�
= |w| |x|(cos ↵ cos � + sin↵ sin��
w · x = w1x1 + w2x2
= |w| |x| cos(↵� �)
w · x = |w| |x| cos � x
O = sgn(w1x1 + w2x2)
O(µ) = sgn(w · x
(µ)) = t(µ)
x2x1
x2
x1
= sgn(w · x)
x
Geometrical solution
Simple perceptron: one neuron. Two input terminals and . Activation function . No threshold, . Output with
Aim: adjust the weights so that network outputs correct target values for all patterns:
Solution:
w1
w2
w
'
angle between and w
� = 0
13
Check: so . Correct since .
O
inputs
output
w
µ = 1, . . . , pfor
Legendt(µ) = 1
t(µ) = �1
cos �
�2���2
O = sgn(w · x) w · x = |w| |x| cos � x
O(µ) = sgn(w · x
(µ)) = t(µ)
x1
x2
x2
x1
x
(8)
w · x
(8) > 0 O(8) = sgn(w · x
(8)) = 1t(8) = 1
sgn(b)x2x1
define decision boundary by w · x = 0so that decision boundary. w ?
Hebb’s rule
Now the pattern is on wrong side of the red line. So .
Move the red line so that , by rotating the weight vector :w
w
14
Legendt(µ) = 1
t(µ) = �1x
(8)
x1
x2
x
(8) O(8) = sgn(w · x
(8)) 6= t(8)
O(8) = sgn(w · x
(8)) = t(8)
Hebb’s rule
Now the pattern is on wrong side of the red line. So .
Move the red line so that , by rotating the weight vector :
(small parameter ) so that .
w
w�
� > 0
15
Legendt(µ) = 1
t(µ) = �1
x
(8) O(8) = sgn(w · x
(8)) 6= t(8)
O(8) = sgn(w · x
(8)) = t(8)
w� = w + �x(8) O(8) = sgn(w · x
(8)) = t(8)
x
(8)
x1
x2
Hebb’s rule
Now the pattern is on wrong side of the red line. So .
Move the red line so that , by rotating the weight vector :
(small parameter ) so that .
w
� > 0
w
16
Legendt(µ) = 1
t(µ) = �1
x1
x2
x
(4)
x
(4) O(4) = sgn(w · x
(4)) 6= t(4)
O(4) = sgn(w · x
(4)) = t(4)
O(4) = sgn(w · x
(4)) = t(4)w� = w � �x(4)
Hebb’s rule
Now the pattern is on wrong side of the red line. So .
Move the red line so that , by rotating the weight vector :
(small parameter ) so that .
w
w�
� > 0
17
Legendt(µ) = 1
t(µ) = �1
x
(4) O(4) = sgn(w · x
(4)) 6= t(4)
O(4) = sgn(w · x
(4)) = t(4)
O(4) = sgn(w · x
(4)) = t(4)w� = w � �x(4)
x1
x2
x
(4)
Hebb’s rule
Now the pattern is on wrong side of the red line. So .
Move the red line so that , by rotating the weight vector :
(small parameter ) so that .
Note the minus sign:
where
where
Learning rule (Hebb’s rule)
with
w
w�
� > 0
w0 = w + �w
18
Apply learning rule many times until problem is solved.
Legendt(µ) = 1
t(µ) = �1 t(8) = 1
t(4) = �1
x
(4) O(4) = sgn(w · x
(4)) 6= t(4)
O(4) = sgn(w · x
(4)) = t(4)
O(4) = sgn(w · x
(4)) = t(4)w� = w � �x(4)
w� = w + �x(8)
w� = w � �x(4)
�w = ⌘ t(µ)x
(µ)x1
x2
x
(4)
Example - AND function
Logical AND
The threshold determines intersection of decision boundary with -axis.
Condition for decision boundary: .
Line equation for decision boundary: .
w
w2 = 1
w1 = 1
19
Legendt(µ) = 1
t(µ) = �1
1
⇠1 ⇠2 t
0 0 -1
1 0 -1
0 1 -1
1 1 1
x1
x2
threshold
x2
w · x� ✓ = 0intersection with -axis x2
�
x2 = �w1w2
x1 + ✓w2
x1 x2
x1
x2
Example - XOR function
Logical XOR
20
Legendt(µ) = 1
t(µ) = �1
1
⇠1 ⇠2 t
0 0 -1
1 0 1
0 1 1
1 1 -1
x1
x2
x2x1
Example - XOR function
Logical XOR
This problem is not linearly separable because we cannot separate from by asingle red line.
21
Legendt(µ) = 1
t(µ) = �1
1
⇠1 ⇠2 t
0 0 -1
1 0 1
0 1 1
1 1 -1
x1
x2
x2x1
Example - XOR function
Logical XOR
This problem is not linearly separable because we cannot separate from by asingle red line.
Solution: use two red lines.
22
Legendt(µ) = 1
t(µ) = �1
1
⇠1 ⇠2 t
0 0 -1
1 0 1
0 1 1
1 1 -1
x1
x2
x2x1
Example - XOR function
layer of hidden neurons and (neither input nor output)
all hidden weights equal to
Logical XOR
Two hidden neurons, each one defines one red line.
Together they define three regions, , , and .
We need a third neuron (with weights and to associate region with ,and regions and with .
I II III
W1 W2 II
II IIII
II III
1
11
1
1
1
2
21
V1
V2
V2V1
O = sgn(V1 � V2 � 1)
1
�1
threshold 23
Legendt(µ) = 1
t(µ) = �1
1
⇠1 ⇠2 t
0 0 -1
1 0 1
0 1 1
1 1 -1
x1
x2
x1
x2
x2x1
Non-(linearly) separable problems
Solve problems that are not linearly separable with a hidden layer of neurons
Four hidden neurons - one for each red line-segment. Move the red lines into the correct configuration by repeatedly using Hebb’s rule until the problem is solved(a fifth neuron assigns regions and solves the classification problem).
24
Legendt(µ) = 1
t(µ) = �1
x1
x2
x1
x2
Generalisation
Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.
Once all red lines are in the right place (all weights determined), apply network to a newdata set. If the training set was reliable, then the network has learnt to classify the new data,it has learnt to generalise.
25
Legendt(µ) = 1
t(µ) = �1
x1
x2
x1
x2
(x(µ), t(µ)), µ = 1, . . . , p
Training
Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.
Once all red lines are in the right place (all weights determined), apply network to a newdata set. If the training set was reliable, then the network has learnt to classify the new data,it has learnt to generalise.
26
Legendt(µ) = 1
t(µ) = �1
x1
x2
x1
x2
(x(µ), t(µ)), µ = 1, . . . , p
Training
Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.
Usually a small number of errors is acceptable. It is often not meaningful to try to fine-tune very precisely (input patterns are subject to noise).
error
27
Legendt(µ) = 1
t(µ) = �1
x1
x2
x1
x2
(x(µ), t(µ)), µ = 1, . . . , p
Overfitting
Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.
Here: used hidden neurons to fit decision boundary very precisely.
But often the inputs are affected by noise, which can be very different in different data sets.
28
15
training data set
Legendt(µ) = 1
t(µ) = �1
x1
x2
(x(µ), t(µ)), µ = 1, . . . , p
Overfitting
Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.
Here: used hidden neurons to fit decision boundary very precisely.
But often the inputs are affected by noise, which can be very different in different data sets.
Here the network just learnt noise in the training set (overfitting). Overfitting is a huge problem because networks usually have many parameters (weights and thresholds).
29
15
test data set
Legendt(µ) = 1
t(µ) = �1
x1
x2
(x(µ), t(µ)), µ = 1, . . . , p
Multilayer perceptrons
All examples had two-dimensional input space and one-dimensional outputs.So that we could draw the problem and understand its geometrical nature.
In reality problems are often high-dimensional. For example image recognition: input dimension . Output dimension: number of classes to be recognised (car, lorry, road sign, person, bicycle,...)
Multilayer perceptron: network with many hidden layers
Problems: (i) overfitting severe if many weights (ii) nets with many layers learn slowly
Questions: (i) how many hidden layers necessary? (ii) how many hidden layers efficient?
N = #pixels� 3
hidden layersnumber , , and
30
�� 1 � � + 1
... ...
How many hidden layers?
All Boolean functions with inputs can be trained/learned with a single hidden layer.
Proof by construction. Requires neurons in hidden layer.
For large , this architecture is not practicalbecause the number of neurons increasesexponentially with .
Example: parity function. More efficient layout: build network using XOR units. Requires only units.
But such deep networksare in general hard to train.
N
2N
N
N
hidden layers
31
inputs outputs
� N
Parity function: if oddnumber of inputs equal ,otherwise .
O = 11
O = 0
Deep learning
Deep learning. Perceptrons withlarge numbers of layers (more than two). Neural nets recognise objects in images with very high accuracy.
Better than Humans?
Neural nets have been around since the 80ies.
It was always thought that they are very difficult to train.
Why does it suddenly work so well?
32
neural nets
estimate ofHuman error
Image classification interface for Humans
33
cs.stanford.edu/people/karpathy/ilsvrc
Better training sets
To obtain training set: must collect and manually annotate data. Expensive.Challenges: (i) efficient data collection and annotation (ii) require large variety in collected data
Example: Image-net (publicly available data set): images, classified into classes.
> 107 103
image-net.org
34
Manual annotation of training data
35
xkcd.com/1897
reCAPTCHA
Two goals. (i) protects websites from bot (ii) users who click correctly provide correct annotation.
github.com/google/recaptcha
36
Better layouts: convolutional networks
Convolutional network. Each neuron infirst hidden layer is connected to a smallregion of inputs. Here pixels.
Slide the region over input image. Use sameweights for all hidden neurons. Update
- detects the same local feature everywhere in input image (edge, corner,...). Feature map.
- the form of the sum is called convolution
- less overfitting because fewer weights
- use several feature maps to detect different features in input image
1
• • • � � � � � � � �• • • � � � � � � � �• • • � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �
1
• � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �1
� • • • � � � � � � �� • • • � � � � � � �� • • • � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �
1
� • � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �
3� 3
Vmn
inputs hidden
V11
V12
10� 10
10� 10
10� 10
8� 8
8� 8
8� 8� 4
37
Vmn = g� 3�
k=1
3�
l=1
wklxm+k�1,n+l�1
�
Convolutional networks
- max-pooling: take maximum of over small region ( ) to reduce # of parameters
- add fully connected layers to learn more abstract features
But not new!
Vmn
inputs hidden
2� 2
8� 8� 410� 10 max-pooling
4� 4� 4
38
outputs
Krizhevsky, Sutskever & Hinton (2012)
Patterns detected by convolutional net in 1986 Rumelhart, Hinton & Williams (1986)
Object detection
Object detection from video frames using Yolo-9000
Film recorded with an Iphone on my way home from work.
Tiny number of weights (so that the algorithm runs on my laptop).
Pretrained on Pascal VOC data set.
github.com/philipperemy/yolo-9000
39
host.robots.ox.ac.uk/pascal/VOC/
Recognising NIST digits
National Institute of Standard (NIST) in USA has a database of hand-written digits.
Convolutional network trained on the M-NIST data set (60 000 digits) classifies digits in the corresponding test set with high accuracy. Only 23 out of 10 000 wrong.
But the convolutional net fails on our own digits!
Convolutional nets excel at learning patterns in a given input distribution very precisely.But they may fail if the distribution is slightly distorted.
40
http://yann.lecun.com/exdb/mnist/
Ciresan, Meier & Schmidhuber, arxiv:1202.2745
O. Balabanov (2018)
Further problems
Convolutional nets do not understand what they see in the same way as Humans do.
Convolutional net misclassifies slightly distorted image with high confidence
Convolutional net misclassifies noise with certainty.
The nearest decision boundary is very Translational-invariant nature ofclose. Not intuitive but possible in very convolutional layout makes ithigh-dimensional input space. difficult to learn global features.
41
original image correctlyclassified as car
slightly distorted imageclassified as ostrich
Szegedy et al. arxiv:1312.6199
99.6%
classified as leopard
Khurshudov, blog (2015)
Nguyen et al. arxiv:1412.61897
Conclusions
Artificial neural networks were already studied in the 80ies.
In the last few years: deep-learning revolution due to - better training sets - better hardware (GPUs, dedicated chips) - improved algorithms for object detection Applications: google, facebook, tesla, zenuity,... .
Perspectives - close connections to statistical physics (spin glasses, random magnets) - related methods in mathematical statistics (big data analysis) - unsupervised learning (clustering and familiarity of input data)
But convolutional nets do not understand what they learn in the way Humans do.
Literature: lecture notes Artificial neural networks FFR 135 (Chalmers) FIM720 (GU).
42
B. Mehlig, physics.gu.se/~frtbm
OpenTAStellan Östlund & Hampus Linander
43