Deep learning and neural networks - Göteborgs...

Preview:

Citation preview

Deep learning and neural networks

1

B. Mehlig, Department of Physics, University of Gothenburg, SwedenFFR135/FIM720 Artificial Neural Networks Chalmers/Gothenburg University, 7.5 credits

Thanks to E. Werner, H. Linander & O. Balabanov.

Neurons in the cerebral cortex

Neurons in the cerebral cortex (outer layer of the cerebrum, the largest and best developed part of the Human brain)

brainmaps.org

2

Neurons in the cerebral cortex

Neurons in the cerebral cortex (outer layer of the cerebrum, the largest and best developed part of the Human brain)

brainmaps.org

3

dendrites

cell body

axon

input

process

output

Neuron anatomy and activity

Neurons in the cerebral cortex

Spike train in electrosensory pyramidal neuron in fish (Eigenmannia)Gabbiani & Metzner, J. Exp. Biol. 202 (1999) 1267

active inactive

10 µm

Total length of dendrites up to . � cm

time

4

Schematic drawing of a neuron Output of a neuron: spike train

input process output

McCulloch-Pitts neuron

Simple model for a neuron: Signal processing: weighted sum of inputs Activation function

Weights(synapticcouplings)

wi1

wi2

wiN

Incoming signalsj=1,...,N

wij ThresholdNeuron number

�i

�i

Output

Oi

Oi

g(bi)

= bi

i

(local field)

5

McCulloch & Pitts, Bull. Math. Biophys. 5 (1943) 115

x1

x2

xN

Oi = g� N�

j=1

wijxj � �i

= g�wi1x1 + . . . + wiNxN � �i)

xj

Activation function

Signal processing of McCulloch-Pitts neuron: weighted sum of inputs with activation function : g(bi)

= bi

InputsWeights wij

ThresholdOutput Oi

�i

g(b)

b

1

Step function

g(b)

b

1

�1

0

Signum function

g(b)

b

1

Sigmoid functionsgn(b)

0

6

g(b)

ReLU function

0

max(0, b)

b

Oi = g� N�

j=1

wijxj � �i

�= g

�wi1x1 + . . . + wiNxN � �i) xj

g(x) =1

e�b + 1

(local field)

�H(b)

Highly simplified idealisation of a neuron

Take activation function to be signum function .Output of neuron can assume only two values, ,representing the level of activity of the neuron.

This is an idealisation.

-real neurons fire spike trains

-switching active inactive can be delayed (time delay)

-real neurons may perform non-linear summation of inputs

-real neurons subject to noise (stochastic dynamics)

��

1

�1

active

inactiveisgn(bi)±1

7

Neural networks

8

Simple perceptrons

Recurrent network

Boltzmann machine

inputs inputs

Two-layer perceptron

Connect neurons into networks that can perform computing tasks: for example objectidentification, video analysis, speech recognition, classification, data clustering,...

The term perceptron was coined in 1958.

Rosenblatt, Psychological Review 65 (1958) 386

A classification task

Input patterns . Index labels different patterns.

Each pattern has two components, and .

Arrange components into vector, . is shown in the Figure.

Patterns fall into two classes: onthe left, and on the right.

µ = 1, . . . , p

9

x

(µ)

x

(µ) =

"x

(µ)1

x

(µ)2

#x

(µ)1 x

(µ)2

x

(8)

x1

x2

x

(8)

A classification task

Input patterns . Index labels different patterns.

Each pattern has two components, and .

Arrange components into vector, . is shown in the Figure.

Patterns fall into two classes: onthe left, and on the right.

Draw a red line (decision boundary) to distinguish the two types of patterns ( and ).

10

µ = 1, . . . , p

x

(8)

x1

x2

x

(µ) =

"x

(µ)1

x

(µ)2

#

x

(µ)

x

(µ)1 x

(µ)2

x

(8)

A classification task

Input patterns . Index labels different patterns.

Each pattern has two components, and .

Arrange components into vector, , is shown in the Figure.

Patterns fall into two classes: onthe left, and on the right.

Draw a red line (decision boundary) to distinguish the two types of patterns ( and ).

Aim: train a neural network to compute the decision boundary. To do this, define targetvalues: for , and for

Training set .

11

t(µ) = 1 t(µ) = �1

µ = 1, . . . , p

x1

x2

x

(µ)

x

(µ)1 x

(µ)2

x

(µ) =

"x

(µ)1

x

(µ)2

#

(x(µ), t(µ)), µ = 1, . . . , p

x

(8)

x

(8)

Simple perceptron

Simple perceptron: one neuron. Two input terminals and . Activation function . No threshold, . Output

Vectors

Scalar product between vectors:

So

Aim: adjust the weights so that network outputs correct target value for all patterns:

w1

w2

sgn(b)

w =

w1

w2

� = 0

angle between and w

' w

= |w|

cos �sin�

�O

↵�

12

inputs

output denotes length of vector w

w

µ = 1, . . . , pfor

cos �

�2���2

x =

x1

x2

�= |x|

cos ↵sin↵

= |w| |x|(cos ↵ cos � + sin↵ sin��

w · x = w1x1 + w2x2

= |w| |x| cos(↵� �)

w · x = |w| |x| cos � x

O = sgn(w1x1 + w2x2)

O(µ) = sgn(w · x

(µ)) = t(µ)

x2x1

x2

x1

= sgn(w · x)

x

Geometrical solution

Simple perceptron: one neuron. Two input terminals and . Activation function . No threshold, . Output with

Aim: adjust the weights so that network outputs correct target values for all patterns:

Solution:

w1

w2

w

'

angle between and w

� = 0

13

Check: so . Correct since .

O

inputs

output

w

µ = 1, . . . , pfor

Legendt(µ) = 1

t(µ) = �1

cos �

�2���2

O = sgn(w · x) w · x = |w| |x| cos � x

O(µ) = sgn(w · x

(µ)) = t(µ)

x1

x2

x2

x1

x

(8)

w · x

(8) > 0 O(8) = sgn(w · x

(8)) = 1t(8) = 1

sgn(b)x2x1

define decision boundary by w · x = 0so that decision boundary. w ?

Hebb’s rule

Now the pattern is on wrong side of the red line. So .

Move the red line so that , by rotating the weight vector :w

w

14

Legendt(µ) = 1

t(µ) = �1x

(8)

x1

x2

x

(8) O(8) = sgn(w · x

(8)) 6= t(8)

O(8) = sgn(w · x

(8)) = t(8)

Hebb’s rule

Now the pattern is on wrong side of the red line. So .

Move the red line so that , by rotating the weight vector :

(small parameter ) so that .

w

w�

� > 0

15

Legendt(µ) = 1

t(µ) = �1

x

(8) O(8) = sgn(w · x

(8)) 6= t(8)

O(8) = sgn(w · x

(8)) = t(8)

w� = w + �x(8) O(8) = sgn(w · x

(8)) = t(8)

x

(8)

x1

x2

Hebb’s rule

Now the pattern is on wrong side of the red line. So .

Move the red line so that , by rotating the weight vector :

(small parameter ) so that .

w

� > 0

w

16

Legendt(µ) = 1

t(µ) = �1

x1

x2

x

(4)

x

(4) O(4) = sgn(w · x

(4)) 6= t(4)

O(4) = sgn(w · x

(4)) = t(4)

O(4) = sgn(w · x

(4)) = t(4)w� = w � �x(4)

Hebb’s rule

Now the pattern is on wrong side of the red line. So .

Move the red line so that , by rotating the weight vector :

(small parameter ) so that .

w

w�

� > 0

17

Legendt(µ) = 1

t(µ) = �1

x

(4) O(4) = sgn(w · x

(4)) 6= t(4)

O(4) = sgn(w · x

(4)) = t(4)

O(4) = sgn(w · x

(4)) = t(4)w� = w � �x(4)

x1

x2

x

(4)

Hebb’s rule

Now the pattern is on wrong side of the red line. So .

Move the red line so that , by rotating the weight vector :

(small parameter ) so that .

Note the minus sign:

where

where

Learning rule (Hebb’s rule)

with

w

w�

� > 0

w0 = w + �w

18

Apply learning rule many times until problem is solved.

Legendt(µ) = 1

t(µ) = �1 t(8) = 1

t(4) = �1

x

(4) O(4) = sgn(w · x

(4)) 6= t(4)

O(4) = sgn(w · x

(4)) = t(4)

O(4) = sgn(w · x

(4)) = t(4)w� = w � �x(4)

w� = w + �x(8)

w� = w � �x(4)

�w = ⌘ t(µ)x

(µ)x1

x2

x

(4)

Example - AND function

Logical AND

The threshold determines intersection of decision boundary with -axis.

Condition for decision boundary: .

Line equation for decision boundary: .

w

w2 = 1

w1 = 1

19

Legendt(µ) = 1

t(µ) = �1

1

⇠1 ⇠2 t

0 0 -1

1 0 -1

0 1 -1

1 1 1

x1

x2

threshold

x2

w · x� ✓ = 0intersection with -axis x2

x2 = �w1w2

x1 + ✓w2

x1 x2

x1

x2

Example - XOR function

Logical XOR

20

Legendt(µ) = 1

t(µ) = �1

1

⇠1 ⇠2 t

0 0 -1

1 0 1

0 1 1

1 1 -1

x1

x2

x2x1

Example - XOR function

Logical XOR

This problem is not linearly separable because we cannot separate from by asingle red line.

21

Legendt(µ) = 1

t(µ) = �1

1

⇠1 ⇠2 t

0 0 -1

1 0 1

0 1 1

1 1 -1

x1

x2

x2x1

Example - XOR function

Logical XOR

This problem is not linearly separable because we cannot separate from by asingle red line.

Solution: use two red lines.

22

Legendt(µ) = 1

t(µ) = �1

1

⇠1 ⇠2 t

0 0 -1

1 0 1

0 1 1

1 1 -1

x1

x2

x2x1

Example - XOR function

layer of hidden neurons and (neither input nor output)

all hidden weights equal to

Logical XOR

Two hidden neurons, each one defines one red line.

Together they define three regions, , , and .

We need a third neuron (with weights and to associate region with ,and regions and with .

I II III

W1 W2 II

II IIII

II III

1

11

1

1

1

2

21

V1

V2

V2V1

O = sgn(V1 � V2 � 1)

1

�1

threshold 23

Legendt(µ) = 1

t(µ) = �1

1

⇠1 ⇠2 t

0 0 -1

1 0 1

0 1 1

1 1 -1

x1

x2

x1

x2

x2x1

Non-(linearly) separable problems

Solve problems that are not linearly separable with a hidden layer of neurons

Four hidden neurons - one for each red line-segment. Move the red lines into the correct configuration by repeatedly using Hebb’s rule until the problem is solved(a fifth neuron assigns regions and solves the classification problem).

24

Legendt(µ) = 1

t(µ) = �1

x1

x2

x1

x2

Generalisation

Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.

Once all red lines are in the right place (all weights determined), apply network to a newdata set. If the training set was reliable, then the network has learnt to classify the new data,it has learnt to generalise.

25

Legendt(µ) = 1

t(µ) = �1

x1

x2

x1

x2

(x(µ), t(µ)), µ = 1, . . . , p

Training

Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.

Once all red lines are in the right place (all weights determined), apply network to a newdata set. If the training set was reliable, then the network has learnt to classify the new data,it has learnt to generalise.

26

Legendt(µ) = 1

t(µ) = �1

x1

x2

x1

x2

(x(µ), t(µ)), µ = 1, . . . , p

Training

Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.

Usually a small number of errors is acceptable. It is often not meaningful to try to fine-tune very precisely (input patterns are subject to noise).

error

27

Legendt(µ) = 1

t(µ) = �1

x1

x2

x1

x2

(x(µ), t(µ)), µ = 1, . . . , p

Overfitting

Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.

Here: used hidden neurons to fit decision boundary very precisely.

But often the inputs are affected by noise, which can be very different in different data sets.

28

15

training data set

Legendt(µ) = 1

t(µ) = �1

x1

x2

(x(µ), t(µ)), µ = 1, . . . , p

Overfitting

Train the network on a training set : move red lines into the correct configuration by repeatedly applying Hebb’s rule to adjust all weights. Usually many iterations necessary.

Here: used hidden neurons to fit decision boundary very precisely.

But often the inputs are affected by noise, which can be very different in different data sets.

Here the network just learnt noise in the training set (overfitting). Overfitting is a huge problem because networks usually have many parameters (weights and thresholds).

29

15

test data set

Legendt(µ) = 1

t(µ) = �1

x1

x2

(x(µ), t(µ)), µ = 1, . . . , p

Multilayer perceptrons

All examples had two-dimensional input space and one-dimensional outputs.So that we could draw the problem and understand its geometrical nature.

In reality problems are often high-dimensional. For example image recognition: input dimension . Output dimension: number of classes to be recognised (car, lorry, road sign, person, bicycle,...)

Multilayer perceptron: network with many hidden layers

Problems: (i) overfitting severe if many weights (ii) nets with many layers learn slowly

Questions: (i) how many hidden layers necessary? (ii) how many hidden layers efficient?

N = #pixels� 3

hidden layersnumber , , and

30

�� 1 � � + 1

... ...

How many hidden layers?

All Boolean functions with inputs can be trained/learned with a single hidden layer.

Proof by construction. Requires neurons in hidden layer.

For large , this architecture is not practicalbecause the number of neurons increasesexponentially with .

Example: parity function. More efficient layout: build network using XOR units. Requires only units.

But such deep networksare in general hard to train.

N

2N

N

N

hidden layers

31

inputs outputs

� N

Parity function: if oddnumber of inputs equal ,otherwise .

O = 11

O = 0

Deep learning

Deep learning. Perceptrons withlarge numbers of layers (more than two). Neural nets recognise objects in images with very high accuracy.

Better than Humans?

Neural nets have been around since the 80ies.

It was always thought that they are very difficult to train.

Why does it suddenly work so well?

32

neural nets

estimate ofHuman error

Image classification interface for Humans

33

cs.stanford.edu/people/karpathy/ilsvrc

Better training sets

To obtain training set: must collect and manually annotate data. Expensive.Challenges: (i) efficient data collection and annotation (ii) require large variety in collected data

Example: Image-net (publicly available data set): images, classified into classes.

> 107 103

image-net.org

34

Manual annotation of training data

35

xkcd.com/1897

reCAPTCHA

Two goals. (i) protects websites from bot (ii) users who click correctly provide correct annotation.

github.com/google/recaptcha

36

Better layouts: convolutional networks

Convolutional network. Each neuron infirst hidden layer is connected to a smallregion of inputs. Here pixels.

Slide the region over input image. Use sameweights for all hidden neurons. Update

- detects the same local feature everywhere in input image (edge, corner,...). Feature map.

- the form of the sum is called convolution

- less overfitting because fewer weights

- use several feature maps to detect different features in input image

1

• • • � � � � � � � �• • • � � � � � � � �• • • � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �

1

• � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �1

� • • • � � � � � � �� • • • � � � � � � �� • • • � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �� � � � � � � � � � �

1

� • � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

3� 3

Vmn

inputs hidden

V11

V12

10� 10

10� 10

10� 10

8� 8

8� 8

8� 8� 4

37

Vmn = g� 3�

k=1

3�

l=1

wklxm+k�1,n+l�1

Convolutional networks

- max-pooling: take maximum of over small region ( ) to reduce # of parameters

- add fully connected layers to learn more abstract features

But not new!

Vmn

inputs hidden

2� 2

8� 8� 410� 10 max-pooling

4� 4� 4

38

outputs

Krizhevsky, Sutskever & Hinton (2012)

Patterns detected by convolutional net in 1986 Rumelhart, Hinton & Williams (1986)

Object detection

Object detection from video frames using Yolo-9000

Film recorded with an Iphone on my way home from work.

Tiny number of weights (so that the algorithm runs on my laptop).

Pretrained on Pascal VOC data set.

github.com/philipperemy/yolo-9000

39

host.robots.ox.ac.uk/pascal/VOC/

Recognising NIST digits

National Institute of Standard (NIST) in USA has a database of hand-written digits.

Convolutional network trained on the M-NIST data set (60 000 digits) classifies digits in the corresponding test set with high accuracy. Only 23 out of 10 000 wrong.

But the convolutional net fails on our own digits!

Convolutional nets excel at learning patterns in a given input distribution very precisely.But they may fail if the distribution is slightly distorted.

40

http://yann.lecun.com/exdb/mnist/

Ciresan, Meier & Schmidhuber, arxiv:1202.2745

O. Balabanov (2018)

Further problems

Convolutional nets do not understand what they see in the same way as Humans do.

Convolutional net misclassifies slightly distorted image with high confidence

Convolutional net misclassifies noise with certainty.

The nearest decision boundary is very Translational-invariant nature ofclose. Not intuitive but possible in very convolutional layout makes ithigh-dimensional input space. difficult to learn global features.

41

original image correctlyclassified as car

slightly distorted imageclassified as ostrich

Szegedy et al. arxiv:1312.6199

99.6%

classified as leopard

Khurshudov, blog (2015)

Nguyen et al. arxiv:1412.61897

Conclusions

Artificial neural networks were already studied in the 80ies.

In the last few years: deep-learning revolution due to - better training sets - better hardware (GPUs, dedicated chips) - improved algorithms for object detection Applications: google, facebook, tesla, zenuity,... .

Perspectives - close connections to statistical physics (spin glasses, random magnets) - related methods in mathematical statistics (big data analysis) - unsupervised learning (clustering and familiarity of input data)

But convolutional nets do not understand what they learn in the way Humans do.

Literature: lecture notes Artificial neural networks FFR 135 (Chalmers) FIM720 (GU).

42

B. Mehlig, physics.gu.se/~frtbm

OpenTAStellan Östlund & Hampus Linander

43

Recommended