INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] ethem/i2ml Lecture Slides for

INTRODUCTION TO Machine LearningETHEM ALPAYDIN© The MIT Press, 2004

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 11:

Multilayer Perceptrons

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Neural Networks

Networks of processing units (neurons) with connections (synapses) between them

Large number of neurons: 1010

Large connectitivity: 105

Parallel processing Distributed computation/memory Robust to noise, failures


Understanding the Brain

Levels of analysis (Marr, 1982)1. Computational theory2. Representation and algorithm3. Hardware implementation

Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD

Neural net: SIMD with modifiable local memoryLearning: Update by training/experience


Perceptron

(Rosenblatt, 1962)

Td

Td

Td

jjj

x,...,x,

w,...,w,w

wxwy

1

10

01

1

x

w

xw


What a Perceptron Does

Regression: y=wx+w0 Classification:

y=1(wx+w0>0)

ww0

y

x

x0=+1

ww0

y

x

s

w0

y

x

xwToy

exp1

1sigmoid


K Outputs

kk

i

i

k k

ii

Tii

yy

C

oo

y

o

maxif

choose

expexp

xw

Classification:

Regression:

xy

xw

W

Tii

d

jjiji wxwy 0

1


Training

Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components

Stochastic gradient-descent: Update after a single pattern

Generic update rule (LMS rule):

InpututActualOutpputDesiredOutctorLearningFaUpdate

tj

ti

ti

tij xyrw


Training a Perceptron: Regression Regression (Linear output):

t

jttt

j

tTtttttt

xyrw

ryrr,E

22

21

21

| xwxw


Classification

Single sigmoid output

K>2 softmax outputs

tj

tttj

ttttttt

tTt

xyrw

yryr,E

y

1 log 1 log |

sigmoid

rxw

xw

tj

ti

ti

tij

i

ti

ti

ttii

t

k

tTk

tTit

xyrw

yr,Ey

log | exp

exprxw

xwxw


Learning Boolean AND


XOR

No w0, w1, w2 satisfy:

(Minsky and Papert, 1969)

0

0

0

0

021

01

02

0

www

ww

ww

w


Multilayer Perceptrons

(Rumelhart et al., 1986)

d

j hjhj

Thh

H

hihih

Tii

wxw

z

vzvy

1 0

10

exp1

1

sigmoid xw

zv


x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)


Backpropagation

hj

h

h

i

ihj

d

j hjhj

Thh

H

hihih

Tii

wz

zy

yE

wE

wxw

z

vzvy

exp1

1

sigmoid

1 0

10

xw

zv


t

jth

th

th

tt

tj

th

th

th

tt

hj

th

th

t

tt

hjhj

xzzvyr

xzzvyr

wz

zy

yE

wE

w

1

1

Regression

Forward

Backward

x

xwThhz sigmoid

H

h

thh

t vzvy1

0

221

| t

tt yr,E XvW

th

t

tth zyrv


Regression with Multiple Outputs

zh

vih

yi

xj

whj

tj

th

th

t iih

ti

tihj

th

t

ti

tiih

i

H

h

thih

ti

t i

ti

ti

xzzvyrw

zyrv

vzvy

yr,E

1

21

|

01

2

XVW




whx+w0

zh

vhzh


Two-Class Discrimination

One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt

t

jth

thh

t

tthj

th

t

tth

t

tttt

H

h

thh

t

xzzvyrw

zyrv

yryr,E

vzvy

1

1 log 1 log |

sigmoid1

0

XvW


K>2 Classes

tj

th

th

t iih

ti

tihj

th

t

ti

tiih

t i

ti

ti

ti

k

tk

tit

i

H

hi

thih

ti

xzzvyrw

zyrv

yr,E

CPo

oyvzvo

1

log|

|exp

exp

10

Xv

x

W


Multiple Hidden Layers

MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks

2

1

1022

21

0212122

11

01111

1sigmoidsigmoid

1sigmoidsigmoid

H

lll

T

H

hlhlh

Tll

d

jhjhj

Thh

vzvy

H,...,l,wzwz

H,...,h,wxwz

zv

zw

xw


Improving Convergence

Momentum

Adaptive learning rate

1

ti

i

tti w

wE

w

otherwise

if

b

EEa tt


Overfitting/OvertrainingNumber of weights: H (d+1)+(H+1)K



Structured MLP

(Le Cun et al, 1989)


Weight Sharing


Hints

Invariance to translation, rotation, size

Virtual examples Augmented error: E’=E+λhEh

If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2

Approximation hint:

(Abu-Mostafa, 1995)

xx

xx

xx

h

bxgbxg

axgaxg

b,axg

E

|if |

|if |

|if 0

2

2


Tuning the Network Size

Destructive Weight decay:

Constructive Growing networks

(Ash, 1989) (Fahlman and Lebiere, 1989)

ii

ii

i

wE'E

wwE

w

2

2


Bayesian Learning

Consider weights wi as random vars, prior p(wi)

Weight decay, ridge regression, regularizationcost=data-misfit + λ complexity

2

2

212exp where

log|log|log

|log max arg |

|

w

w

www

ww

www

w

E'E

)/(w

cwpwpp

Cppp

pˆp

ppp

ii

ii

MAP

XX

XX

XX


Dimensionality Reduction



Learning Time

Applications: Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association

Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986)


Time-Delay Neural Networks


Recurrent Networks


Unfolding in Time

Documents

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] ethem/i2ml Lecture Slides for