Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 1/67

Harvard-MIT Division of Health Sciences and TechnologyHST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo

6.873/HST.951 Medical Decision SupportSpring 2005

Artificial Neural Networks

Lucila Ohno-Machado(with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)



Overview • Motivation

• Perceptrons

• Multilayer perceptrons

• Improving generalization • Bayesian perspective



Motivation

Images removed due to copyright reasons.

benign lesion malignant lesion benign lesion malignant lesion



Motivation • Human brain

– Parallel processing

– Distributed representation

– Fault tolerant – Good generalization capability

• Mimic structure and processing in

computational model



Biological Analogy

Synapses

Axon

Dendrites

Synapses+

+

+--

(weights)

Nodes



Perceptrons



00

01

110010

10

00 11 Input patterns

Input layer

Output layer

11

01 11

11

1100

00

00 10

10Sorted

.patterns



Activation functions



Perceptrons (linear machines) Input units

Cough Headache

weights

Δrule

change weights to

decrease the error

-what we got

what we wantedNo disease Pneumonia Flu Meningitiserror

Output units



0 1 0 0 0 0 0

Abdominal Pain Perceptron Intensity Duration

Male Age Temp WBC Pain Pain adjustable

weights37 10 11 120

Appendic itis Diverticulit is Ulcer Pain Cholecystitis Obstruction PancreatitisDuodenal Non-specific Small BowelPerforated



AND y

x1 x2

w1 w2

θ= 0.5

input output

00

011011

0

001

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0

f(0w1 + 1w2) = 0f(1w1 + 0w2) = 0

f(1w1 + 1w2) = 1

1, for a >θf(a) =

0, for a ≤ θ θ

some possible values for w1 and w2

w1 w2

0.20

0.20

0.25

0.40

0.35

0.40

0.30

0.20



Single layer neural network Output of unit j:

Input units

Input to unit

j

i

Input to unit

units

measured value of variable

Output +θ j)o j = 1/ (1 + e- (a j )

j: a j = Σwijai

i: ai

i



Single layer neural network Output of unit j:

Output o j = 1/ (1 + e - (a j+θ j)

units

Input to unit

j

i

Input to unit

measured

Input units

j: a j = Σwija i

i: a i

value of variable i Increasing

0

1

-5 0 5

i

i

i

0.2

0.4

0.6

0.8

1.2

-15 -10 10 15

Ser es1

Ser es2

Ser es3



Training: Minimize Error Input units

Cough Headache

weights

Δrule

change weights to

decrease the error

-what we got

what we wantedNo disease Pneumonia Flu Meningitiserror

Output units



Error Functions • Mean Squared Error (for regression

problems), where t is target, o is outputΣ(t - o)2/n

• Cross Entropy Error (for binaryclassification)

− Σ(t log o) + (1-t) log (1-o) +θ j)o j = 1/ (1 + e- (a j )



Error function • Convention: w := (w0,w), x := (1,x)

• w0 is “bias” y

x1 x2

w1 w2

θ= 0.5

• o = f (w • x)

• Class labels ti ∈{+1,-1}

• Error measure

– E = -Σ ti (w • x )o

i miscl.

i w

• How to minimize E?

y

x1 x2

w1 w2

1

= 0.5



Minimizing the Error initial error

final error

local minimum

Error surface

derivative

initial trainedw w



Gradient descent

Local minimum

Global minimum

Error



Perceptron learning w

• Find minimum of E by iterating

k+1 = wk – η gradw E

• E = -Σ ti (w • x ) ⇒i i miscl. gradw E = -Σ ti xi

i miscl.

w

• “online” version: pick misclassified x k+1 = wk + η ti xi

i



Perceptron learning • Update rule wk+1 = wk + η ti xi • Theorem: perceptron learning converges

for linearly separable sets



Gradient descent • Simple function minimization algorithm • Gradient is vector of partial derivatives • Negative gradient is direction of steepest

descent

3

3

2

2

1

1

0

0

20

10

0

0

0

11

2

2

3

3

Figures by MIT OCW.



Classification Model Inputs

Weights Output

Age 34

1

4of beingAlive”

5

8

40.6

Gender

“Probability

Stage

Dependent Independent Coefficients variablevariables

a, b, c p x1, x2, x3

Prediction



Terminology • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights”

• Estimates = “targets”

• Iterative step = cycle, epoch



XOR y

x1 x2

w1 w2

θ= 0.5

input output

00

011011

0

110

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0

f(0w1 + 1w2) = 1f(1w1 + 0w2) = 1

f(1w1 + 1w2) = 0

some possible values for w1 and w2

w1 w2

1, for a > θf(a) = 0, for a ≤ θ

θ



XOR x 1 x 2

y

input output

00

011011

0

110

θ= 0.5

w 3 w5 w 4

z θ= 0.5

w1 w2

f (w1, w2, w3, w4, w5)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5) f(a) =0, for a ≤ θ

(0.3,0.3,1,1,-2) θ



XOR input output

00

011011

0

110

w1 w3w2

w5 w6

θ= 0.5 for all units

w4

f (w1, w2, w3, w4, w5 , w6)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) =0, for a ≤ θ

(0.6,-0.6,-0.7,0.8,1,1) θ



From perceptrons tomultilayer perceptrons

Why?



Abdominal Pain

37 10 1

Appendici tis Diverticulit is

PerforatedNon-specific

Cholecystitis

Small Bowel

Pancreatitis

1 20

Male Age Temp WBC PainIntensity

1

PainDuration

0 1 0 0000

adjustable

weights

ObstructionPainDuodenalUlcer



0.8

Heart Attack Network Duration Intensity ECG: ST

Pain Pain elevation Smoker Age Male

Myocardial Infarction

112 1504

“ Probability” of MI



Multilayered Perceptrons

Input uni ts

Input to unit j: a j = Σwijai

j

i

Input to unit i: aii

Output of unit j:o j = 1/ (1 + e- (a j+θ j) )

Output units

Perceptron

MultilayeredInput to unit k:

perceptron

Output of unit k:

ok = 1/ (1 + e- (ak+θk) )k

Hidden

units

ak = Σw jko j

measured value of variable



Neural Network Model Inputs

Age 34

2

4

.6

.5

.8

.2

.1

.3

.7

.2

.4

.2

Output

0.6

Gender

“Probability

Stageof beingAlive”

Independent Weights Hidden Weights Dependentvariables Layer

variable

Prediction



“Combined logistic models” Inputs

Age 34

2

4

.6

.5

.8

.1

.7

Output

0.6

Gender

“Probability



variable

Prediction





Inputs Age 34

1

4

.6.5

.8

.2

.1

.3

.7

.2

Output

0.6

Gender

“Probability



variable

Prediction



Not really,

no target for hidden units... Age

0.6

Gender

34

2

4

.6

.5

.8

.2

.1

.3

.7

.2

.4

.2

“Probability



variable

Prediction



Hidden Units and Backpropagation Output units

Hiddenunits

what we got

what we wanted-error

Δ rule

Δ rule

ba

i

ckpr

o

pagato

n

Input units



Multilayer perceptrons • Sigmoidal hidden layer

• Can represent arbitrary decision regions • Can be trained similar to perceptrons



ECG Interpretation R-R interval

S-T elevation

QRS duration

QRS amplitude

AVF lead

SV tachycardia

Ventricular tachycardia

LV hypertrophy

RV hypertrophy

Myocardial infarction

P-R interval



Linear Separation Separate n-dimensional space using one (n - 1)-dimensional space

Meningitis FluNo cough Cough

Headache Headache

CoughNo coughNo headache No headacheNo disease Pneumonia

No treatment

00 10

01 11

000

010

101

111

110

Treatment

011

100



Another way of thinking about

this… • Have data set D = {(x ,t )} drawn from

i i

probability distribution P(x,t)

• Model P(x,t) given samples D by ANN

with adjustable parameter w• Statistics

analogy:



Maximum Likelihood Estimation • Maximize likelihood of data D• Likelihood L = Πp(x ,t ) = Πp(t |x )p(x ) i i i i i

• Minimize -log L = -Σlog p(t |x ) -Σlog p(x ) i i i

• Drop second term: does not depend on w • Two cases: “regression” and classification



Likelihood for classification (ie categorical target)

• For classification, targets t are class labels

• Minimize -Σlog p(t |x ) i i • p(t 1-ti |x ) = y(x ,w) ti(1- y(x ,w)) ⇒

i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i

• Minimizing –log L equivalent to minimizing

-[Σt log y(x ,w) +(1 – t ) * log(1-y(x ,w))] i i i i

(cross-entropy error )



Likelihood for “regression” (ie continuous target)

• For regression, targets t are real values • Minimize -Σlog p(t |x ) i i

• p(t |x ) = 1/Z exp(-(y(x ,w) – t )

2

/(2σ2

)) ⇒

i i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Z i i i i

• y(x ,w) is network outputi • Minimizing –log L equivalent to minimizing

Σ(y(x ,w) – t )2 (sum-of-squares error ) i i



Backpropagation algorithm w

• Minimizing error function

by gradient descent:k+1 = wk – η gradw E

• Iterative gradientcalculation bypropagating error signals

Backpropagation algorithm



Backpropagation algorithm Problem: how to set learning rate η ?

Better: use more advanced minimizationalgorithms (second-order information)

2

2

0

0

-2

2

0

-2

-4 -2 20-4 -2

Figures by MIT OCW.



Backpropagation algorithm Classification Regression

cross-entropy sum-of-squares



Overfitting

ModelReal Distribution Overfitted



Improving generalization

Problem: memorizing (x,t) combinations

(“overtraining”)

0.7 0.5 00.9-0.5 1

1-1.2-0.2

0.3 0.6 1

0.5-0.2 ?



Improving generalization

• Need test set to judge performance • Goal: represent information in data set, not

noise

• How to improve generalization? – Limit network topology

– Early stopping

– Weight decay



Limit network topology

• Idea: fewer weights ⇒less flexibility • Analogy to polynomial interpolation:



Limit network topology



Early Stopping

C H D

error

holdout

training

Overfitted model “Real” model Overfitted model

0 age cycles



Early stopping

• Idea: stop training when information (but

not noise) is modeled• Need hold-out set to determine when to

stop training



Overfitting

b = training set

a = hold-out set

Overfitted model

tss

Epochs

min (Δtss)

tss a

tss b

Stopping criterion



Early stopping



Weight decay

• Idea: control smoothness of network

output by controlling size of weights

• Add term α||w||2 to error function





Bayesian perspective

• Error function minimization corresponds to

maximum likelihood (ML) estimate: singlebest solution wML

• Can lead to overtraining • Bayesian approach: consider weight

posterior distribution p(w|D).



Bayesian perspective

• Posterior = likelihood * prior

• p(w|D) = p(D|w) p(w)/p(D)• Two approaches to approximating p(w|D):

– Sampling

– Gaussian approximation



Sampling from p(w|D)

prior

likelihood



Sampling from p(w|D)

prior * likelihood = posterior

Bayesian example for



ayes a e a p e o

regression

Bayesian example for



y p

classification

Model Features



(with strong personal biases) Modeling Examples Explanat.Effort Needed

Rule-based Exp. Syst. high low high?

Classification Trees low high+ “high”Neural Nets, SVM low high lowRegression Models high moderate moderate

Learned Bayesian Nets low high+ high(beautiful when it works)



Regression vs. Neural Networks

“X1” 1X3” “X1X2X3”

Y

“X2”

X1 X2 X3 X1X2 X1X3 X2X3

Y

(23-1) possible combinations

X1X2X3

“X

X1 X2 X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...



Summary

• ANNs inspired by functionality of brain • Nonlinear data model• Trained by minimizing error function

• Goal is to generalize well• Avoid overtraining

• Distinguish ML and MAP solutions



Some References

Introductory and Historical Textbooks

• Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed

Processing. MIT Press, Cambridge, 1986. (H)• Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of

Neural Computation. Addison-Wesley, Redwood City, 1991.

• Pao, YH. Adaptive Pattern Recognition and Neural Networks.

Addison-Wesley, Reading, 1989.• Bishop CM. Neural Networks for Pattern Recognition. Clarendon

Press, Oxford, 1995.

Documents

Perceptron Multilayer