Upload
ionut-stanciu
View
252
Download
0
Embed Size (px)
Citation preview
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 1/67
Harvard-MIT Division of Health Sciences and TechnologyHST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision SupportSpring 2005
Artificial Neural Networks
Lucila Ohno-Machado(with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 2/67
Overview • Motivation
• Perceptrons
• Multilayer perceptrons
• Improving generalization • Bayesian perspective
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 3/67
Motivation
Images removed due to copyright reasons.
benign lesion malignant lesion benign lesion malignant lesion
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 4/67
Motivation • Human brain
– Parallel processing
– Distributed representation
– Fault tolerant – Good generalization capability
• Mimic structure and processing in
computational model
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 5/67
Biological Analogy
Synapses
Axon
Dendrites
Synapses+
+
+--
(weights)
Nodes
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 6/67
Perceptrons
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 7/67
00
01
110010
10
00 11 Input patterns
Input layer
Output layer
11
01 11
11
1100
00
00 10
10Sorted
.patterns
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 8/67
Activation functions
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 9/67
Perceptrons (linear machines) Input units
Cough Headache
weights
Δrule
change weights to
decrease the error
-what we got
what we wantedNo disease Pneumonia Flu Meningitiserror
Output units
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 10/67
0 1 0 0 0 0 0
Abdominal Pain Perceptron Intensity Duration
Male Age Temp WBC Pain Pain adjustable
weights37 10 11 120
Appendic itis Diverticulit is Ulcer Pain Cholecystitis Obstruction PancreatitisDuodenal Non-specific Small BowelPerforated
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 11/67
AND y
x1 x2
w1 w2
θ= 0.5
input output
00
011011
0
001
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 0f(1w1 + 0w2) = 0
f(1w1 + 1w2) = 1
1, for a >θf(a) =
0, for a ≤ θ θ
some possible values for w1 and w2
w1 w2
0.20
0.20
0.25
0.40
0.35
0.40
0.30
0.20
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 12/67
Single layer neural network Output of unit j:
Input units
Input to unit
j
i
Input to unit
units
measured value of variable
Output +θ j)o j = 1/ (1 + e- (a j )
j: a j = Σwijai
i: ai
i
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 13/67
Single layer neural network Output of unit j:
Output o j = 1/ (1 + e - (a j+θ j)
units
Input to unit
j
i
Input to unit
measured
Input units
j: a j = Σwija i
i: a i
value of variable i Increasing
0
1
-5 0 5
i
i
i
0.2
0.4
0.6
0.8
1.2
-15 -10 10 15
Ser es1
Ser es2
Ser es3
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 14/67
Training: Minimize Error Input units
Cough Headache
weights
Δrule
change weights to
decrease the error
-what we got
what we wantedNo disease Pneumonia Flu Meningitiserror
Output units
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 15/67
Error Functions • Mean Squared Error (for regression
problems), where t is target, o is outputΣ(t - o)2/n
• Cross Entropy Error (for binaryclassification)
− Σ(t log o) + (1-t) log (1-o) +θ j)o j = 1/ (1 + e- (a j )
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 16/67
Error function • Convention: w := (w0,w), x := (1,x)
• w0 is “bias” y
x1 x2
w1 w2
θ= 0.5
• o = f (w • x)
• Class labels ti ∈{+1,-1}
• Error measure
– E = -Σ ti (w • x )o
i miscl.
i w
• How to minimize E?
y
x1 x2
w1 w2
1
= 0.5
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 17/67
Minimizing the Error initial error
final error
local minimum
Error surface
derivative
initial trainedw w
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 18/67
Gradient descent
Local minimum
Global minimum
Error
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 19/67
Perceptron learning w
• Find minimum of E by iterating
k+1 = wk – η gradw E
• E = -Σ ti (w • x ) ⇒i i miscl. gradw E = -Σ ti xi
i miscl.
w
• “online” version: pick misclassified x k+1 = wk + η ti xi
i
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 20/67
Perceptron learning • Update rule wk+1 = wk + η ti xi • Theorem: perceptron learning converges
for linearly separable sets
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 21/67
Gradient descent • Simple function minimization algorithm • Gradient is vector of partial derivatives • Negative gradient is direction of steepest
descent
3
3
2
2
1
1
0
0
20
10
0
0
0
11
2
2
3
3
Figures by MIT OCW.
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 22/67
Classification Model Inputs
Weights Output
Age 34
1
4of beingAlive”
5
8
40.6
Gender
“Probability
Stage
Dependent Independent Coefficients variablevariables
a, b, c p x1, x2, x3
Prediction
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 23/67
Terminology • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights”
• Estimates = “targets”
• Iterative step = cycle, epoch
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 24/67
XOR y
x1 x2
w1 w2
θ= 0.5
input output
00
011011
0
110
f(x1w1 + x2w2) = y
f(0w1 + 0w2) = 0
f(0w1 + 1w2) = 1f(1w1 + 0w2) = 1
f(1w1 + 1w2) = 0
some possible values for w1 and w2
w1 w2
1, for a > θf(a) = 0, for a ≤ θ
θ
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 25/67
XOR x 1 x 2
y
input output
00
011011
0
110
θ= 0.5
w 3 w5 w 4
z θ= 0.5
w1 w2
f (w1, w2, w3, w4, w5)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5) f(a) =0, for a ≤ θ
(0.3,0.3,1,1,-2) θ
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 26/67
XOR input output
00
011011
0
110
w1 w3w2
w5 w6
θ= 0.5 for all units
w4
f (w1, w2, w3, w4, w5 , w6)
a possible set of values for ws
1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) =0, for a ≤ θ
(0.6,-0.6,-0.7,0.8,1,1) θ
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 27/67
From perceptrons tomultilayer perceptrons
Why?
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 28/67
Abdominal Pain
37 10 1
Appendici tis Diverticulit is
PerforatedNon-specific
Cholecystitis
Small Bowel
Pancreatitis
1 20
Male Age Temp WBC PainIntensity
1
PainDuration
0 1 0 0000
adjustable
weights
ObstructionPainDuodenalUlcer
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 29/67
0.8
Heart Attack Network Duration Intensity ECG: ST
Pain Pain elevation Smoker Age Male
Myocardial Infarction
112 1504
“ Probability” of MI
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 30/67
Multilayered Perceptrons
Input uni ts
Input to unit j: a j = Σwijai
j
i
Input to unit i: aii
Output of unit j:o j = 1/ (1 + e- (a j+θ j) )
Output units
Perceptron
MultilayeredInput to unit k:
perceptron
Output of unit k:
ok = 1/ (1 + e- (ak+θk) )k
Hidden
units
ak = Σw jko j
measured value of variable
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 31/67
Neural Network Model Inputs
Age 34
2
4
.6
.5
.8
.2
.1
.3
.7
.2
.4
.2
Output
0.6
Gender
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 32/67
“Combined logistic models” Inputs
Age 34
2
4
.6
.5
.8
.1
.7
Output
0.6
Gender
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 33/67
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 34/67
Inputs Age 34
1
4
.6.5
.8
.2
.1
.3
.7
.2
Output
0.6
Gender
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 35/67
Not really,
no target for hidden units... Age
0.6
Gender
34
2
4
.6
.5
.8
.2
.1
.3
.7
.2
.4
.2
“Probability
Stageof beingAlive”
Independent Weights Hidden Weights Dependentvariables Layer
variable
Prediction
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 36/67
Hidden Units and Backpropagation Output units
Hiddenunits
what we got
what we wanted-error
Δ rule
Δ rule
ba
i
ckpr
o
pagato
n
Input units
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 37/67
Multilayer perceptrons • Sigmoidal hidden layer
• Can represent arbitrary decision regions • Can be trained similar to perceptrons
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 38/67
ECG Interpretation R-R interval
S-T elevation
QRS duration
QRS amplitude
AVF lead
SV tachycardia
Ventricular tachycardia
LV hypertrophy
RV hypertrophy
Myocardial infarction
P-R interval
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 39/67
Linear Separation Separate n-dimensional space using one (n - 1)-dimensional space
Meningitis FluNo cough Cough
Headache Headache
CoughNo coughNo headache No headacheNo disease Pneumonia
No treatment
00 10
01 11
000
010
101
111
110
Treatment
011
100
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 40/67
Another way of thinking about
this… • Have data set D = {(x ,t )} drawn from
i i
probability distribution P(x,t)
• Model P(x,t) given samples D by ANN
with adjustable parameter w• Statistics
analogy:
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 41/67
Maximum Likelihood Estimation • Maximize likelihood of data D• Likelihood L = Πp(x ,t ) = Πp(t |x )p(x ) i i i i i
• Minimize -log L = -Σlog p(t |x ) -Σlog p(x ) i i i
• Drop second term: does not depend on w • Two cases: “regression” and classification
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 42/67
Likelihood for classification (ie categorical target)
• For classification, targets t are class labels
• Minimize -Σlog p(t |x ) i i • p(t 1-ti |x ) = y(x ,w) ti(1- y(x ,w)) ⇒
i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i
• Minimizing –log L equivalent to minimizing
-[Σt log y(x ,w) +(1 – t ) * log(1-y(x ,w))] i i i i
(cross-entropy error )
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 43/67
Likelihood for “regression” (ie continuous target)
• For regression, targets t are real values • Minimize -Σlog p(t |x ) i i
• p(t |x ) = 1/Z exp(-(y(x ,w) – t )
2
/(2σ2
)) ⇒
i i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Z i i i i
• y(x ,w) is network outputi • Minimizing –log L equivalent to minimizing
Σ(y(x ,w) – t )2 (sum-of-squares error ) i i
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 44/67
Backpropagation algorithm w
• Minimizing error function
by gradient descent:k+1 = wk – η gradw E
• Iterative gradientcalculation bypropagating error signals
Backpropagation algorithm
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 45/67
Backpropagation algorithm Problem: how to set learning rate η ?
Better: use more advanced minimizationalgorithms (second-order information)
2
2
0
0
-2
2
0
-2
-4 -2 20-4 -2
Figures by MIT OCW.
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 46/67
Backpropagation algorithm Classification Regression
cross-entropy sum-of-squares
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 47/67
Overfitting
ModelReal Distribution Overfitted
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 48/67
Improving generalization
Problem: memorizing (x,t) combinations
(“overtraining”)
0.7 0.5 00.9-0.5 1
1-1.2-0.2
0.3 0.6 1
0.5-0.2 ?
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 49/67
Improving generalization
• Need test set to judge performance • Goal: represent information in data set, not
noise
• How to improve generalization? – Limit network topology
– Early stopping
– Weight decay
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 50/67
Limit network topology
• Idea: fewer weights ⇒less flexibility • Analogy to polynomial interpolation:
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 51/67
Limit network topology
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 52/67
Early Stopping
C H D
error
holdout
training
Overfitted model “Real” model Overfitted model
0 age cycles
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 53/67
Early stopping
• Idea: stop training when information (but
not noise) is modeled• Need hold-out set to determine when to
stop training
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 54/67
Overfitting
b = training set
a = hold-out set
Overfitted model
tss
Epochs
min (Δtss)
tss a
tss b
Stopping criterion
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 55/67
Early stopping
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 56/67
Weight decay
• Idea: control smoothness of network
output by controlling size of weights
• Add term α||w||2 to error function
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 57/67
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 58/67
Bayesian perspective
• Error function minimization corresponds to
maximum likelihood (ML) estimate: singlebest solution wML
• Can lead to overtraining • Bayesian approach: consider weight
posterior distribution p(w|D).
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 59/67
Bayesian perspective
• Posterior = likelihood * prior
• p(w|D) = p(D|w) p(w)/p(D)• Two approaches to approximating p(w|D):
– Sampling
– Gaussian approximation
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 60/67
Sampling from p(w|D)
prior
likelihood
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 61/67
Sampling from p(w|D)
prior * likelihood = posterior
Bayesian example for
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 62/67
ayes a e a p e o
regression
Bayesian example for
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 63/67
y p
classification
Model Features
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 64/67
(with strong personal biases) Modeling Examples Explanat.Effort Needed
Rule-based Exp. Syst. high low high?
Classification Trees low high+ “high”Neural Nets, SVM low high lowRegression Models high moderate moderate
Learned Bayesian Nets low high+ high(beautiful when it works)
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 65/67
Regression vs. Neural Networks
“X1” 1X3” “X1X2X3”
Y
“X2”
X1 X2 X3 X1X2 X1X3 X2X3
Y
(23-1) possible combinations
X1X2X3
“X
X1 X2 X3
Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 66/67
Summary
• ANNs inspired by functionality of brain • Nonlinear data model• Trained by minimizing error function
• Goal is to generalize well• Avoid overtraining
• Distinguish ML and MAP solutions
8/13/2019 Perceptron Multilayer
http://slidepdf.com/reader/full/perceptron-multilayer 67/67
Some References
Introductory and Historical Textbooks
• Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed
Processing. MIT Press, Cambridge, 1986. (H)• Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of
Neural Computation. Addison-Wesley, Redwood City, 1991.
• Pao, YH. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, 1989.• Bishop CM. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995.