Upload
nicola-fanizzi
View
216
Download
1
Embed Size (px)
DESCRIPTION
http://lacam.di.uniba.it:8000/~nico/corsi/aa/ANN.pdf
Citation preview
Artificial Neural Networks
Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica
Nicola Fanizzi
Dipartimento di InformaticaUniversità degli Studi di Bari
November 9, 2009
Corso di Apprendimento Automatico Artificial Neural Networks
Outline
Multilayer networksBACKPROPAGATION
Hidden layer representationsExample: Face RecognitionAdvanced topics
Corso di Apprendimento Automatico Artificial Neural Networks
Limitations of the Linear Models I
Minsky and Papert (1969) showed that linear classifiers havelimitations, e.g. can’t learn XOR
LMs provide powerful gradient descent methods forreducing the error, even when the patterns are not linearlyseparableUnfortunately LMs are not general enough in applicationsfor which linear discriminants are insufficient for minimumerrorWith a clever choice of nonlinear φ functions one canobtain arbitrary decisions leading to minimum error
choose a complete basis set (e.g. polynomials);such a classifier would have too many free parameters tobe determined from a limited number of training patternsprior knowledge relevant to the classification problemexploited for guiding the choice of nonlinearity
Corso di Apprendimento Automatico Artificial Neural Networks
Limitations of the Linear Models II
Corso di Apprendimento Automatico Artificial Neural Networks
Connectionist Models
Consider humans:Neuron switching time ∼ .001 secondNumber of neurons ∼ 1010
Connections per neuron ∼ 104−5
Scene recognition time ∼ .1 second100 inference steps doesn’t seem like enough
→ much parallel computation
Properties of artificial neural nets (ANN’s):Many neuron-like threshold switching unitsMany weighted interconnections among unitsHighly parallel, distributed processEmphasis on tuning weights automatically
Corso di Apprendimento Automatico Artificial Neural Networks
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued(e.g. raw sensor input)Output is discrete or real valuedOutput is a vector of valuesPossibly noisy dataForm of target function is unknownHuman readability of result is unimportant
Examples:
Speech phoneme recognition [Waibel]Image classification [Kanade, Baluja, Rowley]Financial prediction
Corso di Apprendimento Automatico Artificial Neural Networks
Application
Non linear decision surface
learning how to predict vowels in the context h.dinput: numeric, from spectral analysis of the sound
Corso di Apprendimento Automatico Artificial Neural Networks
Multilayer ANN I
Can create network of perceptrons to approximatearbitrary target concepts
Multilayer Artificial Neural Networks is an example of anartificial neural network
Consists of: input layer, hidden layer(s), and output layerTopological structure usually found by experimentationParameters can be found using BACKPROPAGATION
In analogy with neurobiology, weights or connections aresometimes called synapses and the value of the connectionthe synaptic weights
Corso di Apprendimento Automatico Artificial Neural Networks
Multilayer ANN II
Corso di Apprendimento Automatico Artificial Neural Networks
Multilayer ANN III
Corso di Apprendimento Automatico Artificial Neural Networks
Multilayer Network Structure
Corso di Apprendimento Automatico Artificial Neural Networks
Feed-forward Operation I
Input Layer: each input vector is presented to the inputunits whose output equals the corresponding componentsHidden Layer: Each hidden net unit performs the weightedsum of its inputs to form its (scalar) net activation (innerproduct of the inputs with the weights at the hidden units):
netj =d∑
i=0
xiwji = ~w tj ~x
Each hidden unit emits an output that is a nonlinearfunction (transfer function) of its activation:
yj = f (netj)
Example: a simple threshold or sign function
f (net) = sgn(net) =
+1 net ≥ 0−1 net < 0
Corso di Apprendimento Automatico Artificial Neural Networks
Feed-forward Operation II
Output Layer: Each output unit computes its net activationbased on the hidden unit signals:
netk =
nH∑i=0
yjwkj = ~w tk~y
Each output unit then computes the nonlinear function ofits net, emitting
zk = f (netk )
Typically c output units are given and the classification isdecided with the label corresponding to the maximumyk = gk (~x)
Corso di Apprendimento Automatico Artificial Neural Networks
Feed-forward Operation III
General discriminant functions:
gk (~x) = zk = f
nH∑j=1
wkj f
(d∑
i=1
wjixi + wj0
)+ wk0
Class of functions that can be implemented by a three-layerneural network
Broader generalizations:1 transfer functions at the output layer different from those at
the hidden layer2 different functions at each individual unit
Corso di Apprendimento Automatico Artificial Neural Networks
Expressive Capabilities of ANNs I
Boolean functions:Every boolean function can be represented by networkwith single hidden layerbut might require exponential (in number of inputs) hiddenunits
Continuous functions:Kolmogorov: any continuous function from input to outputcan be implemented in a three-layer net, given sufficientnumber of hidden units, proper nonlinearities and weights
Every bounded continuous function can be approximatedwith arbitrarily small error, by network with one hidden layer[Cybenko 1989; Hornik et al. 1989]Any function can be approximated to arbitrary accuracy bya network with two hidden layers [Cybenko 1988].
Corso di Apprendimento Automatico Artificial Neural Networks
Expressive Capabilities of ANNs II
Corso di Apprendimento Automatico Artificial Neural Networks
Sigmoid Unit I
How to learn weights given network structure?Cannot simply use Perceptron learning rule because wehave hidden layer(s)Function we are trying to minimize: errorCan use gradient descentNeed differentiable activation function:use sigmoid function instead of threshold function
f (x) =1
1 + exp(−x)
Need differentiable error function:can’t use zero-one loss, but can use squared error
E(x) =12
(y − f (x))2
Corso di Apprendimento Automatico Artificial Neural Networks
Sigmoid Unit II
σ(x) is the sigmoid function
11 + e−x
Corso di Apprendimento Automatico Artificial Neural Networks
Sigmoid Unit III
Nice property:dσ(x)
dx = σ(x)(1− σ(x))
Corso di Apprendimento Automatico Artificial Neural Networks
Multilayer Networks
We can derive gradient descent rules to train Multilayernetworks of (sigmoid) units→ BACKPROPAGATION
Multiple outputs→ new error expression:
E [~w ] =12
∑k
(tk − zk )2
Corso di Apprendimento Automatico Artificial Neural Networks
Criterion Function and Gradient Descent
Squared error:
J[~w ] =12
c∑k=1
e2k =
12
c∑k=1
(tk − zk )2 =12
(~t − ~z)2
where~t and ~z represent the target and the network output(length = c)
Gradient descent: weights initialized with random values,changed in a direction that will reduce the error∆~w = −η∂J/∂~w that is:
∆wqp = −η ∂J∂wqp
Corso di Apprendimento Automatico Artificial Neural Networks
Fitting the Weights I
Iterative update: ~w(m + 1) = ~w(m) + ∆w(m)where m indexes the particular input example
∆wqp(m) = η × δq(m) × xp(m)
weight learning local inputcorrection rate gradient at unit j
Evaluate ∆wqp = −η ∂J∂wqp
:for output unitsfor hidden units
We can trasform ∂J∂wqp
using the chain rule:
∂J∂wqp
=∂J∂eq
∂eq
∂f (netq)
∂f (netq)
∂netq∂netq∂wqp
Corso di Apprendimento Automatico Artificial Neural Networks
Fitting the Weights II
∂J∂eq
=∂
∂eq
(12
c∑k=1
e2k
)= eq
∂eq
∂f (netq)=∂(tq − f (netq))
∂f (netq)= −1
∂f (netq)
∂netq= f ′(netq)
∂netq∂wqp
= f (netp) = xp
Corso di Apprendimento Automatico Artificial Neural Networks
Fitting the Weights IIIHence:
∂J∂wqp
= −eqf ′(netq)xp
Then the correction to be applied is defined by the delta rule:
∆wqp = −η ∂J∂wqp
If we consider the local gradient defined
δq = − ∂J∂netq
= − ∂J∂eq
∂eq
∂f (netq)
∂f (netq)
∂netq= eqf ′(netq)
the delta rule becomes:
∆wqp = ηδqxp
Corso di Apprendimento Automatico Artificial Neural Networks
Fitting the Weights IV1 hidden-to-output weights: The error is not explicitly
dependent upon wkj , use the chain rule for differentiation:
∂J∂wkj
=∂J∂netk
∂netk∂wkj
First term: ∂J∂netk
local gradient (a.k.a. error or sensitivity) ofunit k :
δk =∂J∂netk
=∂J∂zk
∂zk
∂netk= (tk − zk )f ′(netk )
Second term: ∂netk∂wkj
= yj
Summing up, the weight update is:
∆wkj = ηδkyj = η(tk − zk )f ′(netk )yj
Corso di Apprendimento Automatico Artificial Neural Networks
Fitting the Weights V2 input-to-hidden weights: credit assignment problem
∂J∂wji
=∂J∂yj
∂yj
∂netj
∂netj∂wkj
First term: ∂J∂yj
= ∂∂yj
[12∑c
k=1(tk − zk )2]= −
∑ck=1(tk − zk )∂zk
∂yj
= −∑c
k=1(tk − zk ) ∂zk∂netk
∂netk∂yj
= −∑c
k=1(tk − zk )f ′(netk )wjk
Second term: let δj = f ′(netj)∑c
k=1 wkjδk
Third term: ∂netj∂wkj
= xi
Summing up, the weight update is:
∆wji = ηδjxi = ηxi f ′(netj)c∑
k=1
wkjδk
Corso di Apprendimento Automatico Artificial Neural Networks
Learning Algorithm
BACKPROPAGATION
Initialize weights wji , wkj ; criterion θ; η learning rate; m← 0do
m← m + 1Input the training example x (m) to the network and computethe outputs zkfor each output unit kcompute δkwjk ← wjk + ηδk yjfor each hidden unit jcompute δkwij ← wij + ηδjxi
until ∇J < θ
Corso di Apprendimento Automatico Artificial Neural Networks
Stopping Condition in BACKPROPAGATION
Error over training examples falling below some thresholdError over separate validation set meeting some criterion. . .
Warning:too few iterations→ fail to reduce errortoo many iterations→ overfitting
Corso di Apprendimento Automatico Artificial Neural Networks
More on BACKPROPAGATION
Gradient descent over entire network weight vector1 Often include weight momentum α
∆wij(n) = ηδjxij + α∆wij(n − 1)
2 Easily generalized to arbitrary directed graphs:
δr = or (1− or )∑
s∈downstream(r)
wsrδs
3 Will find a local, not necessarily global error minimumIn practice, often works well (can run multiple times)
4 Minimizes error over training examplesWill it generalize well to subsequent examples?
5 Training can take thousands of iterations→ slow!6 Using network after training is very fast
Corso di Apprendimento Automatico Artificial Neural Networks
Learning Hidden Layer Representations I
Given an ANN:
Corso di Apprendimento Automatico Artificial Neural Networks
Learning Hidden Layer Representations II
A target function (identity):
Input Output10000000 → 1000000001000000 → 0100000000100000 → 0010000000010000 → 0001000000001000 → 0000100000000100 → 0000010000000010 → 0000001000000001 → 00000001
Can this be learned??
Corso di Apprendimento Automatico Artificial Neural Networks
Learning Hidden Layer Representations III
Learned hidden layer representation (after 5000 epochs)
Input Hidden OutputValues
10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001
Rounding the weights to 0 or 1: an encoding of the distinctvalues
→ Can learn/invent new features !
Corso di Apprendimento Automatico Artificial Neural Networks
Training I
one line per network output
Corso di Apprendimento Automatico Artificial Neural Networks
Training II
evolution of the weights in the hidden layer representation foroutput 01000000
Corso di Apprendimento Automatico Artificial Neural Networks
Training III
evolution of the weights for one of the three hidden units
Corso di Apprendimento Automatico Artificial Neural Networks
Convergence of BACKPROPAGATION
Gradient descent to some local minimumPerhaps not global minimum...Add momentumStochastic gradient descentTrain multiple nets with different initial weights
Nature of convergenceInitialize weights near zeroTherefore, initial networks near-linearIncreasingly non-linear functions possible as trainingprogresses
Corso di Apprendimento Automatico Artificial Neural Networks
Remarks I
Can update weights after all training instances have beenprocessed or incrementally:
batch learning vs. stochastic backpropagation
Weights are initialized to small random valuesHow to avoid overfitting?
Early stopping: use validation set to check when to stopWeight decay: add penalty term to error functionHow to speed up learning?
Momentum: re-use proportion of old weight changeUse optimization method that employs 2nd derivative
Corso di Apprendimento Automatico Artificial Neural Networks
Remarks II
Corso di Apprendimento Automatico Artificial Neural Networks
Remarks III
Momentum ~w(m + 1)← ~w(m) + ∆~w(m)︸ ︷︷ ︸gradient descent
+α∆~w(m − 1)︸ ︷︷ ︸momentum
Corso di Apprendimento Automatico Artificial Neural Networks
Overfitting in ANNs I
better stopping after 9100 iterations
Corso di Apprendimento Automatico Artificial Neural Networks
Overfitting in ANNs II
when to stop? not always obvious:error decreases, then increase, then decreases again ...
Corso di Apprendimento Automatico Artificial Neural Networks
Neural Nets for Face Recognition I
Corso di Apprendimento Automatico Artificial Neural Networks
Neural Nets for Face Recognition II
Typical input images
90% accurate learning head pose, and recognizing 1-of-20 faces
Corso di Apprendimento Automatico Artificial Neural Networks
Learned Hidden Unit Weights
Learned Weights
http://www.cs.cmu.edu/~tom/faces.html
Corso di Apprendimento Automatico Artificial Neural Networks
Alternative Error Functions
Weight decay: penalize large weight
E(~w) ≡ 12
∑d∈D
∑k∈outputs
(tkd − okd )2 + γ∑i,j
w2ji
bias learning against complex decision surfaces
Train on target slopes as well as values:
E(~w) ≡ 12
∑d∈D
∑k∈outputs
(tkd − okd )2 + µ∑
j∈inputs
(∂tkd
∂x jd
− ∂okd
∂x jd
)2
Tie together weights:e.g., in phoneme recognition network
Corso di Apprendimento Automatico Artificial Neural Networks
Recurrent Networks I
from acyclic graphs to...Recurrent Network: An output (at time t) can be input fornodes at previous layers (at time t + 1)
apply to time series
Learning algorithm:Unfold + BACKPROPAGATION [Mozer, 1995]
Corso di Apprendimento Automatico Artificial Neural Networks
Recurrent Networks II
Corso di Apprendimento Automatico Artificial Neural Networks
Dynamically Modifying Network Structure
CASCADE-CORRELATION [Fahlam & Labiere, 1990]start from an ANN without hidden layer nodesif residual error then add hidden layer nodes maximizingthe weight of the correlation between hidden unit and error
”Optimal brain damage” [LeCun, 1990] opposite strategystart with a complex ANNprune if connections are unessentiale.g.
weight close to 0study the effect of variations of weight on error
until termination condition based on error
Corso di Apprendimento Automatico Artificial Neural Networks
Radial Basis Function Networks
Radial Basis Function Networks (RBF Networks):another type of feedforward network with 3 layersHidden units represent points in instance space andactivation depends on distance
To this end, distance is converted into similarity:Gaussian activation function f
Width may be different for each hidden unit
Points of equal activation form hypersphere (orhyperellipsoid) as opposed to hyperplane
Output layer same as in MultiLayer Feedforward Networks
Corso di Apprendimento Automatico Artificial Neural Networks
Learning Radial Basis Function Networks
Parameterscenters and widths of the RBFs + weights in output layerCan learn two sets of parameters independently and stillget accurate models
Eg.: clusters from k-means can be used to form basisfunctionsLinear model can be used based on fixed RBFsMakes learning RBFs very efficient
Disadvantage:no built-in attribute weighting based on relevance
Corso di Apprendimento Automatico Artificial Neural Networks
Example: Reduced Coulomb Energy Nets I
z1 zc
λ1 λ2 λ3 λk
x1 x2 xd
categories
patterns
input
. . .
. . .
. . .
Corso di Apprendimento Automatico Artificial Neural Networks
Example: Reduced Coulomb Energy Nets II
binary case
▪ ▪▪
▪▪
▪
▪▪
▪
▪▪▪
▪
▪▪ ▪
▪
▪
▪▪
▪
Corso di Apprendimento Automatico Artificial Neural Networks
Example: Reduced Coulomb Energy Nets III
Corso di Apprendimento Automatico Artificial Neural Networks
Credits
R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann
Corso di Apprendimento Automatico Artificial Neural Networks