nico/corsi/aa/ANN

Artificial Neural Networks

Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica

Nicola Fanizzi

Dipartimento di InformaticaUniversità degli Studi di Bari

November 9, 2009

Corso di Apprendimento Automatico Artificial Neural Networks

Outline

Multilayer networksBACKPROPAGATION

Hidden layer representationsExample: Face RecognitionAdvanced topics


Limitations of the Linear Models I

Minsky and Papert (1969) showed that linear classifiers havelimitations, e.g. can’t learn XOR

LMs provide powerful gradient descent methods forreducing the error, even when the patterns are not linearlyseparableUnfortunately LMs are not general enough in applicationsfor which linear discriminants are insufficient for minimumerrorWith a clever choice of nonlinear φ functions one canobtain arbitrary decisions leading to minimum error

choose a complete basis set (e.g. polynomials);such a classifier would have too many free parameters tobe determined from a limited number of training patternsprior knowledge relevant to the classification problemexploited for guiding the choice of nonlinearity


Limitations of the Linear Models II


Connectionist Models

Consider humans:Neuron switching time ∼ .001 secondNumber of neurons ∼ 1010

Connections per neuron ∼ 104−5

Scene recognition time ∼ .1 second100 inference steps doesn’t seem like enough

→ much parallel computation

Properties of artificial neural nets (ANN’s):Many neuron-like threshold switching unitsMany weighted interconnections among unitsHighly parallel, distributed processEmphasis on tuning weights automatically


When to Consider Neural Networks

Input is high-dimensional discrete or real-valued(e.g. raw sensor input)Output is discrete or real valuedOutput is a vector of valuesPossibly noisy dataForm of target function is unknownHuman readability of result is unimportant

Examples:

Speech phoneme recognition [Waibel]Image classification [Kanade, Baluja, Rowley]Financial prediction


Application

Non linear decision surface

learning how to predict vowels in the context h.dinput: numeric, from spectral analysis of the sound


Multilayer ANN I

Can create network of perceptrons to approximatearbitrary target concepts

Multilayer Artificial Neural Networks is an example of anartificial neural network

Consists of: input layer, hidden layer(s), and output layerTopological structure usually found by experimentationParameters can be found using BACKPROPAGATION

In analogy with neurobiology, weights or connections aresometimes called synapses and the value of the connectionthe synaptic weights


Multilayer ANN II


Multilayer ANN III


Multilayer Network Structure


Feed-forward Operation I

Input Layer: each input vector is presented to the inputunits whose output equals the corresponding componentsHidden Layer: Each hidden net unit performs the weightedsum of its inputs to form its (scalar) net activation (innerproduct of the inputs with the weights at the hidden units):

netj =d∑

i=0

xiwji = ~w tj ~x

Each hidden unit emits an output that is a nonlinearfunction (transfer function) of its activation:

yj = f (netj)

Example: a simple threshold or sign function

f (net) = sgn(net) =

+1 net ≥ 0−1 net < 0


Feed-forward Operation II

Output Layer: Each output unit computes its net activationbased on the hidden unit signals:

netk =

nH∑i=0

yjwkj = ~w tk~y

Each output unit then computes the nonlinear function ofits net, emitting

zk = f (netk )

Typically c output units are given and the classification isdecided with the label corresponding to the maximumyk = gk (~x)


Feed-forward Operation III

General discriminant functions:

gk (~x) = zk = f

nH∑j=1

wkj f

(d∑

i=1

wjixi + wj0

)+ wk0

Class of functions that can be implemented by a three-layerneural network

Broader generalizations:1 transfer functions at the output layer different from those at

the hidden layer2 different functions at each individual unit


Expressive Capabilities of ANNs I

Boolean functions:Every boolean function can be represented by networkwith single hidden layerbut might require exponential (in number of inputs) hiddenunits

Continuous functions:Kolmogorov: any continuous function from input to outputcan be implemented in a three-layer net, given sufficientnumber of hidden units, proper nonlinearities and weights

Every bounded continuous function can be approximatedwith arbitrarily small error, by network with one hidden layer[Cybenko 1989; Hornik et al. 1989]Any function can be approximated to arbitrary accuracy bya network with two hidden layers [Cybenko 1988].


Expressive Capabilities of ANNs II


Sigmoid Unit I

How to learn weights given network structure?Cannot simply use Perceptron learning rule because wehave hidden layer(s)Function we are trying to minimize: errorCan use gradient descentNeed differentiable activation function:use sigmoid function instead of threshold function

f (x) =1

1 + exp(−x)

Need differentiable error function:can’t use zero-one loss, but can use squared error

E(x) =12

(y − f (x))2


Sigmoid Unit II

σ(x) is the sigmoid function

11 + e−x


Sigmoid Unit III

Nice property:dσ(x)

dx = σ(x)(1− σ(x))


Multilayer Networks

We can derive gradient descent rules to train Multilayernetworks of (sigmoid) units→ BACKPROPAGATION

Multiple outputs→ new error expression:

E [~w ] =12

∑k

(tk − zk )2


Criterion Function and Gradient Descent

Squared error:

J[~w ] =12

c∑k=1

e2k =

12

c∑k=1

(tk − zk )2 =12

(~t − ~z)2

where~t and ~z represent the target and the network output(length = c)

Gradient descent: weights initialized with random values,changed in a direction that will reduce the error∆~w = −η∂J/∂~w that is:

∆wqp = −η ∂J∂wqp


Fitting the Weights I

Iterative update: ~w(m + 1) = ~w(m) + ∆w(m)where m indexes the particular input example

∆wqp(m) = η × δq(m) × xp(m)

weight learning local inputcorrection rate gradient at unit j

Evaluate ∆wqp = −η ∂J∂wqp

:for output unitsfor hidden units

We can trasform ∂J∂wqp

using the chain rule:

∂J∂wqp

=∂J∂eq

∂eq

∂f (netq)

∂f (netq)

∂netq∂netq∂wqp


Fitting the Weights II

∂J∂eq

=∂

∂eq

(12

c∑k=1

e2k

)= eq

∂eq

∂f (netq)=∂(tq − f (netq))

∂f (netq)= −1

∂f (netq)

∂netq= f ′(netq)

∂netq∂wqp

= f (netp) = xp


Fitting the Weights IIIHence:

∂J∂wqp

= −eqf ′(netq)xp

Then the correction to be applied is defined by the delta rule:

∆wqp = −η ∂J∂wqp

If we consider the local gradient defined

δq = − ∂J∂netq

= − ∂J∂eq

∂eq

∂f (netq)

∂f (netq)

∂netq= eqf ′(netq)

the delta rule becomes:

∆wqp = ηδqxp


Fitting the Weights IV1 hidden-to-output weights: The error is not explicitly

dependent upon wkj , use the chain rule for differentiation:

∂J∂wkj

=∂J∂netk

∂netk∂wkj

First term: ∂J∂netk

local gradient (a.k.a. error or sensitivity) ofunit k :

δk =∂J∂netk

=∂J∂zk

∂zk

∂netk= (tk − zk )f ′(netk )

Second term: ∂netk∂wkj

= yj

Summing up, the weight update is:

∆wkj = ηδkyj = η(tk − zk )f ′(netk )yj


Fitting the Weights V2 input-to-hidden weights: credit assignment problem

∂J∂wji

=∂J∂yj

∂yj

∂netj

∂netj∂wkj

First term: ∂J∂yj

= ∂∂yj

[12∑c

k=1(tk − zk )2]= −

∑ck=1(tk − zk )∂zk

∂yj

= −∑c

k=1(tk − zk ) ∂zk∂netk

∂netk∂yj

= −∑c

k=1(tk − zk )f ′(netk )wjk

Second term: let δj = f ′(netj)∑c

k=1 wkjδk

Third term: ∂netj∂wkj

= xi

Summing up, the weight update is:

∆wji = ηδjxi = ηxi f ′(netj)c∑

k=1

wkjδk


Learning Algorithm

BACKPROPAGATION

Initialize weights wji , wkj ; criterion θ; η learning rate; m← 0do

m← m + 1Input the training example x (m) to the network and computethe outputs zkfor each output unit kcompute δkwjk ← wjk + ηδk yjfor each hidden unit jcompute δkwij ← wij + ηδjxi

until ∇J < θ


Stopping Condition in BACKPROPAGATION

Error over training examples falling below some thresholdError over separate validation set meeting some criterion. . .

Warning:too few iterations→ fail to reduce errortoo many iterations→ overfitting


More on BACKPROPAGATION

Gradient descent over entire network weight vector1 Often include weight momentum α

∆wij(n) = ηδjxij + α∆wij(n − 1)

2 Easily generalized to arbitrary directed graphs:

δr = or (1− or )∑

s∈downstream(r)

wsrδs

3 Will find a local, not necessarily global error minimumIn practice, often works well (can run multiple times)

4 Minimizes error over training examplesWill it generalize well to subsequent examples?

5 Training can take thousands of iterations→ slow!6 Using network after training is very fast


Learning Hidden Layer Representations I

Given an ANN:


Learning Hidden Layer Representations II

A target function (identity):

Input Output10000000 → 1000000001000000 → 0100000000100000 → 0010000000010000 → 0001000000001000 → 0000100000000100 → 0000010000000010 → 0000001000000001 → 00000001

Can this be learned??


Learning Hidden Layer Representations III

Learned hidden layer representation (after 5000 epochs)

Input Hidden OutputValues

10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001

Rounding the weights to 0 or 1: an encoding of the distinctvalues

→ Can learn/invent new features !


Training I

one line per network output


Training II

evolution of the weights in the hidden layer representation foroutput 01000000


Training III

evolution of the weights for one of the three hidden units


Convergence of BACKPROPAGATION

Gradient descent to some local minimumPerhaps not global minimum...Add momentumStochastic gradient descentTrain multiple nets with different initial weights

Nature of convergenceInitialize weights near zeroTherefore, initial networks near-linearIncreasingly non-linear functions possible as trainingprogresses


Remarks I

Can update weights after all training instances have beenprocessed or incrementally:

batch learning vs. stochastic backpropagation

Weights are initialized to small random valuesHow to avoid overfitting?

Early stopping: use validation set to check when to stopWeight decay: add penalty term to error functionHow to speed up learning?

Momentum: re-use proportion of old weight changeUse optimization method that employs 2nd derivative


Remarks II


Remarks III

Momentum ~w(m + 1)← ~w(m) + ∆~w(m)︸︷︷︸gradient descent

+α∆~w(m − 1)︸︷︷︸momentum


Overfitting in ANNs I

better stopping after 9100 iterations


Overfitting in ANNs II

when to stop? not always obvious:error decreases, then increase, then decreases again ...


Neural Nets for Face Recognition I


Neural Nets for Face Recognition II

Typical input images

90% accurate learning head pose, and recognizing 1-of-20 faces


Learned Hidden Unit Weights

Learned Weights

http://www.cs.cmu.edu/~tom/faces.html


http://www.cs.cmu.edu/~tom/faces.html

Alternative Error Functions

Weight decay: penalize large weight

E(~w) ≡ 12

∑d∈D

∑k∈outputs

(tkd − okd )2 + γ∑i,j

w2ji

bias learning against complex decision surfaces

Train on target slopes as well as values:

E(~w) ≡ 12

∑d∈D

∑k∈outputs

(tkd − okd )2 + µ∑

j∈inputs

(∂tkd

∂x jd

− ∂okd

∂x jd

)2

Tie together weights:e.g., in phoneme recognition network


Recurrent Networks I

from acyclic graphs to...Recurrent Network: An output (at time t) can be input fornodes at previous layers (at time t + 1)

apply to time series

Learning algorithm:Unfold + BACKPROPAGATION [Mozer, 1995]


Recurrent Networks II


Dynamically Modifying Network Structure

CASCADE-CORRELATION [Fahlam & Labiere, 1990]start from an ANN without hidden layer nodesif residual error then add hidden layer nodes maximizingthe weight of the correlation between hidden unit and error

”Optimal brain damage” [LeCun, 1990] opposite strategystart with a complex ANNprune if connections are unessentiale.g.

weight close to 0study the effect of variations of weight on error

until termination condition based on error


Radial Basis Function Networks

Radial Basis Function Networks (RBF Networks):another type of feedforward network with 3 layersHidden units represent points in instance space andactivation depends on distance

To this end, distance is converted into similarity:Gaussian activation function f

Width may be different for each hidden unit

Points of equal activation form hypersphere (orhyperellipsoid) as opposed to hyperplane

Output layer same as in MultiLayer Feedforward Networks


Learning Radial Basis Function Networks

Parameterscenters and widths of the RBFs + weights in output layerCan learn two sets of parameters independently and stillget accurate models

Eg.: clusters from k-means can be used to form basisfunctionsLinear model can be used based on fixed RBFsMakes learning RBFs very efficient

Disadvantage:no built-in attribute weighting based on relevance


Example: Reduced Coulomb Energy Nets I

z1 zc

λ1 λ2 λ3 λk

x1 x2 xd

categories

patterns

input

. . .

. . .

. . .


Example: Reduced Coulomb Energy Nets II

binary case

▪ ▪▪

▪▪

▪

▪▪

▪

▪▪▪

▪

▪▪ ▪

▪

▪

▪▪

▪


Example: Reduced Coulomb Energy Nets III


Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann


http://rii.ricoh.com/%7Estork/DHS.html

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

http://www.cs.waikato.ac.nz/~ml/weka/book.html

Documents

nico/corsi/aa/ANN