54
Artificial Neural Networks Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 9, 2009 Corso di Apprendimento Automatico Artificial Neural Networks

7Enico/corsi/aa/ANN

Embed Size (px)

DESCRIPTION

http://lacam.di.uniba.it:8000/%7Enico/corsi/aa/ANN.pdf

Citation preview

Page 1: 7Enico/corsi/aa/ANN

Artificial Neural Networks

Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica

Nicola Fanizzi

Dipartimento di InformaticaUniversità degli Studi di Bari

November 9, 2009

Corso di Apprendimento Automatico Artificial Neural Networks

Page 2: 7Enico/corsi/aa/ANN

Outline

Multilayer networksBACKPROPAGATION

Hidden layer representationsExample: Face RecognitionAdvanced topics

Corso di Apprendimento Automatico Artificial Neural Networks

Page 3: 7Enico/corsi/aa/ANN

Limitations of the Linear Models I

Minsky and Papert (1969) showed that linear classifiers havelimitations, e.g. can’t learn XOR

LMs provide powerful gradient descent methods forreducing the error, even when the patterns are not linearlyseparableUnfortunately LMs are not general enough in applicationsfor which linear discriminants are insufficient for minimumerrorWith a clever choice of nonlinear φ functions one canobtain arbitrary decisions leading to minimum error

choose a complete basis set (e.g. polynomials);such a classifier would have too many free parameters tobe determined from a limited number of training patternsprior knowledge relevant to the classification problemexploited for guiding the choice of nonlinearity

Corso di Apprendimento Automatico Artificial Neural Networks

Page 4: 7Enico/corsi/aa/ANN

Limitations of the Linear Models II

Corso di Apprendimento Automatico Artificial Neural Networks

Page 5: 7Enico/corsi/aa/ANN

Connectionist Models

Consider humans:Neuron switching time ∼ .001 secondNumber of neurons ∼ 1010

Connections per neuron ∼ 104−5

Scene recognition time ∼ .1 second100 inference steps doesn’t seem like enough

→ much parallel computation

Properties of artificial neural nets (ANN’s):Many neuron-like threshold switching unitsMany weighted interconnections among unitsHighly parallel, distributed processEmphasis on tuning weights automatically

Corso di Apprendimento Automatico Artificial Neural Networks

Page 6: 7Enico/corsi/aa/ANN

When to Consider Neural Networks

Input is high-dimensional discrete or real-valued(e.g. raw sensor input)Output is discrete or real valuedOutput is a vector of valuesPossibly noisy dataForm of target function is unknownHuman readability of result is unimportant

Examples:

Speech phoneme recognition [Waibel]Image classification [Kanade, Baluja, Rowley]Financial prediction

Corso di Apprendimento Automatico Artificial Neural Networks

Page 7: 7Enico/corsi/aa/ANN

Application

Non linear decision surface

learning how to predict vowels in the context h.dinput: numeric, from spectral analysis of the sound

Corso di Apprendimento Automatico Artificial Neural Networks

Page 8: 7Enico/corsi/aa/ANN

Multilayer ANN I

Can create network of perceptrons to approximatearbitrary target concepts

Multilayer Artificial Neural Networks is an example of anartificial neural network

Consists of: input layer, hidden layer(s), and output layerTopological structure usually found by experimentationParameters can be found using BACKPROPAGATION

In analogy with neurobiology, weights or connections aresometimes called synapses and the value of the connectionthe synaptic weights

Corso di Apprendimento Automatico Artificial Neural Networks

Page 9: 7Enico/corsi/aa/ANN

Multilayer ANN II

Corso di Apprendimento Automatico Artificial Neural Networks

Page 10: 7Enico/corsi/aa/ANN

Multilayer ANN III

Corso di Apprendimento Automatico Artificial Neural Networks

Page 11: 7Enico/corsi/aa/ANN

Multilayer Network Structure

Corso di Apprendimento Automatico Artificial Neural Networks

Page 12: 7Enico/corsi/aa/ANN

Feed-forward Operation I

Input Layer: each input vector is presented to the inputunits whose output equals the corresponding componentsHidden Layer: Each hidden net unit performs the weightedsum of its inputs to form its (scalar) net activation (innerproduct of the inputs with the weights at the hidden units):

netj =d∑

i=0

xiwji = ~w tj ~x

Each hidden unit emits an output that is a nonlinearfunction (transfer function) of its activation:

yj = f (netj)

Example: a simple threshold or sign function

f (net) = sgn(net) =

+1 net ≥ 0−1 net < 0

Corso di Apprendimento Automatico Artificial Neural Networks

Page 13: 7Enico/corsi/aa/ANN

Feed-forward Operation II

Output Layer: Each output unit computes its net activationbased on the hidden unit signals:

netk =

nH∑i=0

yjwkj = ~w tk~y

Each output unit then computes the nonlinear function ofits net, emitting

zk = f (netk )

Typically c output units are given and the classification isdecided with the label corresponding to the maximumyk = gk (~x)

Corso di Apprendimento Automatico Artificial Neural Networks

Page 14: 7Enico/corsi/aa/ANN

Feed-forward Operation III

General discriminant functions:

gk (~x) = zk = f

nH∑j=1

wkj f

(d∑

i=1

wjixi + wj0

)+ wk0

Class of functions that can be implemented by a three-layerneural network

Broader generalizations:1 transfer functions at the output layer different from those at

the hidden layer2 different functions at each individual unit

Corso di Apprendimento Automatico Artificial Neural Networks

Page 15: 7Enico/corsi/aa/ANN

Expressive Capabilities of ANNs I

Boolean functions:Every boolean function can be represented by networkwith single hidden layerbut might require exponential (in number of inputs) hiddenunits

Continuous functions:Kolmogorov: any continuous function from input to outputcan be implemented in a three-layer net, given sufficientnumber of hidden units, proper nonlinearities and weights

Every bounded continuous function can be approximatedwith arbitrarily small error, by network with one hidden layer[Cybenko 1989; Hornik et al. 1989]Any function can be approximated to arbitrary accuracy bya network with two hidden layers [Cybenko 1988].

Corso di Apprendimento Automatico Artificial Neural Networks

Page 16: 7Enico/corsi/aa/ANN

Expressive Capabilities of ANNs II

Corso di Apprendimento Automatico Artificial Neural Networks

Page 17: 7Enico/corsi/aa/ANN

Sigmoid Unit I

How to learn weights given network structure?Cannot simply use Perceptron learning rule because wehave hidden layer(s)Function we are trying to minimize: errorCan use gradient descentNeed differentiable activation function:use sigmoid function instead of threshold function

f (x) =1

1 + exp(−x)

Need differentiable error function:can’t use zero-one loss, but can use squared error

E(x) =12

(y − f (x))2

Corso di Apprendimento Automatico Artificial Neural Networks

Page 18: 7Enico/corsi/aa/ANN

Sigmoid Unit II

σ(x) is the sigmoid function

11 + e−x

Corso di Apprendimento Automatico Artificial Neural Networks

Page 19: 7Enico/corsi/aa/ANN

Sigmoid Unit III

Nice property:dσ(x)

dx = σ(x)(1− σ(x))

Corso di Apprendimento Automatico Artificial Neural Networks

Page 20: 7Enico/corsi/aa/ANN

Multilayer Networks

We can derive gradient descent rules to train Multilayernetworks of (sigmoid) units→ BACKPROPAGATION

Multiple outputs→ new error expression:

E [~w ] =12

∑k

(tk − zk )2

Corso di Apprendimento Automatico Artificial Neural Networks

Page 21: 7Enico/corsi/aa/ANN

Criterion Function and Gradient Descent

Squared error:

J[~w ] =12

c∑k=1

e2k =

12

c∑k=1

(tk − zk )2 =12

(~t − ~z)2

where~t and ~z represent the target and the network output(length = c)

Gradient descent: weights initialized with random values,changed in a direction that will reduce the error∆~w = −η∂J/∂~w that is:

∆wqp = −η ∂J∂wqp

Corso di Apprendimento Automatico Artificial Neural Networks

Page 22: 7Enico/corsi/aa/ANN

Fitting the Weights I

Iterative update: ~w(m + 1) = ~w(m) + ∆w(m)where m indexes the particular input example

∆wqp(m) = η × δq(m) × xp(m)

weight learning local inputcorrection rate gradient at unit j

Evaluate ∆wqp = −η ∂J∂wqp

:for output unitsfor hidden units

We can trasform ∂J∂wqp

using the chain rule:

∂J∂wqp

=∂J∂eq

∂eq

∂f (netq)

∂f (netq)

∂netq∂netq∂wqp

Corso di Apprendimento Automatico Artificial Neural Networks

Page 23: 7Enico/corsi/aa/ANN

Fitting the Weights II

∂J∂eq

=∂

∂eq

(12

c∑k=1

e2k

)= eq

∂eq

∂f (netq)=∂(tq − f (netq))

∂f (netq)= −1

∂f (netq)

∂netq= f ′(netq)

∂netq∂wqp

= f (netp) = xp

Corso di Apprendimento Automatico Artificial Neural Networks

Page 24: 7Enico/corsi/aa/ANN

Fitting the Weights IIIHence:

∂J∂wqp

= −eqf ′(netq)xp

Then the correction to be applied is defined by the delta rule:

∆wqp = −η ∂J∂wqp

If we consider the local gradient defined

δq = − ∂J∂netq

= − ∂J∂eq

∂eq

∂f (netq)

∂f (netq)

∂netq= eqf ′(netq)

the delta rule becomes:

∆wqp = ηδqxp

Corso di Apprendimento Automatico Artificial Neural Networks

Page 25: 7Enico/corsi/aa/ANN

Fitting the Weights IV1 hidden-to-output weights: The error is not explicitly

dependent upon wkj , use the chain rule for differentiation:

∂J∂wkj

=∂J∂netk

∂netk∂wkj

First term: ∂J∂netk

local gradient (a.k.a. error or sensitivity) ofunit k :

δk =∂J∂netk

=∂J∂zk

∂zk

∂netk= (tk − zk )f ′(netk )

Second term: ∂netk∂wkj

= yj

Summing up, the weight update is:

∆wkj = ηδkyj = η(tk − zk )f ′(netk )yj

Corso di Apprendimento Automatico Artificial Neural Networks

Page 26: 7Enico/corsi/aa/ANN

Fitting the Weights V2 input-to-hidden weights: credit assignment problem

∂J∂wji

=∂J∂yj

∂yj

∂netj

∂netj∂wkj

First term: ∂J∂yj

= ∂∂yj

[12∑c

k=1(tk − zk )2]= −

∑ck=1(tk − zk )∂zk

∂yj

= −∑c

k=1(tk − zk ) ∂zk∂netk

∂netk∂yj

= −∑c

k=1(tk − zk )f ′(netk )wjk

Second term: let δj = f ′(netj)∑c

k=1 wkjδk

Third term: ∂netj∂wkj

= xi

Summing up, the weight update is:

∆wji = ηδjxi = ηxi f ′(netj)c∑

k=1

wkjδk

Corso di Apprendimento Automatico Artificial Neural Networks

Page 27: 7Enico/corsi/aa/ANN

Learning Algorithm

BACKPROPAGATION

Initialize weights wji , wkj ; criterion θ; η learning rate; m← 0do

m← m + 1Input the training example x (m) to the network and computethe outputs zkfor each output unit kcompute δkwjk ← wjk + ηδk yjfor each hidden unit jcompute δkwij ← wij + ηδjxi

until ∇J < θ

Corso di Apprendimento Automatico Artificial Neural Networks

Page 28: 7Enico/corsi/aa/ANN

Stopping Condition in BACKPROPAGATION

Error over training examples falling below some thresholdError over separate validation set meeting some criterion. . .

Warning:too few iterations→ fail to reduce errortoo many iterations→ overfitting

Corso di Apprendimento Automatico Artificial Neural Networks

Page 29: 7Enico/corsi/aa/ANN

More on BACKPROPAGATION

Gradient descent over entire network weight vector1 Often include weight momentum α

∆wij(n) = ηδjxij + α∆wij(n − 1)

2 Easily generalized to arbitrary directed graphs:

δr = or (1− or )∑

s∈downstream(r)

wsrδs

3 Will find a local, not necessarily global error minimumIn practice, often works well (can run multiple times)

4 Minimizes error over training examplesWill it generalize well to subsequent examples?

5 Training can take thousands of iterations→ slow!6 Using network after training is very fast

Corso di Apprendimento Automatico Artificial Neural Networks

Page 30: 7Enico/corsi/aa/ANN

Learning Hidden Layer Representations I

Given an ANN:

Corso di Apprendimento Automatico Artificial Neural Networks

Page 31: 7Enico/corsi/aa/ANN

Learning Hidden Layer Representations II

A target function (identity):

Input Output10000000 → 1000000001000000 → 0100000000100000 → 0010000000010000 → 0001000000001000 → 0000100000000100 → 0000010000000010 → 0000001000000001 → 00000001

Can this be learned??

Corso di Apprendimento Automatico Artificial Neural Networks

Page 32: 7Enico/corsi/aa/ANN

Learning Hidden Layer Representations III

Learned hidden layer representation (after 5000 epochs)

Input Hidden OutputValues

10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001

Rounding the weights to 0 or 1: an encoding of the distinctvalues

→ Can learn/invent new features !

Corso di Apprendimento Automatico Artificial Neural Networks

Page 33: 7Enico/corsi/aa/ANN

Training I

one line per network output

Corso di Apprendimento Automatico Artificial Neural Networks

Page 34: 7Enico/corsi/aa/ANN

Training II

evolution of the weights in the hidden layer representation foroutput 01000000

Corso di Apprendimento Automatico Artificial Neural Networks

Page 35: 7Enico/corsi/aa/ANN

Training III

evolution of the weights for one of the three hidden units

Corso di Apprendimento Automatico Artificial Neural Networks

Page 36: 7Enico/corsi/aa/ANN

Convergence of BACKPROPAGATION

Gradient descent to some local minimumPerhaps not global minimum...Add momentumStochastic gradient descentTrain multiple nets with different initial weights

Nature of convergenceInitialize weights near zeroTherefore, initial networks near-linearIncreasingly non-linear functions possible as trainingprogresses

Corso di Apprendimento Automatico Artificial Neural Networks

Page 37: 7Enico/corsi/aa/ANN

Remarks I

Can update weights after all training instances have beenprocessed or incrementally:

batch learning vs. stochastic backpropagation

Weights are initialized to small random valuesHow to avoid overfitting?

Early stopping: use validation set to check when to stopWeight decay: add penalty term to error functionHow to speed up learning?

Momentum: re-use proportion of old weight changeUse optimization method that employs 2nd derivative

Corso di Apprendimento Automatico Artificial Neural Networks

Page 38: 7Enico/corsi/aa/ANN

Remarks II

Corso di Apprendimento Automatico Artificial Neural Networks

Page 39: 7Enico/corsi/aa/ANN

Remarks III

Momentum ~w(m + 1)← ~w(m) + ∆~w(m)︸ ︷︷ ︸gradient descent

+α∆~w(m − 1)︸ ︷︷ ︸momentum

Corso di Apprendimento Automatico Artificial Neural Networks

Page 40: 7Enico/corsi/aa/ANN

Overfitting in ANNs I

better stopping after 9100 iterations

Corso di Apprendimento Automatico Artificial Neural Networks

Page 41: 7Enico/corsi/aa/ANN

Overfitting in ANNs II

when to stop? not always obvious:error decreases, then increase, then decreases again ...

Corso di Apprendimento Automatico Artificial Neural Networks

Page 42: 7Enico/corsi/aa/ANN

Neural Nets for Face Recognition I

Corso di Apprendimento Automatico Artificial Neural Networks

Page 43: 7Enico/corsi/aa/ANN

Neural Nets for Face Recognition II

Typical input images

90% accurate learning head pose, and recognizing 1-of-20 faces

Corso di Apprendimento Automatico Artificial Neural Networks

Page 44: 7Enico/corsi/aa/ANN

Learned Hidden Unit Weights

Learned Weights

http://www.cs.cmu.edu/~tom/faces.html

Corso di Apprendimento Automatico Artificial Neural Networks

Page 45: 7Enico/corsi/aa/ANN

Alternative Error Functions

Weight decay: penalize large weight

E(~w) ≡ 12

∑d∈D

∑k∈outputs

(tkd − okd )2 + γ∑i,j

w2ji

bias learning against complex decision surfaces

Train on target slopes as well as values:

E(~w) ≡ 12

∑d∈D

∑k∈outputs

(tkd − okd )2 + µ∑

j∈inputs

(∂tkd

∂x jd

− ∂okd

∂x jd

)2

Tie together weights:e.g., in phoneme recognition network

Corso di Apprendimento Automatico Artificial Neural Networks

Page 46: 7Enico/corsi/aa/ANN

Recurrent Networks I

from acyclic graphs to...Recurrent Network: An output (at time t) can be input fornodes at previous layers (at time t + 1)

apply to time series

Learning algorithm:Unfold + BACKPROPAGATION [Mozer, 1995]

Corso di Apprendimento Automatico Artificial Neural Networks

Page 47: 7Enico/corsi/aa/ANN

Recurrent Networks II

Corso di Apprendimento Automatico Artificial Neural Networks

Page 48: 7Enico/corsi/aa/ANN

Dynamically Modifying Network Structure

CASCADE-CORRELATION [Fahlam & Labiere, 1990]start from an ANN without hidden layer nodesif residual error then add hidden layer nodes maximizingthe weight of the correlation between hidden unit and error

”Optimal brain damage” [LeCun, 1990] opposite strategystart with a complex ANNprune if connections are unessentiale.g.

weight close to 0study the effect of variations of weight on error

until termination condition based on error

Corso di Apprendimento Automatico Artificial Neural Networks

Page 49: 7Enico/corsi/aa/ANN

Radial Basis Function Networks

Radial Basis Function Networks (RBF Networks):another type of feedforward network with 3 layersHidden units represent points in instance space andactivation depends on distance

To this end, distance is converted into similarity:Gaussian activation function f

Width may be different for each hidden unit

Points of equal activation form hypersphere (orhyperellipsoid) as opposed to hyperplane

Output layer same as in MultiLayer Feedforward Networks

Corso di Apprendimento Automatico Artificial Neural Networks

Page 50: 7Enico/corsi/aa/ANN

Learning Radial Basis Function Networks

Parameterscenters and widths of the RBFs + weights in output layerCan learn two sets of parameters independently and stillget accurate models

Eg.: clusters from k-means can be used to form basisfunctionsLinear model can be used based on fixed RBFsMakes learning RBFs very efficient

Disadvantage:no built-in attribute weighting based on relevance

Corso di Apprendimento Automatico Artificial Neural Networks

Page 51: 7Enico/corsi/aa/ANN

Example: Reduced Coulomb Energy Nets I

z1 zc

λ1 λ2 λ3 λk

x1 x2 xd

categories

patterns

input

. . .

. . .

. . .

Corso di Apprendimento Automatico Artificial Neural Networks

Page 52: 7Enico/corsi/aa/ANN

Example: Reduced Coulomb Energy Nets II

binary case

▪ ▪▪

▪▪

▪▪

▪▪▪

▪▪ ▪

▪▪

Corso di Apprendimento Automatico Artificial Neural Networks

Page 53: 7Enico/corsi/aa/ANN

Example: Reduced Coulomb Energy Nets III

Corso di Apprendimento Automatico Artificial Neural Networks

Page 54: 7Enico/corsi/aa/ANN

Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann

Corso di Apprendimento Automatico Artificial Neural Networks