© Enn Tyugu1 Algorithms of Artificial Intelligence Lecture 6: Learning E. Tyugu

© Enn Tyugu 1

Algorithms of Artificial Intelligence

Lecture 6: Learning

E. Tyugu

© Enn Tyugu 2

Content• Basic concepts

– transfer function– classification – stages of usage

• Perceptron• Hopfield net • Hamming net • Carpenter-Grossberg’s net• Kohonen’s feature maps• Bayesian networks• ID3• AQ

© Enn Tyugu 3

Neural nets

Neural nets provide another form of massively parallel learning functionality. They are well suited for learning pattern recognition. A simple way to describe a neural net is to represent it as a graph. Each node of the graph has an associated variable called state and a constant called threshold. Each arc of the graph has an associated numeric value called weight. Behaviour of a neural net is determined by transfer functions for nodes which compute new values of states from previous states of neighbouring nodes.

© Enn Tyugu 4

Node of a net A common transfer function is of the form

xj = f( wij*xi - tj)

where the sum is taken over incoming arcs with weigths wij, and xi are states of the neighboring nodes, tj is threshold of the node j where the new state is computed. Learning in neural nets means changing the weights in a right way.

w1j

wnj

x1

xn

xjf

© Enn Tyugu 5

Transfer functions

hard limiter sigmoidthreshold logic

-1

+1 +1 +1

x x x

f(x) f(x) f(x)

© Enn Tyugu 6

Forward-pass and layered nets

1. Forward-pass neural net is an acyclic graph. Its nodes can be classified as input, output and internal nodes. Input nodes do not have neighbours on incoming arcs, output nodes do not have them on outgoing arcs and internal nodes possess both kinds of neighbours.

2. Layered (n-layered) net is a forward-pass net where each path from an input node to an output node contains exactly n nodes. Each node in such a graph belongs exactly to one layer, n-layered net is strongly connected, if each node in the i-th layer is connected to all nodes of the (i+1) -st layer, i=1,2, . . . ,n-1. States of the layered net can be interpreted as decisions made on the basis of the states of the input nodes.

© Enn Tyugu 7

Layered neural net

} input nodes

} output nodes

} intermediate nodes. . . .

Learning in a layered net can be performed by means of back-propagation. In this case, the states taken by output nodes are evaluated and credit or blame is assigned to each output node. The evaluations are propagated back to other layers.

© Enn Tyugu 8

Stages of usage

1. Selection of the structure (of the network type)

2. Assignment of initial weights

3. Learning/teaching

4. Application

© Enn Tyugu 9

Perceptrons

single-layer double-layer three-layer

Perceptrons´nodes are hard limiters or sigmoids. Examples:

© Enn Tyugu 10

Learning in a single-layer perceptron

1. Initialize weights wi and threshold t to small random values.2. Take an input x1, …, xn and the desired output d.3. Calculate output x of the perceptron.4. Adapt the weights:wi´= wi + h*(d - x)*xi, where h<1 is a positive gain value,

+1, if input is from one classd =

-1, if input is from the other class

Repeat 2 - 4, if needed.NB! Note that the weights are changed only for incorrect output d.

© Enn Tyugu 11

Regions separable by perceptrons

single-layered double-layered three-layered

A

B

B

BA

A

© Enn Tyugu 12

Hopfield net

x1 x2 xn

x1' x2' xn'

Every node is connected to all other nodes, weights are symmetric (wij = wji). Works with binary (+1,-1) input signals. The output is also a tuple of values +1 or -1.

even a sigmoid can be used

© Enn Tyugu 13

Hopfield net1. Initialize connection weights:

wij = xis * xjs, i j,

where xis is +1 or -1 as in the description x1s, …, xns of the class s.2. Initialise states with an unknown pattern x1, …, xn.3. Iterate until convergence (even can be done asynchronously):

xj´= f ( wij*xi),

where f is the hard limiter.

Remarks: • A Hopfield net can be used either as a classifier or an associative memory.• It converges always, but no match may occur.• It works well, in the case when number of classes is less than 0.15*n.• There are several modifications of the Hopfield net architecture..

s

© Enn Tyugu 14

Hamming net The Hamming net calculates Hamming distance to exemplar of each

class and shows positive output for the class with the minimal distance.

This net is widely used for restoring corrupted binary fixed length signals.

Hamming net works faster than Hopfield net, has less connections for larger number of input signals.

It implements the optimum minimum error classifier when bit errors are random and independent.

© Enn Tyugu 15

Hamming net

x1 x2 xn

y1 y2 ym

calculate Hamming distance

select the best match

z1 zmz2

© Enn Tyugu 16

Hamming net

Value at a middle node zs is n – hds where hds is Hamming distance to the exemplar pattern ps .

Threfore in the lower subnet weight from input xi to the middle node zs

is wis = xis/2, ts = n/2 for each exemplar s.

Indeed, 0 for the most incorrect code, and 1 = (+1 –(-1))* xis/2 is added for each correct input signal, so that this gives n for correct code.

© Enn Tyugu 17

Hamming net continued

1. Initialize weights and offsets:

a) lower subnet: wis = xis/2, tj = n/2 for each exemplar s;

b) upper subnet: tk=0, wsk = if k=s then 1 else -e, where 0 < e < 1/m.

2. Initialize the lower subnet with (unknown) pattern x1,…, xn and calculate

yj = f( wij*xi - tj).

3. Iterate in the upper subnet until convergence:

yj´= f(yj - e* yk).

© Enn Tyugu 18

A comparator subnet

Here is a comparator subnet that selects the maximum of two analog inputs x0, x1. Combining several of these nets one builds comparators for more inputs (4, 8 etc., approximately log2n layers for n inputs). Output z is the maximum value, y0 and y1 indicate which input is maximum, dark nodes are hard limiters, light nodes are threshold logic nodes, all thresholds are 0, weights are shown on arcs.

y0 y1z

x0 x1

1

1

1

1

-1 -1

0.5 0.5

0.5 0.5

© Enn Tyugu 19

Carpenter-Grossberg net

This net forms clusters without supervision. Its clustering algorithm is similar to the simple leader clustering algorithm:

select the first input as the exemplar for the first cluster;

if the next input is close enough to some cluster exemplar, it is

added to the cluster, otherwise it becomes the exemplar of a new cluster.

The net includes much feedback and is described by nonlinear differential equations.

© Enn Tyugu 20

Carpenter-Grossberg net

Carpenter-Grossberg net for three binary inputs: x0, x1, x2 and two classes.

x0 x1x2

© Enn Tyugu 21

Kohonen´s feature mapsA Kohonen’s self organizing feayture map (K-map) is uses analogy with such biological neural structures where the placement of neurons is orderly and reflects structure of external (sensed) stimuli (e.g. in auditory and visual pathways).

K-map learns, when continuous-valued input vectors are presented to it without specifying the desired output. The weights of connections can adjust to regularities in input. Large number of examples is needed.

K-map mimics well learning in biological neural structures.

It is usable in speech recognizer.

© Enn Tyugu 22

Kohonen´s feature maps continuedThis is a flat (two-dimensional) structure with connections between neighbors and connections from each input node to all its output nodes.

It learns clusters of input vectors without any help from teacher. Preserves closeness (topolgy).

continues valued input vector

Output nodes

© Enn Tyugu 23

Learning in K-maps

1. Initialize weights to small random numbers and set initial radius of neighborhood of nodes.

2. Get an input x1, …, xn.3. Compute distance dj to each output node:

dj = (xi - wij)2

4. Select output node s with minimal distance ds. 5. Update weights for the node s and all nodes in its neighborhood:wij´= wij + h* (xi - wij), where h<1 is a gain that decreases in time.

Repeat steps 2 - 5.

© Enn Tyugu 24

Bayesian networksBayesian networks use the conditional probability formula

P(e,H)=P(H|e)P(e) = P(e|H)P(H)

binding the conditiona probabilities of evidence e and hypothesis H.

Bayesian network is a graph whose nodes are variables denoting occurrence of events, arcs express causal dependence of events. Each node x has conditional probabilities for every possible combination of events influencing the node, i.e. for every collection of events in nodes of pred(x) immediately preceding the node x in the graph.

© Enn Tyugu 25

Bayesian networksExample: x1

x2x4

x6

x3

x5

The joint probability assessment for all nodes x1,…,xn:

P(x1,…,xn) = P(x1|pred(x1))*...*P(xn|pred(xn))

constitutes a joint-probability model that supports the assessed event combination. For the present example it is as follows:

P(x1,…,x6) = P(x6|x5)*P(x5|x2,x3)*P(x4|x1,x2)*P(x3|x1)*P(x2|x1)*P(x1)

© Enn Tyugu 26

Bayesian networks continued

A bayesian network can be used for diagnosis/classification: given some events, the probablities of events depending on the given ones can be predicted.

To construct a bayesian network, one needs to

• determine its structure (topology)

• find conditional probabilities for each dependency.

© Enn Tyugu 27

Taxonomy of neural netsNEURAL NETS

BINARY-VALUED INPUTS CONTINUOUS INPUTS

UNSUPERVISED LEARNING

SUPERVISED LEARNING

UNSUPERVISED LEARNING

SUPERVISED LEARNING

KOHONEN MAPS

MULTI-LAYERED PERCEPTRONS

SINGLE-LAYERED PERCEPTRONS

CARPENTER- GROSSBERG NETS

HAMMING NETS

HOPFIELD NETS

© Enn Tyugu 28

A decision tree

outlook

humidity windy

overcastsunny rain

high normal true false

+

_ + _ +

© Enn Tyugu 29

ID3 algorithm• To get the fastest decision-making procedure, one has to

arrange attributes in a decision tree in a proper order - the most discriminating attributes first. This is done by the algorithm called ID3.

• The most discriminating attribute can be defined in precise terms as the attribute for which the fixing its value changes the enthropy of possible decisions at most. Let wj be the frequency of the j-th decision in a set of examples x. Then the enthropy of the set isE(x)= - wj* log(wj)

• Let fix(x,a,v) denote the set of these elements of x whose value of attribute a is v. The average enthropy that remains in x , after the value a has been fixed, is:H(x,a) = kv E(fix(x,a,v)),

where kv is the ratio of examples in x with attribute a having value v.

© Enn Tyugu 30

ID3 algorithm

ID3 uses the following variables and functions:

p -- pointer to the root of the decision tree being built;x -- set of examples;E(x) -- enthropy of x for the the set of examples x;H(x,a) -- average entropy that remains in x after the value

of a has been fixed;atts(x) -- attributes of the set of examples x;vals(a) -- values of the attribute a;mark(p,d) -- mark node p with d;newsucc(p,v) -- new successor to the node p, with attribute

value v, returns pointer p to the new node;fix(x,a,v) -- subset of given set of examples x with the

value v of the attribute a.

© Enn Tyugu 31

ID3 continued

A.3.10: ID3(x,p)= if empty(x) then failureelif E(x)=0 then mark(p,decision(x))else h:=bignumber;

for aatts(x) doif H(x,a) < h then h:=H(x,a);

am:=a fiod;mark(p,am);for vvals(am,x) do

ID3(fix(x,am,v),newsucc(p,v))od

fi

© Enn Tyugu 32

AQ algorithm

This algorith is for learning knowledge in the form of rules.

The algorithm AQ(ex,cl) builds a set of rules from the given set of examples ex for the collection of classes cl using the function aqrules(p,n,c) for building a set of rules for a class c from its given positive examples p and negative examples n.

pos(ex,c) is a set of positive examples for class c in ex

neg(ex,c) is a set of negative examples for class c in ex

covers(r,e) is a predicate which is true when example e satisfies the rule r.

prune(rules) throws away rules covered by some other rule.

© Enn Tyugu 33

AQ continued

A.3.11: AQ(ex,cl)= allrules = { }; for c cl do

allrules:=alrules aqrules(pos(ex,c),neg(ex,c),c) od; return(allrules)

aqrules(pos,neg,c) =rules := {aqrule(selectFrom(pos),neg,c)};for e pos do

L: {for r rules do if covers(r,e)then break L fi od; rules:=rules {aqrule(e,neg,c)}; prune(rules) }

od;return(rules)

© Enn Tyugu 34

AQ continuedaqrule(seed,neg,c) -- builds a new rule from the initial condition seed and

negative examples neg for the class c.newtests(r,seed,e) -- generates amendments q to the rule r, r&q covers seed

and not e;worstelements(star) -- chooses the least promising elements in star. aqrule(seed,neg,c) =

star:={true};for e neg do

for r star doif covers(r,e) then

star:=(star {r&q| q newtests(r,seed,e)}) \{r}

fi;while size(star)>maxstar do

star:=star\worstelements(star)od

odod;return("if" bestin(star) "then"c)

© Enn Tyugu 35

A clustering problem(learning without a teacher)

© Enn Tyugu 36

Hierarchy of learning methods

specific to general

Learning

massively parallel learing

parametric learning

by auto- mata

numeri- cally

neural nets

genetic algorithms

symbolic learning

search in concept space

inductive inference

general to specific

inverse resolution

© Enn Tyugu 37

Otsustustabelid

Otsutustabel on teadmuse kompakte esitusviis, kui teadmus on valiku tegemiseks lõpliku (ja mitte eriti suure) hulga võimaluste seast.

Otsustustabel on kolme tüüpi aladest koosnev liittabel.

Tingimuste list (C1, C2,…,Ck on tingimused, mis pannakse kirja mingis formaalses – programmiks tõlgitavas keeles):

C1 C2 … Ck

© Enn Tyugu 38

Otsustustabelid Valikumaatriks,mis koosneb tingimustele vastavatest veergudest ja

valikuvariantidele vastavatest ridadest.

Tabeli igas ruudus võib olla üks kolmest väärtusest:y – jah, tingimus peab oleman täidetud

n – ei, tingimus ei tohi olla täidetud

0 – pole oluline, kas tingimus on täidetud või ei (lahter on siis sageli lihtsalt tühi).

C1 C2 …. Ck

© Enn Tyugu 39

Otsustustabelid

Tabeli kolmandas väljas on valitavad otsused. Kui esimest ja teist tüüpi välju on kumbagi kaks, saab ka kolmanda välja teha maatriksi kujul:

y n

n y

y

y y n n

y y

y n

Otsused

TingimusedTingimused

© Enn Tyugu 40

Bibliography

• Kohonen, T. (1984) Self-Organization and Associative Memory. Springer Verlag, Holland.

• Lippmann, R. (1987) An Introduction to Computing with Neural Nets. IEEE ASSP Magazine, No. 4, 4 - 22.

• Michalski, S. (1983) Theory and methodology of inductive learning. In: S. Michalski, J. Carbonell, T. Mitchell, eds. Machine learning: an Artificial Intelligence approach. Tioga, Palo Alto, 83 –134.

© Enn Tyugu 41

Exercises

Sample data for ID3

Outlook Temperature Humidity Windy Class

Sunny

Sunny

Overcast

Rain

Rain

Rain

Overcast

Sunny

Sunny

Rain

Sunny

Overcast

Overcast

Rain

Hot

Hot

Hot

Mild

Cool

Cool

Cool

Mild

Cool

Mild

Mild

Mild

Hot

Mild

High

High

High

High

Normal

Normal

Normal

High

Normal

Normal

Normal

High

Normal

High

False

True

False

False

False

True

True

False

False

False

True

True

False

True

-

-

+

+

+

-

+

-

+

+

+

+

+

-

1. Calculate the entropies of attributes; 2. build a decision tree

Documents

© Enn Tyugu1 Algorithms of Artificial Intelligence Lecture 6: Learning E. Tyugu