Upload
pauline-smith
View
222
Download
0
Embed Size (px)
Citation preview
© Enn Tyugu 1
Algorithms of Artificial Intelligence
Lecture 6: Learning
E. Tyugu
© Enn Tyugu 2
Content• Basic concepts
– transfer function– classification – stages of usage
• Perceptron• Hopfield net • Hamming net • Carpenter-Grossberg’s net• Kohonen’s feature maps• Bayesian networks• ID3• AQ
© Enn Tyugu 3
Neural nets
Neural nets provide another form of massively parallel learning functionality. They are well suited for learning pattern recognition. A simple way to describe a neural net is to represent it as a graph. Each node of the graph has an associated variable called state and a constant called threshold. Each arc of the graph has an associated numeric value called weight. Behaviour of a neural net is determined by transfer functions for nodes which compute new values of states from previous states of neighbouring nodes.
© Enn Tyugu 4
Node of a net A common transfer function is of the form
xj = f( wij*xi - tj)
where the sum is taken over incoming arcs with weigths wij, and xi are states of the neighboring nodes, tj is threshold of the node j where the new state is computed. Learning in neural nets means changing the weights in a right way.
w1j
wnj
x1
xn
xjf
© Enn Tyugu 5
Transfer functions
hard limiter sigmoidthreshold logic
-1
+1 +1 +1
x x x
f(x) f(x) f(x)
© Enn Tyugu 6
Forward-pass and layered nets
1. Forward-pass neural net is an acyclic graph. Its nodes can be classified as input, output and internal nodes. Input nodes do not have neighbours on incoming arcs, output nodes do not have them on outgoing arcs and internal nodes possess both kinds of neighbours.
2. Layered (n-layered) net is a forward-pass net where each path from an input node to an output node contains exactly n nodes. Each node in such a graph belongs exactly to one layer, n-layered net is strongly connected, if each node in the i-th layer is connected to all nodes of the (i+1) -st layer, i=1,2, . . . ,n-1. States of the layered net can be interpreted as decisions made on the basis of the states of the input nodes.
© Enn Tyugu 7
Layered neural net
} input nodes
} output nodes
} intermediate nodes. . . .
Learning in a layered net can be performed by means of back-propagation. In this case, the states taken by output nodes are evaluated and credit or blame is assigned to each output node. The evaluations are propagated back to other layers.
© Enn Tyugu 8
Stages of usage
1. Selection of the structure (of the network type)
2. Assignment of initial weights
3. Learning/teaching
4. Application
© Enn Tyugu 9
Perceptrons
single-layer double-layer three-layer
Perceptrons´nodes are hard limiters or sigmoids. Examples:
© Enn Tyugu 10
Learning in a single-layer perceptron
1. Initialize weights wi and threshold t to small random values.2. Take an input x1, …, xn and the desired output d.3. Calculate output x of the perceptron.4. Adapt the weights:wi´= wi + h*(d - x)*xi, where h<1 is a positive gain value,
+1, if input is from one classd =
-1, if input is from the other class
Repeat 2 - 4, if needed.NB! Note that the weights are changed only for incorrect output d.
© Enn Tyugu 11
Regions separable by perceptrons
single-layered double-layered three-layered
A
B
B
BA
A
© Enn Tyugu 12
Hopfield net
x1 x2 xn
x1' x2' xn'
Every node is connected to all other nodes, weights are symmetric (wij = wji). Works with binary (+1,-1) input signals. The output is also a tuple of values +1 or -1.
even a sigmoid can be used
© Enn Tyugu 13
Hopfield net1. Initialize connection weights:
wij = xis * xjs, i j,
where xis is +1 or -1 as in the description x1s, …, xns of the class s.2. Initialise states with an unknown pattern x1, …, xn.3. Iterate until convergence (even can be done asynchronously):
xj´= f ( wij*xi),
where f is the hard limiter.
Remarks: • A Hopfield net can be used either as a classifier or an associative memory.• It converges always, but no match may occur.• It works well, in the case when number of classes is less than 0.15*n.• There are several modifications of the Hopfield net architecture..
s
© Enn Tyugu 14
Hamming net The Hamming net calculates Hamming distance to exemplar of each
class and shows positive output for the class with the minimal distance.
This net is widely used for restoring corrupted binary fixed length signals.
Hamming net works faster than Hopfield net, has less connections for larger number of input signals.
It implements the optimum minimum error classifier when bit errors are random and independent.
© Enn Tyugu 15
Hamming net
x1 x2 xn
y1 y2 ym
calculate Hamming distance
select the best match
z1 zmz2
© Enn Tyugu 16
Hamming net
Value at a middle node zs is n – hds where hds is Hamming distance to the exemplar pattern ps .
Threfore in the lower subnet weight from input xi to the middle node zs
is wis = xis/2, ts = n/2 for each exemplar s.
Indeed, 0 for the most incorrect code, and 1 = (+1 –(-1))* xis/2 is added for each correct input signal, so that this gives n for correct code.
© Enn Tyugu 17
Hamming net continued
1. Initialize weights and offsets:
a) lower subnet: wis = xis/2, tj = n/2 for each exemplar s;
b) upper subnet: tk=0, wsk = if k=s then 1 else -e, where 0 < e < 1/m.
2. Initialize the lower subnet with (unknown) pattern x1,…, xn and calculate
yj = f( wij*xi - tj).
3. Iterate in the upper subnet until convergence:
yj´= f(yj - e* yk).
© Enn Tyugu 18
A comparator subnet
Here is a comparator subnet that selects the maximum of two analog inputs x0, x1. Combining several of these nets one builds comparators for more inputs (4, 8 etc., approximately log2n layers for n inputs). Output z is the maximum value, y0 and y1 indicate which input is maximum, dark nodes are hard limiters, light nodes are threshold logic nodes, all thresholds are 0, weights are shown on arcs.
y0 y1z
x0 x1
1
1
1
1
-1 -1
0.5 0.5
0.5 0.5
© Enn Tyugu 19
Carpenter-Grossberg net
This net forms clusters without supervision. Its clustering algorithm is similar to the simple leader clustering algorithm:
select the first input as the exemplar for the first cluster;
if the next input is close enough to some cluster exemplar, it is
added to the cluster, otherwise it becomes the exemplar of a new cluster.
The net includes much feedback and is described by nonlinear differential equations.
© Enn Tyugu 20
Carpenter-Grossberg net
Carpenter-Grossberg net for three binary inputs: x0, x1, x2 and two classes.
x0 x1x2
© Enn Tyugu 21
Kohonen´s feature mapsA Kohonen’s self organizing feayture map (K-map) is uses analogy with such biological neural structures where the placement of neurons is orderly and reflects structure of external (sensed) stimuli (e.g. in auditory and visual pathways).
K-map learns, when continuous-valued input vectors are presented to it without specifying the desired output. The weights of connections can adjust to regularities in input. Large number of examples is needed.
K-map mimics well learning in biological neural structures.
It is usable in speech recognizer.
© Enn Tyugu 22
Kohonen´s feature maps continuedThis is a flat (two-dimensional) structure with connections between neighbors and connections from each input node to all its output nodes.
It learns clusters of input vectors without any help from teacher. Preserves closeness (topolgy).
continues valued input vector
Output nodes
© Enn Tyugu 23
Learning in K-maps
1. Initialize weights to small random numbers and set initial radius of neighborhood of nodes.
2. Get an input x1, …, xn.3. Compute distance dj to each output node:
dj = (xi - wij)2
4. Select output node s with minimal distance ds. 5. Update weights for the node s and all nodes in its neighborhood:wij´= wij + h* (xi - wij), where h<1 is a gain that decreases in time.
Repeat steps 2 - 5.
© Enn Tyugu 24
Bayesian networksBayesian networks use the conditional probability formula
P(e,H)=P(H|e)P(e) = P(e|H)P(H)
binding the conditiona probabilities of evidence e and hypothesis H.
Bayesian network is a graph whose nodes are variables denoting occurrence of events, arcs express causal dependence of events. Each node x has conditional probabilities for every possible combination of events influencing the node, i.e. for every collection of events in nodes of pred(x) immediately preceding the node x in the graph.
© Enn Tyugu 25
Bayesian networksExample: x1
x2x4
x6
x3
x5
The joint probability assessment for all nodes x1,…,xn:
P(x1,…,xn) = P(x1|pred(x1))*...*P(xn|pred(xn))
constitutes a joint-probability model that supports the assessed event combination. For the present example it is as follows:
P(x1,…,x6) = P(x6|x5)*P(x5|x2,x3)*P(x4|x1,x2)*P(x3|x1)*P(x2|x1)*P(x1)
© Enn Tyugu 26
Bayesian networks continued
A bayesian network can be used for diagnosis/classification: given some events, the probablities of events depending on the given ones can be predicted.
To construct a bayesian network, one needs to
• determine its structure (topology)
• find conditional probabilities for each dependency.
© Enn Tyugu 27
Taxonomy of neural netsNEURAL NETS
BINARY-VALUED INPUTS CONTINUOUS INPUTS
UNSUPERVISED LEARNING
SUPERVISED LEARNING
UNSUPERVISED LEARNING
SUPERVISED LEARNING
KOHONEN MAPS
MULTI-LAYERED PERCEPTRONS
SINGLE-LAYERED PERCEPTRONS
CARPENTER- GROSSBERG NETS
HAMMING NETS
HOPFIELD NETS
© Enn Tyugu 28
A decision tree
outlook
humidity windy
overcastsunny rain
high normal true false
+
_ + _ +
© Enn Tyugu 29
ID3 algorithm• To get the fastest decision-making procedure, one has to
arrange attributes in a decision tree in a proper order - the most discriminating attributes first. This is done by the algorithm called ID3.
• The most discriminating attribute can be defined in precise terms as the attribute for which the fixing its value changes the enthropy of possible decisions at most. Let wj be the frequency of the j-th decision in a set of examples x. Then the enthropy of the set isE(x)= - wj* log(wj)
• Let fix(x,a,v) denote the set of these elements of x whose value of attribute a is v. The average enthropy that remains in x , after the value a has been fixed, is:H(x,a) = kv E(fix(x,a,v)),
where kv is the ratio of examples in x with attribute a having value v.
© Enn Tyugu 30
ID3 algorithm
ID3 uses the following variables and functions:
p -- pointer to the root of the decision tree being built;x -- set of examples;E(x) -- enthropy of x for the the set of examples x;H(x,a) -- average entropy that remains in x after the value
of a has been fixed;atts(x) -- attributes of the set of examples x;vals(a) -- values of the attribute a;mark(p,d) -- mark node p with d;newsucc(p,v) -- new successor to the node p, with attribute
value v, returns pointer p to the new node;fix(x,a,v) -- subset of given set of examples x with the
value v of the attribute a.
© Enn Tyugu 31
ID3 continued
A.3.10: ID3(x,p)= if empty(x) then failureelif E(x)=0 then mark(p,decision(x))else h:=bignumber;
for aatts(x) doif H(x,a) < h then h:=H(x,a);
am:=a fiod;mark(p,am);for vvals(am,x) do
ID3(fix(x,am,v),newsucc(p,v))od
fi
© Enn Tyugu 32
AQ algorithm
This algorith is for learning knowledge in the form of rules.
The algorithm AQ(ex,cl) builds a set of rules from the given set of examples ex for the collection of classes cl using the function aqrules(p,n,c) for building a set of rules for a class c from its given positive examples p and negative examples n.
pos(ex,c) is a set of positive examples for class c in ex
neg(ex,c) is a set of negative examples for class c in ex
covers(r,e) is a predicate which is true when example e satisfies the rule r.
prune(rules) throws away rules covered by some other rule.
© Enn Tyugu 33
AQ continued
A.3.11: AQ(ex,cl)= allrules = { }; for c cl do
allrules:=alrules aqrules(pos(ex,c),neg(ex,c),c) od; return(allrules)
aqrules(pos,neg,c) =rules := {aqrule(selectFrom(pos),neg,c)};for e pos do
L: {for r rules do if covers(r,e)then break L fi od; rules:=rules {aqrule(e,neg,c)}; prune(rules) }
od;return(rules)
© Enn Tyugu 34
AQ continuedaqrule(seed,neg,c) -- builds a new rule from the initial condition seed and
negative examples neg for the class c.newtests(r,seed,e) -- generates amendments q to the rule r, r&q covers seed
and not e;worstelements(star) -- chooses the least promising elements in star. aqrule(seed,neg,c) =
star:={true};for e neg do
for r star doif covers(r,e) then
star:=(star {r&q| q newtests(r,seed,e)}) \{r}
fi;while size(star)>maxstar do
star:=star\worstelements(star)od
odod;return("if" bestin(star) "then"c)
© Enn Tyugu 35
A clustering problem(learning without a teacher)
© Enn Tyugu 36
Hierarchy of learning methods
specific to general
Learning
massively parallel learing
parametric learning
by auto- mata
numeri- cally
neural nets
genetic algorithms
symbolic learning
search in concept space
inductive inference
general to specific
inverse resolution
© Enn Tyugu 37
Otsustustabelid
Otsutustabel on teadmuse kompakte esitusviis, kui teadmus on valiku tegemiseks lõpliku (ja mitte eriti suure) hulga võimaluste seast.
Otsustustabel on kolme tüüpi aladest koosnev liittabel.
Tingimuste list (C1, C2,…,Ck on tingimused, mis pannakse kirja mingis formaalses – programmiks tõlgitavas keeles):
C1 C2 … Ck
© Enn Tyugu 38
Otsustustabelid Valikumaatriks,mis koosneb tingimustele vastavatest veergudest ja
valikuvariantidele vastavatest ridadest.
Tabeli igas ruudus võib olla üks kolmest väärtusest:y – jah, tingimus peab oleman täidetud
n – ei, tingimus ei tohi olla täidetud
0 – pole oluline, kas tingimus on täidetud või ei (lahter on siis sageli lihtsalt tühi).
C1 C2 …. Ck
© Enn Tyugu 39
Otsustustabelid
Tabeli kolmandas väljas on valitavad otsused. Kui esimest ja teist tüüpi välju on kumbagi kaks, saab ka kolmanda välja teha maatriksi kujul:
y n
n y
y
y y n n
y y
y n
Otsused
TingimusedTingimused
© Enn Tyugu 40
Bibliography
• Kohonen, T. (1984) Self-Organization and Associative Memory. Springer Verlag, Holland.
• Lippmann, R. (1987) An Introduction to Computing with Neural Nets. IEEE ASSP Magazine, No. 4, 4 - 22.
• Michalski, S. (1983) Theory and methodology of inductive learning. In: S. Michalski, J. Carbonell, T. Mitchell, eds. Machine learning: an Artificial Intelligence approach. Tioga, Palo Alto, 83 –134.
© Enn Tyugu 41
Exercises
Sample data for ID3
Outlook Temperature Humidity Windy Class
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
False
True
False
False
False
True
True
False
False
False
True
True
False
True
-
-
+
+
+
-
+
-
+
+
+
+
+
-
1. Calculate the entropies of attributes; 2. build a decision tree