Learning “Any process by which a system improves its performance” -- Herb Simon Critic Performance Element Problem Generator Learning Element Sensors Effectors

Learning

“Any process by which a system improves its performance” -- Herb Simon

Critic

PerformanceElement

ProblemGenerator

LearningElement

Sensors

Effectors

Changes

Knowledge

Lear

ning

Goa

lsF

eedb

ack

Agent

En

vironm

ent

Components

• Performance Element– Responsible for selecting external actions.– E.g. decision trees

• Learning Element– Monitors how the agent is doing in order to modify and

hopefully improve its future performance.– The learning element can improve the competence and

efficiency of the performance element.– E.g. Reorganising & extending decision trees.

• Critic– Indicates how well the agent is performing in terms of

some external performance standard.– E.g, in chess the performance element might select a

check-mate move but only the critic can indicate that this is a good thing.

• Problem Generator– Suggests actions that will lead to new informative

experiences.

Simple Rote Learning

• Rote Learning = Memorisation– Retrieval instead of computation.

– Rote learning involves taking problems that the performance element has solved and storing the problem and the solution.

• Learning to Improve Efficiency– The aim is to improve performance element

efficiency.

– Competence is not improved (?)

• Store vs Compute Trade Off– The Utility Problem

– E.g. Consider the application of rote learning to the task of multiplication.

Samuel’s Checkers

• Checkers is a difficult game to play well.– 1040 possible moves

– Samuel’s checkers playing program knew the rules of the game so it could play correctly.

– However it could not play well.

• Standard Game Tree Search– Only possible to look a few moves ahead

– Static evaluation is then used to evaluate the lookahead positions.

– Chose the move that leads to the best attainable lookahead position (use minimax method).

• Learning to Play Better– Improve the lookahead by rote learning board

positions.

A

B

C

D1 3 3 2 10 12 4

86 17 2 3 5 2 6

3 3 12 8 17 5 6

3 8 5

8

Samuel's rote learning procedure saved every board position encountered along with its minimax value.

So, in the example, the pair [A,8] would be stored for future reuse.

When A is encountered in later games its score is retrieved (8) rather than recomputed - thus the program is made more efficient.

There is a more important benefit to this type of learning than just a simple efficiency gain.

The memorised value of A is more accurate that the static value of A because it is based on a lookahead search. Thus the lookahead power of the program is improved.

E

D

The program is considering what to do from position E. It searches ahead 3 moves and applies the static evaluator. For position A however it uses its stored value (8) which actually represents a further lookahead of 3 moves. Thus the effective lookahead is now 6 (at least for this example).

A

Case-Based Learning

• Domain Models– Later we will look at learning techniques that display

the ability to learn accurate problem solving knowledge and models.

– In some domains however good models might be impossible to build or learn.

• Record Cases– When faced with a new problem select the nearest

case and use its solution.

• Decision Trees & K-D Trees– Perform nearest-neighbour searches in logarithmic

time.

The Feature Space

YellowPurple

Red

Orange

Red

Blue

Green

Violet

Each plate has a known colour, height and width.

Plotting the heights and widths of these plates yields the following feature space.

Hei

ght

Width60 2 4

0

2

4

6 Red

Orange

Red

Violet

Blue

Green

Purple

Yellow

Target

Using our knowledge of existing cases can we predict the colour of the target plate given that we know its width and height?

We can find the nearest known plate and use its colour as an educated guess.

A Robotics Example

L1

L2A1

A2

Consider the problem of moving a robot hand along a prescribed path at a certain speed.

Need to know how the joint angles (A1 & A2) will change in time and also what joint motor torques will produce these angles.

Kinematics Problem

Relating joint angles to hand position.

For the two joint, two dimensional robot in our example we can come up with the following formulae for determining joint angles given a desired hand position.

A1 = Tan-1(y/x) - Tan-1(L2Sin(A2)/(L1+L2CosA2))A2 = Cos-1((x2+y2-L12-L22)/2L1L2

However the joint motors only understand torque, they do not understand angles. We need equations that relate joint motions to motor torques.

These give us complicated equations in terms of torques, positions, velocities, velocity products, and accelerations.

For 3D situations its even worse because we have 6 joints.

Even with all this mathematical sophistication performance (accuracy and smoothness of motion) can be poor - ignores gravity, coriolis forces, etc.

Dynamics Problem

W can use nearest-neighbour calculations in a feature space defined by the various controlling parameters of motion (position, velocities etc.)

The robot arm can then fill this feature space by sampling its motion at various intervals. For example, during some operation it might note its parameter values for a given position in space (x,y). This reading is stored in a giant case table.

Now when we want to move the arm along a particular trajectory: we break the trajectory up into pieces; treat the giant table as a populated feature space; look for cases with entries for position, velocity etc that closely match our current target segment; interpolate among the cases and work out the appropriate torques for our target segment.

One worry is that the table would have to be huge to properly fill the feature space. But the system can fill the table as it learns through practice.

The first time the robot tries to follow some trajectory it fails miserably because its table is very sparsely populated. But even during this failure new cases are being learned. After a few tries the motion becomes much smoother and more accurate as the cases properly cover the trajectory.

Trial 1

Trial 2

Trial 3

Decision Trees

• Nearest-Neighbour Computations– n examples means n(n-1) comparisons

• Decision Trees (Log2 n)– Each node in the tree corresponds to a feature test

– Each arc corresponds to a possible feature value

– Cases must be divided up in advance of the nearest-neighbour calculation so that an appropriate decision tree can be built.

F1

F1=V1 F1=V2

Hei

ght

Width60 2 4

0

2

4

6 Red

Orange

Red

Violet

Blue

Green

Purple

Yellow

We can divide our feature space into 8 sets, each containing just one plate.

Finding the nearest neighbour is now a matter of finding a path through the corresponding decision tree.

In fact in this example only 3 comparisons will be needed to find the nearest-neighbour of our target plate. Once the distance to this neighbour has been calculated 3 more comparisons are needed to validate the decision. If the decision turns out to be wrong the further comparisons in previously ignored sets have to be made.

Dividing Cases

DIVIDE(Cases,...)

If there is only one case then stop.

Choose a dimension for comparison that is different from the dimension chosen during the previous division.

Using only this dimension find the average of the two middle cases. This is the threshold. Construct a decision tree test that compares unknowns in the dimension of comparison against this threshold.

Also not the positions of the two middle cases and call them the upper and lower boundaries.

Split up cases into two subsets according to the threshold test.

Divide up the cases in each subset.

Height > 3.5?

Height > 1.5

Height > 1.5 Height > 5.5

Height > 5.5

Violet Red Green Blue Orange Red Purple Yellow

No Yes

Yes YesNoNo

NoNo No

No YesYesYes

Yes

The overall result is called a K-D-Tree (a k dimensional decision tree).

In a k-d tree each test specifies a threshold and a neutral zone around this containing no cases.

Each test results in a binary division of the remaining case.

Width > 3.0 Width > 3.5

Upper = 5Lower = 2

Hei

ght

Width60 2 4

0

2

4

6 Red

Orange

Red

Violet

Blue

Green

Purple

Yellow

Target

Target height > 3.5 which indicates that

the target is likely to be nearer one of the tall plates.

So temporarily ignore the short plates. Because the tallest short plate (Red) is 2cms tall, the distance between the target and this plate is at least 2 cms. If the target is less than or equal to 2 cms from a tall plate then our decision to ignore short plates will become permanent. If the target is more than 2cms from a tall plate then we will have to return to the short plates eventually.

So looking at the tall plates …Because the target width is less that 3.5 it is more likely (but not certain) to be closer to one of the narrow tall plates than one of the wide tall plates.

So temporarily ignore the wide tall cases. If the target proves to be less than or equal to 4 cms from a narrow tall plate then our decision to ignore the wide tall plates becomes permanent.

One more step puts the target with the short, narrow, tall plates, of which there is only one, an orange plate. If the target differs in height from this orange one by less that 2 cms then we can ignore the competing red plate which differs by 2 cms in height alone.

Clearly the orange block is nearer the target than 2 cms. This justifies the rejection of the red plate. The orange plate is less than 4 cms in width from the target justifying the earlier rejection of the yellow and purple plates, and again its is less that 2 cms in height justifying the initial rejection of all the short plates.

2 cms

Hei

ght

Width60 2 4

0

2

4

6 Red

Orange

Red

Violet

Blue

Green

Purple

Yellow

Target 4 cms

Hei

ght

Width60 2 4

0

2

4

6 Red

Orange

Red

Violet

Blue

Green

Purple

Yellow

Target

2 cms

K-D Tree Search

K-D_NN_Search(Target, ...)

If there is only one case in the current set then report it.

Otherwise the result of the current decision node on the target determines the likely set.

Find the nearest-neighbour (nn) in this set using this procedure.

If the distance to the nn in the likely set is <= the distance to the other set’s boundary in the comparison dimension then report the nn from the likely set.

Otherwise, check the unlikely set using the procedure, returning the nearer of the nn’s from the likely set and the unlikely set.

More Decision Trees

• Expressiveness of Decision Trees– Can DTs be used to represent any set?

– No. DT’s represent a propositional language.

– Any Boolean function.

• Applicability?– What types of problem situations can be usefully

solved by a suitable decision tree?

• How can we learn DTs?– How can we learn/build good DTs?

– What is a good DT? Compactness & Error

• Difficulties– Overfitting

– Continuous valued attributes

– Many valued discrete attributes

– Unknown or missing values

– Variable attribute testing costs

Expressiveness

• DTs are fully expressive within the class of propositional languages.– DT are implicitly limited to talking about 1 object.

– We cannot test between 2 or more objects.

– Any Boolean function can be written as a DT -- Each Row in the table is a path in the tree.

B

10

00

B

10

10

A

10

A B A and B

0

0

0 0

0

00

1

1 11

1

Not a very efficient tree for A AND B.

Try XOR and the parity function and the majority function.

Applicability

• DTs good for some Boolean functions but not others.– E.g. The parity functions results in exponentially

large decision trees.

– There is no representation that is efficient for all functions.

• Examples as attribute-value pairs.– Obviously each example must be represented as a set

of attribute-value pairs.

– Each attribute corresponds to a test feature in a DT.

• Discrete (atomic) output values

Exercise:

You are stranded on a deserted island, with only a pile of records and a book of poetry, and there is no way to determine which of the many types of fruit available are safe to eat. They are of various colors and sizes, some have hairy skins and others are smooth. After a great deal of stomach ache, you compile the table of data shown below:

Conclusion Skin Color Size Flesh

Safe Hairy Brown Large Hard

Safe Hairy Green Large Hard

Dangerous Smooth Red Large Soft

Safe Hairy Green Large Soft

Safe Hairy Red Small Hard

Safe Smooth Red Small Hard

Safe Smooth Brown Small Hard

Dangerous Hairy Green Small Soft

Dangerous Smooth Green Small Hard

Safe Hairy Red Large Hard

Safe Smooth Brown Large Soft

Dangerous Smooth Green Small Soft

Safe Hairy Red Small Soft

Dangerous Smooth Red Large Hard

Safe Smooth Red Small Hard

Dangerous Hairy Green Small Hard

Entropy

• -log2p(ci|aj)

Is the amount of information that aj has to offer about conclusion ci

• Entropy = -

Entropy is a measure of “degree of doubt”- the higher it is, the more doubt there is about the possible conclusion

n

i

ij jacpacp1

2i )|(log)|(

m

j

jap1

)(

The solution

p(safe|large) = 5/7

p(dangerous|large) = 2/7

p(large) = 7/16

p(safe|small) = 5/9

p(dangerous|small) = 4/9

p(small) = 9/16

7/16(5/7log(5/7) +2/7log(2/7)) +

9/16(5/9log(5/9) + 4/9log(4/9)) = 0.9350955..

If you do all the calculations you will find that Color has the smallest entropy. The next step is to partition the set of examples according to the possible values for the color. This eventually will lead to the rules:

The rules

• If color = brown conclusion = safe

• If color = green & size = large conclusion = safe

• If color = green & size = small conclusion = dangerous

• If color = red & skin = hairy conclusion= safe

• If color = red & skin = smooth & size = large conclusion = dangerous

• If color = red & skin = smooth & size = small conclusion = safe

Learning DTs

Learn a DT for the following training examples.

No. Outlook Temp. Humidity Windy Class

1 Sunny Hot High False -

2 Sunny Hot High True -

3 Overcast Hot High False +

4 Rain Mild High False +

5 Rain Cool Normal False +

6 Rain Cool Normal True -

7 Overcast Cool Normal True +

8 Sunny Mild High False -

9 Sunny Cool Normal False +

10 Rain Mild Normal False +

11 Sunny Mild Normal True +

12 Overcast Mild High True +

13 Overcast Hot Normal False +

14 Rain Mild High True -

Learning Algorithm

Top-Down Induction of Decision Trees

DT-Learn(Examples)

Choose best attribute.

Extend tree by adding a new node for this attribute and a new branch for each attribute value.

Sort training examples down through this node to the current leaves.

If the training examples are unambiguously classified then stop.

Otherwise repeat from the top.

Good Decision Trees

• Training Consistency

• Accuracy

• Conciseness– It should perform the required classification with as

few tests as possible.

– Finding an optimal tree is intractable.

– Finding a near optimal tree is not.

• Ockham’s Razor– The most likely hypothesis is the simplest one that is

consistent with all hypotheses.

Attribute Choice

Does the order of attribute selection matter?

Which attribute should be selected first?

Choose Humidity first ... Humidity

High Normal

[3+, 4-] [6+, 1-]

Choose Outlook first ... Outlook

Sunny Rain

[2+, 3-] [3+, 2-]

Overcast

[4+]

Selection Heuristics

• Best Attribute?– The attribute that makes the most difference to the

classification of an example.

– The ideal attribute would divide the examples in to sets, each with only one clasification, and we’d be finished; i.e. an immediate classification of all examples.

– More likely, certain attributes may lead to an immediate classification of some examples. These may, in the future, also lead to an immediate classification of new problems. So these attributes are important.

– E.g. Oulook = Overcast.

Information Theory

Suppose we have a possible test with n outcomes that partitions the set T of training cases into n subsets, T1,…,Tn. If this test is to be evaluated without exploring further divisions of the Ti’s, the only information available for evaluation is the distribution of classes in T and its subsets.

S is an set of cases.

freq(Ci,S) is the number of cases in S that belong to class Ci

Imagine selecting one case at random from a S and announcing that it belongs to class Cj. This message has a probability of,

freq(Cj,S) / |S|

And information theory tells us that this message has an information content of,

-log2(freq(Cj,S) / |S|) bits

Theinformation conveyed by a message depends on its probability and can be measured in bits as minus the

logarithm to base 2 of that probability.

So to find the expected information content from such a message pertaining to class membership, we sum over the classes in proportion to their frequencies in S, giving…

info(S) = - freq(Cj,S)/|S| ).log2(freq(Cj,S) / |S|) bits

When applied to a set of training cases T, info(T) measures the average amount of information needed to identify the class of a case in T.

We will often write info(S) as I(P(C1),…P(Cn)) where P(Cn) is the probability of a random case coming from class Ci.

j

k

1

Information Content

• The Information Content of an Attribute– What is the information gained by testing on a

particular attribute?

– Information content is measured in bits.

– 1 bit is enough information to answer an unbiased yes/no question.

Suppose we have a yes/no question and that the probability of a yes is P(yes) and the probability of a no is P(no).

Then the average information content of the answer is given by:

I(P(yes),P(no)) = -P(yes)log2P(yes) -P(no)log2P(no)

In general, for a question with n answers, v1,…,vn we have:

I(P(v1),…,P(vn)) = -P(vi) log2P(vi)

E.g. the information content of a coin flip (unbiased) is:

I(1/2,1/2) = -1/2(log21/2)-1/2(log21/2) = 1 bit

j

n

1

ID3: Information Gain

For decision tree learning, the question is: for a given example what is the correct classification? A correct decision tree will answer this question. An initial estimate of the probabilities of the possible answers is given by the proportions of positive and negative examples in the training set.

Suppose the training set has p positive examples and n negative ones. Then an estimate of the information contained in a correct answer is:

I(p/(p+n), n/(p+n))

This is the expected information content of the tree.

A test on a single attribute will provide us with some of this information. We can estimate how much by looking at the information needed after the attribute has been tested.

Some attribute A divides the training set T into v subsets, T1,…,Tv, where A can take on v different values across the training set.

Each subset Ti has pi positive examples and ni negative examples. So if we travel along the vi branch we still need an additional I(pi/(pi+ni),ni/(pi+ni)) bits of information to answer the question.

A random example has a value vi for attribute A with probability, (pi+ni)/p+n. So on average the remaining information we will need to answer the question, after test attribute A, is given by:

Remainder(A)= (pi+ni)/(p+n).I(pi/(pi+ni),ni/(pi+ni))

So now we can compute the information gained by the attribute test on A as,

Gain(A) = I(p/(p+n), n/(p+n)) - Remainder(A)

i

v

1

Example

Return to our earlier table of weather examples. Of the 14 objects, 9 are of class + and 5 are of class -.

The total information required for the classification is:

I(P(+),P(-)) = -9/14 log29/14 - 5/14 log25/14 = 0.940 bits

Now consider the outlook attribute with the three values {sunny, overcast, rain}. 5 of the 14 examples have the first value, sunny, 2 of them from class + and 2 from class -.

So, p1 = 2 n1 = 3 I(2/5,3/5) = 0.971 (Sunny)

Similarly, p2 = 4 n2 = 0 I(1,0) = 0 (Overcast)

p3 = 3 n3 = 2 I(3/5,2/5) = 0.971 (Rain)

Remainder(T, Outlook) = 5/14(.971) + 4/14(0) + 5/14(.971) = 0.694

Therefore, Gain(T, Outlook) = 0.246 bits. Best

Similarly, Gain(T, Temp) = 0.029 bitsGain(T, Humidity) = 0.151 bitsGain(T, Windy) = 0.048 bits

Noise

• Problem ...– If there are no attributes left, but there are both

positive and negative examples remaining, then we have a problem. Some of the data are incorrect. We say there is noise in the data.

– It can also occur if the attributes do not give enough information to classify the examples.

• Solution ...– One solution is to have leaf nodes report the

majority classification.

– Another, is to report the estimated classification proabilities using the relative frequencies of the remaining examples.

Overfitting

• Problem ...– Even when vital information is missing our algorithm

may come up with a consistent hypothesis.

– The algorithm can use irrelevant attributes to make spurious distinctions among the examples.

• Solution...– Prevent recursive splitting on attributes that are not

clearly relevant.This happens when information gain is close to 0. Why?

– How large should Gain be to proceed?

– Use statistical significance tests and estimate the liklihood that the gain observed for a particular example set size is due to chance. If for example there is only a 5% liklihood that it is due to chance then we can assume the attribute is relevant.

– Cross validation is another technique that tries to predict how the current hypothesis will work with unseen data. It sets aside part of the training set to use as a test set. Testing can be carried our with different test sets and an average error computed.

Missing Data

• Missing Data– Not all attributes may be known for every example.

– Given a complete decision tree how do we classify a new example that is missing some attribute value?

– How can we modify the Gain formula to accommodate missing values during learning?

• Solutions …– Assign a value for the empy attribute.

– For example, use the average value of the same attribute in the training examples.

– Or assume the attribute has all possible values and proceed down all branches. Weight each procession according to the relative frequency of that value in the examples that have reached the current node in the decision tree.

– Similar ideas can be used to change the learning algorithm by weighting the Gain function according to the hypothesised value frequencies.

+ - + -

A

B C

a1 a2

c1 c2b1 b2

20 Cases

Target:

A:? B:? C:c2

A = a1 (15/20)

15 Cases 5 Cases

A = a2 (5/20)

B = b1 (5/15)

5 Cases 10 Cases

B = b2 (10/15)

Probability of a + : (15/20)(5/15) = 1/4

Probability of a - : (15/20)(10/15) + (5/20) = 3/4

2 Cases 3 Cases

Multi-Valued Data

Attributes that have large numbers of possible values can give unfair Gain results and so innacruate measures of an attribute’s usefulness are often returned.

For example, if we used example number as an attribute then every training example would be unique and have a unique classification. This attribute would score very highly using the Gain function but a rather usefuless tree would result.

Solution ...

Penalise broad uniform splits.

Modify the gain formula to get the following so called Gain Ratio formula:

GainRatio(T, A) = Gain(T, A) / Split-Info(T, A)

where Split-Info(T, A) = |Ti|/|T|Log2|Ti|/|T|i

n

1

Continuous Values

• So far we’ve looked at nominal values only.

• What about continuous values?– E.g. Height or Weight etc… (An infinity of values)– Building a tree using the actual traning values can lead

to problems when it comes to classification -- it is unlikely that a new target will have exctly the same value as one of the training examples for a particular continuous attribute.

• Discretisation– Split the value range up into manageable chunks.– No longer test for exact matches. Look for

membership.– E.g. Tall = 1.8m < Height < 2.2m– If height = 2m in the example then it passes the height

test for Tallness.– One should preprocess the example data to determine

which ranges are best in terms of the classifications they support.

Costly Attributes

• Attribute Costs– Testing certain attributes may carry a substantial

cost.

– E.g. In a medical setting asking a patient if she has a headache is less costly that carrying out a CAT scan. However, a CAT scan may be more conclusive and lead to a quicker diagnosis. Therefore there can be a trade-off between information gain and testing cost.

– Therefore, when building trees we should take cost into account.

– A small tree may no longer be desirable if it incurs a large cost.

– A larger tree may ask more questions on average but the answers may be less costly to come by.

• Solution …– Add a cost term to the attribute selection measure so

that cost is considered during tree building.

– E.g. Quality(A) = Gain(A)/Cost(A)2

Inductive Learning

• Supervised Learning– Learning is provided with correct input, output pairs.

• Inductive Learning– Task : to learn form function f.

– Given : a collection of input,output examples of f

– Result : a new function h that approximates f

x

f(x)

For example, given a set of (x,y) points on the plane, where y=f(x). The task is to learn a function h that fits the points well.

f

h

There are many possible choices for h. Preferences for one over another, beyond issues of mere consistency, are known as learning biases. All learning algorithms exhibit some sort of bias.

Learning Theory

• Basic Question:– How can we be sure that our inductive learner has

learned a hypothesis h that is close to the true function f if we don’t know what f is?

• Computational Learning Theory– Why does learning work? Does it work?

• Basic Principle:– Any hypothesis that is seriously wrong will be found

out with high probability after a small number of examples.

– Thus any hypothesis that is consistent with a large set of examples is unlikely to be seriously wrong.

PAC Learning

• Probably Approximately Correct– Any hypothesis that is consistent with a large

number of examples is probably approximately correct.

– PAC Learning is devoted to this idea.

• Assumptions– The stationarity assumption says that the training

set and test set are drawn from the same probability distribution.

– In other words we don’t have to worry about using non-representative training examples - the malevolent teacher syndrome.

• The BIG Question– How many training examples do we need before we

can be sufficiently positive that our learned function is sufficiently close to the real underlying function?

Training Set Size

• Let …– X be the set of all possible examples.

– D be the distribution from which examples are drawn.

– H be the set of all possible hypotheses.

– m be the number of training examples.

• Hypothesis Error– We can define the error of a learned hypothesis, h,

with respect to the true underlying function f as follows:

• Approximately Correct– A hypothesis h is approximately correct if:

error(h) = P(h(x) f(x) | x drawn from D)

error(h) <= , where is a small constant.

f

H

HBAD

HGOOD

One can visualise an approximately correct hypothesis as being “close” to the true function - it lies within what is called the ball of f (this is the set HGOOD)

The remainder of the hypotheses is called HBAD.

We need to calculate the probability that a “seriously wrong” hypothesis hb HBAD is consistent with the first m examples.

We know that error(hb)>

Therefore the probability that this bad hypothesis agrees with any given example is <= (1- ).

Now, the probability that this hypothesis agrees with all m examples is given by:

P(hb agrees with m examples) <= (1- )m

We need to know the probability that HBAD contains such a consistent hypothesis. Well, for this to be true obviously at least one of the HBAD hypotheses must be consistent.

So if the probability that a particular hypothesis hb is consistent is (1- )m then the probability that one of | HBAD| such hypotheses is consistent is given by:

P(HBAD contains a consistent hyp.) <= | HBAD|.(1- )m

<= | H|.(1- )m

Obviously we don’t want this to be terribly likely -- it is not a good idea for us to be learning “seriously wrong” hypotheses. In fact we would like this probability to be less that some small number . That is:

| H|.(1- )m <=

And rearrangeing this expression, we find that we can achieve this as long as our learning algorithm sees

m >= 1/ (ln(1/)+ln(|H|))

Thus, if a learning algorithm returns a hypothesis that is consistent with this many examples, then with a probability if 1- it has an error of at most .

In otherwords if is small then there it is very likely that our hypothesis is approximately correct.

This number of examples as a function of and is called the sample complexity.

Obviously m depends very much on |H|, the size of the hypothesis space.

If H is the set of all Boolean functions of n attributes then |H| = 2^2n (2 to the power of 2 to the power of n). Thus the sample complexity of this space grows as 2n.

But, the total number of possible examples is also 2n which means that any learning algorithm for the space of all Boolean functions can do not better than a look-up table, it merely returns a hypothesis that is consistent with all known examples.

The Dilemma !

Unless we restrict the space of functions the learning algorithm can consider it will not be possible to learn effectively.

However, if we do restrict the space, then we may eliminate the true function altogether.

Solutions:

Insist that our algorithm returns not just any consistent hypothesis, but preferably the simplest one (this is intractable in general)

Another approach is to consider subsets of the Boolean functions. In most cases we do not need the full expressiveness of Boolean functions, we can get by with restricted languages.

Decision Lists

• Decision Lists are– Restricted logical expressions.

– A DL is a series of tests. Each test is a conjunction of literals.If a test succeeds on an example then the decision list specifies the value to return. If the test fails then the next test is checked.

– For example a learning method that saw the first 5 examples from our original set my learn the following hypothesis in DL form.

Outlook(D,Sunny)&Temp(D,Hot) Windy(D,False)

T

-T

+

N

• DLs are like decision trees– Their overall structure is simpler

– Their individual tests are more complex

N

-

Restricted DLs

• K Literal Restriction– If we allow each DL test to be arbitrarily complex

then DLs can represent any Boolean function.

– However if we restrict the tests to k literals then it is possible to learn accurate hypotheses from a small number of examples.

– In fact we can show that:

m >= 1/(ln 1/ + O(nklog2(nk)))

– Therefore any algorithm that returns a DL consistent with the training examples will PAC-learn a k-DL function in a reasonable number of examples for small k.

Learning Algorithm

Learning Decision Lists

DL-Learn(Examples)

If examples is empty then return False

Select a test T that matches a nonempty subset, examples’, of examples such that the members of examples’ are all positive or are all negative.

If there is no T then return Failure

If examples’ are all + then the outcome O is +

Else O is -

Return a DL with an initial test T and outcome O and remaining elements given by DL-Learn (examples - examples’)Otherwise repeat from the top.

Concept Learning

• Learning Single Concepts– A “concept” is taken to be a predicate that returns

TRUE when applied to a positive example of the concept and FALSE otherwise.

– A concept thus partitions the instance space into positive and negative subsets.

– The learning algorithms we will look at learn single concepts by manipulating a sequence of training examples.

• Assumptions– All training examples are either positive or negative

examples of a single concept.

– This single concept can be represented by a point within the learning algorithm’s description space.

• Techniques– Decision Tree Learning (ID3)

– Winston’s Structural Learning

– Mitchell’s Version Space Method

Analysing Differences

• Learning by analysing differences– Current-Best Hypothesis Search

– Positive examples

– Negative examples

• Induction Heuristics– Class descriptions

– Evolving Models

• Assumptions– Pedagogical order must be good.

– Negative examples must be near-misses.

• Felicity Conditions

• Learning Special Cases

• Identification & Similarity Nets

Arch

Arch

Non-Arch

Non-Arch

Winston’s ARCH learning experiments …

1 2

43

A

B C

A

B C

A

B C

A

B C

Induction Heuristics

Induction occurs when one uses specific examples to reach general conclusions.

Consider a program that learns arches from the previous “Arch” and “Non-Arch” examples.

The first example tells us what the general concept of an Arch is: namely, two standing blocks supporting a third.

Each subsequent example serves to specialise this general description.

The second example allows our learning procedure to conclude that the support links must be an important aspect of Arches in general.

The third example (also negative) allows our learner to conclude that the standing bricks must not touch.

The fourth example allows the learner to generalise further by permitting wedge shaped lintels rather than only brick shaped lintels.

Near-Misses

• Initial Description– The learner begins with a typical member of the

class to be learned (i.e. a typical arch).

– This provides an initial description.

Left-Of

SupportSupport

BrickIs-A

A

B C

• Near Misses– A near miss is a negative example that, for a small

number of reasons, is not a member of the class being taught.

– Example 2 is a near-miss because it differs only in terms of its absence of support links.

– Its purpose, therefore, is to teach us about the importance of support links in the arch concept.

Left-Of

• Evolving Models– During learning differences from the initial

description (such as the lack of support links in the negative example) are used to indicate which model relations are important and which are not.

– This leads to an augmented class description called an evolving model.

BrickIs-A

A

B C

Require-Link Heuristic

• Require Links– Because example 2 is negative and only differs from

the current class model in terms of its lack of support links, our learning procedure concludes that support links are necessary in true arches.

– This conclusion is an instance of the require-link heuristic.

– The new class decription looks like …

Left-Of

Must-SupportMust-Support

BrickIs-A

A

B C

Forbid-Link Heuristic

• Unwanted Links– The next comparison between the new model and

the near-miss that is example 3 isolates a difference in the form of “touch” links.

– That is, the near-miss includes two touch-links that are not in the current class description.

– However, this time the example fails to be a class member because of the presence of these links rather than their absence.

Left-Of


Touch

Touch

BrickIs-A

A

B C

• The Forbid-Link Heuristic– The learning procedure uses the forbid-link heuristic

to conclude that these extra links must be responsible for the non-arch nature of the example.

– It handles this by converting these links to their negative emphatic form: must-not-touch.

– The updated class description is shown below…

Left-Of


Must-Not-Touch

Must-Not-Touch

BrickIs-A

A

B C

Positive Examples

• Negative Examples RESTRICT– Limiting the concept of an “Arch”.

• Positive Examples RELAX– Expanding the concept of an “Arch”

– The fourth training example looks like ...

Left-Of


Must-Not-Touch

Must-Not-Touch

Wedge Block

Brick

Is-A Is-A

Is-A

A

B C

Climb-Tree Heuristic

• The Climb-Tree Heuristic– Compared to the current evolving model example 4

differs because its lintel object is a wedge instead of a brick.

– If WEDGE and BRICK share some common super class then the climb-tree heuristic uses this super class in the class description. E.g. Bricks and Wedges are both types of Block.

Left-Of


Must-Not-Touch

Must-Not-Touch

BlockMust-Be-A

A

B C

More Heuristics

• The Enlarge-Set Heuristic– Sometimes there is no classification tree for the

climb-tree heuristic to climb so no common super class can be found.

– One solution is to facilitate the creation of a new common super class: the BRICK-OR-WEDGE super class for instance.

• The Drop-Link Heuristic– If there are no objects other than bricks or wedges

the IS-A link can be dropped (bricks and wedges form an exhaustive set).

– This heuristic can be also used when a link in the evolving model is not present in the current positive example.

– For example, if the initial example has a colour link but future examples do not specify block colour, then this link can be dropped as irrelevant from the model.

Concept-Learn

Here is the procedure that learns by analysing differences and builds a general mode of some target

concept to be learned.

Note: the example order is important and must be chosen by the teacher. The initial example must be

positive.

Concept-Learn(Examples)

Let the first example (positive) be the initial descrition.

For all subsequent examples

If the example is a near-miss use the procedure SPECIALISE

If the example is positive use the procedure GENERALISE

SPECIALISE

SPECIALISE(Model, Example)

Match the model to the example to establish corresspondences among parts

If there is a single, most important difference between the model and the near-miss

If the evolving model has a link that is not in the near miss then use the require-link heuristic

If the near miss has a link that is not in the model then use the forbid-link heuristic

GENERALISE

GENERALISE(Model, Example)

Match the model to the example to establish corresspondences among parts

For each difference determine the difference type:

If a link points to a class in the model different from the class that the link points to in the example

If the classes are part of a class tree use the climb-tree heuristic

If the classes form an exhaustive set use the drop-link heuristic

Else use the enlarge-set heuristic.

If the link is missing from the example use the drop-link heuristic

Else ignore the difference

Felicity Conditions

• The Wait-&-See Principle– The procedure given cannot unlearn.

– It may be better not to learn something that will later have to be unlearned -- Wait & See

– Our procedure honours this principle when it ignores negative examples for which it cannot identify a single most important difference (invalid near-misses).

• Felicity Conditions– The teacher (us) can help the procedure to avoid the

need to ignore negative examples by ensuring that the negative examples are bona fide near-misses.

– Alternatively, difference types can be ranked according to their importance.

– These type of teacher-student learning agreements are called felicity conditions.

Special Cases

• “Penguins can’t fly!”– Even when elaborate felicity conditions are provided

models can be generated which are inconsistent with future positive examples.

– The classic example is that of the Penguin, a bird which cannot fly. Typically the evolving model will include a must-be-able-to-fly linkwhich is obviously not present in the penguin description.

– What should happen? Should the must-be-able-to-fly link be dropped by the drop-link heuristic as suggested by our GENERALISE procedure?

• The No-Altering Principle– When a positive example fails to match the current

model create a special-case exception.

– Thus, penguin is listed as a special-case exception of the bird concept.

Learning & Search

• Search through a space of possible models (hypotheses)– Search Operators = Specialisation & Generalisation

– Heuristic Depth-First Search

– Search pursues a single hypothesis (model)

• Problems– Must ensure that model modifications result in

consistent models. This can mean carefully constructing the specialise and generalise operators so that they only lead to consistent models or it may mean rechecking the new model against all previous examples.

– The search space is exponentially large and finding heuristics that efficiently traverse it is very difficult. Backtracking may be common.

– The result of the learning procedure may not be the simplest possible model.

Identification

• Identification methods– Use learned models to recognise unknown objects.

• Emphatic Links Dominate Matching– Check whether the new object is compatible with the

model’s emphatic links (must and must-not links).

– All must-links in the model must be present in the unknown object’s description.

– Similarly, must-not links must not be in the unknown description.

Similarity Nets

• Model-lists– Checking an unknown description against each

possible model is feasible only if the number of models is small.

• Similarity Nets– We can arrange models in similarity nets in which

links connect models that are similar.

– Suppose the unknow fails to match a given model. If the match does not fail by much then similar models could be tried next. These will be connected to the original model in the similarity net.

– In particular, if an unknown differs from a test model in the same way that a neighbour of the test model differs from the test model, then the neighbour should be examined next.

Unknown

M12

M35

M74 M121

M332

Sim(U,M35)

Sim(M35,M74)

Sim(M35,M12)

Thus, attention moves not just to a family of likely models, but to a particular likely model.

Of course, it should be clear that the search procedure for traversing a similarity net is a hill-climbing one because movement is to the nearest neighbour that seems most likely to yeild an improved match with the unknown.

Unknown M35

M74

Unknown has SpoutM35 has no Spout

M35 has no SpoutM74 has Spout

CUP

JUG

Version-Spaces

• Description Generality & Partial Ordering– In all representation languages, sentences can be

placed in a partial order according to the generalityof each sentence.

c1 : Red(c1)

c1,c2 : Red(c1) & Red(c2) c1,c2 : Red(c1) & Black(c2)

c1,c2,c3 : Red(c1) & Red(c2) & Black(c3)

c1,c2,c3 : Red(c1) & Red(c2) & Red(c3)

c1,c2,c3 : Red(c1) & Black(c2) & Black(c3)

• Most General Concept– By dropping sentence conditions we can obtain the

most general concept, usually the null description, which matches with every thing.

• Most Specific Concepts– The most specific concepts in the space correspond

to the actuall training instances themselves.

Null Description

Training Instances

More General

More Specific

Hypothesis SpaceH

• Boundary Representations– Thus, the set of hypotheses (H) can be represented

very compactly by two sets:

• G-SET, the set of most general elements in H.

• S-SET, the set of most specific elements in H.

Null Description

Training Instances

G

S

• Version Space– Set of plausible hypothesis, H’.

– Thus, H’ is the set of all concept descritions that are consistent with the training examples seen so far.

Version-Space (H’)

Version Space Method

• The Version-Space Method Shrinks H’– Initially G-SET is the null description and S-SET is

the first positive example.

– As new examples are presented the S-SET and G-SET descriptions change.

– If a positive example is seen then the program generalises, removing very specific concepts descriptions from H’.

– If a negative example is seen then the program must specialise and remove very general concepts from H’

– The effect is that the H’ shrinks to converge on a single concept description -- the target concept.

• Constraints– Care must be taken when generalising and

specialising that newly created models still lead to convergence.

– For example, each new specialisation must be a specialisation of some general model.

VP Algorithm

VP-Learn(S-Set, G-Set, Examples)

S-Set = First positive example

S-Set = Null Description

For each remaining example until S-Set and G-Set converge on a single concept

If the example is positive then

VP-POS(S-Set, G-Set, Example)

Else VP-Neg(S-Set, G-Set, Example)

VP-POS(S-Set, G-Set, Examples)

Generalise all specific models to match the positive example, but ensure the following:

The new specific models involve minimal changes.

Each new specific model is a specialisation of some general model.

No new specific model is a generalisation of some other specific model.

Prune away all general models that fail to match the positive example.

VP-NEG(S-Set, G-Set, Examples)

Specialise all general models to prevent match the negative example, but ensure the following:

The new general models involve minimal changes.

Each new general model is a generalisation of some specific model.

No new general model is a specialisation of some other general model.

Prune away all specific models that match the positive example.

Evidently, positive and negative examples are handled symmetrically by the version-space algorithm.

Note that multiple specialisations and generalisations are permitted in theory.

(a) (b)

(c) (d)

most general model

most specific model

Negative samples specialise general models

Positive samples generalise soecific models.

Positive samples prune general models.

Negative samples prune specific models

Eventual solutionconvergence

A Simple Example

Restaurant Meal Day Cost Reaction

1

2

3

4

5

Sam’s

Sam’s

Sam’s

Lobdell

Sarah’s

Breakfast

Lunch

Lunch

Breakfast

Breakfast

Friday

Friday

Saturday

Sunday

Sunday

Cheap

Expensive

Cheap

Expensive

Cheap

Yes

Yes

No

No

No

The above table describes a patient’s restaurant eating habits.The patient is suffering an occasional allergic reaction.

The task is to learn a model for the causes of this reaction in terms of restaurant, meal, day, and meal cost.

[ ? ? ? ? ]

[ Sam B’fast Fri Cheap ]

The birth of a Version-Space. It contains one positive example (the most specific model) and the null description (most general model).

We will keep generalisation and specialisation simple. To generalise a model we replace one attribute value with a question mark. To specialise we replace a question mark with a concrete value.

[ ? ? ? ? ]

[Sam B’fast Fri Cheap]

Negative:[Lobdell Lunch Friday Expensive]

[Sam ? ? ?] [? B’fast ? ?] [? ? Fri ?] [? ? ? Cheap]

The next example is negative and forces a specialisation of the general model. The result is four new models one of which is invalid because it matches the negative example.

Note each specialisation involves a minimal change of the most general model.

Each specialisation is formed by replacing a ? in the general model by the corresponding value in the most specific model. This ensures that each new specialisation is a generalisation of the most specific model -- this obviously cuts down the number of possible specialisation.

[ ? ? ? ? ]


[Sam ? ? ?] [? B’fast ? ?] [? ? ? Cheap]

[Sam ? ? Cheap]

Positive:[Sam’s Lunch Saturday Cheap]

The third example (positive) forces a generalisation of the specific model. To generalise: Every attribute in the specific model that differs from the positive example is replaced by a question mark (again a minimal change) -- again to ensure convergence.

Also one of the general models can be pruned as it cannot possibly match with the positive example.

[ ? ? ? ? ]


Negative:[Sarah’s Breakfast Sunday Cheap]

[Sam ? ? ?]

[? ? ? Cheap]

[Sam ? ? Cheap]

[Sam’s ? ? Cheap]

Another negative example forces specialisation of any of the general models that matches it. Again this specialisation must take the direction of a specialisation of a general model (of which there is only one).

Note that the new specialisation [Sam’s ? ? Cheap] is also a specialisation of another general model, namely [Sam’s ? ? ?]. So we prune away the new specialisation.

[ ? ? ? ? ]


Negative:[Sam’s Breakfast Sunday Expensive]

[Sam ? ? ?]

[Sam ? ? Cheap]

[Sam ? ? Cheap]

Same !

The final example is also negative. At this stage there are only two models left, one general and one specific. The negative example forces a specialisation of the general model in the direction of the specific one.

This results in a new general model that is the same as the specific one and the process has converged -- The patient is allergic to cheap food at Sam’s!

Keep the Noise Down!

• Noise Data Causes Problems– For example false positives result in over

generalisation.

– Eventually noise can lead to a situation where no concept description is consistent with all of the training examples. When this happens the G-Set passes the S-Set.

• Multiple-Boundary Solution– One solution is to maintain multipl G and S sets. S0

and G0 are consistent with all examples, S1 and G1 are consistent with all but one and so on…

– When G0 crosses So the algorithm concludes that no single concept will be consistent with all training instances and so checks S1 and G1.

– This works well for low levels of noise.

Null Description

Training Instances

G0

S0

Version-Space (H’)

S1

S2

G1

G3

Deductive Learning

• The story so far …– Learning from multiple examples.

• Quinlan’s ID3

• Winston’s near-miss method

• Mitchell’s Version Space procedure

– Many examples needed.

• Question– How is it that people seem to be able to learn a lot

from just a single example?

– For instance, what can we learn from the chess board overleaf?

The Fork Trap

The chess position shown is known as a “fork” because the white knight attacks both the king and the queen.

Black must move the king, therby surrendering the queen.

Black can use this single experience to learn a lot about this trap.

In general the following new rule can be acquired:

if any piece x attackes both the opponent’s king and another piece y, then piecey will be lost.

Many examples of the fork trap are not needed?

What makes such single-example learning possible?

Knowledge in Learning

• What makes single-example learning possible?– Domain Knowledge

– E.g., Rules of chess, previously acquired strategies

• How is this knowledge used?– Identify critical aspects of the training example.

– Generalise from this example.

• Deductive Learning– Explanation Based Learning

EBL

• Explanation-Based Learning– … learn from a single example x by explaining why

x is an example of the target concept.

• Training Example

• Goal Concept– High-level description of what the program is

supposed to learn.

• Domain Theory– A set of rules that describe the relationships between

the objects and entities in a domain.

• An Operationality Criterion– A predicate over concept descriptions, specifying the

form in which the learned concept description must be expressed.

The training example input is familiar to us.

Providing a goal concept as input may seem strange. Up until now our learning programs have produced goal concepts as output. However, in EBL the goal concept is not operational. It is a high-level description of some learning goal.

The task in EBL is then to operationalise this concept -- to convert it into an expression that can be used by some problem solver.

The terms of this operational expression are provided in the operationality criterion.

Finally, a domain theory must be provided to guide the learning.

For example, remember the chess board…

GOAL CONCEPT: “Bad position for Black”

OP. CRITERION: Generalised board situation.

DOMAIN KN.: Rules of chess.

EBG

• Explanation Based Generalisation– … algorithm for EBL (Mitchell, 1986)

• Step 1: Explain

• Step 2: Generalise

During this step the domain theory is used to prune away all unimportant aspects of the training example w.r.t goal concept.

What is left is an explanation of why the example is an instance of the goal concept. This explanation is expressed in terms that satisfy the operationality criterion.

Generalise the explanation as far as possible while still describing the goal concept

Example: CUP

• Training Example

• Domain Knowledge

• Goal Concept

• Operationality Criterion

Owner(Object23, Ralph) & has-part(Object23,Concavity12) & is(Object23,Light) & Colour(Object23,Brown) & ...

is(x,Light) & has-part(x,y) & isa(y, handle) => liftable(x)has-part(x,y) & isa(y,Bottom) & is(y,Flat) => Stable(x)has-part(x,y) & isa(y,Concavity)

& isa(y,Upward-Pointing) => open-vessel(x)

CUP: x is a Cup iff x is liftable, stable, and open-vessel

Concept definition must be expressed in purely structural terms (e.g., Light, Flat, etc…)

Step 1: Explain

We need to explain why Object23 is a cup.

We do this by construction the proof shown using standard theorem-proving techniques.

Notice that the proof has isolated only relevant features of the example. There is no mention of Owner or Colour.

Step 2: Generalise

The proof also serves as the basis for a valid generalisation.

We just replace constants with variables in our assumptions to get:

has-part(x,y) & isa(y,Concavity) & (y,Upward-Pointing) & has-part(x,z) & isa(z,Bottom) & is(z,Flat) & has-part(x,w) & isa(w,Handle) & is(x,Light)

Cup(Object23)

Liftable(Object23)

is(Object23,Light)has-part(Object23,Handle16)

isa(Handle16,Handle)

has-part(Object23,Bottom19)isa(Bottom19,Bottom)

is(Bottom19,Flat)

has-part(Object23,Concavity12)isa(Concavity12,Concavity)

is(Concavity12,Upward-Pointing)

Open-Vessel(Object23)

Stable(Object23)

Issues

• Why do we need examples at all?– We could have operationalised the goal concept Cup

without referencing the example.

– Examples allow us to focus on relevant operationalisations.

• Providing a tractable domain theory is difficult.– Complex or ill-structured domains pose problems for

EBL.

• Generalisation– Not always a matter of replacing constants with

variables.

Documents

Learning “Any process by which a system improves its performance” -- Herb Simon Critic Performance Element Problem Generator Learning Element Sensors Effectors