59
Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Embed Size (px)

Citation preview

Page 1: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Support Vector Machines: a different approach to finding the decision boundary, particularly

good at generalisation

finishing off last lecture …

Page 2: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we can divide the classes with a simple hyperplane

Page 3: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

There will be infinitely many such lines

Page 4: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

One of them is ‘optimal’

Page 5: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Beause it maximises the average distance of the hyperplane from the ‘support vectors’ – instances

that are closest to instances of different class

Page 6: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

A Support Vector Machine (SVM) finds this hyperplane

Page 7: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

But, usually there is no simple hyperplane that separates the classes!

Page 8: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

One dimension (x), two classes

Page 9: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Two dimensions (x, x*sin(x)),

Page 10: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Now we can separate the classes

Page 11: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

SVMs do ths:If we add enough extra dimensions/fields using arbitrary functions of the existing fields, then it becomes very likely

we can separate the data.SVMs - apply such a transformation - then find the optimal separating hyperplane.

The ‘optimality’ of the sep hyp means goodgeneralisation properties

Page 12: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …
Page 13: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Decision Trees

Page 14: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Real world applications of DTs

See here for a list: http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/survey/node32.html

Includes: Agriculture, Astronomy, Biomedical Engineering, Control Systems, Financial analysis, Manufacturing and Production, Medicine, Molecular biology, Object recognition, Pharmacology, Physics, Plant diseases, Power systems, Remote Sensing, Software development, Text processing:

Page 15: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Field names

Page 16: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Field names

Field values

Page 17: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Field names

Field values

Class values

Page 18: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Why decision trees?

Popular, since they are interpretable

... and correspond to human reasoning/thinking about decision-making

Can perform quite well in accuracy when compared with other approaches

... and there are good algorithms to learn decision trees from data

Page 19: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …
Page 20: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Figure 1. Binary Strategy as a tree model.

Mohammed MA, Rudge G, Wood G, Smith G, et al. (2012) Which Is More Useful in Predicting Hospital Mortality -Dichotomised Blood Test Results or Actual Test Values? A Retrospective Study in Two Hospitals. PLoS ONE 7(10): e46860. doi:10.1371/journal.pone.0046860http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046860

Page 21: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Figure 1. Binary Strategy as a tree model.

Page 22: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

We will learn the ‘classic’ algorithm to learn a DT from categorical data:

Page 23: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

We will learn the ‘classic’ algorithm to learn a DT from categorical data:

ID3

Page 24: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we want a tree that helps us predict someone’s politics, given their

gender, age, and wealth

gender age wealth politicsmale middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 25: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Choose a start node (field) at randomgender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 26: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Choose a start node (field) at random

?

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 27: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Choose a start node (field) at random

Age

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 28: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Add branches for each value of this field

Ageyoung

mid

old

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 29: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Check to see what has filtered down

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 30: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Where possible, assign a class value

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

Right-Wing

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 31: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Otherwise, we need to add further nodes

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

? ? Right-Wing

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 32: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Repeat this process every time we need a new node

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

? ? Right-Wing

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 33: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Starting with first new node – choose field at random

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

wealth ? Right-Wing

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 34: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Check the classes of the data at this node…

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

wealth ? Right-Wingrich

poor1 L, 0 R

1 L, 1 R

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 35: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

And so on …

Ageyoung

mid

old

1 L, 2 R 1 L, 1 R 0 L, 1 R

wealth ? Right-Wingrich

poor

1 L, 1 RRight-wing

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 36: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

But we can do better than randomly chosen fields!gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 37: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

This is the tree we get if first choice is `gender’gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 38: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

gendermale female

Right-Wing Left-Wing

gender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

This is the tree we get if first choice is `gender’

Page 39: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Algorithms for building decision trees (of this type)

Initialise: tree T contains one ‘unexpanded’ node Repeat until no unexpanded nodes remove an unexpanded node U from T expand U by choosing a field add the resulting nodes to T

Page 40: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Algorithms for building decision trees (of this type) – expanding a node

?

Page 41: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Algorithms for building decision trees (of this type) – the essential step

Field

? ? ?

Value = XValue = Y

Value = Z

Page 42: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

So, which field?

Field

? ? ?

Value = XValue = Y

Value = Z

Page 43: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Three choices: gender, age, or wealthgender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 44: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we choose age(table now sorted by age values)

gender age wealth politicsmale middle-aged rich Right-wing

female middle-aged poor Left-wing

male old poor Right-wing

male young rich Right-wing

female young poor Left-wing

male young poor Right-wing

Two of the values have a mixture of classes

Page 45: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we choose wealth(table now sorted by wealth values)

gender age wealth politicsfemale middle-aged poor Left-wing

male old poor Right-wingfemale young poor Left-wing

male young poor Right-wing

male middle-aged rich Right-wing

male young rich Right-wing

One of the values has a mixture of classes - this choice is a bit less mixed up than age?

Page 46: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we choose gender(table now sorted by gender values)

gender age wealth politicsfemale middle-aged poor Left-wing

female young poor Left-wingmale old poor Right-wing

male middle-aged rich Right-wing

male young poor Right-wing

male young rich Right-wing

The classes are not mixed up at all within the values

Page 47: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

So, at each step where we choose a node to expand, we

make the choice where the relationship between the field values and the class values is

least mixed up

Page 48: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Measuring ‘mixed-up’ness: Shannon’s entropy measure

Suppose you have a bag of N discrete things,and there T different types of things.

Where, pT is the proportion of things in thebag that are type T, the entropy of the bag is:

T

TT pp )log(

Page 49: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Examples:

This mixture: { left left left right right }has entropy: − ( 0.6 log(0.6) + 0.4 log(0.4)) = 0.292

This mixture: { A A A A A A A A B C }has entropy: − ( 0.8 log(0.8) + 0.1 log(0.1) + 0.1 log(0.1)) =0.278

This mixture: {same same same same same same}has entropy: − ( 1.0 log(1.0) ) = 0

Lower entropy = less mixed up

T

TT pp )log(

Page 50: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

ID3 chooses fields based on entropy

Field1 Field2 Field3 … val1 val1 val1 val2 val2 val2 val3 val3

Each val has an entropy value – how mixed up the classes are for that value choice

Page 51: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

ID3 chooses fields based on entropy

Field1 Field2 Field3 … val1xp1 val1xp1 val1xp1 val2xp2 val2xp2 val2xp2 val3xp3 val3xp3

Each val has an entropy value – how mixed up the classes are for that value choiceAnd each val also has a proportion – how much of the data at this node has this val

Page 52: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

ID3 chooses fields based on entropy

Field1 Field2 Field3 … val1xp1 val1xp1 val1xp1 val2xp2 val2xp2 val2xp2 val3xp3 val3xp3 = = =H(D|Field1) H(D|Field2) H(D|Field3)

So ID3 works out H(D|Field) for each field, which is the entropies of the valuesweighted by the proportions.

Page 53: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

ID3 chooses fields based on entropy

Field1 Field2 Field3 … val1xp1 val1xp1 val1xp1 val2xp2 val2xp2 val2xp2 val3xp3 val3xp3 = = =H(D|Field1) H(D|Field2) H(D|Field3)

So ID3 works out H(D|Field) for each field, which is the entropies of the valuesweighted by the proportions.

The one with the lowest value is chosen – this maximises ‘Information Gain’

Page 54: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Back here gender, age, or wealthgender age wealth politics

male middle-aged rich Right-wing

male young rich Right-wing

female young poor Left-wing

female middle-aged poor Left-wing

male young poor Right-wing

male old poor Right-wing

Page 55: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we choose age(table now sorted by age values)

gender age wealth politicsmale middle-aged rich Right-wing

female middle-aged poor Left-wing

male old poor Right-wing

male young rich Right-wing

female young poor Left-wing

male young poor Right-wing

H(D| age) = proportion-weighted entropy = 0.3333 x − ( 0.5 x log(0.5) + 0.5 x log(0.5) )+ 0.1666 x − ( 1 x log(1) )+ x − ( 0.33 x log(0.33) + 0.66 xlog(0.66) )

0.33330.16666

0.5

Page 56: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we choose wealth(table now sorted by wealth values)

gender age wealth politicsfemale middle-aged poor Left-wing

male old poor Right-wingfemale young poor Left-wing

male young poor Right-wing

male middle-aged rich Right-wing

male young rich Right-wing

H(D|wealth) =

0.3333 x − ( 0.5 x log(0.5) + 0.5 x log(0.5) )+ x − ( 1 x log(1) )

0.6666

0.3333

Page 57: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we choose gender(table now sorted by gender values)

gender age wealth politicsfemale middle-aged poor Left-wing

female young poor Left-wingmale old poor Right-wing

male middle-aged rich Right-wing

male young poor Right-wing

male young rich Right-wing

H(D| gender) = 0.3333 x − ( 1 x log (1) )+ x − ( 1 x log (1) )

0.33330.6666

This is the one we would choose ...

Page 58: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Alternatives to Information Gain- all, somehow or other, give a

measure of mixed-upnessand have been used in building DTs

• Chi Square• Gain Ratio, • Symmetric Gain Ratio, • Gini index • Modified Gini index • Symmetric Gini index• J-Measure • Minimum Description Length, • Relevance • RELIEF • Weight of Evidence

Page 59: Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Decision Trees

Further reading is on google

Interesting topics in context are:

Pruning: close a branch down before

you hit 0 entropy ( why?)

Discretization and regression: trees that

deal with real valued fields

Decision Forests: what do you think

these are?