Download ppt - 1 L. Orseau Induction of decision trees Induction of Decision Trees Laurent Orseau ([email protected]) AgroParisTech based on slides by Antoine

1L. OrseauInduction of decision trees

Induction

of Decision Trees

Laurent Orseau([email protected])

AgroParisTech

based on slides by Antoine Cornuéjols


Induction of decision trees

• Task

Learning a discrimination function for patterns of several classes

• Protocol

Supervised learning by greedy iterative approximation

• Criterion of success

Classification error rate

• Inputs

Attribute-value data (space with N dimensions)

• Target functions

Decision trees


1- Decision trees: example• Decision trees are classifiers for attribute/value instances

A node of the tree test for an attribute

There is a branchbranch for each value of the tested attribute

The leaves specify the categories (two or more)

abdomenabdomen

pain?

appendicitis cough?

fever?

yesyes

yesyes nono

a cold cooling

nothing

aucuneaucunechestchest

infarctusfever?

yesyes nono

a cold throat aches

throatthroat

nono


1- Decision trees: the problem• Each instance is described by an attribute/value vector

• Input: an set of instances with their class (given by an expert)

• Learning algorithm must build a decision tree E.g. a decision tree for diagnostic (common application in Machine Learning)

Cough Fever Weight PainMarie no yes normal throatFred no yes normal abdomenJulie yes yes thin noneElvis yes no obese chest

Cough Fever Weight Pain DiagnosticMarie no yes normal throat a coldFred no yes normal abdomen appendicitis.....


1- Decision trees: expressive power

• The choice of the attributes is very important!

• If a crucial attribute is not represented

Not possible to induce a good decision tree

• If two instances have the same representation but belong to two different classes, the language of the instances (attributes) is said to be inadequate.

Cough Fever Weight Pain DiagnosticMarie no yes normal abdomen a coldPolo no yes normal abdomen appendicitis.....

inadequate language


1- Decision trees: expressive power

• Any boolean function can be represented with a decision tree

– Note: with 6 boolean attributes, there are about 1.8*10^19 boolean functions…

• Depending on the functions to represent, the trees are more or less large

E.g. “parity” and “majority” function: exponential growth

Sometimes a single node is enough

• Limited to propositional logic (only attribute-value, no relation)

• A tree can be represented by a disjunction of rules:

(Si Feathers = no Alors Classe= not-bird)OR (Si Feathers = yes AND Color= brown Alors Classe= not-bird)OR (Si Feathers = yes AND Color= B&W Alors Classe= bird)OR (Si Feathers = yes AND Color= yellow Alors Classe= bird)

DT4


2- Decision trees: choice of a tree

Color Wings Feathers Sonar ConceptFalcon yellow yes yes no birdPigeon B&W yes yes no birdBat brown yes no yes not bird

Feathers?

yes no

bird not bird

Sonar?

yes no

not bird bird

Color?

brown yellow

not bird bird

B&W

birdColor?

brown yellow

not bird bird

B&W

bird

Feathers?

yes no

not bird

Quatre decision trees coherents with the data:

DT1

DT2

DT3 DT4


2- Decision trees: the choice of a tree

• When the langage is adequate,it is always possible to build a decision trees that correctly classifies all the training examples.

• There are often many correct decision trees.

• Enumeration of all trees is not possible (NP-completeness)

for binary trees

How to give a value to a tree?

Requires a constructive iterative method

n

n

nC n

2

1

1


2- What model for generalization?• Among all possible coherent hypotheses, which one to choose for a

good generalization?

Is the intuitive answer…

... confirmed by theory?

• Some learnability theory [Vapnik,82,89,95]

Consistence of the empirical risk minimizationempirical risk minimization (ERM)

Principle of structural risk minimization structural risk minimization (SRM)

• In short, trees must be short

How?

Methods of induction of decision trees


3- Induction of decision trees: Example [Quinlan,86]

Attributes Pif Temp Humid WindPossible Values sunny,cloudy,rain hot,warm,cool normal,high true,false

classN° Pif Temp Humid Wind Golf

1 sunny hot high false DontPlay2 sunny hot high true DontPlay3 cloudy hot high false Play4 rain warm high false Play5 rain cool normal false Play6 rain cool normal true DontPlay7 cloudy cool normal true Play8 sunny warm high false DontPlay9 sunny cool normal false Play

10 rain warm normal false Play11 sunny warm normal true Play12 cloudy warm high true Play13 cloudy hot normal false Play14 rain warm high true DontPlay


3- Induction of decision trees

• Strategy: Top-down induction: TDIDT

Best first search, no backtracking, with a evaluation function

Recursive choice of an attribute to test until stopping criterion

• Operation:

Choose the first attribute as the root of the tree: the most informative one

Then, iterate with same operation on all sub-nodes

recursive algorithm


3- Induction of decision trees: example

• If we choose attribute Temp? ...

Temp?

hot warm cool

J3,J4,J5,J7,J9,J10,J11,J12,J13J1,J2, J6,J8,J14

+-

J3,J13J1,J2

+-

J4,J10,J11,J13J8,J14

+-

J5,J7,J9J6

+-

N° Pif Temp Humid Wind Golf

1 sunny hot high false DontPlay

2 sunny hot high true DontPlay3 cloudy hot high false Play

4 rain warm high false Play5 rain cool normal false Play

6 rain cool normal true DontPlay7 cloudy cool normal true Play8 sunny warm high false DontPlay

9 sunny cool normal false Play

10 rain warm normal false Play11 sunny warm normal true Play

12 cloudy warm high true Play13 cloudy hot normal false Play

14 rain warm high true DontPlay


3- Induction of decision trees: TDIDT algorithm

PROCEDURE AAD(T,E)

IF all examples of E are in the same class Ci

THEN label the current node with Ci. END

ELSE select an attribute A with values v1...vn

Partition E with v1...vn into E1, ...,En

For j=1 to n AAD(Tj, Ej).

T1

E

v2v1

T2

E2

T

Tn

En

vn

E1

A={v1...vn}

E=E1.. En


3- Induction of decision trees: selection of attribute

Wind?

true false

J3,J4,J5,J7,J9,J10,J11,J12,J13J1,J2, J6,J8,J14

+-

+-

+-

Pif?

cloudy rain sunny

J3,J4,J5,J9,10,J13J1,J8

+-

J3,J13,J7,J12+-

J4,J5,J10J6,J14

+-

J9,J11J1,J8,J2

+-

J7,J11,J12J2,J6,J14

J3,J4,J5,J7,J9,J10,J11,J12,J13J1,J2, J6,J8,J14


3- La selection of a warm attribute of test

• How to build a “simple” tree?

Simple tree:

Minimize expected number of tests to class a new objectMinimize expected number of tests to class a new object

How to translate this global criterion into a local choice procedure?

• Criterions to choose a node

We don't know how to associate a local criterion to the global objective criterion

Use of heuristics

Notion of measure of ”impurity”

– Gini Index

– Entropic criterion (ID3, C4.5, C5.0)

– ...


3- Measure of impurity: the Gini index

• Ideally:

Null measure if all populations are homogeneous

Maximal measure if the populations are maximally mixed

• Gini Index [Breiman and al.,84]


3- The entropic criterion(1/3)

• Boltzmann's entropy ...

• ... and Shannon's entropy

Shannon, 1949, proposed a measure of entropy for discrete probability distributions.

Expresses the quantity of information, i.e. the number of bits need to specify the distribution

Information entropy:

where pi is the probability of class Ci.

I = - p log (pi 2 ii=1..k

)


3- The entropic criterion(2/3)

Information entropy of S (in C classes):

•Null when only one class

•The most equiprobable the classes are, the highest I(S)

•= log2(k) when the k classes are equiprobable

•Unit: the bit of information

I(S) p(c i) log i 1

C

p(ci)

p(ci): probability of the class ci


3- The entropic criterion (3/3): case of two classes

0,00

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

1,00

0

0,1

0

0,2

0

0,3

0

0,4

0

0,5

0

0,6

0

0,7

0

0,8

0

0,9

0

1,0

0

• For C=2: I(S) = -p+ x log2(p+)- p- x log2(p-)

From hypothesis p+ = p/ (p+n) and p- = n/ (p+n)

Thus I(S) = - p log ( p )- n log( n )

(p+n) (p+n) (p+n) (p+n)

et I(S) = - P log P - (1-P) log(1-P)

I(S)

P

P=p/(p+n)=n/(n+p)=0.5equiprobability


3- Entropic gain associated with an attribute

|Sv|: size of the sub-population in the branch v of A

Gain(S, A) I(S) Sv

Sv valeurs( A) I(Sv)

How is the knowledge of the value of attribute A

informative about the class of an example


3- Example (1/4)

• Entropy of initial set of examples

I(p,n) = - 9/14 log2(9/14) - 5/14 log2(5/14)

• Entropy of subtrees associated with test on Pif?

p1 = 4 n1 = 0: I(p1,n1) = 0

p2 = 2 n2 = 3: I(p2,n2) = 0.971

p3 = 3 n3 = 2: I(p3,n3) = 0.971

• Entropy of subtrees associated with test on Temp?

p1 = 2 n1 = 2: I(p1,n1) = 1

p2 = 4 n2 = 2: I(p2,n2) = 0.918

p3 = 3 n3 = 1: I(p3,n3) = 0.811


3- Example (2/4)

val1 val2 val3

N1+N2+N3=N

N objectsn+p=N

E(N,A)= N1/N x I(p1,n1) + N2/N xI(p2,n2) + N3/N x I(p3,n3)

Information gain of A : GAIN(A)= I(S)-E(N,A)

Attribute A

N1 objectsn1+p1=N1

N2 objectsn2+p2=N2

N3 objectsn3+p3=N3

I(S)


3- Example (3/4)

• For the initial examples

I(S) = - 9/14 log2(9/14) - 5/14 log2(5/14)

• Entropy of the tree associated with test on Pif?

E(Pif) = 4/14 I(p1,n1) + 5/14 I(p2,n2) + 5/14 I(p3,n3)

Gain(Pif) = 0.940 - 0.694 = 0.246 bits

Gain(Temp) = 0.029 bits

Gain(Humid) = 0.151 bits

Gain(Wind) = 0.048 bits

Choice of attribute Pif for the first test


3- Example (4/4)

• Finale built tree:

cloudycloudy

Pif

play Wind

yesyes

don't playplay

rainrain

Humid

normalnormal highhigh

play don't play

sunnysunny

nono


3- Some TDIDT systems

Input: vector of attributes-values associated with each example

Output: decision tree

• CLS (Hunt, 1966) [analyse of data]

• ID3 (Quinlan 1979)

• ACLS (Paterson & Niblett 1983)

• ASSISTANT (Bratko 1984)

• C4.5 (Quinlan 1986)

• CART (Breiman, Friedman, Ohlson, Stone, 1984)


4- Potential problems

1. Continuous value attributes

2. Attributes with different branching factors

3. Missing values

4. Overfitting

5. Greedy search

6. Choice of attributes

7. Variance of results:

• Different trees from similar data


4.1. Discretization of continuous attribute values

Here, two possible thresholds: 16°C and 30°C

attribute Temp>16°C is the most informative, and is kept

Temp.6°C 8°C 14°C 18°C 20°C 28°C 32°C

Non Non Non Oui Oui Oui Non Playau golf


4.2. Different branching factors• Problem:

The entropic gain criterion favors attributes with higher branching The entropic gain criterion favors attributes with higher branching factorfactor

• Two solutions:

Make all attributes binaryMake all attributes binary

– But loss of legibility of treesBut loss of legibility of trees

Introduce a normalization factorIntroduce a normalization factor

Gain _ norm(S, A) Gain(S, A)

Si

Slog

Si

Si1

nb values of A


4.3. Processing missing values

• Let an example x , c(x) for which we don't know the value for attribute A

• How to compute gain(S,A)?

1. Take the most frequent value in entire S

2. Take the most frequent value at this node

3. Split example in fictitious examplesfictitious examples with the different possible values of A

weighted by their respective frequency

E.g. if 6 examples at this node take the value A=a1 and 4 the value A=a2

A(x) = a1 with prob=0.6 and A(x) = a2 with prob=0.4

For prediction, class the example with the label of the most probable leaf.


5- The generalization problem

• Training set. Ensemble test.

• Learning curve

• Methods to evaluate generalization

On a test set

Cross validation

– “Leave-one-out”

Did we learn a good decision tree?


5.1. Overfitting: Effect of noise on induction

• Types of noise

Description errors

Classification errors

“clashes”

Missing values

• Effects

Over-developed tree: too deep, too many leaves


5.1. Overfitting: The generalization problem

• Low empirical risk. High real risk.

• SRM (Structural Risk Minimization)

Justification [Vapnik,71,79,82,95]

– Notion of “capacity” of the hypothesis space

– Vapnik-Chervonenkis dimension

We must control the hypothesisWe must control the hypothesis spacespace


5.1. Control of space H: motivations & strategies

• Motivations:

Improve generalization performance (SRM)

Build a legible model of the data (for experts)

• Strategies:

1. Direct control of the size of the induced tree: pruningpruning

2. Modify the state space (trees) in which to search

3. Modify the search algorithm

4. Restrain the data base

5. Translate built trees into another representation


5.2. Overfitting: Controlling the size with pre-pruning

• Idea: modify the termination criterion

Depth threshold (e.g. [Holte,93]: threshold =1 or 2)

Chi2 test

Laplacian error

Low information gain

Low number of examples

Population of examples not statistically significant

Comparison between ”static error” and ” dynamic error”

• Problem: often too short-sighted


5.2. Example: Chi2 test

Let a binary attribute A

Ag d

(neg1,neg2) (ned1,ned2)

Ag d

(ng1,ng2) (nd1,nd2)

(n) = (n1,n2)

P (1-P)

(n) = (n1,n2)

n1 = ng1 + nd1

n2 = ng2 + nd2

neg1 = Pn1 ; ned1 = (1-P)n1

neg2 = Pn2 ; ned2 = (1-P)n2

Null hypothesisNull hypothesis

P (1-P)

2 (ngi Pni)

2

Pnii1

2


5.3. Overfitting: Controlling the size with post-post-pruningpruning

• Idea: Prune after the construction of whole tree, by replacing subtrees that optimize a pruning criterion on a node.

• Many methods. Still lots of research.

Minimal Cost-Complexity Pruning (MCCP) (Breiman and al.,84)

Reduced Error Pruning (REP) (Quinlan,87,93)

Minimum Error Pruning (MEP) (Niblett & Bratko,86)

Critical Value Pruning (CVP) (Mingers,87)

Pessimistic Error Pruning (PEP) (Quinlan,87)

Error-Based Pruning (EBP) (Quinlan,93) (used in C4.5)

...


5.3- Cost-Complexity pruning

• [Breiman and al.,84]

• Cost-complexity for a tree:


6. Forward search

• Instead of a greedy search, search n nodes ahead

If I choose this node and then this node and then …

• But exponential growth of the number of computations


6. Modification of the search strategy

• Idea: no more depth first search

• Methods that use a different measure:

Minimum Description Length principle

– Measure of the complexity of the tree

– Measure of the complexity of the examples not coded by the tree

– Keep tree that minimizes the sum of these measures

Measure of low learnability theory

Kolmogorov-Smirnoff measure

Class separation measure

Mix of selection tests


7. Modification of the search space

• Modification of the node tests

To solve the problems of an inadequate representations

Methods of constructive induction (e.g. multivariate tests)

E.g. Oblique decision trees

• Methods:

Numerical Operators

– Perceptron trees

– Trees and Genetic Programming

Logical operators


7. Oblique treesx2

x1

x1 < 0.70

x2 < 0.30 x2 < 0.88

1.1x1 + x2 < 0.2

x2 < 0.62 c1c2 c2

c2

x1 < 0.17

c1c2 c2 c1c2

c1


7. Induction of oblique trees

• Other cause of leafy trees: an inadequate representation

• Solutions:

Ask an expert (e.g. chess endgame [Quinlan,83])

Do an PCA beforehand

Other attribute selection method

Apply a constructive induction

Induction of oblique treesInduction of oblique trees


8. Translation into other representations

• Idea: Translate a complex tree into a representation where the

result is simpler

• Translation into decision graphs

• Translation rule sets


9. Conclusions• Appropriate for:

Classification of attribute-value examples

Attributes with discrete values

Resistance to noise

• Strategy:

Search by incremental construction of hypothesis

Local criterion (gradient) based on statistical criterion

• Generates

Interpretable decision trees (e.g. production rules)

• Requires a control of the size of the tree