1L. OrseauInduction of decision trees
Induction
of Decision Trees
Laurent Orseau([email protected])
AgroParisTech
based on slides by Antoine Cornuéjols
2L. OrseauInduction of decision trees
Induction of decision trees
• Task
Learning a discrimination function for patterns of several classes
• Protocol
Supervised learning by greedy iterative approximation
• Criterion of success
Classification error rate
• Inputs
Attribute-value data (space with N dimensions)
• Target functions
Decision trees
3L. OrseauInduction of decision trees
1- Decision trees: example• Decision trees are classifiers for attribute/value instances
A node of the tree test for an attribute
There is a branchbranch for each value of the tested attribute
The leaves specify the categories (two or more)
abdomenabdomen
pain?
appendicitis cough?
fever?
yesyes
yesyes nono
a cold cooling
nothing
aucuneaucunechestchest
infarctusfever?
yesyes nono
a cold throat aches
throatthroat
nono
4L. OrseauInduction of decision trees
1- Decision trees: the problem• Each instance is described by an attribute/value vector
• Input: an set of instances with their class (given by an expert)
• Learning algorithm must build a decision tree E.g. a decision tree for diagnostic (common application in Machine Learning)
Cough Fever Weight PainMarie no yes normal throatFred no yes normal abdomenJulie yes yes thin noneElvis yes no obese chest
Cough Fever Weight Pain DiagnosticMarie no yes normal throat a coldFred no yes normal abdomen appendicitis.....
5L. OrseauInduction of decision trees
1- Decision trees: expressive power
• The choice of the attributes is very important!
• If a crucial attribute is not represented
Not possible to induce a good decision tree
• If two instances have the same representation but belong to two different classes, the language of the instances (attributes) is said to be inadequate.
Cough Fever Weight Pain DiagnosticMarie no yes normal abdomen a coldPolo no yes normal abdomen appendicitis.....
inadequate language
6L. OrseauInduction of decision trees
1- Decision trees: expressive power
• Any boolean function can be represented with a decision tree
– Note: with 6 boolean attributes, there are about 1.8*10^19 boolean functions…
• Depending on the functions to represent, the trees are more or less large
E.g. “parity” and “majority” function: exponential growth
Sometimes a single node is enough
• Limited to propositional logic (only attribute-value, no relation)
• A tree can be represented by a disjunction of rules:
(Si Feathers = no Alors Classe= not-bird)OR (Si Feathers = yes AND Color= brown Alors Classe= not-bird)OR (Si Feathers = yes AND Color= B&W Alors Classe= bird)OR (Si Feathers = yes AND Color= yellow Alors Classe= bird)
DT4
7L. OrseauInduction of decision trees
2- Decision trees: choice of a tree
Color Wings Feathers Sonar ConceptFalcon yellow yes yes no birdPigeon B&W yes yes no birdBat brown yes no yes not bird
Feathers?
yes no
bird not bird
Sonar?
yes no
not bird bird
Color?
brown yellow
not bird bird
B&W
birdColor?
brown yellow
not bird bird
B&W
bird
Feathers?
yes no
not bird
Quatre decision trees coherents with the data:
DT1
DT2
DT3 DT4
8L. OrseauInduction of decision trees
2- Decision trees: the choice of a tree
• When the langage is adequate,it is always possible to build a decision trees that correctly classifies all the training examples.
• There are often many correct decision trees.
• Enumeration of all trees is not possible (NP-completeness)
for binary trees
How to give a value to a tree?
Requires a constructive iterative method
n
n
nC n
2
1
1
9L. OrseauInduction of decision trees
2- What model for generalization?• Among all possible coherent hypotheses, which one to choose for a
good generalization?
Is the intuitive answer…
... confirmed by theory?
• Some learnability theory [Vapnik,82,89,95]
Consistence of the empirical risk minimizationempirical risk minimization (ERM)
Principle of structural risk minimization structural risk minimization (SRM)
• In short, trees must be short
How?
Methods of induction of decision trees
10L. OrseauInduction of decision trees
3- Induction of decision trees: Example [Quinlan,86]
Attributes Pif Temp Humid WindPossible Values sunny,cloudy,rain hot,warm,cool normal,high true,false
classN° Pif Temp Humid Wind Golf
1 sunny hot high false DontPlay2 sunny hot high true DontPlay3 cloudy hot high false Play4 rain warm high false Play5 rain cool normal false Play6 rain cool normal true DontPlay7 cloudy cool normal true Play8 sunny warm high false DontPlay9 sunny cool normal false Play
10 rain warm normal false Play11 sunny warm normal true Play12 cloudy warm high true Play13 cloudy hot normal false Play14 rain warm high true DontPlay
11L. OrseauInduction of decision trees
3- Induction of decision trees
• Strategy: Top-down induction: TDIDT
Best first search, no backtracking, with a evaluation function
Recursive choice of an attribute to test until stopping criterion
• Operation:
Choose the first attribute as the root of the tree: the most informative one
Then, iterate with same operation on all sub-nodes
recursive algorithm
12L. OrseauInduction of decision trees
3- Induction of decision trees: example
• If we choose attribute Temp? ...
Temp?
hot warm cool
J3,J4,J5,J7,J9,J10,J11,J12,J13J1,J2, J6,J8,J14
+-
J3,J13J1,J2
+-
J4,J10,J11,J13J8,J14
+-
J5,J7,J9J6
+-
N° Pif Temp Humid Wind Golf
1 sunny hot high false DontPlay
2 sunny hot high true DontPlay3 cloudy hot high false Play
4 rain warm high false Play5 rain cool normal false Play
6 rain cool normal true DontPlay7 cloudy cool normal true Play8 sunny warm high false DontPlay
9 sunny cool normal false Play
10 rain warm normal false Play11 sunny warm normal true Play
12 cloudy warm high true Play13 cloudy hot normal false Play
14 rain warm high true DontPlay
13L. OrseauInduction of decision trees
3- Induction of decision trees: TDIDT algorithm
PROCEDURE AAD(T,E)
IF all examples of E are in the same class Ci
THEN label the current node with Ci. END
ELSE select an attribute A with values v1...vn
Partition E with v1...vn into E1, ...,En
For j=1 to n AAD(Tj, Ej).
T1
E
v2v1
T2
E2
T
Tn
En
vn
E1
A={v1...vn}
E=E1.. En
14L. OrseauInduction of decision trees
3- Induction of decision trees: selection of attribute
Wind?
true false
J3,J4,J5,J7,J9,J10,J11,J12,J13J1,J2, J6,J8,J14
+-
+-
+-
Pif?
cloudy rain sunny
J3,J4,J5,J9,10,J13J1,J8
+-
J3,J13,J7,J12+-
J4,J5,J10J6,J14
+-
J9,J11J1,J8,J2
+-
J7,J11,J12J2,J6,J14
J3,J4,J5,J7,J9,J10,J11,J12,J13J1,J2, J6,J8,J14
15L. OrseauInduction of decision trees
3- La selection of a warm attribute of test
• How to build a “simple” tree?
Simple tree:
Minimize expected number of tests to class a new objectMinimize expected number of tests to class a new object
How to translate this global criterion into a local choice procedure?
• Criterions to choose a node
We don't know how to associate a local criterion to the global objective criterion
Use of heuristics
Notion of measure of ”impurity”
– Gini Index
– Entropic criterion (ID3, C4.5, C5.0)
– ...
16L. OrseauInduction of decision trees
3- Measure of impurity: the Gini index
• Ideally:
Null measure if all populations are homogeneous
Maximal measure if the populations are maximally mixed
• Gini Index [Breiman and al.,84]
17L. OrseauInduction of decision trees
3- The entropic criterion(1/3)
• Boltzmann's entropy ...
• ... and Shannon's entropy
Shannon, 1949, proposed a measure of entropy for discrete probability distributions.
Expresses the quantity of information, i.e. the number of bits need to specify the distribution
Information entropy:
where pi is the probability of class Ci.
I = - p log (pi 2 ii=1..k
)
18L. OrseauInduction of decision trees
3- The entropic criterion(2/3)
Information entropy of S (in C classes):
•Null when only one class
•The most equiprobable the classes are, the highest I(S)
•= log2(k) when the k classes are equiprobable
•Unit: the bit of information
I(S) p(c i) log i 1
C
p(ci)
p(ci): probability of the class ci
19L. OrseauInduction of decision trees
3- The entropic criterion (3/3): case of two classes
0,00
0,10
0,20
0,30
0,40
0,50
0,60
0,70
0,80
0,90
1,00
0
0,1
0
0,2
0
0,3
0
0,4
0
0,5
0
0,6
0
0,7
0
0,8
0
0,9
0
1,0
0
• For C=2: I(S) = -p+ x log2(p+)- p- x log2(p-)
From hypothesis p+ = p/ (p+n) and p- = n/ (p+n)
Thus I(S) = - p log ( p )- n log( n )
(p+n) (p+n) (p+n) (p+n)
et I(S) = - P log P - (1-P) log(1-P)
I(S)
P
P=p/(p+n)=n/(n+p)=0.5equiprobability
20L. OrseauInduction of decision trees
3- Entropic gain associated with an attribute
|Sv|: size of the sub-population in the branch v of A
Gain(S, A) I(S) Sv
Sv valeurs( A) I(Sv)
How is the knowledge of the value of attribute A
informative about the class of an example
21L. OrseauInduction of decision trees
3- Example (1/4)
• Entropy of initial set of examples
I(p,n) = - 9/14 log2(9/14) - 5/14 log2(5/14)
• Entropy of subtrees associated with test on Pif?
p1 = 4 n1 = 0: I(p1,n1) = 0
p2 = 2 n2 = 3: I(p2,n2) = 0.971
p3 = 3 n3 = 2: I(p3,n3) = 0.971
• Entropy of subtrees associated with test on Temp?
p1 = 2 n1 = 2: I(p1,n1) = 1
p2 = 4 n2 = 2: I(p2,n2) = 0.918
p3 = 3 n3 = 1: I(p3,n3) = 0.811
22L. OrseauInduction of decision trees
3- Example (2/4)
val1 val2 val3
N1+N2+N3=N
N objectsn+p=N
E(N,A)= N1/N x I(p1,n1) + N2/N xI(p2,n2) + N3/N x I(p3,n3)
Information gain of A : GAIN(A)= I(S)-E(N,A)
Attribute A
N1 objectsn1+p1=N1
N2 objectsn2+p2=N2
N3 objectsn3+p3=N3
I(S)
23L. OrseauInduction of decision trees
3- Example (3/4)
• For the initial examples
I(S) = - 9/14 log2(9/14) - 5/14 log2(5/14)
• Entropy of the tree associated with test on Pif?
E(Pif) = 4/14 I(p1,n1) + 5/14 I(p2,n2) + 5/14 I(p3,n3)
Gain(Pif) = 0.940 - 0.694 = 0.246 bits
Gain(Temp) = 0.029 bits
Gain(Humid) = 0.151 bits
Gain(Wind) = 0.048 bits
Choice of attribute Pif for the first test
24L. OrseauInduction of decision trees
3- Example (4/4)
• Finale built tree:
cloudycloudy
Pif
play Wind
yesyes
don't playplay
rainrain
Humid
normalnormal highhigh
play don't play
sunnysunny
nono
25L. OrseauInduction of decision trees
3- Some TDIDT systems
Input: vector of attributes-values associated with each example
Output: decision tree
• CLS (Hunt, 1966) [analyse of data]
• ID3 (Quinlan 1979)
• ACLS (Paterson & Niblett 1983)
• ASSISTANT (Bratko 1984)
• C4.5 (Quinlan 1986)
• CART (Breiman, Friedman, Ohlson, Stone, 1984)
26L. OrseauInduction of decision trees
4- Potential problems
1. Continuous value attributes
2. Attributes with different branching factors
3. Missing values
4. Overfitting
5. Greedy search
6. Choice of attributes
7. Variance of results:
• Different trees from similar data
27L. OrseauInduction of decision trees
4.1. Discretization of continuous attribute values
Here, two possible thresholds: 16°C and 30°C
attribute Temp>16°C is the most informative, and is kept
Temp.6°C 8°C 14°C 18°C 20°C 28°C 32°C
Non Non Non Oui Oui Oui Non Playau golf
28L. OrseauInduction of decision trees
4.2. Different branching factors• Problem:
The entropic gain criterion favors attributes with higher branching The entropic gain criterion favors attributes with higher branching factorfactor
• Two solutions:
Make all attributes binaryMake all attributes binary
– But loss of legibility of treesBut loss of legibility of trees
Introduce a normalization factorIntroduce a normalization factor
Gain _ norm(S, A) Gain(S, A)
Si
Slog
Si
Si1
nb values of A
29L. OrseauInduction of decision trees
4.3. Processing missing values
• Let an example x , c(x) for which we don't know the value for attribute A
• How to compute gain(S,A)?
1. Take the most frequent value in entire S
2. Take the most frequent value at this node
3. Split example in fictitious examplesfictitious examples with the different possible values of A
weighted by their respective frequency
E.g. if 6 examples at this node take the value A=a1 and 4 the value A=a2
A(x) = a1 with prob=0.6 and A(x) = a2 with prob=0.4
For prediction, class the example with the label of the most probable leaf.
30L. OrseauInduction of decision trees
5- The generalization problem
• Training set. Ensemble test.
• Learning curve
• Methods to evaluate generalization
On a test set
Cross validation
– “Leave-one-out”
Did we learn a good decision tree?
31L. OrseauInduction of decision trees
5.1. Overfitting: Effect of noise on induction
• Types of noise
Description errors
Classification errors
“clashes”
Missing values
• Effects
Over-developed tree: too deep, too many leaves
32L. OrseauInduction of decision trees
5.1. Overfitting: The generalization problem
• Low empirical risk. High real risk.
• SRM (Structural Risk Minimization)
Justification [Vapnik,71,79,82,95]
– Notion of “capacity” of the hypothesis space
– Vapnik-Chervonenkis dimension
We must control the hypothesisWe must control the hypothesis spacespace
33L. OrseauInduction of decision trees
5.1. Control of space H: motivations & strategies
• Motivations:
Improve generalization performance (SRM)
Build a legible model of the data (for experts)
• Strategies:
1. Direct control of the size of the induced tree: pruningpruning
2. Modify the state space (trees) in which to search
3. Modify the search algorithm
4. Restrain the data base
5. Translate built trees into another representation
34L. OrseauInduction of decision trees
5.2. Overfitting: Controlling the size with pre-pruning
• Idea: modify the termination criterion
Depth threshold (e.g. [Holte,93]: threshold =1 or 2)
Chi2 test
Laplacian error
Low information gain
Low number of examples
Population of examples not statistically significant
Comparison between ”static error” and ” dynamic error”
• Problem: often too short-sighted
35L. OrseauInduction of decision trees
5.2. Example: Chi2 test
Let a binary attribute A
Ag d
(neg1,neg2) (ned1,ned2)
Ag d
(ng1,ng2) (nd1,nd2)
(n) = (n1,n2)
P (1-P)
(n) = (n1,n2)
n1 = ng1 + nd1
n2 = ng2 + nd2
neg1 = Pn1 ; ned1 = (1-P)n1
neg2 = Pn2 ; ned2 = (1-P)n2
Null hypothesisNull hypothesis
P (1-P)
2 (ngi Pni)
2
Pnii1
2
36L. OrseauInduction of decision trees
5.3. Overfitting: Controlling the size with post-post-pruningpruning
• Idea: Prune after the construction of whole tree, by replacing subtrees that optimize a pruning criterion on a node.
• Many methods. Still lots of research.
Minimal Cost-Complexity Pruning (MCCP) (Breiman and al.,84)
Reduced Error Pruning (REP) (Quinlan,87,93)
Minimum Error Pruning (MEP) (Niblett & Bratko,86)
Critical Value Pruning (CVP) (Mingers,87)
Pessimistic Error Pruning (PEP) (Quinlan,87)
Error-Based Pruning (EBP) (Quinlan,93) (used in C4.5)
...
37L. OrseauInduction of decision trees
5.3- Cost-Complexity pruning
• [Breiman and al.,84]
• Cost-complexity for a tree:
38L. OrseauInduction of decision trees
6. Forward search
• Instead of a greedy search, search n nodes ahead
If I choose this node and then this node and then …
• But exponential growth of the number of computations
39L. OrseauInduction of decision trees
6. Modification of the search strategy
• Idea: no more depth first search
• Methods that use a different measure:
Minimum Description Length principle
– Measure of the complexity of the tree
– Measure of the complexity of the examples not coded by the tree
– Keep tree that minimizes the sum of these measures
Measure of low learnability theory
Kolmogorov-Smirnoff measure
Class separation measure
Mix of selection tests
40L. OrseauInduction of decision trees
7. Modification of the search space
• Modification of the node tests
To solve the problems of an inadequate representations
Methods of constructive induction (e.g. multivariate tests)
E.g. Oblique decision trees
• Methods:
Numerical Operators
– Perceptron trees
– Trees and Genetic Programming
Logical operators
41L. OrseauInduction of decision trees
7. Oblique treesx2
x1
x1 < 0.70
x2 < 0.30 x2 < 0.88
1.1x1 + x2 < 0.2
x2 < 0.62 c1c2 c2
c2
x1 < 0.17
c1c2 c2 c1c2
c1
42L. OrseauInduction of decision trees
7. Induction of oblique trees
• Other cause of leafy trees: an inadequate representation
• Solutions:
Ask an expert (e.g. chess endgame [Quinlan,83])
Do an PCA beforehand
Other attribute selection method
Apply a constructive induction
Induction of oblique treesInduction of oblique trees
43L. OrseauInduction of decision trees
8. Translation into other representations
• Idea: Translate a complex tree into a representation where the
result is simpler
• Translation into decision graphs
• Translation rule sets
44L. OrseauInduction of decision trees
9. Conclusions• Appropriate for:
Classification of attribute-value examples
Attributes with discrete values
Resistance to noise
• Strategy:
Search by incremental construction of hypothesis
Local criterion (gradient) based on statistical criterion
• Generates
Interpretable decision trees (e.g. production rules)
• Requires a control of the size of the tree