Click here to load reader
Upload
mohd-noor-abdul-hamid
View
1.164
Download
0
Embed Size (px)
Citation preview
Mohd Noor Abdul Hamid, Ph.DUniversiti Utara Malaysia
After this class, you should be able to :
Explain the C4.5 AlgorithmUse the algorithm to develop a Decision Tree
Decision tree are constructed using only those attributes best able to differentiate the concepts to be learned.Main goal is to minimize the number of tree levels and tree nodes maximizing data generalization
Bear in [email protected]
The C4.5 AlgorithmLet T be the set of training instances
Choose an attribute that best differentiates the instances contained in
T.
Let T be the set of training instances
Choose an attribute that best differentiates the instances contained in
T.
The C4.5 Algorithm
Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.
Let T be the set of training instances
Choose an attribute that best differentiates the instances contained in
T.
Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.
The C4.5 Algorithm
Instances in the subclass satisfy predefine criteria
ORRemaining attributes choice
for the path is null
Choose an attribute that best differentiates the instances contained in
T.
Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.
Instances in the subclass satisfy predefine criteria
ORRemaining attributes choice
for the path is null Specify the classification for new
instances following this decision path
Y
The C4.5 Algorithm
Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.
Instances in the subclass satisfy predefine criteria
ORRemaining attributes choice
for the path is null
Specify the classification for newinstances following this decision path
Y
The C4.5 Algorithm
N
Let T be the set of training instances
Choose an attribute that best differentiates the instances contained in
T.
Create a tree node whose value is the chosen attribute. Create child links from this node where each link represents a unique value for the chosen attribute.Use the child link values to further subdivide the instances into subclasses.
The C4.5 Algorithm
END
Exercise
Exercise : The Scenario• BIGG Credit Card company wish to
develop a predictive model in order to identify customers who are likely to take advantage of the life insurance promotion – so that they can mail the promotional item to the potential customer.
Exercise : The ScenarioThe model will be develop using the data
stored in the credit card promotion database. The data contains information obtained about customers through their initial credit card application as well as data about whether these individual have accepted various promotional offerings sponsored by the company
Dataset
Let T be set of training instances
ExerciseWe follow our previous work with creditcardpromotion.xls The dataset consist of 15 instances (observations)- T, each having 10 attributes (variables) for our example, the input attributes are limited to 5. Why??
Step 1
Decision tree are constructed using only those attributes best
able to differentiate the concepts to be learned.
Let T be set of training instances
ExerciseStep 1
Age Interval 19 – 55 years
IndependentVariables(Inputs)
Sex Nominal MaleFemale
Income Range
Ordinal 20 – 30K30 – 40K40 – 50K50 – 60K
Credit Card Insurance
Binary YesNo
Life Insurance Promotion
Binary YesNo
DependentVariables
(Target / Output)
Choose an attribute that best differentiates the instances contained in T
ExerciseC4.5 uses measure taken from information theory to help with the attribute selection process. The idea is; for any choice point in the tree, C4.5 selects the attributes that splits the data so as to show the largest amount of gain in information.We need to choose an input attribute to best differentiate the instances in T our choices are among : - Income Range - Credit Card Insurance - Sex - Age
Step 2
INSURED
Choose an attribute that best differentiates the instances contained in T
ExerciseGoodness Score for each attribute is calculated to determined which attribute best differentiate the training instances (T).
Step 2
Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)
Goodness Score
We can develop a partial tree for each attribute in order to calculate the Goodness Score.
Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)
Goodness Score Accuracy
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2 2a. Income Range
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2 2a. Income Range
IncomeRange
2 Yes2 No
20 – 30K
4 Yes1 No
30-40K
1 Yes3 No
40-50K
2 Yes
50-60K
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2 2a. Income Range
Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)
Goodness Score
= (2 + 4 + 3 + 2) ÷ 15 4
= 0.183
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2 2b. Credit Card Insurance
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2 2b. Credit Card Insurance
CCInsurance
6 Yes6 No
No
3 Yes0 No
Yes
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2 2b. Credit Card Insurance
Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)
Goodness Score
= (6 + 3) ÷ 15 2
= 0.30
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2c. Sex
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2c. Sex
Sex
3 Yes5 No
Male
6 Yes1 No
Female
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2c. Sex
Sum of the most frequently encountered class in each branch (level) ÷ T Number of branches (levels)
Goodness Score
= (6 + 5) ÷ 15 2
= 0.367
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age
Age is an interval variable (numeric), therefore we need to determine where is the best split point for all the values.For this example, we are opt for binary split. Why??
Main goal is to minimize the number
of tree levels and tree nodes
maximizing data generalization
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age Steps to determine the best splitting point for interval (numerical) attribute:1. Sort the Age values (pair with the target – Life Ins Promo)
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
1 Yes0 No
8 Yes6 No
Goodness Score = (1 + 8) ÷ 15 = 0.30 2
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
1 Yes1 No
8 Yes5 No
Goodness Score = (1 + 8) ÷ 15 = 0.30 2
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
This process continues until a score for the split between 45 and 55 is obtained.Split point with the highest Goodness Score is chosen 43.
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age Steps to determine the best splitting point for interval (numerical) attribute (Age):
2. A Goodness Score is computed for each possible split point
Age 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55
LIP Y N Y Y Y Y Y Y N Y Y N N N N
9 Yes3 No
0 Yes3 No
Goodness Score = (9 + 3) ÷ 15 = 0.40 2
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
2d. Age
Age
9 Yes3 No
≤ 43
0 Yes3 No
> 43
Choose an attribute that best differentiates the instances contained in T
ExerciseStep 2
Overall Goodness Score for each input attribute:Attribute Goodness Score
Age 0.4Sex 0.367Credit Card Insurance 0.3Income Range 0.183
Therefore the attribute Age is chosen as the top level node
• Create a tree node whose value is the chosen attribute.
• Create child links from this node where each link represents a unique value for the chosen attribute.
ExerciseStep 3
ExerciseStep 3 Age
9 Yes3 No
≤ 43
0 Yes3 No
> 43
For each subclass :a. If the instances in the subclass satisfy the
predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision.
b. If the subclass does not satisfy the predefined criteria and there is at least one attribute to further subdivide the path of the three, let T be the current set of subclass instances and return to step 2.
ExerciseStep 3
ExerciseStep 3 Age
9 Yes3 No
≤ 43
0 Yes3 No
> 43
Does not satisfy the predefine criteria.Subdivide!
Satisfy the predefine criteria.Classification :Life Insurance = No
Step 3 Age
≤ 43
0 Yes3 No
> 43
Life Insurance = Yes
Sex
Exercise
Female
6 Yes0 No
Male
3 Yes3 No
Life Insurance = Yes Subdivide
Age
≤ 43
0 Yes3 No
> 43
Life Insurance = Yes
Sex
Female
6 Yes0 No
Male
Life Insurance = Yes
ExerciseStep 3
CCInsurance
1 Yes3 No
No
2 Yes0 No
Yes
Life Insurance = No Life Insurance = Yes
Age
≤ 43
0 Yes3 No
> 43
Life Insurance = No
SexFemale
6 Yes0 No
Male
Life Insurance = Yes
Exercise
CCInsurance
1 Yes3 No
No
2 Yes0 No
Yes
Life Insurance = No Life Insurance = Yes
The Decision Tree:Life Insurance Promo
ExerciseThe Decision Tree:1. Our Decision Tree is able to accurately classify 14
out of 15 training instances.2. Therefore, the accuracy of our model is 93%
Assignment• Based on the Decision Tree
model for the Life Insurance Promotion, develop application (program) using any tools you are familiar with.
• Submit your code and report next week!