Chap4 Classification Sep13

8/13/2019 Chap4 Classification Sep13

1/129

Unit 5

Classification: Basic Concepts, Decision

Trees, and Model Evaluation

by

PangNing Tan, Vipin Kumar, Michael Steinbach


2/129

Examples of Classification Task

Predicting tumor cells as benign or malignant

Classifying credit card transactionsas legitimate or fraudulent

Classifying secondary structures of proteinas alpha-helix, beta-sheet, or randomcoil

Categorizing news stories as finance,weather, entertainment, sports, etc


3/129

Definition

Classification is the task of learning a target function f that maps each attribute set

x to one of the predefined class labels y.

Target function known as classification model.

Descriptive Modeling: Distinguishes between objects of different classes.

(mammal, reptile, bird, fish, or amphibian).

Predictive Modeling: To predict class label of unknown records. (class- fish)

Gila monster Cold-blooded scales Not give Birth ?


4/129

Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

human yes no no no yes mammals

python no yes no no no reptilessalmon no yes no yes no fishes

whale yes no no yes no mammals

frog no yes no sometimes yes amphibians

komodo no yes no no yes reptiles

bat yes no yes no yes mammals

pigeon no yes yes no yes birdscat yes no no no yes mammals

leopard shark yes no no yes no fishes

turtle no yes no sometimes yes reptiles

penguin no yes no sometimes yes birds

porcupine yes no no no yes mammals

eel no yes no yes no fishessalamander no yes no sometimes yes amphibians

gila monster no yes no no yes reptiles

platypus no yes no no yes mammals

owl no yes yes no yes birds

dolphin yes no no yes no mammals

eagle no yes yes no yes birds

Vertebrate data set (20)


5/129

General Approach to solve a Classification Problem

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1

Attrib2 Attrib3 Class

1

Yes Large

125K No

2

No Medium

100K No

3

No Small

70K No

4

Yes Medium

120K No

5

No Large

95K Yes

6

No Medium

60K No

7

Yes Large

220K No

8

No Small

85K Yes

9

No Medium

75K No

10

No Small

90K Yes10

Tid Attrib1

Attrib2 Attrib3 Class

11

No Small

55K ?

12

Yes Medium

80K ?

13

Yes Large

110K ?

14

No Small

95K ?

15

No Large

67K ?10

Test Set

Learning

algorithm

Training Set


6/129

Classification

Task of assigning objects to predefined category

Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other

attributes.

Goal: previously unseen records should be assigned a class as accurately as

possible.

A test setis used to determine the accuracy of the model. Usually, the

given data set is divided into training and test sets, with training set used

to build the model and test set used to validate it.


7/129

Performance metrics

Accuracy = Number of correct predictions = f11+ f00

Total number of predictions f11+ f10+ f01+ f00 Error Rate = Number of wrong predictions = f10+ f01

Total number of predictions f11+ f10+ f01+ f00

Classification Techniques

Decision Tree Induction Methods

Rule-based Classifier Methods

Nearest-Neighbor classifiers

Bayesian Classifiers


8/129

Decision Tree Induction (How it works?)

Root node : No incoming edges and zero or more outgoing edges.

Internal Nodes: Exactly one incoming edge & two or more outgoing edges.

Leaf or terminal nodes: Exactly one incoming edge and no outgoing edges.


9/129

Classifying a unlabeled vertebrate


10/129

How to build a Decision Tree

Algorithm which employ greedy method in a reasonable amount of time is

used.

One such algorithm is Hunts algorithm.


11/129

Hunts Algorithm

1. If all the records in Dtbelong to the same class yt , then tis a leaf

node labeled as yt

2. If Dtcontains records that :

belong to more than one class, an attribute test condition is used

to split the data into smaller subsets.

Child node created for each outcome of the test condition and

the records in Dtare distributed to the children based on theoutcomes.

3. Recursively apply the procedure to each subset.

Dt

?


12/129


13/129

Conditions & Issues

First means most of the borrowers repaid the loans

We need to consider data from both class; so we take the root as home owner.

Left child is splitted again to continue splitting

Some records can be empty with no nodes associated with it

Records with identical attribute cannot be split further.

Design Issues:

(How) Training records splitting is based on attribute test condition tosmaller subsets.

Procedure to stop splitting is to exapand a node until all the records

belonging to same class have identical attribute values.


14/129

Method for Expressing Attribute Test Conditions

Depends on attribute types

Binary Nominal

Ordinal

Continuous

Depends on number of ways to split

2-way split Multi-way split


15/129

Binary Attributes

Test condition generates two potential outcomes


16/129

Nominal Attributes

Multi-way split:Use as many partitions as distinctvalues.

Binary split: Divides values into two subsets.Need to find optimal partitioning.

Decision tree algorithm like CART produce 2k-1

CarTypeFamily

Sports

Luxury

CarType{Family,Luxury} {Sports}

CarType{Sports,Luxury} {Family}

OR


17/129


18/129

Produce Multi-way split or binary.

Grouped as long it does not violate the order ofattribute values.

4.10 (a) (b) preserve the order but (c) combines small & large and

also medium & extra large

Ordinal Attributes


19/129

Continuous Attributes

Different ways of handling

Discretizationto form an ordinal categoricalattribute

Staticdiscretize once at the beginning

Dynamicranges can be found by equal intervalbucketing, equal frequency bucketing

(percentiles), or clustering.

Binary Decision: (A < v) or (A

v)consider all possible splits and finds the best cut

can be more compute intensive


20/129

Continuous Attributes


21/129

Measures for selecting the Best Split

)|( tjp

)|( tip Fraction of records belonging to class i at a given node t

Measures for selecting the best split is based on the degree of

impurity of child nodes

Smaller the degree of impurity the more skewed the class

distribution

Node with class distribution (0,1) has impurity=0; node with

uniform distribution (0.5, 0.5) has highest impurity


22/129

Which test condition is the best?

d i h li


23/129

How to determine the Best Split

Greedy approach:

Nodes with homogeneousclass distribution are

preferred

Need a measure of node impurity:

C0: 5

C1: 5

C0: 9

C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

f d i


24/129

Measures of Node Impurity

Gini Index

Entropy

Misclassification error

i

tiptGINI 2)]|([1)(

i

tiptiptEntropy )|(log)|()( 2

(NOTE: p( j | t) is the relative frequency of class j at node t).

)]/([max1)( tipttionerrorClassifica i

Where c is the number of classes and 0 log2 0 = 0 is entropy calculations

M f I i GINI


25/129

Measure of Impurity: GINI

Gini Index for a given node t :


Maximum (1 - 1/nc) when records are equally distributed among all

classes, implying least interesting information

Minimum (0.0) when all records belong to one class, implying mostinteresting information

C1 0

C2 6

Gini=0.000

C1 2

C2 4

Gini=0.444

C1 3

C2 3

Gini=0.500

C1 1

C2 5

Gini=0.278

i

tiptGINI 2)]|([1)(

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1P(C1)2P(C2)2= 101 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1(1/6)2(5/6)2= 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1(2/6)2(4/6)2= 0.444

P(C1) = 3/6 P(C2) = 3/6

Gini = 1(3/6)2(3/6)2 =0.5

Al i S li i C i i b d INFO


26/129

Alternative Splitting Criteria based on INFO

Entropy at a given node t:


Measures homogeneity of a node.

Maximum (log nc) when records are equally distributed among all

classes implying least information

Minimum (0.0) when all records belong to one class, implying most

information

Entropy based computations are similar to the GINI index

computations

i


E l f ti E t


27/129

Examples for computing Entropy

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy =0 log201 log21 =00 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy =(1/6) log2(1/6)(5/6) log2(1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy =(2/6) log2(2/6)(4/6) log2(4/6) = 0.92

i



28/129

Splitting Criteria based on Classification Error

Classification error at a node t :

Measures misclassification error made by a node.

Maximum (1 - 1/nc) when records are equally distributed

among all classes, implying least interesting information

Minimum (0.0) when all records belong to one class, implying

most interesting information


E l f C ti E


29/129

Examples for Computing Error

C1 0

C2 6

C1 2

C2 4

C1 1

C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1max (0, 1) = 11 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1max (1/6, 5/6) = 15/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1max (2/6, 4/6) = 14/6 = 1/3



30/129

Mi l ifi ti E Gi i


31/129

Misclassification Error vs Gini

A?

Yes No

Node N1 Node N2

Parent

C1 7

C2 3

Gini = 0.42

N1 N2

C1 3 4

C2 0 3

Gini=0.361

Gini(N1)

= 1(3/3)2(0/3)2

= 0

Gini(N2)= 1(4/7)2(3/7)2

= 0.489

Gini(Children)

= 3/10 * 0

+ 7/10 * 0.489

= 0.342

Gini improves !!

Info mation Gain to find best split


32/129

Information Gain to find best split

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

Before Splitting:

C0 N10

C1 N11

C0 N20

C1 N21

C0 N30

C1 N31

C0 N40

C1 N41

C0 N00

C1 N01M0

M1 M2 M3 M4

M12 M34

Difference in entropy gives information Gain

Gain = M0M12 vs M0M34

Splitting of Binary Attributes


33/129

Splitting of Binary Attributes

i ib C i G d


34/129

Binary Attributes: Computing GINI Index

Splits into two partitions

Effect of Weighing partitions:

Larger and Purer Partitions are sought for.

B?

Yes No

Node N1 Node N2

Parent

C0 6

C1 6

Gini = 0.500

N1 N2C0 1 5

C1 4 2

Gini=0.333

Gini(N1)

= 1(5/6)2(2/6)2

= 0.194

Gini(N2)

= 1(1/6)2(4/6)2

= 0.528

Gini(Children)

= 7/12 * 0.194 + 5/12 * 0.528

= 0.333


35/129

Splitting of Nominal Attributes: Computing Gini Index

For each distinct value, gather counts for each class in the dataset

Use the count matrix to make decisions

Multiway split

Gini{family} = 0.375 ; Gini {sports } = 0 ; Gini {Luxury} = 0.219

Gini (car type) = (4/20) * 0.375 + (8/20) * 0 + (8/20) * 0.219 = 0.163


36/129

Splitting of Continuous Attributes: Computing Gini Index

Use Binary Decisions based on one value

Several Choices for the splitting value

Number of possible splitting values

= Number of distinct values

Each splitting value has a count matrix associated

with it

Class counts in each of the partitions, A < v

and A v

Simple method to choose best v

For each v, scan the database to gather count

matrix and compute its Gini index

Computationally Inefficient! Repetition of work.

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

TaxableIncome

> 80K?

Yes No

C ti Att ib t C ti Gi i I d


37/129

Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,

Sort the attribute on values

Linearly scan these values, each time updating the count matrix and

computing Gini index

Choose the split position that has the least Gini index

S litti B d INFO


38/129

Splitting Based on INFO...

Information Gain:

Parent Node, p is split into k partitions;

niis number of records in partition i

Measures Reduction in Entropy achieved because of the split.

Choose the split that achieves most reduction (maximizes GAIN)

Used in ID3 and C4.5

Disadvantage: Tends to prefer splits that result in large number of

partitions, each being small but pure.

k

i

i

sp li tiEntropy

nnpEntropyGAIN

1)()(

Example of Information Gain


39/129

Example of Information Gain

Class P: buys_computer = yes

Class N: buys_computer = noage pi ni I(pi, ni)

40 3 2 0.971

694.0)2,3(145

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

246.0)()()( DInfoDInfoageGain age

age income student credit_rating buys_computer

40 low yes fair yes

>40 low yes excellent no

3140 low yes excellent yes


40/129

Splitting the samples using age

income student credit_rating buys_computer

high no fair nohigh no excellent no

medium no fair no

low yes fair yes

medium yes excellent yes


high no fair yeslow yes excellent yes

medium no excellent yes

high yes fair yes


medium no fair yeslow yes fair yes

low yes excellent no

medium yes fair yes

medium no excellent no

age?

40

labeled yes

Splitting Based on INFO


41/129

Splitting Based on INFO...

Gain Ratio:

Parent Node, p is split into k partitions

niis the number of records in partition I

Adjusts Information Gain by the entropy of the partitioning (SplitINFO).

Higher entropy partitioning (large number of small partitions) is penalized!

Used in C4.5

Designed to overcome the disadvantage of Information Gain

SplitINFO

GAINGainRATIO Split

sp li t

k

i

ii

n

n

n

nSplitINFO

1log

Decision Tree Induction


42/129


Greedy strategy.

Split the records based on an attribute testthat optimizes certain criterion.

Issues

Determine how to split the records

How to specify the attribute test condition?

How to determine the best split? Determine when to stop splitting

Stopping Criteria for Tree Induction


43/129

Stopping Criteria for Tree Induction

Stop expanding a node when all the recordsbelong to the same class

Stop expanding a node when all the records havesimilar attribute values

Early termination

Algorithm : Decision Tree Algorithm


44/129

Algorithm : Decision Tree Algorithm

TreeGrowth (E, F)

1. If stopping_cond(E, F) =truethen2. leaf = createNode ()

3. leaf.label = Classify (E)

4. return leaf.

5. else

6. root = createNode()

7. root. test_cond = find_best_split (E,F)

8. let V= { v | v is poosible outcome of root. test_cond }.

9. for each v V do

10.

Ev = { e | root. test_cond (e) = v and e V }.11. child = TreeGrowth (Ev , F)

12. add child as descendent of root and label the edge ( rootchild) as v

13. end for

14.end if

15.return root


45/129

createNode()create a new node. Has a test condition or a

class label (node.label)

find_best_split ()attribute to be selected as as test condition for

splitting records. Entropy, Gini, Error.

Classify()determine class label to be assigned to leaf node.

leaf.label =argmax p(i | t)

stopping _cond()terminate the tree growth by testing whether all

records has same class label or same attribute values.

Later tree pruning and overfitting.

Decision Tree Based Classification


46/129

Decision Tree Based Classification

Advantages: Inexpensive to construct

Extremely fast at classifying unknown records

Easy to interpret for small-sized trees

Accuracy is comparable to other classification

techniques for many simple data sets


47/129


48/129



49/129


Many Algorithms: Hunts Algorithm (one of the earliest)

CART

ID3, C4.5

SLIQ,SPRINT

Computing Impurity Measure


50/129

Computing Impurity Measure


TaxableIncome Class



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No

10 ? Single 90K Yes10

Class

= Yes

Class

= No

Refund=Yes 0 3

Refund=No 2 4

Refund=? 1 0

Split on Refund:

Entropy(Refund=Yes) = 0

Entropy(Refund=No)

= -(2/6)log(2/6)(4/6)log(4/6) = 0.9183

Entropy(Children)

= 0.3 (0) + 0.6 (0.9183) = 0.551

Gain = 0.9 (0.88130.551) = 0.3303

Missing

value

Before Splitting:

Entropy(Parent)

= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

Rule-Based Classifier


51/129

Rule Based Classifier

Classify records by using a collection ofifthen rules

Rule: (Condition) y

where

Conditionis a conjunctions of attributes

yis the class label

LHS: rule antecedent or condition

RHS: rule consequent

Examples of classification rules:

(Blood Type=Warm) (Lay Eggs=Yes) Birds

(Taxable Income < 50K) (Refund=Yes) Evade=No

Rule-based Classifier (Example)


52/129

Rule based Classifier (Example)

R1: (Give Birth = no) (Can Fly = yes) Birds

R2: (Give Birth = no) (Live in Water = yes) Fishes

R3: (Give Birth = yes) (Blood Type = warm) Mammals

R4: (Give Birth = no) (Can Fly = no) Reptiles

R5: (Live in Water = sometimes) Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class

human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fishes

whale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birds

porcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

Application of Rule-Based Classifier


53/129

Application of Rule Based Classifier

A rule rcoversan instance x if the attributes of the instance

satisfy the condition of the rule (Aj op vj ) attribute test which is an attribute- value pair called

conjunct

Conditioni = (A1 op v1 ) (A2 op v2 ) . (Ak op vk )

The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal


hawk warm no yes no ?grizzly bear warm yes no no ?


54/129

Tid Refund Marital Taxable


55/129

Status Income Class



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No

10 NoSingle

90KYes10

(Status=Single) No

Coverage = ---- %, Accuracy = -------------%

How does Rule-based Classifier Work?


56/129

How does Rule based Classifier Work?


R2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) Mammals



A lemur triggers rule R3, so it is classified as a mammalA turtle triggers both R4 and R5

A dogfish shark triggers none of the rules


lemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?

Characteristics of Rule-Based Classifier


57/129

Characteristics of Rule Based Classifier

Mutually exclusive rules

Classifier contains mutually exclusive rules ifthe rules are independent of each other

Every record is covered by at most one rule

Exhaustive rules

Classifier has exhaustive coverage if it

accounts for every possible combination ofattribute values

Each record is covered by at least one rule

From Decision Trees To Rules


58/129


YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable

Income

Marital

Status

Refund

Classification Rules

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced},

Taxable Income No


Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

Rules are mutually exclusive and exhaustive

Rule set contains as much information as the

tree

Rules Can Be Simplified


59/129

p

YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable

Income

Marital

Status

Refund


TaxableIncome Cheat



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Initial Rule: (Refund=No) (Status=Married) No

Simplified Rule: (Status=Married) No

Effect of Rule Simplification


60/129

p

Rules are no longer mutually exclusive

A record may trigger more than one rule

Solution?

Ordered rule set

Unordered rule setuse voting schemes

Rules are no longer exhaustive

A record may not trigger any rules Solution?

Use a default class

Ordered Rule Set


61/129

Rules are rank ordered according to their priority An ordered rule set is known as a decision list

When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has

triggered

If none of the rules fired, it is assigned to the default class







turtle cold no no sometimes ?

Rule Ordering Schemes


62/129

g

Rule-based ordering Individual rules are ranked based on their quality

Class-based ordering Rules that belong to the same class appear together

Rule-based Ordering

(Refund=Yes) ==> No


Taxable Income No




Class-based Ordering

(Refund=Yes) ==> No


Taxable Income No




Building Classification Rules


63/129

g

Direct Method:

Extract rules directly from data

e.g.: RIPPER, CN2, Holtes 1R

Indirect Method:Extract rules from other classification models (e.g.

decision trees, neural networks, etc).

e.g: C4.5rules

Direct Method: Sequential Covering


64/129

q g

1. Start from an empty rule

2. Grow a rule using the Learn-One-Rule function

3. Remove training records covered by the rule

4. Repeat Step (2) and (3) until stopping criterion

is met


65/129

Example of Sequential Covering


66/129

p q g

(iii) Step 2

R1

(iv) Step 3

R1

R2

Aspects of Sequential Covering


67/129

p q g

Rule Growing

Instance Elimination

Rule Evaluation

Stopping Criterion

Rule Pruning

Rule Growing


68/129

g

Two common strategies

Status =

Single

Status =

DivorcedStatus =

Married

Income

> 80K...

Yes: 3

No: 4{ }

Yes: 0

No: 3

Refund=

No

Yes: 3

No: 4

Yes: 2

No: 1

Yes: 1

No: 0

Yes: 3

No: 1

(a) General-to-specific

Refund=No,

Status=Single,

Income=85K

(Class=Yes)

Refund=No,

Status=Single,

Income=90K

(Class=Yes)

Refund=No,

Status = Single

(Class = Yes)

(b) Specific-to-general

Ripper Algorithm


69/129

Sequential covering Algorithm:

Start from an empty rule: {} => class

Add conjuncts that maximizes FOILs information gain measure:

R0: {} => class (initial rule)

R1: {A} => class (rule after adding conjunct)

Gain(R0, R1) = t [ log (p1/(p1+n1))log (p0/(p0 + n0)) ]

where t: number of positive instances covered by both R0 and R1

p0: number of positive instances covered by R0

n0: number of negative instances covered by R0





70/129

Why do we need to

eliminate instances? Otherwise, the next rule is

identical to previous rule

Why do we removepositive instances? Ensure that the next rule is

different

Why do we removenegative instances? Prevent underestimating

accuracy of rule

Compare rules R2 and R3in the diagram

class = +

class = -

+

+ +

++

++

+

++

+

+

+

+

+

+

++

+

+

-

-

--

- --

--

- -

-

-

-

-

--

-

-

-

-

+

+

++

+

+

+

R1

R3 R2

+

+


71/129

Stopping Criterion and Rule Pruning


72/129

Stopping criterion

Compute the gain If gain is not significant, discard the new rule

Rule Pruning Similar to post-pruning of decision trees

Reduced Error Pruning:Remove one of the conjuncts in the rule

Compare error rate on validation set before andafter pruning

If error improves, prune the conjunct



73/129

Why do we need to

eliminate instances? Otherwise, the next rule is

identical to previous rule

Why do we removepositive instances?

Ensure that the next rule isdifferent


accuracy of rule

Compare rules R2 and R3in the diagram

class = +

class = -

+

+ +

++

++

+

++

+

+

+

+

+

+

++

+

+

-

-

--

- --

--

- -

-

-

-

-

--

-

-

-

-

+

+

++

+

+

+

R1

R3 R2

+

+


74/129



75/129

Stopping criterion






Indirect Methods


76/129

Rule Set

r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +

r3: (P=Yes,R=No) ==> +

r4: (P=Yes,R=Yes,Q=No) ==> -

r5: (P=Yes,R=Yes,Q=Yes) ==> +

P

Q R

Q- + +

- +

No No

No

Yes Yes

Yes

No Yes

Rule set from decision tree

Each rule with a class + or -.

R2, R3, r5 predict positive class when Q=yes

Rules simplified as r2 : (Q=yes) +

r3: (P=yes) (R=No)+

Rule generation : C4.5rules


77/129

Extract classification rules from every path of a

decision tree For each rule, r: A y,

consider an alternative rule r: A y where A isobtained by removing one of the conjuncts in A

Compare the pessimistic error rate for r againstall rs

Prune if one of the rs has lower pessimistic errorrate

Repeat until we can no longer improvegeneralization error

Rule ordering : C4.5rules


78/129

Use of class ordering where rules that predict same

class grouped into same subsets. Compute description length of each subset

The classes are arranged in increasing order of

their total description length. Class with smallest description length is given

highest priority.

Description length = L(exception) + g * L(model)

g is a parameter that takes into account the presenceof redundant attributes in a rule set(default value = 0.5)

Example


79/129



python no yes no no no reptilessalmon no yes no yes no fisheswhale yes no no yes no mammals

frog no yes no sometimes yes amphibianskomodo no yes no no yes reptilesbat yes no yes no yes mammals


leopard shark yes no no yes no fishesturtle no yes no sometimes yes reptiles

penguin no yes no sometimes yes birdsporcupine yes no no no yes mammals

eel no yes no yes no fishessalamander no yes no sometimes yes amphibians

gila monster no yes no no yes reptilesplatypus no yes no no yes mammals

owl no yes yes no yes birdsdolphin yes no no yes no mammalseagle no yes yes no yes birds

C4.5 versus C4.5rules versus RIPPER


80/129

C4.5rules:

(Give Birth=No, Can Fly=Yes) Birds

(Give Birth=No, Live in Water=Yes) Fishes

(Give Birth=Yes) Mammals

(Give Birth=No, Can Fly=No, Live in Water=No)Reptiles

( ) Amphibians

Give

Birth?

Live In

Water?

Can

Fly?

Mammals

Fishes Amphibians

Birds Reptiles

Yes No

Yes

Sometimes

No

Yes No

RIPPER:

(Live in Water=Yes) Fishes

(Have Legs=No) Reptiles

(Give Birth=No, Can Fly=No, Live In Water=No)Reptiles

(Can Fly=Yes,Give Birth=No) Birds() Mammals

Advantages of Rule-Based Classifiers


81/129

As highly expressive as decision trees

Easy to interpret

Easy to generate

Can classify new instances rapidly

Performance comparable to decision trees

Rule-Based Classifier


82/129

Classify records by using a collection of

ifthen rules

Rule: (Condition) y

where

Conditionis a conjunctions of attributes

yis the class label

LHS: rule antecedent or condition

RHS: rule consequent

Examples of classification rules:

(Blood Type=Warm) (Lay Eggs=Yes) Birds

(Taxable Income < 50K) (Refund=Yes) Evade=No


83/129

Application of Rule-Based Classifier


84/129

A rule rcoversan instance x if the attributes of the instance

satisfy the condition of the rule (Aj op vj ) attribute test which is an attribute- value pair called

conjunct

Conditioni = (A1 op v1 ) (A2 op v2 ) . (Ak op vk )

The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal


hawk warm no yes no ?grizzly bear warm yes no no ?

Rule Coverage and Accuracy


85/129

Coverage of a rule:

Fraction of records that satisfy the antecedent of a rule= | A | / |D|

Accuracy of a rule:

Fraction of records that satisfy both the antecedent and

consequent of a rule= | A y | / |D|

(Gives Birth=yes) (Body Temperature = warm-blooded) Mammals

Coverage = 33%, Accuracy = 6/6 = 100%


TaxableIncome Class


86/129



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10(Status=Single) No

Coverage = ---- %, Accuracy = -------------%

How does Rule-based Classifier Work?


87/129


R2: (Give Birth = no)

(Live in Water = yes)

FishesR3: (Give Birth = yes) (Blood Type = warm) Mammals



A lemur triggers rule R3, so it is classified as a mammalA turtle triggers both R4 and R5

A dogfish shark triggers none of the rules


lemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?


88/129



89/129

YESYESNONO

NONO

NONO

Yes No

{Married}

{Single,

Divorced}

< 80K > 80K

Taxable

Income

Marital

Status

Refund

Classification Rules

(Refund=Yes) ==> No


Taxable Income No




Rules are mutually exclusive and exhaustive

Rule set contains as much information as the

tree

Rules Can Be Simplified


90/129

YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable

Income

Marital

Status

Refund


TaxableIncome Cheat



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Initial Rule: (Refund=No) (Status=Married) No

Simplified Rule: (Status=Married) No

Effect of Rule Simplification


91/129

Rules are no longer mutually exclusive

A record may trigger more than one rule

Solution?

Ordered rule set

Unordered rule setuse voting schemes

Rules are no longer exhaustive

A record may not trigger any rules Solution?

Use a default class

Ordered Rule Set


92/129

Rules are rank ordered according to their priority An ordered rule set is known as a decision list

When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has

triggered

If none of the rules fired, it is assigned to the default class







turtle cold no no sometimes ?

Rule Ordering Schemes


93/129

Rule-based ordering Individual rules are ranked based on their quality

Class-based ordering Rules that belong to the same class appear together

Rule-based Ordering

(Refund=Yes) ==> No


Taxable Income No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes


Class-based Ordering

(Refund=Yes) ==> No


Taxable Income No




Building Classification Rules


94/129

Direct Method:

Extract rules directly from data

e.g.: RIPPER, CN2, Holtes 1R

Indirect Method:Extract rules from other classification models (e.g.

decision trees, neural networks, etc).

e.g: C4.5rules

Direct Method: Sequential Covering


95/129

1. Start from an empty rule

2. Grow a rule using the Learn-One-Rule function

3. Remove training records covered by the rule

4. Repeat Step (2) and (3) until stopping criterion

is met



96/129

(i) Original Data (ii) Step 1



97/129

(iii) Step 2

R1

(iv) Step 3

R1

R2

Aspects of Sequential Covering


98/129

Rule Growing


Rule Evaluation

Stopping Criterion

Rule Pruning

Rule Growing


99/129

Two common strategies

Status =

Single

Status =

DivorcedStatus =

Married

Income

> 80K...

Yes: 3

No: 4{ }

Yes: 0

No: 3

Refund=

No

Yes: 3

No: 4

Yes: 2

No: 1

Yes: 1

No: 0

Yes: 3

No: 1

(a) General-to-specific

Refund=No,

Status=Single,

Income=85K

(Class=Yes)

Refund=No,

Status=Single,

Income=90K

(Class=Yes)

Refund=No,

Status = Single

(Class = Yes)

(b) Specific-to-general

Rule Growing (Examples)


100/129

CN2 Algorithm:

Start from an empty conjunct: {} Add conjuncts that minimizes the entropy measure: {A}, {A,B},

Determine the rule consequent by taking majority class of instancescovered by the rule

RIPPER Algorithm:

Start from an empty rule: {} => class Add conjuncts that maximizes FOILs information gain measure:

R0: {} => class (initial rule)

R1: {A} => class (rule after adding conjunct)

Gain(R0, R1) = t [ log (p1/(p1+n1))log (p0/(p0 + n0)) ]

where t: number of positive instances covered by both R0 and R1







101/129

Why do we need to

eliminate instances? Otherwise, the next rule isidentical to previous rule

Why do we removepositive instances?

Ensure that the next rule isdifferent


accuracy of rule Compare rules R2 and R3

in the diagram

class = +

class = -

+

+ +

++

++

+

++

+

+

+

+

+

+

++

+

+

-

-

--

- --

--

- -

-

-

-

-

--

-

-

-

-

+

+

++

+

+

+

R1

R3 R2

+

+

Rule Evaluation


102/129

Metrics:

Accuracy

Laplace

M-estimate

knnc

1

knkpnc

n : Number of instancescovered by rule

nc: Number of instancescovered by rule

k: Number of classes

p: Prior probability

nnc



103/129

Stopping criterion






Summary of Direct Method


104/129

Grow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

Repeat

Direct Method: RIPPER


105/129

For 2-class problem, choose one of the classes as

positive class, and the other as negative class Learn rules for positive class

Negative class will be default class

For multi-class problem

Order the classes according to increasing classprevalence (fraction of instances that belong to aparticular class)

Learn the rule set for smallest class first, treat the rest

as negative class

Repeat with next smallest class as positive class



106/129

Growing a rule:

Start from empty rule Add conjuncts as long as they improve FOILs

information gain

Stop when rule no longer covers negative examples

Prune the rule immediately using incremental reducederror pruning

Measure for pruning: v = (p-n)/(p+n)p: number of positive examples covered by the rule in

the validation set

n: number of negative examples covered by the rule inthe validation set

Pruning method: delete any final sequence ofconditions that maximizes v



107/129

Building a Rule Set:

Use sequential covering algorithmFinds the best rule that covers the current set ofpositive examples

Eliminate both positive and negative examplescovered by the rule

Each time a rule is added to the rule set,compute the new description length

stop adding new rules when the new descriptionlength is d bits longer than the smallest descriptionlength obtained so far



108/129

Optimize the rule set:

For each rule rin the rule set R

Consider 2 alternative rules:

Replacement rule (r*): grow new rule from scratch

Revised rule(r): add conjuncts to extend the rule r

Compare the rule set for r against the rule set for r*and r

Choose rule set that minimizes MDL principle

Repeat rule generation and rule optimizationfor the remaining positive examples

Indirect Methods


109/129

Rule Set

r1: (P=No,Q=No) ==> -

r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +

r4: (P=Yes,R=Yes,Q=No) ==> -

r5: (P=Yes,R=Yes,Q=Yes) ==> +

P

Q R

Q- + +

- +

No No

No

Yes Yes

Yes

No Yes

Rule set from decision tree

Each rule with a class + or -.

R2, R3, r5 predict positive class when Q=yes

Rules simplified as r2 : (Q=yes) +

r3: (P=yes) (R=No)+

Rule generation : C4.5rules


110/129

Extract classification rules from every path of a

decision tree For each rule, r: A y,

consider an alternative rule r: A y where A isobtained by removing one of the conjuncts in A

Compare the pessimistic error rate for r againstall rs

Prune if one of the rs has lower pessimistic errorrate

Repeat until we can no longer improvegeneralization error


111/129

Example


112/129



python no yes no no no reptilessalmon no yes no yes no fisheswhale yes no no yes no mammals

frog no yes no sometimes yes amphibianskomodo no yes no no yes reptilesbat yes no yes no yes mammals


leopard shark yes no no yes no fishesturtle no yes no sometimes yes reptiles

penguin no yes no sometimes yes birdsporcupine yes no no no yes mammals

eel no yes no yes no fishes

salamander no yes no sometimes yes amphibiansgila monster no yes no no yes reptilesplatypus no yes no no yes mammals

owl no yes yes no yes birdsdolphin yes no no yes no mammalseagle no yes yes no yes birds

C4.5 versus C4.5rules versus RIPPER


113/129

C4.5rules:

(Give Birth=No, Can Fly=Yes) Birds

(Give Birth=No, Live in Water=Yes) Fishes

(Give Birth=Yes) Mammals

(Give Birth=No, Can Fly=No, Live in Water=No)Reptiles

( ) Amphibians

Give

Birth?

Live In

Water?

Can

Fly?

Mammals

Fishes Amphibians

Birds Reptiles

Yes No

Yes

Sometimes

No

Yes No

RIPPER:(Live in Water=Yes) Fishes

(Have Legs=No) Reptiles

(Give Birth=No, Can Fly=No, Live In Water=No)Reptiles

(Can Fly=Yes,Give Birth=No) Birds

() Mammals


114/129

Advantages of Rule-Based Classifiers


115/129

As highly expressive as decision trees

Easy to interpret

Easy to generate

Can classify new instances rapidly

Performance comparable to decision trees

Instance-Based Classifiers


116/129

Atr1 ... AtrN Class

A

B

B

C

A

C

B

Set of Stored Cases

Atr1 ... AtrN

Unseen Case

Store the training records

Use training records to

predict the class label of

unseen cases

Instance Based Classifiers


117/129

Examples:

Rote-learnerMemorizes entire training data and performsclassification only if attributes of record match one ofthe training examples exactly

Nearest neighbor

Uses k closest points (nearest neighbors) for

performing classification

Nearest Neighbor Classifiers


118/129

Basic idea:

If it walks like a duck, quacks like a duck, thenits probably a duck

Training

Records

TestRecord

Compute

Distance

Choose k of the

nearest records

Nearest-Neighbor Classifiers


119/129

Requires three things

The set of stored records

Distance Metric to computedistance between records

The value of k, the number ofnearest neighbors to retrieve

To classify an unknown record:

Compute distance to othertraining records

Identify knearest neighbors

Use class labels of nearestneighbors to determine theclass label of unknown record(e.g., by taking majority vote)

Unknown record

Definition of Nearest Neighbor


120/129

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data pointsthat have the k smallest distance to x

1 nearest-neighbor


121/129

Voronoi Diagram

Nearest Neighbor Classification


122/129

Compute distance between two points:

Euclidean distance

Determine the class from nearest neighbor list

take the majority vote of class labels among

the k-nearest neighbors Weigh the vote according to distance

weight factor, w = 1/d2

i ii

qpqpd 2)(),(



123/129

Choosing the value of k:

If k is too small, sensitive to noise points If k is too large, neighborhood may include points from

other classes

X



124/129

Scaling issues

Attributes may have to be scaled to preventdistance measures from being dominated byone of the attributes

Example:height of a person may vary from 1.5m to 1.8m

weight of a person may vary from 90lb to 300lb

income of a person may vary from $10K to $1M



125/129

Problem with Euclidean measure:

High dimensional datacurse of dimensionality

Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1

vs

d = 1.4142 d = 1.4142

Solution: Normalize the vectors to unit length

Nearest neighbor Classification


126/129

k-NN classifiers are lazy learners

It does not build models explicitly

Unlike eager learners such as decision treeinduction and rule-based systems

Classifying unknown records are relativelyexpensive

Example: PEBLS


127/129

PEBLS: Parallel Examplar-Based Learning

System (Cost & Salzberg) Works with both continuous and nominal

features

For nominal features, distance between twonominal values is computed using modified valuedifference metric (MVDM)

Each record is assigned a weight factor

Number of nearest neighbor, k = 1


128/129

Example: PEBLS


129/129

d

i

iiYX YXdwwYX1

2),(),(


TaxableIncome Cheat

X Yes Single 125K No

Y No Married 100K No10

Distance between record X and record Y:

where:

correctlypredictsXtimesofNumber

predictionforusedisXtimesofNumberXw

wX 1 if X makes accurate prediction most of the time

Documents

Chap4 Classification Sep13