Chap4 Classification Sep13

Embed Size (px)

Citation preview

  • 8/13/2019 Chap4 Classification Sep13

    1/129

    Unit 5

    Classification: Basic Concepts, Decision

    Trees, and Model Evaluation

    by

    PangNing Tan, Vipin Kumar, Michael Steinbach

  • 8/13/2019 Chap4 Classification Sep13

    2/129

    Examples of Classification Task

    Predicting tumor cells as benign or malignant

    Classifying credit card transactionsas legitimate or fraudulent

    Classifying secondary structures of proteinas alpha-helix, beta-sheet, or randomcoil

    Categorizing news stories as finance,weather, entertainment, sports, etc

  • 8/13/2019 Chap4 Classification Sep13

    3/129

    Definition

    Classification is the task of learning a target function f that maps each attribute set

    x to one of the predefined class labels y.

    Target function known as classification model.

    Descriptive Modeling: Distinguishes between objects of different classes.

    (mammal, reptile, bird, fish, or amphibian).

    Predictive Modeling: To predict class label of unknown records. (class- fish)

    Gila monster Cold-blooded scales Not give Birth ?

  • 8/13/2019 Chap4 Classification Sep13

    4/129

    Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

    human yes no no no yes mammals

    python no yes no no no reptilessalmon no yes no yes no fishes

    whale yes no no yes no mammals

    frog no yes no sometimes yes amphibians

    komodo no yes no no yes reptiles

    bat yes no yes no yes mammals

    pigeon no yes yes no yes birdscat yes no no no yes mammals

    leopard shark yes no no yes no fishes

    turtle no yes no sometimes yes reptiles

    penguin no yes no sometimes yes birds

    porcupine yes no no no yes mammals

    eel no yes no yes no fishessalamander no yes no sometimes yes amphibians

    gila monster no yes no no yes reptiles

    platypus no yes no no yes mammals

    owl no yes yes no yes birds

    dolphin yes no no yes no mammals

    eagle no yes yes no yes birds

    Vertebrate data set (20)

  • 8/13/2019 Chap4 Classification Sep13

    5/129

    General Approach to solve a Classification Problem

    Apply

    Model

    Induction

    Deduction

    Learn

    Model

    Model

    Tid Attrib1

    Attrib2 Attrib3 Class

    1

    Yes Large

    125K No

    2

    No Medium

    100K No

    3

    No Small

    70K No

    4

    Yes Medium

    120K No

    5

    No Large

    95K Yes

    6

    No Medium

    60K No

    7

    Yes Large

    220K No

    8

    No Small

    85K Yes

    9

    No Medium

    75K No

    10

    No Small

    90K Yes10

    Tid Attrib1

    Attrib2 Attrib3 Class

    11

    No Small

    55K ?

    12

    Yes Medium

    80K ?

    13

    Yes Large

    110K ?

    14

    No Small

    95K ?

    15

    No Large

    67K ?10

    Test Set

    Learning

    algorithm

    Training Set

  • 8/13/2019 Chap4 Classification Sep13

    6/129

    Classification

    Task of assigning objects to predefined category

    Given a collection of records (training set )

    Each record contains a set of attributes, one of the attributes is the class.

    Find a model for class attribute as a function of the values of other

    attributes.

    Goal: previously unseen records should be assigned a class as accurately as

    possible.

    A test setis used to determine the accuracy of the model. Usually, the

    given data set is divided into training and test sets, with training set used

    to build the model and test set used to validate it.

  • 8/13/2019 Chap4 Classification Sep13

    7/129

    Performance metrics

    Accuracy = Number of correct predictions = f11+ f00

    Total number of predictions f11+ f10+ f01+ f00 Error Rate = Number of wrong predictions = f10+ f01

    Total number of predictions f11+ f10+ f01+ f00

    Classification Techniques

    Decision Tree Induction Methods

    Rule-based Classifier Methods

    Nearest-Neighbor classifiers

    Bayesian Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    8/129

    Decision Tree Induction (How it works?)

    Root node : No incoming edges and zero or more outgoing edges.

    Internal Nodes: Exactly one incoming edge & two or more outgoing edges.

    Leaf or terminal nodes: Exactly one incoming edge and no outgoing edges.

  • 8/13/2019 Chap4 Classification Sep13

    9/129

    Classifying a unlabeled vertebrate

  • 8/13/2019 Chap4 Classification Sep13

    10/129

    How to build a Decision Tree

    Algorithm which employ greedy method in a reasonable amount of time is

    used.

    One such algorithm is Hunts algorithm.

  • 8/13/2019 Chap4 Classification Sep13

    11/129

    Hunts Algorithm

    1. If all the records in Dtbelong to the same class yt , then tis a leaf

    node labeled as yt

    2. If Dtcontains records that :

    belong to more than one class, an attribute test condition is used

    to split the data into smaller subsets.

    Child node created for each outcome of the test condition and

    the records in Dtare distributed to the children based on theoutcomes.

    3. Recursively apply the procedure to each subset.

    Dt

    ?

  • 8/13/2019 Chap4 Classification Sep13

    12/129

  • 8/13/2019 Chap4 Classification Sep13

    13/129

    Conditions & Issues

    First means most of the borrowers repaid the loans

    We need to consider data from both class; so we take the root as home owner.

    Left child is splitted again to continue splitting

    Some records can be empty with no nodes associated with it

    Records with identical attribute cannot be split further.

    Design Issues:

    (How) Training records splitting is based on attribute test condition tosmaller subsets.

    Procedure to stop splitting is to exapand a node until all the records

    belonging to same class have identical attribute values.

  • 8/13/2019 Chap4 Classification Sep13

    14/129

    Method for Expressing Attribute Test Conditions

    Depends on attribute types

    Binary Nominal

    Ordinal

    Continuous

    Depends on number of ways to split

    2-way split Multi-way split

  • 8/13/2019 Chap4 Classification Sep13

    15/129

    Binary Attributes

    Test condition generates two potential outcomes

  • 8/13/2019 Chap4 Classification Sep13

    16/129

    Nominal Attributes

    Multi-way split:Use as many partitions as distinctvalues.

    Binary split: Divides values into two subsets.Need to find optimal partitioning.

    Decision tree algorithm like CART produce 2k-1

    CarTypeFamily

    Sports

    Luxury

    CarType{Family,Luxury} {Sports}

    CarType{Sports,Luxury} {Family}

    OR

  • 8/13/2019 Chap4 Classification Sep13

    17/129

  • 8/13/2019 Chap4 Classification Sep13

    18/129

    Produce Multi-way split or binary.

    Grouped as long it does not violate the order ofattribute values.

    4.10 (a) (b) preserve the order but (c) combines small & large and

    also medium & extra large

    Ordinal Attributes

  • 8/13/2019 Chap4 Classification Sep13

    19/129

    Continuous Attributes

    Different ways of handling

    Discretizationto form an ordinal categoricalattribute

    Staticdiscretize once at the beginning

    Dynamicranges can be found by equal intervalbucketing, equal frequency bucketing

    (percentiles), or clustering.

    Binary Decision: (A < v) or (A

    v)consider all possible splits and finds the best cut

    can be more compute intensive

  • 8/13/2019 Chap4 Classification Sep13

    20/129

    Continuous Attributes

  • 8/13/2019 Chap4 Classification Sep13

    21/129

    Measures for selecting the Best Split

    )|( tjp

    )|( tip Fraction of records belonging to class i at a given node t

    Measures for selecting the best split is based on the degree of

    impurity of child nodes

    Smaller the degree of impurity the more skewed the class

    distribution

    Node with class distribution (0,1) has impurity=0; node with

    uniform distribution (0.5, 0.5) has highest impurity

  • 8/13/2019 Chap4 Classification Sep13

    22/129

    Which test condition is the best?

    d i h li

  • 8/13/2019 Chap4 Classification Sep13

    23/129

    How to determine the Best Split

    Greedy approach:

    Nodes with homogeneousclass distribution are

    preferred

    Need a measure of node impurity:

    C0: 5

    C1: 5

    C0: 9

    C1: 1

    Non-homogeneous,

    High degree of impurity

    Homogeneous,

    Low degree of impurity

    f d i

  • 8/13/2019 Chap4 Classification Sep13

    24/129

    Measures of Node Impurity

    Gini Index

    Entropy

    Misclassification error

    i

    tiptGINI 2)]|([1)(

    i

    tiptiptEntropy )|(log)|()( 2

    (NOTE: p( j | t) is the relative frequency of class j at node t).

    )]/([max1)( tipttionerrorClassifica i

    Where c is the number of classes and 0 log2 0 = 0 is entropy calculations

    M f I i GINI

  • 8/13/2019 Chap4 Classification Sep13

    25/129

    Measure of Impurity: GINI

    Gini Index for a given node t :

    (NOTE: p( j | t) is the relative frequency of class j at node t).

    Maximum (1 - 1/nc) when records are equally distributed among all

    classes, implying least interesting information

    Minimum (0.0) when all records belong to one class, implying mostinteresting information

    C1 0

    C2 6

    Gini=0.000

    C1 2

    C2 4

    Gini=0.444

    C1 3

    C2 3

    Gini=0.500

    C1 1

    C2 5

    Gini=0.278

    i

    tiptGINI 2)]|([1)(

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Gini = 1P(C1)2P(C2)2= 101 = 0

    P(C1) = 1/6 P(C2) = 5/6

    Gini = 1(1/6)2(5/6)2= 0.278

    P(C1) = 2/6 P(C2) = 4/6

    Gini = 1(2/6)2(4/6)2= 0.444

    P(C1) = 3/6 P(C2) = 3/6

    Gini = 1(3/6)2(3/6)2 =0.5

    Al i S li i C i i b d INFO

  • 8/13/2019 Chap4 Classification Sep13

    26/129

    Alternative Splitting Criteria based on INFO

    Entropy at a given node t:

    (NOTE: p( j | t) is the relative frequency of class j at node t).

    Measures homogeneity of a node.

    Maximum (log nc) when records are equally distributed among all

    classes implying least information

    Minimum (0.0) when all records belong to one class, implying most

    information

    Entropy based computations are similar to the GINI index

    computations

    i

    tiptiptEntropy )|(log)|()( 2

    E l f ti E t

  • 8/13/2019 Chap4 Classification Sep13

    27/129

    Examples for computing Entropy

    C1 0

    C2 6

    C1 2

    C2 4

    C1 1

    C2 5

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Entropy =0 log201 log21 =00 = 0

    P(C1) = 1/6 P(C2) = 5/6

    Entropy =(1/6) log2(1/6)(5/6) log2(1/6) = 0.65

    P(C1) = 2/6 P(C2) = 4/6

    Entropy =(2/6) log2(2/6)(4/6) log2(4/6) = 0.92

    i

    tiptiptEntropy )|(log)|()( 2

  • 8/13/2019 Chap4 Classification Sep13

    28/129

    Splitting Criteria based on Classification Error

    Classification error at a node t :

    Measures misclassification error made by a node.

    Maximum (1 - 1/nc) when records are equally distributed

    among all classes, implying least interesting information

    Minimum (0.0) when all records belong to one class, implying

    most interesting information

    )]/([max1)( tipttionerrorClassifica i

    E l f C ti E

  • 8/13/2019 Chap4 Classification Sep13

    29/129

    Examples for Computing Error

    C1 0

    C2 6

    C1 2

    C2 4

    C1 1

    C2 5

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Error = 1max (0, 1) = 11 = 0

    P(C1) = 1/6 P(C2) = 5/6

    Error = 1max (1/6, 5/6) = 15/6 = 1/6

    P(C1) = 2/6 P(C2) = 4/6

    Error = 1max (2/6, 4/6) = 14/6 = 1/3

    )]/([max1)( tipttionerrorClassifica i

  • 8/13/2019 Chap4 Classification Sep13

    30/129

    Mi l ifi ti E Gi i

  • 8/13/2019 Chap4 Classification Sep13

    31/129

    Misclassification Error vs Gini

    A?

    Yes No

    Node N1 Node N2

    Parent

    C1 7

    C2 3

    Gini = 0.42

    N1 N2

    C1 3 4

    C2 0 3

    Gini=0.361

    Gini(N1)

    = 1(3/3)2(0/3)2

    = 0

    Gini(N2)= 1(4/7)2(3/7)2

    = 0.489

    Gini(Children)

    = 3/10 * 0

    + 7/10 * 0.489

    = 0.342

    Gini improves !!

    Info mation Gain to find best split

  • 8/13/2019 Chap4 Classification Sep13

    32/129

    Information Gain to find best split

    B?

    Yes No

    Node N3 Node N4

    A?

    Yes No

    Node N1 Node N2

    Before Splitting:

    C0 N10

    C1 N11

    C0 N20

    C1 N21

    C0 N30

    C1 N31

    C0 N40

    C1 N41

    C0 N00

    C1 N01M0

    M1 M2 M3 M4

    M12 M34

    Difference in entropy gives information Gain

    Gain = M0M12 vs M0M34

    Splitting of Binary Attributes

  • 8/13/2019 Chap4 Classification Sep13

    33/129

    Splitting of Binary Attributes

    i ib C i G d

  • 8/13/2019 Chap4 Classification Sep13

    34/129

    Binary Attributes: Computing GINI Index

    Splits into two partitions

    Effect of Weighing partitions:

    Larger and Purer Partitions are sought for.

    B?

    Yes No

    Node N1 Node N2

    Parent

    C0 6

    C1 6

    Gini = 0.500

    N1 N2C0 1 5

    C1 4 2

    Gini=0.333

    Gini(N1)

    = 1(5/6)2(2/6)2

    = 0.194

    Gini(N2)

    = 1(1/6)2(4/6)2

    = 0.528

    Gini(Children)

    = 7/12 * 0.194 + 5/12 * 0.528

    = 0.333

  • 8/13/2019 Chap4 Classification Sep13

    35/129

    Splitting of Nominal Attributes: Computing Gini Index

    For each distinct value, gather counts for each class in the dataset

    Use the count matrix to make decisions

    Multiway split

    Gini{family} = 0.375 ; Gini {sports } = 0 ; Gini {Luxury} = 0.219

    Gini (car type) = (4/20) * 0.375 + (8/20) * 0 + (8/20) * 0.219 = 0.163

  • 8/13/2019 Chap4 Classification Sep13

    36/129

    Splitting of Continuous Attributes: Computing Gini Index

    Use Binary Decisions based on one value

    Several Choices for the splitting value

    Number of possible splitting values

    = Number of distinct values

    Each splitting value has a count matrix associated

    with it

    Class counts in each of the partitions, A < v

    and A v

    Simple method to choose best v

    For each v, scan the database to gather count

    matrix and compute its Gini index

    Computationally Inefficient! Repetition of work.

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    TaxableIncome

    > 80K?

    Yes No

    C ti Att ib t C ti Gi i I d

  • 8/13/2019 Chap4 Classification Sep13

    37/129

    Continuous Attributes: Computing Gini Index...

    For efficient computation: for each attribute,

    Sort the attribute on values

    Linearly scan these values, each time updating the count matrix and

    computing Gini index

    Choose the split position that has the least Gini index

    S litti B d INFO

  • 8/13/2019 Chap4 Classification Sep13

    38/129

    Splitting Based on INFO...

    Information Gain:

    Parent Node, p is split into k partitions;

    niis number of records in partition i

    Measures Reduction in Entropy achieved because of the split.

    Choose the split that achieves most reduction (maximizes GAIN)

    Used in ID3 and C4.5

    Disadvantage: Tends to prefer splits that result in large number of

    partitions, each being small but pure.

    k

    i

    i

    sp li tiEntropy

    nnpEntropyGAIN

    1)()(

    Example of Information Gain

  • 8/13/2019 Chap4 Classification Sep13

    39/129

    Example of Information Gain

    Class P: buys_computer = yes

    Class N: buys_computer = noage pi ni I(pi, ni)

    40 3 2 0.971

    694.0)2,3(145

    )0,4(14

    4)3,2(

    14

    5)(

    I

    IIDInfoage

    048.0)_(

    151.0)(

    029.0)(

    ratingcreditGain

    studentGain

    incomeGain

    246.0)()()( DInfoDInfoageGain age

    age income student credit_rating buys_computer

    40 low yes fair yes

    >40 low yes excellent no

    3140 low yes excellent yes

  • 8/13/2019 Chap4 Classification Sep13

    40/129

    Splitting the samples using age

    income student credit_rating buys_computer

    high no fair nohigh no excellent no

    medium no fair no

    low yes fair yes

    medium yes excellent yes

    income student credit_rating buys_computer

    high no fair yeslow yes excellent yes

    medium no excellent yes

    high yes fair yes

    income student credit_rating buys_computer

    medium no fair yeslow yes fair yes

    low yes excellent no

    medium yes fair yes

    medium no excellent no

    age?

    40

    labeled yes

    Splitting Based on INFO

  • 8/13/2019 Chap4 Classification Sep13

    41/129

    Splitting Based on INFO...

    Gain Ratio:

    Parent Node, p is split into k partitions

    niis the number of records in partition I

    Adjusts Information Gain by the entropy of the partitioning (SplitINFO).

    Higher entropy partitioning (large number of small partitions) is penalized!

    Used in C4.5

    Designed to overcome the disadvantage of Information Gain

    SplitINFO

    GAINGainRATIO Split

    sp li t

    k

    i

    ii

    n

    n

    n

    nSplitINFO

    1log

    Decision Tree Induction

  • 8/13/2019 Chap4 Classification Sep13

    42/129

    Decision Tree Induction

    Greedy strategy.

    Split the records based on an attribute testthat optimizes certain criterion.

    Issues

    Determine how to split the records

    How to specify the attribute test condition?

    How to determine the best split? Determine when to stop splitting

    Stopping Criteria for Tree Induction

  • 8/13/2019 Chap4 Classification Sep13

    43/129

    Stopping Criteria for Tree Induction

    Stop expanding a node when all the recordsbelong to the same class

    Stop expanding a node when all the records havesimilar attribute values

    Early termination

    Algorithm : Decision Tree Algorithm

  • 8/13/2019 Chap4 Classification Sep13

    44/129

    Algorithm : Decision Tree Algorithm

    TreeGrowth (E, F)

    1. If stopping_cond(E, F) =truethen2. leaf = createNode ()

    3. leaf.label = Classify (E)

    4. return leaf.

    5. else

    6. root = createNode()

    7. root. test_cond = find_best_split (E,F)

    8. let V= { v | v is poosible outcome of root. test_cond }.

    9. for each v V do

    10.

    Ev = { e | root. test_cond (e) = v and e V }.11. child = TreeGrowth (Ev , F)

    12. add child as descendent of root and label the edge ( rootchild) as v

    13. end for

    14.end if

    15.return root

  • 8/13/2019 Chap4 Classification Sep13

    45/129

    createNode()create a new node. Has a test condition or a

    class label (node.label)

    find_best_split ()attribute to be selected as as test condition for

    splitting records. Entropy, Gini, Error.

    Classify()determine class label to be assigned to leaf node.

    leaf.label =argmax p(i | t)

    stopping _cond()terminate the tree growth by testing whether all

    records has same class label or same attribute values.

    Later tree pruning and overfitting.

    Decision Tree Based Classification

  • 8/13/2019 Chap4 Classification Sep13

    46/129

    Decision Tree Based Classification

    Advantages: Inexpensive to construct

    Extremely fast at classifying unknown records

    Easy to interpret for small-sized trees

    Accuracy is comparable to other classification

    techniques for many simple data sets

  • 8/13/2019 Chap4 Classification Sep13

    47/129

  • 8/13/2019 Chap4 Classification Sep13

    48/129

    Decision Tree Induction

  • 8/13/2019 Chap4 Classification Sep13

    49/129

    Decision Tree Induction

    Many Algorithms: Hunts Algorithm (one of the earliest)

    CART

    ID3, C4.5

    SLIQ,SPRINT

    Computing Impurity Measure

  • 8/13/2019 Chap4 Classification Sep13

    50/129

    Computing Impurity Measure

    Tid Refund MaritalStatus

    TaxableIncome Class

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 ? Single 90K Yes10

    Class

    = Yes

    Class

    = No

    Refund=Yes 0 3

    Refund=No 2 4

    Refund=? 1 0

    Split on Refund:

    Entropy(Refund=Yes) = 0

    Entropy(Refund=No)

    = -(2/6)log(2/6)(4/6)log(4/6) = 0.9183

    Entropy(Children)

    = 0.3 (0) + 0.6 (0.9183) = 0.551

    Gain = 0.9 (0.88130.551) = 0.3303

    Missing

    value

    Before Splitting:

    Entropy(Parent)

    = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

    Rule-Based Classifier

  • 8/13/2019 Chap4 Classification Sep13

    51/129

    Rule Based Classifier

    Classify records by using a collection ofifthen rules

    Rule: (Condition) y

    where

    Conditionis a conjunctions of attributes

    yis the class label

    LHS: rule antecedent or condition

    RHS: rule consequent

    Examples of classification rules:

    (Blood Type=Warm) (Lay Eggs=Yes) Birds

    (Taxable Income < 50K) (Refund=Yes) Evade=No

    Rule-based Classifier (Example)

  • 8/13/2019 Chap4 Classification Sep13

    52/129

    Rule based Classifier (Example)

    R1: (Give Birth = no) (Can Fly = yes) Birds

    R2: (Give Birth = no) (Live in Water = yes) Fishes

    R3: (Give Birth = yes) (Blood Type = warm) Mammals

    R4: (Give Birth = no) (Can Fly = no) Reptiles

    R5: (Live in Water = sometimes) Amphibians

    Name Blood Type Give Birth Can Fly Live in Water Class

    human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fishes

    whale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birds

    porcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

    Application of Rule-Based Classifier

  • 8/13/2019 Chap4 Classification Sep13

    53/129

    Application of Rule Based Classifier

    A rule rcoversan instance x if the attributes of the instance

    satisfy the condition of the rule (Aj op vj ) attribute test which is an attribute- value pair called

    conjunct

    Conditioni = (A1 op v1 ) (A2 op v2 ) . (Ak op vk )

    The rule R1 covers a hawk => Bird

    The rule R3 covers the grizzly bear => Mammal

    Name Blood Type Give Birth Can Fly Live in Water Class

    hawk warm no yes no ?grizzly bear warm yes no no ?

  • 8/13/2019 Chap4 Classification Sep13

    54/129

    Tid Refund Marital Taxable

  • 8/13/2019 Chap4 Classification Sep13

    55/129

    Status Income Class

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 NoSingle

    90KYes10

    (Status=Single) No

    Coverage = ---- %, Accuracy = -------------%

    How does Rule-based Classifier Work?

  • 8/13/2019 Chap4 Classification Sep13

    56/129

    How does Rule based Classifier Work?

    R1: (Give Birth = no) (Can Fly = yes) Birds

    R2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) Mammals

    R4: (Give Birth = no) (Can Fly = no) Reptiles

    R5: (Live in Water = sometimes) Amphibians

    A lemur triggers rule R3, so it is classified as a mammalA turtle triggers both R4 and R5

    A dogfish shark triggers none of the rules

    Name Blood Type Give Birth Can Fly Live in Water Class

    lemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?

    Characteristics of Rule-Based Classifier

  • 8/13/2019 Chap4 Classification Sep13

    57/129

    Characteristics of Rule Based Classifier

    Mutually exclusive rules

    Classifier contains mutually exclusive rules ifthe rules are independent of each other

    Every record is covered by at most one rule

    Exhaustive rules

    Classifier has exhaustive coverage if it

    accounts for every possible combination ofattribute values

    Each record is covered by at least one rule

    From Decision Trees To Rules

  • 8/13/2019 Chap4 Classification Sep13

    58/129

    From Decision Trees To Rules

    YESYESNONO

    NONO

    NONO

    Yes No

    {Married}{Single,

    Divorced}

    < 80K > 80K

    Taxable

    Income

    Marital

    Status

    Refund

    Classification Rules

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income>80K) ==> Yes

    (Refund=No, Marital Status={Married}) ==> No

    Rules are mutually exclusive and exhaustive

    Rule set contains as much information as the

    tree

    Rules Can Be Simplified

  • 8/13/2019 Chap4 Classification Sep13

    59/129

    p

    YESYESNONO

    NONO

    NONO

    Yes No

    {Married}{Single,

    Divorced}

    < 80K > 80K

    Taxable

    Income

    Marital

    Status

    Refund

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Initial Rule: (Refund=No) (Status=Married) No

    Simplified Rule: (Status=Married) No

    Effect of Rule Simplification

  • 8/13/2019 Chap4 Classification Sep13

    60/129

    p

    Rules are no longer mutually exclusive

    A record may trigger more than one rule

    Solution?

    Ordered rule set

    Unordered rule setuse voting schemes

    Rules are no longer exhaustive

    A record may not trigger any rules Solution?

    Use a default class

    Ordered Rule Set

  • 8/13/2019 Chap4 Classification Sep13

    61/129

    Rules are rank ordered according to their priority An ordered rule set is known as a decision list

    When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has

    triggered

    If none of the rules fired, it is assigned to the default class

    R1: (Give Birth = no) (Can Fly = yes) Birds

    R2: (Give Birth = no) (Live in Water = yes) Fishes

    R3: (Give Birth = yes) (Blood Type = warm) Mammals

    R4: (Give Birth = no) (Can Fly = no) Reptiles

    R5: (Live in Water = sometimes) Amphibians

    Name Blood Type Give Birth Can Fly Live in Water Class

    turtle cold no no sometimes ?

    Rule Ordering Schemes

  • 8/13/2019 Chap4 Classification Sep13

    62/129

    g

    Rule-based ordering Individual rules are ranked based on their quality

    Class-based ordering Rules that belong to the same class appear together

    Rule-based Ordering

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income>80K) ==> Yes

    (Refund=No, Marital Status={Married}) ==> No

    Class-based Ordering

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income No

    (Refund=No, Marital Status={Married}) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income>80K) ==> Yes

    Building Classification Rules

  • 8/13/2019 Chap4 Classification Sep13

    63/129

    g

    Direct Method:

    Extract rules directly from data

    e.g.: RIPPER, CN2, Holtes 1R

    Indirect Method:Extract rules from other classification models (e.g.

    decision trees, neural networks, etc).

    e.g: C4.5rules

    Direct Method: Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    64/129

    q g

    1. Start from an empty rule

    2. Grow a rule using the Learn-One-Rule function

    3. Remove training records covered by the rule

    4. Repeat Step (2) and (3) until stopping criterion

    is met

  • 8/13/2019 Chap4 Classification Sep13

    65/129

    Example of Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    66/129

    p q g

    (iii) Step 2

    R1

    (iv) Step 3

    R1

    R2

    Aspects of Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    67/129

    p q g

    Rule Growing

    Instance Elimination

    Rule Evaluation

    Stopping Criterion

    Rule Pruning

    Rule Growing

  • 8/13/2019 Chap4 Classification Sep13

    68/129

    g

    Two common strategies

    Status =

    Single

    Status =

    DivorcedStatus =

    Married

    Income

    > 80K...

    Yes: 3

    No: 4{ }

    Yes: 0

    No: 3

    Refund=

    No

    Yes: 3

    No: 4

    Yes: 2

    No: 1

    Yes: 1

    No: 0

    Yes: 3

    No: 1

    (a) General-to-specific

    Refund=No,

    Status=Single,

    Income=85K

    (Class=Yes)

    Refund=No,

    Status=Single,

    Income=90K

    (Class=Yes)

    Refund=No,

    Status = Single

    (Class = Yes)

    (b) Specific-to-general

    Ripper Algorithm

  • 8/13/2019 Chap4 Classification Sep13

    69/129

    Sequential covering Algorithm:

    Start from an empty rule: {} => class

    Add conjuncts that maximizes FOILs information gain measure:

    R0: {} => class (initial rule)

    R1: {A} => class (rule after adding conjunct)

    Gain(R0, R1) = t [ log (p1/(p1+n1))log (p0/(p0 + n0)) ]

    where t: number of positive instances covered by both R0 and R1

    p0: number of positive instances covered by R0

    n0: number of negative instances covered by R0

    p1: number of positive instances covered by R1

    n1: number of negative instances covered by R1

    Instance Elimination

  • 8/13/2019 Chap4 Classification Sep13

    70/129

    Why do we need to

    eliminate instances? Otherwise, the next rule is

    identical to previous rule

    Why do we removepositive instances? Ensure that the next rule is

    different

    Why do we removenegative instances? Prevent underestimating

    accuracy of rule

    Compare rules R2 and R3in the diagram

    class = +

    class = -

    +

    + +

    ++

    ++

    +

    ++

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    -

    -

    --

    - --

    --

    - -

    -

    -

    -

    -

    --

    -

    -

    -

    -

    +

    +

    ++

    +

    +

    +

    R1

    R3 R2

    +

    +

  • 8/13/2019 Chap4 Classification Sep13

    71/129

    Stopping Criterion and Rule Pruning

  • 8/13/2019 Chap4 Classification Sep13

    72/129

    Stopping criterion

    Compute the gain If gain is not significant, discard the new rule

    Rule Pruning Similar to post-pruning of decision trees

    Reduced Error Pruning:Remove one of the conjuncts in the rule

    Compare error rate on validation set before andafter pruning

    If error improves, prune the conjunct

    Instance Elimination

  • 8/13/2019 Chap4 Classification Sep13

    73/129

    Why do we need to

    eliminate instances? Otherwise, the next rule is

    identical to previous rule

    Why do we removepositive instances?

    Ensure that the next rule isdifferent

    Why do we removenegative instances? Prevent underestimating

    accuracy of rule

    Compare rules R2 and R3in the diagram

    class = +

    class = -

    +

    + +

    ++

    ++

    +

    ++

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    -

    -

    --

    - --

    --

    - -

    -

    -

    -

    -

    --

    -

    -

    -

    -

    +

    +

    ++

    +

    +

    +

    R1

    R3 R2

    +

    +

  • 8/13/2019 Chap4 Classification Sep13

    74/129

    Stopping Criterion and Rule Pruning

  • 8/13/2019 Chap4 Classification Sep13

    75/129

    Stopping criterion

    Compute the gain If gain is not significant, discard the new rule

    Rule Pruning Similar to post-pruning of decision trees

    Reduced Error Pruning:Remove one of the conjuncts in the rule

    Compare error rate on validation set before andafter pruning

    If error improves, prune the conjunct

    Indirect Methods

  • 8/13/2019 Chap4 Classification Sep13

    76/129

    Rule Set

    r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +

    r3: (P=Yes,R=No) ==> +

    r4: (P=Yes,R=Yes,Q=No) ==> -

    r5: (P=Yes,R=Yes,Q=Yes) ==> +

    P

    Q R

    Q- + +

    - +

    No No

    No

    Yes Yes

    Yes

    No Yes

    Rule set from decision tree

    Each rule with a class + or -.

    R2, R3, r5 predict positive class when Q=yes

    Rules simplified as r2 : (Q=yes) +

    r3: (P=yes) (R=No)+

    Rule generation : C4.5rules

  • 8/13/2019 Chap4 Classification Sep13

    77/129

    Extract classification rules from every path of a

    decision tree For each rule, r: A y,

    consider an alternative rule r: A y where A isobtained by removing one of the conjuncts in A

    Compare the pessimistic error rate for r againstall rs

    Prune if one of the rs has lower pessimistic errorrate

    Repeat until we can no longer improvegeneralization error

    Rule ordering : C4.5rules

  • 8/13/2019 Chap4 Classification Sep13

    78/129

    Use of class ordering where rules that predict same

    class grouped into same subsets. Compute description length of each subset

    The classes are arranged in increasing order of

    their total description length. Class with smallest description length is given

    highest priority.

    Description length = L(exception) + g * L(model)

    g is a parameter that takes into account the presenceof redundant attributes in a rule set(default value = 0.5)

    Example

  • 8/13/2019 Chap4 Classification Sep13

    79/129

    Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

    human yes no no no yes mammals

    python no yes no no no reptilessalmon no yes no yes no fisheswhale yes no no yes no mammals

    frog no yes no sometimes yes amphibianskomodo no yes no no yes reptilesbat yes no yes no yes mammals

    pigeon no yes yes no yes birdscat yes no no no yes mammals

    leopard shark yes no no yes no fishesturtle no yes no sometimes yes reptiles

    penguin no yes no sometimes yes birdsporcupine yes no no no yes mammals

    eel no yes no yes no fishessalamander no yes no sometimes yes amphibians

    gila monster no yes no no yes reptilesplatypus no yes no no yes mammals

    owl no yes yes no yes birdsdolphin yes no no yes no mammalseagle no yes yes no yes birds

    C4.5 versus C4.5rules versus RIPPER

  • 8/13/2019 Chap4 Classification Sep13

    80/129

    C4.5rules:

    (Give Birth=No, Can Fly=Yes) Birds

    (Give Birth=No, Live in Water=Yes) Fishes

    (Give Birth=Yes) Mammals

    (Give Birth=No, Can Fly=No, Live in Water=No)Reptiles

    ( ) Amphibians

    Give

    Birth?

    Live In

    Water?

    Can

    Fly?

    Mammals

    Fishes Amphibians

    Birds Reptiles

    Yes No

    Yes

    Sometimes

    No

    Yes No

    RIPPER:

    (Live in Water=Yes) Fishes

    (Have Legs=No) Reptiles

    (Give Birth=No, Can Fly=No, Live In Water=No)Reptiles

    (Can Fly=Yes,Give Birth=No) Birds() Mammals

    Advantages of Rule-Based Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    81/129

    As highly expressive as decision trees

    Easy to interpret

    Easy to generate

    Can classify new instances rapidly

    Performance comparable to decision trees

    Rule-Based Classifier

  • 8/13/2019 Chap4 Classification Sep13

    82/129

    Classify records by using a collection of

    ifthen rules

    Rule: (Condition) y

    where

    Conditionis a conjunctions of attributes

    yis the class label

    LHS: rule antecedent or condition

    RHS: rule consequent

    Examples of classification rules:

    (Blood Type=Warm) (Lay Eggs=Yes) Birds

    (Taxable Income < 50K) (Refund=Yes) Evade=No

  • 8/13/2019 Chap4 Classification Sep13

    83/129

    Application of Rule-Based Classifier

  • 8/13/2019 Chap4 Classification Sep13

    84/129

    A rule rcoversan instance x if the attributes of the instance

    satisfy the condition of the rule (Aj op vj ) attribute test which is an attribute- value pair called

    conjunct

    Conditioni = (A1 op v1 ) (A2 op v2 ) . (Ak op vk )

    The rule R1 covers a hawk => Bird

    The rule R3 covers the grizzly bear => Mammal

    Name Blood Type Give Birth Can Fly Live in Water Class

    hawk warm no yes no ?grizzly bear warm yes no no ?

    Rule Coverage and Accuracy

  • 8/13/2019 Chap4 Classification Sep13

    85/129

    Coverage of a rule:

    Fraction of records that satisfy the antecedent of a rule= | A | / |D|

    Accuracy of a rule:

    Fraction of records that satisfy both the antecedent and

    consequent of a rule= | A y | / |D|

    (Gives Birth=yes) (Body Temperature = warm-blooded) Mammals

    Coverage = 33%, Accuracy = 6/6 = 100%

    Tid Refund MaritalStatus

    TaxableIncome Class

  • 8/13/2019 Chap4 Classification Sep13

    86/129

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10(Status=Single) No

    Coverage = ---- %, Accuracy = -------------%

    How does Rule-based Classifier Work?

  • 8/13/2019 Chap4 Classification Sep13

    87/129

    R1: (Give Birth = no) (Can Fly = yes) Birds

    R2: (Give Birth = no)

    (Live in Water = yes)

    FishesR3: (Give Birth = yes) (Blood Type = warm) Mammals

    R4: (Give Birth = no) (Can Fly = no) Reptiles

    R5: (Live in Water = sometimes) Amphibians

    A lemur triggers rule R3, so it is classified as a mammalA turtle triggers both R4 and R5

    A dogfish shark triggers none of the rules

    Name Blood Type Give Birth Can Fly Live in Water Class

    lemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?

  • 8/13/2019 Chap4 Classification Sep13

    88/129

    From Decision Trees To Rules

  • 8/13/2019 Chap4 Classification Sep13

    89/129

    YESYESNONO

    NONO

    NONO

    Yes No

    {Married}

    {Single,

    Divorced}

    < 80K > 80K

    Taxable

    Income

    Marital

    Status

    Refund

    Classification Rules

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income>80K) ==> Yes

    (Refund=No, Marital Status={Married}) ==> No

    Rules are mutually exclusive and exhaustive

    Rule set contains as much information as the

    tree

    Rules Can Be Simplified

  • 8/13/2019 Chap4 Classification Sep13

    90/129

    YESYESNONO

    NONO

    NONO

    Yes No

    {Married}{Single,

    Divorced}

    < 80K > 80K

    Taxable

    Income

    Marital

    Status

    Refund

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Initial Rule: (Refund=No) (Status=Married) No

    Simplified Rule: (Status=Married) No

    Effect of Rule Simplification

  • 8/13/2019 Chap4 Classification Sep13

    91/129

    Rules are no longer mutually exclusive

    A record may trigger more than one rule

    Solution?

    Ordered rule set

    Unordered rule setuse voting schemes

    Rules are no longer exhaustive

    A record may not trigger any rules Solution?

    Use a default class

    Ordered Rule Set

  • 8/13/2019 Chap4 Classification Sep13

    92/129

    Rules are rank ordered according to their priority An ordered rule set is known as a decision list

    When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has

    triggered

    If none of the rules fired, it is assigned to the default class

    R1: (Give Birth = no) (Can Fly = yes) Birds

    R2: (Give Birth = no) (Live in Water = yes) Fishes

    R3: (Give Birth = yes) (Blood Type = warm) Mammals

    R4: (Give Birth = no) (Can Fly = no) Reptiles

    R5: (Live in Water = sometimes) Amphibians

    Name Blood Type Give Birth Can Fly Live in Water Class

    turtle cold no no sometimes ?

    Rule Ordering Schemes

  • 8/13/2019 Chap4 Classification Sep13

    93/129

    Rule-based ordering Individual rules are ranked based on their quality

    Class-based ordering Rules that belong to the same class appear together

    Rule-based Ordering

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income No

    (Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

    (Refund=No, Marital Status={Married}) ==> No

    Class-based Ordering

    (Refund=Yes) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income No

    (Refund=No, Marital Status={Married}) ==> No

    (Refund=No, Marital Status={Single,Divorced},

    Taxable Income>80K) ==> Yes

    Building Classification Rules

  • 8/13/2019 Chap4 Classification Sep13

    94/129

    Direct Method:

    Extract rules directly from data

    e.g.: RIPPER, CN2, Holtes 1R

    Indirect Method:Extract rules from other classification models (e.g.

    decision trees, neural networks, etc).

    e.g: C4.5rules

    Direct Method: Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    95/129

    1. Start from an empty rule

    2. Grow a rule using the Learn-One-Rule function

    3. Remove training records covered by the rule

    4. Repeat Step (2) and (3) until stopping criterion

    is met

    Example of Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    96/129

    (i) Original Data (ii) Step 1

    Example of Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    97/129

    (iii) Step 2

    R1

    (iv) Step 3

    R1

    R2

    Aspects of Sequential Covering

  • 8/13/2019 Chap4 Classification Sep13

    98/129

    Rule Growing

    Instance Elimination

    Rule Evaluation

    Stopping Criterion

    Rule Pruning

    Rule Growing

  • 8/13/2019 Chap4 Classification Sep13

    99/129

    Two common strategies

    Status =

    Single

    Status =

    DivorcedStatus =

    Married

    Income

    > 80K...

    Yes: 3

    No: 4{ }

    Yes: 0

    No: 3

    Refund=

    No

    Yes: 3

    No: 4

    Yes: 2

    No: 1

    Yes: 1

    No: 0

    Yes: 3

    No: 1

    (a) General-to-specific

    Refund=No,

    Status=Single,

    Income=85K

    (Class=Yes)

    Refund=No,

    Status=Single,

    Income=90K

    (Class=Yes)

    Refund=No,

    Status = Single

    (Class = Yes)

    (b) Specific-to-general

    Rule Growing (Examples)

  • 8/13/2019 Chap4 Classification Sep13

    100/129

    CN2 Algorithm:

    Start from an empty conjunct: {} Add conjuncts that minimizes the entropy measure: {A}, {A,B},

    Determine the rule consequent by taking majority class of instancescovered by the rule

    RIPPER Algorithm:

    Start from an empty rule: {} => class Add conjuncts that maximizes FOILs information gain measure:

    R0: {} => class (initial rule)

    R1: {A} => class (rule after adding conjunct)

    Gain(R0, R1) = t [ log (p1/(p1+n1))log (p0/(p0 + n0)) ]

    where t: number of positive instances covered by both R0 and R1

    p0: number of positive instances covered by R0

    n0: number of negative instances covered by R0

    p1: number of positive instances covered by R1

    n1: number of negative instances covered by R1

    Instance Elimination

  • 8/13/2019 Chap4 Classification Sep13

    101/129

    Why do we need to

    eliminate instances? Otherwise, the next rule isidentical to previous rule

    Why do we removepositive instances?

    Ensure that the next rule isdifferent

    Why do we removenegative instances? Prevent underestimating

    accuracy of rule Compare rules R2 and R3

    in the diagram

    class = +

    class = -

    +

    + +

    ++

    ++

    +

    ++

    +

    +

    +

    +

    +

    +

    ++

    +

    +

    -

    -

    --

    - --

    --

    - -

    -

    -

    -

    -

    --

    -

    -

    -

    -

    +

    +

    ++

    +

    +

    +

    R1

    R3 R2

    +

    +

    Rule Evaluation

  • 8/13/2019 Chap4 Classification Sep13

    102/129

    Metrics:

    Accuracy

    Laplace

    M-estimate

    knnc

    1

    knkpnc

    n : Number of instancescovered by rule

    nc: Number of instancescovered by rule

    k: Number of classes

    p: Prior probability

    nnc

    Stopping Criterion and Rule Pruning

  • 8/13/2019 Chap4 Classification Sep13

    103/129

    Stopping criterion

    Compute the gain If gain is not significant, discard the new rule

    Rule Pruning Similar to post-pruning of decision trees

    Reduced Error Pruning:Remove one of the conjuncts in the rule

    Compare error rate on validation set before andafter pruning

    If error improves, prune the conjunct

    Summary of Direct Method

  • 8/13/2019 Chap4 Classification Sep13

    104/129

    Grow a single rule

    Remove Instances from rule

    Prune the rule (if necessary)

    Add rule to Current Rule Set

    Repeat

    Direct Method: RIPPER

  • 8/13/2019 Chap4 Classification Sep13

    105/129

    For 2-class problem, choose one of the classes as

    positive class, and the other as negative class Learn rules for positive class

    Negative class will be default class

    For multi-class problem

    Order the classes according to increasing classprevalence (fraction of instances that belong to aparticular class)

    Learn the rule set for smallest class first, treat the rest

    as negative class

    Repeat with next smallest class as positive class

    Direct Method: RIPPER

  • 8/13/2019 Chap4 Classification Sep13

    106/129

    Growing a rule:

    Start from empty rule Add conjuncts as long as they improve FOILs

    information gain

    Stop when rule no longer covers negative examples

    Prune the rule immediately using incremental reducederror pruning

    Measure for pruning: v = (p-n)/(p+n)p: number of positive examples covered by the rule in

    the validation set

    n: number of negative examples covered by the rule inthe validation set

    Pruning method: delete any final sequence ofconditions that maximizes v

    Direct Method: RIPPER

  • 8/13/2019 Chap4 Classification Sep13

    107/129

    Building a Rule Set:

    Use sequential covering algorithmFinds the best rule that covers the current set ofpositive examples

    Eliminate both positive and negative examplescovered by the rule

    Each time a rule is added to the rule set,compute the new description length

    stop adding new rules when the new descriptionlength is d bits longer than the smallest descriptionlength obtained so far

    Direct Method: RIPPER

  • 8/13/2019 Chap4 Classification Sep13

    108/129

    Optimize the rule set:

    For each rule rin the rule set R

    Consider 2 alternative rules:

    Replacement rule (r*): grow new rule from scratch

    Revised rule(r): add conjuncts to extend the rule r

    Compare the rule set for r against the rule set for r*and r

    Choose rule set that minimizes MDL principle

    Repeat rule generation and rule optimizationfor the remaining positive examples

    Indirect Methods

  • 8/13/2019 Chap4 Classification Sep13

    109/129

    Rule Set

    r1: (P=No,Q=No) ==> -

    r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +

    r4: (P=Yes,R=Yes,Q=No) ==> -

    r5: (P=Yes,R=Yes,Q=Yes) ==> +

    P

    Q R

    Q- + +

    - +

    No No

    No

    Yes Yes

    Yes

    No Yes

    Rule set from decision tree

    Each rule with a class + or -.

    R2, R3, r5 predict positive class when Q=yes

    Rules simplified as r2 : (Q=yes) +

    r3: (P=yes) (R=No)+

    Rule generation : C4.5rules

  • 8/13/2019 Chap4 Classification Sep13

    110/129

    Extract classification rules from every path of a

    decision tree For each rule, r: A y,

    consider an alternative rule r: A y where A isobtained by removing one of the conjuncts in A

    Compare the pessimistic error rate for r againstall rs

    Prune if one of the rs has lower pessimistic errorrate

    Repeat until we can no longer improvegeneralization error

  • 8/13/2019 Chap4 Classification Sep13

    111/129

    Example

  • 8/13/2019 Chap4 Classification Sep13

    112/129

    Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class

    human yes no no no yes mammals

    python no yes no no no reptilessalmon no yes no yes no fisheswhale yes no no yes no mammals

    frog no yes no sometimes yes amphibianskomodo no yes no no yes reptilesbat yes no yes no yes mammals

    pigeon no yes yes no yes birdscat yes no no no yes mammals

    leopard shark yes no no yes no fishesturtle no yes no sometimes yes reptiles

    penguin no yes no sometimes yes birdsporcupine yes no no no yes mammals

    eel no yes no yes no fishes

    salamander no yes no sometimes yes amphibiansgila monster no yes no no yes reptilesplatypus no yes no no yes mammals

    owl no yes yes no yes birdsdolphin yes no no yes no mammalseagle no yes yes no yes birds

    C4.5 versus C4.5rules versus RIPPER

  • 8/13/2019 Chap4 Classification Sep13

    113/129

    C4.5rules:

    (Give Birth=No, Can Fly=Yes) Birds

    (Give Birth=No, Live in Water=Yes) Fishes

    (Give Birth=Yes) Mammals

    (Give Birth=No, Can Fly=No, Live in Water=No)Reptiles

    ( ) Amphibians

    Give

    Birth?

    Live In

    Water?

    Can

    Fly?

    Mammals

    Fishes Amphibians

    Birds Reptiles

    Yes No

    Yes

    Sometimes

    No

    Yes No

    RIPPER:(Live in Water=Yes) Fishes

    (Have Legs=No) Reptiles

    (Give Birth=No, Can Fly=No, Live In Water=No)Reptiles

    (Can Fly=Yes,Give Birth=No) Birds

    () Mammals

  • 8/13/2019 Chap4 Classification Sep13

    114/129

    Advantages of Rule-Based Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    115/129

    As highly expressive as decision trees

    Easy to interpret

    Easy to generate

    Can classify new instances rapidly

    Performance comparable to decision trees

    Instance-Based Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    116/129

    Atr1 ... AtrN Class

    A

    B

    B

    C

    A

    C

    B

    Set of Stored Cases

    Atr1 ... AtrN

    Unseen Case

    Store the training records

    Use training records to

    predict the class label of

    unseen cases

    Instance Based Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    117/129

    Examples:

    Rote-learnerMemorizes entire training data and performsclassification only if attributes of record match one ofthe training examples exactly

    Nearest neighbor

    Uses k closest points (nearest neighbors) for

    performing classification

    Nearest Neighbor Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    118/129

    Basic idea:

    If it walks like a duck, quacks like a duck, thenits probably a duck

    Training

    Records

    TestRecord

    Compute

    Distance

    Choose k of the

    nearest records

    Nearest-Neighbor Classifiers

  • 8/13/2019 Chap4 Classification Sep13

    119/129

    Requires three things

    The set of stored records

    Distance Metric to computedistance between records

    The value of k, the number ofnearest neighbors to retrieve

    To classify an unknown record:

    Compute distance to othertraining records

    Identify knearest neighbors

    Use class labels of nearestneighbors to determine theclass label of unknown record(e.g., by taking majority vote)

    Unknown record

    Definition of Nearest Neighbor

  • 8/13/2019 Chap4 Classification Sep13

    120/129

    X X X

    (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

    K-nearest neighbors of a record x are data pointsthat have the k smallest distance to x

    1 nearest-neighbor

  • 8/13/2019 Chap4 Classification Sep13

    121/129

    Voronoi Diagram

    Nearest Neighbor Classification

  • 8/13/2019 Chap4 Classification Sep13

    122/129

    Compute distance between two points:

    Euclidean distance

    Determine the class from nearest neighbor list

    take the majority vote of class labels among

    the k-nearest neighbors Weigh the vote according to distance

    weight factor, w = 1/d2

    i ii

    qpqpd 2)(),(

    Nearest Neighbor Classification

  • 8/13/2019 Chap4 Classification Sep13

    123/129

    Choosing the value of k:

    If k is too small, sensitive to noise points If k is too large, neighborhood may include points from

    other classes

    X

    Nearest Neighbor Classification

  • 8/13/2019 Chap4 Classification Sep13

    124/129

    Scaling issues

    Attributes may have to be scaled to preventdistance measures from being dominated byone of the attributes

    Example:height of a person may vary from 1.5m to 1.8m

    weight of a person may vary from 90lb to 300lb

    income of a person may vary from $10K to $1M

    Nearest Neighbor Classification

  • 8/13/2019 Chap4 Classification Sep13

    125/129

    Problem with Euclidean measure:

    High dimensional datacurse of dimensionality

    Can produce counter-intuitive results

    1 1 1 1 1 1 1 1 1 1 1 0

    0 1 1 1 1 1 1 1 1 1 1 1

    1 0 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 1

    vs

    d = 1.4142 d = 1.4142

    Solution: Normalize the vectors to unit length

    Nearest neighbor Classification

  • 8/13/2019 Chap4 Classification Sep13

    126/129

    k-NN classifiers are lazy learners

    It does not build models explicitly

    Unlike eager learners such as decision treeinduction and rule-based systems

    Classifying unknown records are relativelyexpensive

    Example: PEBLS

  • 8/13/2019 Chap4 Classification Sep13

    127/129

    PEBLS: Parallel Examplar-Based Learning

    System (Cost & Salzberg) Works with both continuous and nominal

    features

    For nominal features, distance between twonominal values is computed using modified valuedifference metric (MVDM)

    Each record is assigned a weight factor

    Number of nearest neighbor, k = 1

  • 8/13/2019 Chap4 Classification Sep13

    128/129

    Example: PEBLS

  • 8/13/2019 Chap4 Classification Sep13

    129/129

    d

    i

    iiYX YXdwwYX1

    2),(),(

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    X Yes Single 125K No

    Y No Married 100K No10

    Distance between record X and record Y:

    where:

    correctlypredictsXtimesofNumber

    predictionforusedisXtimesofNumberXw

    wX 1 if X makes accurate prediction most of the time