Homework 4: Solutions

Homework 4: Solutions

CS4445/B12Provided by: Kenneth J. Loomis

Homework 4 SolutionsCLASSIFICATION RULES: RIPPER ALGORITHM

RIPPER: First Rule• The first thing that needs to be determined is the consequence of the rule: Recall that a rule is made up of an antecedent consequence.• The table below contains the frequency counts of the possible consequences of the rules from the userprofile dataset using budget as the classification attribute:Rule Frequency

… budget=low 35… budget=medium

91

… budget=high 5… budget=? 7

• We can see that budget=high has the lowest frequency count in our training dataset, so we choose that as the first antecedent that we will find rules for.• Note: I have included missing values here as one could classify the target as missing. Alternately, these instances could be removed.

RIPPER: First Rule• Next we attempt to find the first condition in the antecedent. We need only look at possible conditions that exists in the 5 instances that have budget=high.• The list of possible conditions are in the table below.

Rule: ___ -> budget=highsmoker=true ambience=family personality=hard-workersmoker=false ambience=friends personality=conformist

drink_level=abstemious transport=car owner personality=hunter-ostentatious

drink_level=casual drinker transport=public personality=thrifty-protector

drink_level=social drinker marital_status=single religion=nonedress_preference=no

preferenceinterest=technology religion=mormon

dress_preference=informal

interest=none religion=christian

dress_preference=formal interest=variety activity=student

RIPPER: First Rule• Next we determine the information gain for each of the candidate rules in the table.• Below is a detailed example of the calculation for the rulesmoker = true budget = high:Given: is the number of instances such that budget=highis the number of instance such that budget ≠ highis the number of instances such that smoker=true and budget=highis the number of instance such that smoker=true but budget ≠ high

=

RIPPER: First Rule• Here we see a list of the information gain for each of the possible first condition in the antecedentRule: ___ -> budget=high Info

GainRule: ___ -> budget=high Info

Gainsmoker=true 0.0862 marital_status=single 0.8889smoker=false 0.07365 interest=technology 3.6049

drink_level=abstemious 2.0974 interest=none -0.1203drink_level=casual drinker -0.7680 interest=variety 3.6049drink_level=social drinker -0.5353 personality=hard-worker -1.1441

dress_preference=no preference

0.1174 personality=conformist 1.9792

dress_preference=informal -0.3426 personality=hunter-ostentatious

1.2016

dress_preference=formal -0.5710 personality=thrifty-protector -0.1428ambience=family -0.6854 religion=none -0.1203ambience=friends 2.5440 religion=mormon 4.7866

transport=car owner 6.7865 religion=christian 1.9792transport=public -1.5710 activity=student -0.1343

RIPPER: First Rule

• Since the following rule results in the highest information gain we select that as the first condition of our rule:transport = car owner budget = high:• Now we can use the number of instances calculated from this rule as and we calculate all the possible second conditions as in the next set of calculations.

RIPPER: First Rule• Next we attempt to find the second condition in the antecedent. We need only look at possible conditions that exists in the 4 instances that have transport = car owner and budget=high.• The list of possible conditions are in the table below.Rule: transport=car owner and ___ -> budget=high

smoker=false ambience=friends personality=thrifty-protector

drink_level=abstemious marital_status=single religion=nonedrink_level=casual drinker interest=technology religion=mormon


interest=none religion=christian


interest=variety activity=student

dress_preference=elegant personality=hard-workerambience=family personality=hunter-

ostentatious

RIPPER: First Rule• Here we see a list of the information gain for each of the possible second condition in the antecedentRule: transport=car owner

and ___ -> budget=high

Info Gain

Rule: transport=car owner and

___ -> budget=high

Info Gain

smoker=false 2.5121 interest=none 0.0875drink_level=abstemious 5.0173 interest=variety 2.5602drink_level=casual drinker -0.6130 personality=hard-worker -1.1605


-.06097 personality=hunter-ostentatious

0.7655

dress_preference=informal 0.7655 personality=thrifty-protector 1.5311dress_preference=elegant 3.0875 religion=none -0.0824

ambience=family -0.6130 religion=mormon 3.0875ambience=friends 1.5075 religion=christian 3.0875

marital_status=single 2.7570 activity=student -0.0840interest=technology 2.5602

RIPPER: First Rule

• Since the following rule results in the highest information gain we select that as the second condition of our rule:transport = car owner and drink_level=abstemious budget = high:• Now we can use the number of instances calculated from this rule as and we calculate all the possible third conditions as in the next set of calculations.

RIPPER: First Rule• Next we attempt to find the third condition in the antecedent. We need only look at possible conditions that exists in the 3 instances that have transport = car owner and drink_level = abstemious and budget=high.• The list of possible conditions are in the table below.Rule: transport=car owner and drink_level=abstemious

and ___ -> budget=highsmoker=false interest=technology personality=thrifty-

protectordress_preference=no

preferenceinterest=none religion=none

dress_preference=formal interest=variety religion=catholicambience=family personality=hard-worker religion=christianambience=friends personality=hunter-

ostentatiousactivity=student

marital_status=single

RIPPER: First Rule• Here we see a list of the information gain for each of the possible third conditions in the antecedentRule: transport=car owner

and drink_level=abstemious and ___ -> budget=high

Info Gain

Rule: transport=car owner and

drink_level=abstemious and ___ -> budget=high

Info Gain

smoker=false 0 interest=variety 0.4515dress_preference=no

preference-0.3399 personality=hard-worker -0.5850

dress_preference=formal 1.4513 personality=hunter-ostentatious

1.4150

ambience=family -0.5850 personality=thrifty-protector -0.1699ambience=friends 2.8300 religion=none -0.1699

marital_status=single 1.2415 religion=catholic -0.5850interest=technology 0.4515 religion=christian 1.4150

interest=none -0.5850 activity=student .01826

RIPPER: First Rule

• Since the following rule results in the highest information gain we select that as the third condition of our rule:transport = car owner and drink_level = abstemious and ambience = friends budget = high:• Note that this rule covers only positive examples (i.e., budget=high data instances). Since it doesn’t cover negative examples, then there is no need to add more conditions to the rule. RIPPER’s construction of the first rule is now complete.

RIPPER: Pruning the First Rule

First rule: transport = car owner and drink_level = abstemious and ambience = friends budget = high:In order to decide if/how to prune this rule, RIPPER will:• use a validation set (that is, a piece of the training set that was kept apart and not used to construct the rule)• use a metric for pruning: v = (p-n)/(p+n) where

• p: # of positive examples covered by the rule in the validation set• n: # of negative examples covered by the rule in the validation set

• pruning method: deletes any final sequence of conditions that maximizes v. That is, it calculates v for each of the following pruned versions of the rule and keeps the version of the rule with maximum v:• transport = car owner & drink_level = abstemious & ambience = friends budget = high• transport = car owner & drink_level = abstemious budget = high• transport = car owner budget = high• budget = high

Homework 4 SolutionsASSOCIATION RULES: APRIORI ALGORITHM

Apriori: Level 1• We begin the Apriori algorithm by determining the order:

• Here I will use the order that the attributes appear and the values for each attribute in alphabetical order.• Then all the possible single item rules are generated and the support calculated for each rule.

• The following slide shows the complete list of possible items in the rule.• Support is calculated in the following manner:

• Since we know the minimum acceptable support count is 55, we need only look at the numerator of this ratio to determine whether or not to keep this item.

Apriori: Level 1Candidate Itemsets with Support Count

smoker=false 109 transport=on foot 14 religion=christian 7smoker=true 26 transport=public 82 religion=jewish 1

drink_level=abstemious 51 marital_status=single 122

religion=mormon 1

drink_level=casual drinker 47 marital_status=married 10 religion=none 30drink_level=social drinker 40 interest=eco-friendly 16 activity=professio

nal15

dress_preference=elegant 4 interest=none 30 activity=student

113

dress_preference=formal 41 interest=technology 36 activity=unemployed

2


53 interest=variety 50 activity=working-class

1


35 personality=conformist 7 budget=high 5

ambience=family 70 personality=hard-worker

61 budget-low 35

ambience=friends 46 personality=hunter-ostentatious

12 budget=medium

91

ambience=solitary 16 personality=thrifty-protector

58

transport=car owner 34 religion=catholic 99

• We keep the ones in bold as they meet the minimum support threshold.

Apriori: Level 1

Itemsets with Supportsmoker=false 109

ambience=family 70transport=public 82

marital_status=single 122personality=hard-worker 61

personality=thrifty-protector

58

religion=catholic 99activity=student 113budget=medium 91

• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.

Apriori: Level 2• We merge pairs from the level 1 set. Since there are no prefixes here then we must consider all combinations. (Continued on next slide)Candidate Itemsets with Support Count

smoker=false, ambience=family 59 smoker=false,

budget=medium 75 ambience=family, budget=medium 54

smoker=false, transport=public 69 ambience=family,

transport=public 46transport=public,


76

smoker=false, marital_status=single 98

ambience=family, marital_status=singl

e63

transport=public, personality=hard-

worker28

smoker=false, personality=hard-

worker49

ambience=family, personality=hard-

worker26

transport=public, personality=thrifty-

protector44

smoker=false, personality=thrifty-

protector48

ambience=family, personality=thrifty-

protector33 transport=public,

religion=catholic 62

smoker=false, religion=catholic 79 ambience=family,

religion=catholic 57 transport=public, activity=student 71

smoker=false, activity=student 90 ambience=family,

activity=student 61 transport=public, budget=medium 54


marital_status=single, personality=hard-worker 52 personality=hard-worker

budget=medium 40

marital_status=single, personality=thrifty-

protector51 personality=thrifty-

protector, religion=catholic 45

marital_status=single, religion=catholic 91 personality=thrifty-

protector, activity=student 50

marital_status=single, activity=student

107

personality=thrifty-protector, budget=medium 41

marital_status=single, budget=medium 79 religion=catholic,

activity=student 84

personality=hard-worker, personality=thrifty-

protector0 religion=catholic,

budget=medium 67

personality=hard-worker, religion=catholic 40 activity=student,

budget=medium 71

personality=hard-worker, activity=student 46

Apriori: Level 2

Itemsets with Support Countsmoker=false,

ambience=family 59 ambience=family, marital_status=single 63

marital_status=single,

religion=catholic91

smoker=false, transport=public 69 ambience=family,

religion=catholic 57marital_status=sin

gle, activity=student

107

smoker=false, marital_status=single 98 ambience=family,

activity=student 61marital_status=sin

gle, budget=medium

79

smoker=false, religion=catholic 79 transport=public,

marital_status=single 76 religion=catholic,activity=student 84

smoker=false, activity=student 90 transport=public,

religion=catholic 62 religion=catholic,budget=medium 67

smoker=false, budget=medium 75 transport=public,

activity=student 71 activity=student,budget=medium 71


Apriori: Level 3• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.

Itemsets from Level 2smoker=false,

ambience=familyambience=family,

marital_status=singlemarital_status=sin

gle, religion=catholic

smoker=false, transport=public

ambience=family, religion=catholic


activity=student

smoker=false, marital_status=single

ambience=family, activity=student


budget=mediumsmoker=false,

religion=catholictransport=public,

marital_status=singlereligion=catholic,activity=student

smoker=false, activity=student

transport=public, religion=catholic

religion=catholic,budget=medium

smoker=false, budget=medium

transport=public, activity=student

activity=student,budget=medium

Apriori: Level 3• First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets are the same)• Here we need only match the first item in the itemset.Itemsets from Level 2

smoker=false, ambience=family

ambience=family, marital_status=single


religion=catholic

smoker=false, transport=public

ambience=family, religion=catholic


activity=student

smoker=false, marital_status=single

ambience=family, activity=student



religion=catholictransport=public,

marital_status=singlereligion=catholic,activity=student

smoker=false, activity=student

transport=public, religion=catholic


smoker=false, budget=medium

transport=public, activity=student


Apriori: Level 3• That results in this set of potential candidate itemsets.

Potential Candidate Itemsetssmoker=false,

ambience=family,transport=public

smoker=false, transport=public,religion=catholic

smoker=false, activity=student,budget=medium

transport=public, religion=catholic,activity=student

smoker=false, ambience=family,


smoker=false, transport=public,activity=student


e, religion=catholic

marital_status=single, religion=catholic,

activity=student

smoker=false, ambience=family,religion=catholic

smoker=false, transport=public,budget=medium


e, activity=student

marital_status=single, religion=catholic,

budget=medium

smoker=false, ambience=family,activity=student

smoker=false, marital_status=singl

e,religion=catholic

ambience=family, religion=catholic,activity=student

marital_status=single, activity=student,

budget=medium

smoker=false, ambience=family,budget=medium

smoker=false, marital_status=singl

e,activity=student

transport=public, marital_status=singl

e,religion=catholic

religion=catholic,activity=student,budget=medium

smoker=false, transport=public,


smoker=false, marital_status=single, budget=medium

transport=public, marital_status=singl

e,activity=student

Apriori: Level 3• We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 2 in each of these itemsets also existed in the level 2 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets.• The following itemsets can be removed as the bolded subsets do not appear in the Level 2 itemsets. This leaves us the candidate itemsets on the next slide.

Candidate Itemsets That Can be Removedsmoker=false,

ambience=family,transport=public

smoker=false, ambience=family,budget=medium

smoker=false, transport=public,budget=medium


smoker=false, ambience=family,


53smoker=false,

transport=public,

activity=student58


e, religion=catholic

50

transport=public,

religion=catholic,

activity=student

59

smoker=false, ambience=family,religion=catholic

46smoker=false,


religion=catholic72

ambience=family, marital_status=si

ngle, activity=student

57


religion=catholic,

activity=student

80

smoker=false, ambience=family,activity=student

52smoker=false,


activity=student85

ambience=family, religion=catholic,activity=student

51


religion=catholic,

budget=medium

80

smoker=false, transport=public,marital_status=si

ngle63

smoker=false, marital_status=s

ingle, budget=medium

65transport=public, marital_status=si

ngle,religion=catholic

57


activity=student,

budget=medium

59

smoker=false, transport=public,religion=catholic

52smoker=false,

activity=student,

budget=medium58

transport=public, marital_status=si

ngle,activity=student

67religion=catholic,activity=student,budget=medium

53

Apriori: Level 3

Level 3 Itemsets with Support


marital_status=single63


58marital_status=sing

le, religion=catholic,activity=student

80


58ambience=family,


57marital_status=sing

le, religion=catholic,budget=medium

80

smoker=false, marital_status=single,

religion=catholic72

transport=public, marital_status=single,

religion=catholic57


budget=medium59


activity=student85


activity=student67


budget=medium65


59


Apriori: Level 4• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.Level 3 Itemsets





religion=catholic,activity=student


ambience=family, marital_status=single,

activity=student




religion=catholic


religion=catholic



marital_status=single,activity=student


activity=studentsmoker=false,

marital_status=single, budget=medium


Apriori: Level 4• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.Level 3 Itemsets







ambience=family, marital_status=single,

activity=student




religion=catholic


religion=catholic





activity=studentsmoker=false,



• First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets match)• Here we need only match the first two items in the itemset.

Apriori: Level 4

Potential Candidate Item Setssmoker=false,

transport=public,marital_status=sing

le,activity=student


activity=student, budget=medium



transport=public, marital_status=single,religion=catholic,activity=student

smoker=false, marital_status=single,religion=catholic,budget=medium

marital_status=single, religion=catholic,activity=student, budget=medium

• That results in this set of candidate itemsets.• We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 3 in each of these itemsets also existed in the level 3 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets.• Here we again eliminate candidates from consideration, the offending subsets are bolded.

Apriori: Level 4

Candidate Itemsets with Support Count

smoker=false, marital_status=sing

le,religion=catholic,activity=student

63


activity=student, budget=medium

53

• In the end we keep only one single itemset that has enough support for this level.

• The following slide depicts the complete itemset.

Level 4 Itemsets with Support Count



63

Apriori: Complete ItemsetItemsets with Support Count

smoker=false 109 smoker=false, marital_status=single 98 marital_status=single,

religion=catholic 91smoker=false,


65

ambience=family 70 smoker=false, religion=catholic 79 marital_status=single,

activity=student 107smoker=false,


58

marital_status=single 122 smoker=false, activity=student 90 marital_status=single,

budget=medium 79ambience=family,


57

personality=hard-worker

61 smoker=false, budget=medium 75 religion=catholic,

activity=student 84transport=public,

marital_status=single,religion=catholic

57

transport=public 82 ambience=family, marital_status=single 63 religion=catholic,

budget=medium 67transport=public,


67

religion=catholic 99 ambience=family, religion=catholic 57 activity=student,

budget=medium 71transport=public, religion=catholic,activity=student

59

activity=student 113 ambience=family, activity=student 61


marital_status=single63

marital_status=single, religion=catholic,activity=student

80

budget=medium 91 transport=public, marital_status=single 76


58marital_status=single,


80

smoker=false, ambience=family 59 transport=public,

religion=catholic 62smoker=false,

marital_status=single,religion=catholic

72marital_status=single,


59

smoker=false, transport=public 69 transport=public,

activity=student 71smoker=false,


85smoker=false,

marital_status=single,religion=catholic,activity=student

63

Rule ConstructionLargest itemset: Let’s call this itemset I4:

I4: smoker=false, marital_status=single, religion=catholic, activity=student

Rules constructed from I4 with 2 items in the antecedent: R1: smoker=false, marital_status=single religion=catholic, activity=student

conf(R1) = supp(I4)/supp(smoker=false, marital_status=single ) = 63/ 98 = 64.28% R2: smoker=false, religion=catholic marital_status=single, activity=student

conf(R2) = supp(I4)/supp(smoker=false, religion=catholic ) = 63/ 79 = 79.74% R3: smoker=false, activity=student marital_status=single, religion=catholic conf(R3) =

supp(I4)/supp(smoker=false, activity=student ) = 63/ 90= 70% R4: marital_status=single, religion=catholic smoker=false, activity=student

conf(R4) = supp(I4)/supp(marital_status=single, religion=catholic ) = 63/ 91 = 69.23% R5: marital_status=single, activity=student smoker=false, religion=catholic

conf(R5) = supp(I4)/supp(marital_status=single, activity=student ) = 63/ 107 = 58.87% R6: religion=catholic, activity=student smoker=false, marital_status=single

conf(R6) = supp(I4)/supp(religion=catholic, activity=student) = 63/ 84 = 75%

Documents

Homework 4: Solutions