1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott

1

Data MiningChapter 4, Part 1

Algorithms: The Basic Methods

Kirk Scott

2

Dendrogramma, a New Phylum

3

• Did you have any idea that symbols like these could be inserted into PowerPoint presentations?

• Ѿ• ҉ • ҈• ۞• ۩• ҂

4

Basic Methods

• A good rule of thumb is try the simple things first

• Quite frequently the simple things will be good enough or will provide useful insights for further explorations

• One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set

5

• Certain data sets have a certain structure• Certain algorithms are designed to elicit

particular kinds of structures• The right algorithm applied to the right set

will give straightforward results• A mismatch between algorithm and data

set will give complicated, cloudy results

6

• This is just another thing where you have to accept the fact that it’s magic, or a guessing game

• Or you could call it exploratory research• Based on experience, you might have

some idea of which data mining algorithm might be right for a given data set

• Otherwise, you just start trying them and seeing what kind of results you get

7

• Chapter 4 is divided into 8 basic algorithm descriptions plus a 9th topic

• The first four algorithms are covered by this set of overheads

• They are listed on the following overhead

8

• 4.1 Inferring Rudimentary Rules• 4.2 Statistical Modeling• 4.3 Divide and Conquer: Constructing

Decision Trees• 4.4 Covering Algorithms: Constructing

Rules

9

4.1 Inferring Rudimentary Rules

• The 1R (1-rule) approach• Given a training set with correct

classifications, make a one-level decision tree based on one attribute

• Each value of the attribute generates a branch

• Each branch contains the collection of instances that have that attribute value

10

• For each collection of instances belonging to an attribute value, count up the number of occurrences of each classification

• Let the predicted classification for that branch be the classification that has the greatest number of occurrences in the training set

11

• Do this for each attribute• Out of all of the trees generated, pick the

one that has the lowest error rate• The error rate is the count of the total

number of misclassified instances across the training set for the rule set

12

• This simple approach frequently works well

• This suggests that for a lot of data sets, one dominant attribute is a strong determinant

13

Missing Values

• Missing values are easily handled by 1R• Missing is just one of the branches in the

decision tree• 1R is fundamentally nominal

14

Numeric Attributes

• If 1R is fundamentally nominal, how do you decide how to branch on a numeric attribute?

• One approach to branching on numerics:• Sort the numeric instances• Create a break point everywhere in the

sequence that the classification changes• This partitions the domain

15

Overfitting

• The problem is that you may end up with lots of break points/partitions

• If there are lots of break points, this is counterproductive

• Rather than grouping things into categories, you’re fragmenting them

• This is a sign that you’re overfitting to the existing, individual instances in the data set

16

• In the extreme case, there are as many determinant values as there are classifications

• This is not good• You’ve essentially determined a 1-1

coding from the attribute in question to the classification

17

• If this happens, instances in the future that do not have these values for the attributes can’t be classified by the system

• They will not fall into any known partition• This is an extreme case• It is the classic case of overfitting

18

• The less extreme case is when a model tends to have poor prediction performance due to overfitting to the existing data

• In other words, the model will make predictions, but the error rate is higher than a model that is less tightly fitted to the training data

19

Dealing with Overfitting in 1R

• This is a blanket rule of thumb to deal with the foregoing problem:

• When picking the numeric ranges that define the branches of the tree, specify that a minimum number, n, of instances from the training set have to be in each partition

20

• The ranges are taken from the sorted values in order

• Wherever you’re at in the scheme, the problem that arises now is that the next n instances may include more than one classification

• The solution is to take the majority classification of the n as the rule for that branch

21

• The latest rule of thumb, given above, may result in neighboring partitions with the same classification

• The solution to that is to merge those partitions

• This potentially will reduce the number of partitions significantly

22

• Notice how rough and ready this is• It’s a series of rules of thumb to fix

problems cause by the previous rule of thumb

• You are essentially guaranteed some misclassifications

• However, on the whole, you hope that these heuristics result in a significant proportion of correct classifications

23

Discussion

• 1R is fast and easy• 1R quite often performs only slightly less

well than advanced techniques• It makes sense to start simple in order to

get a handle on a data set• Go to something more complicated if

desired

24

• At the end of this section, the text describes a more complicated kind of 1R

• The details are unimportant• Their concluding point is that, with

experience, an initial examination of the data with simple techniques may give you insight into which more advanced technique might be suitable for it

25

4.2 Statistical Modeling

26

• This is basically a discussion of an application of Bayes’ Theorem

• Bayes’ Theorem makes a statement about what is known as conditional probability

• I will cover the same ideas as the book, but I will do it in a slightly different way

• Whichever explanation makes the most sense to you is the “right” one

27

• The book refers to this approach as Naïve Bayes

• It is based on the simplifying assumption that the attributes in a data set are independent

• Independence isn’t typical• Otherwise there would be no associations

to mine• Even so, the technique gives good results

28

The Weather Example

• Table 4.2, shown on the following overhead, summarizes the outcome of play yes/no for each weather attribute for all instances in the training set

29

30

• Note this in particular on first viewing:• 9 total yeses, 5 total nos• The middle part of the table shows the raw

counts of attribute value occurrences in the data set

• The bottom part of the table is the instructive one

31

• Take the Outlook attribute for example:– Sunny-yes 2/9 no 3/5– Overcast-yes 4/9 no 0/5– Rainy-yes 3/9 no 2/5

• Given the outcome, yes/no (the denominator), these fractions tell you the likelihood that there was a given outlook (the numerator)

32

Bayes’ Theorem

• Bayes’ Theorem involves a hypothesis, H, and some evidence, E, relevant to the hypothesis

• The theorem gives a formula for finding the probability that H is true under the condition that you know that E is true

33

• The theorem is based on knowing some some probabilistic quantities related to the problem

• It is a statement of conditional probability, which makes use of conditional probabilities

• This is the notation:• P(A|B) = the probability of a given that B is

true

34

• This is a statement of Bayes’ theorem:

• P(E|H)P(H)• P(H|E) = ---------------• P(E)

35

Illustrating the Application of the Theorem with the Weather Example

• The book does its example with all of the attributes at once

• I will do this with one attribute and then generalize

• I will use the Outlook attribute• Let H = (play = yes)• Let E = (outlook = sunny)

36

• Then P(H|E)• = P(play = yes | outlook = sunny)

• By Bayes’ Theorem this equals

• P(outlook = sunny | play = yes)P(play = yes)• ----------------------------------------------------------• P(outlook = sunny)

37

• Typing these things into PowerPoint is making me cry

• The following overheads show in words how you can intuitively understand what’s going on with the weather example

• They are based on the idea that you can express the probabilities in the expression in terms of fractions involving counts

38

39

40

• After you’ve simplified like this, it’s apparent that you could do the calculation by just pulling 2 values out of the table

• However, the full formula where you can have multiple different E (weather attributes, for example) is based on using Bayes’ formula with all of the intermediate expressions

41

• Before considering the case with multiple E, here is the simple case using the full Bayes’ formula

• Here are the fractions for the parts:• P(E|H) = P(outlook = sunny | play = yes) =

2/9• P(H) = P(play = yes) = 9/14• P(E) = P(outlook = sunny) = 5/9

42

• Then P(H|E)

• P(E|H)P(H)• = ---------------• P(E)

• 2/9 * 9/14 2/14 2• = ------------- = ------ = ---- = .4• 5/14 5/14 5

43

• In other words, there were 2 sunny days when play = yes out of 5 sunny days total

• Using the same approach, you can find P(H|E) where H = (play = no)

• The arithmetic gives this result: .6

44

Using Bayes’ Theorem to Classify with More Than One Attribute

• The preceding example illustrated Bayes’ Theorem applied to one attribute

• A full Bayesian expression (conditional probability) will be derived including multiple bits of evidence, Ei, one for each of the i attributes

• The totality of evidence, including all attributes would be noted as E

45

• Prediction can be done for a new instance with values for the i attributes

• The fractions from the weather table corresponding to the instance’s attribute values can be plugged into the Bayesian expression

• The result would be a probability, or prediction that play = yes or play = no for that set of attribute values

46

Statistical Independence

• Recall that Naïve Bayes assumed statistical independence of the attributes

• Stated simply, one of the results of statistics is that the probability of two events’ occurring is the product of their individual probabilities

• This is reflected when forming the expression for the general case

47

A Full Bayesian Example with Four Attributes

• The weather data had four attributes: outlook, temperature, humidity, windy

• Let E be the composite of E1, E2, E3, and E4

• In other words, an instance has values for each of these attributes, and fractional values for each of the attributes can be read from the table based on the training set

48

• Bayes’ Theorem extended to this case looks like this:

• P(E1|H)P(E2|H)P(E3|H)P(E4|H)P(H)

• P(H|E) = ----------------------------------------------• P(E)

49

• Suppose you get a new data item and you’d like to make a prediction based on its attribute values

• Let them be:• Outlook = sunny• Temperature = cool• Humidity = high• Windy = true• Let the hypothesis be that Play = yes

50

• Referring back to the original table:• P(E1|H) = P(sunny | play = yes) = 2/9

• P(E2|H) = P(cool | play = yes) = 3/9

• P(E3|H) = P(high humidity | play = yes) = 3/9

• P(E4|H) = P(windy | play = yes) = 3/9

• P(H) = P(play = yes) = 9/14

51

• The product of the quantities given on the previous overhead is the numerator of the Bayesian expression

• This product = .0053• Doing the same calculation to find the

numerator of the expression for play = no gives the value .0206

52

• In general, you would also be interested in the denominator of the expressions, P[E]

• However, in the case where there are only two alternative predictions, you don’t have to do a separate calculation

• You can arrive at the needed value by other means

53

• The universe of possible values is just yes or no, so you can form the denominator by just adding the two numerator values

• The denominator is .0053 + .0206• So the P[yes|E]• = (.0053) / (.0053 + .0206) = 20.5%• And the P[no|E]• = (.0206) / (.0053 + .0206) = 79.5%

54

• In effect, instead of computing the complete expressions, we’ve normalized them

• However we arrive at the numeric figures, it is their relative values that makes the prediction

55

• Given the set of attribute values, it is approximately 4 times more likely that play will have the value no than that play will have the value yes

• Therefore, the prediction, or classification of the instance will be no (with approximately 80% confidence)

56

A Small Problem with the Bayes’ Theorem Formula

• If one of the probabilities in the numerator is 0, the whole numerator goes to 0

• This would happen when the training set did not contain any instances with a particular value for an attribute i, but a new instance did

• You can’t compare yes/no probabilities if they have gone to 0

57

• One solution approach is to add constants to the top and bottom of fractions in the expression

• This can be accomplished without changing the relative yes/no outcome

• I don’t propose to go into this in detail (now)

• To me it seems more appropriate for an advanced discussion later, if needed

58

Missing Values

• Missing values for one or more attributes in an instance to be classified are not a problem

• If an attribute value is missing, a fraction for its conditional probability is simply not included in the Bayesian formula

• In other words, you’re only doing prediction based on the attributes that do exist in the instance

59

Numeric Attributes

• The discussion so far has been based on the weather example

• As initially given, all of the attributes are categorical

• It is also possible to handle numeric attributes

• This involves a bit more work, but it’s straightforward

60

• For the purposes of illustration, assume that the distribution of numeric attributes is normal

• In the summary of the training set, instead of forming fractions like for occurrences of nominal values, find the mean and standard deviation for numeric ones

61

• The important thing to remember is that in the earlier example, you did summaries for both the yes and no cases

• For numeric data, you need to find the mean and standard deviation for both the yes and no cases

• These parameters, µ and σ, will appear in the calculation of the parts of the Bayesian expression

62

• This is the normal probability density function (p.d.f.):

• If x is distributed according to this p.d.f., then f(x) is the probability that the value x will occur

63

• Let µ and σ be the mean and standard deviation of some attribute for those cases in the training set where the value of play = yes

• Then put the value of x for that attribute into the equation

64

• f(x) is the probability of x, given that the value of play = yes

• In other words, this is P(E|H), the kind of thing in the numerator of the formula for Bayes’ theorem

65

• Now you can plug this into the Bayesian expression just like the fractions for the nominal attributes in the earlier example

• This procedure isn’t proven correct, but based on background knowledge in statistics it seems to make sense

• We’ll just accept it as given in the book and apply it as needed

66

Naïve Bayes for Document Classification

• The details for this appear in a box in the text, which means it’s advanced and not to be covered in detail in this course

• The basic idea is that documents can be classified by which words appear in them

• The occurrence of a word can be modeled as a Boolean yes/no

67

• The classification can be improved if the frequency of words is also taken into account

• This is the barest introduction to the topic• It may come up again later in the book

68

Discussion

• Like 1R, Naïve Bayes can often produce good results

• The rule of thumb remains, start simple• It is true that dependency among attributes

is a theoretical problem with Bayesian analysis and can lead to results which aren’t accurate

69

• The presence of dependent attributes means multiple factors for the same feature in the Bayesian expression

• Potentially too much weight will be put on a feature with multiple factors

• One solution is to try and select only a subset of independent attributes to work with as part of preprocessing

70

• For numeric attributes, if they’re not normally distributed, the normal p.d.f. shouldn’t be used

• If the attributes do fall into a known distribution, you can use its p.d.f.

• In the absence of any knowledge, the uniform distribution might be a starting point for an analysis, with statistical analysis revealing the actual distribution

71

4.3 Divide-and-Conquer: Constructing Decision Trees

• Note that like everything else in this course, this is a purely pragmatic presentation

• Ideas will be given• Nothing will be proven• The book gives things in a certain order• I will try to cover pretty much the same

things• I will do it in a different order

72

When Forming a Tree…

• 1. The fundamental question at each level of the tree is always which attribute to split on

• In other words, given attributes x1, x2, x3…, do you branch first on x1 or x2 or x3…?

• Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on?

73

• 2. Suppose you can come up with a function, the information (info) function

• This function is a measure of how much information is needed in order to make a decision at each node in a tree

• 3. You split on the attribute that gives the greatest information gain from level to level

74

• 4. A split is good if it means that little information will be needed at the next level down

• You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level

75

Defining an Information Function

• It is necessary to describe the information function more fully, first informally, then formally

• This is the guiding principle:• The best split results in branches where

the instances in each of the branches are all of the same classification

• The branches are leaves and you’ve arrived at a complete classification

76

• No more splitting is needed• No more information is needed• Expressed somewhat formally:• If the additional information needed is 0,

then the information gain from the previous level(s) is 100%

• I.e., whatever information remained to be gained has been gained

77

Developing Some Notation

• Taking part of the book’s example as a starting point:

• Let some node have a total of 9 cases• Suppose that eventually 2 classify as yes

and 7 classify as no• This notation represents the information

function:• info([2, 7])

78

• info([2, 7])• The idea is this:• In practice, in advance we wouldn’t know

that 9 would split into 2 and 7• This is a symbolic way of indicating the

information that would be needed to classify the instances into 2 and 7

• At this point we haven’t assigned a value to the expression info([2, 7])

79

• The next step in laying the groundwork of the function is simple arithmetic

• Let p1 = 2/9 and p2 = 7/9

• These fractions represent the proportion of each case out of the total

• They will appear in calculations of information gain

80

Properties Required of the Information Function

• Remember the general description of the information function

• It is a measure of how much information is needed in order to make a decision at each node in a tree

81

• This is a relatively formal description of the characteristics required of the information function

• 1. When a split gives a leaf that’s all one classification, no additional information should be needed at that leaf

• That is, the function applied to the leaf should evaluate to 0

82

• 2. Assuming you’re working with binary attributes, when a split gives a result node that’s exactly half and half in classification, the information needed should be a maximum

• We don’t know the function yet and how it computes a value, but however it does so, the half and half case should generate a maximum value for the information function at that node

83

The Multi-Stage Property

• 3. The function should have what is known as the multi-state property

• An attribute may not be binary• If it is not binary, you can accomplish the

overall splitting by a series of binary splits

84

• The multi-stage property says that not only should you be able to accomplish the splitting with a series of binary splits

• It should also be possible to compute the information function value of the overall split based on the information function values of the series of binary splits

85

• Here is an example of the multi-stage property

• Let there be 9 cases overall, with 3 different classifications

• Let there be 2, 3, and 4 instances of each of the cases, respectively

86

• This is the multi-stage requirement:• info([2, 3, 4])• = info([2, 7]) + 7/9 * info([3, 4])• How to understand this:• Consider the first term• The info needed to split [2, 3, 4] includes

the full cost of splitting 9 instances into two classifications, [2, 7]

87

• You could rewrite the first term in this way:• info([2, 7]) • = 2/9 * info([2, 7]) + 7/9 * info([2, 7])• In other words, the cost is apportioned on

an instance-by-instance basis• A decision is computed for each instance• The first term is “full cost” because it

involves all 9 instances

88

• It is the apportionment idea that leads to the second term in the expression overall:

• 7/9 * info([3, 4])• Splitting [3, 4] involves a particular cost

per instance• After the [2, 7] split is made, the [3, 4] cost

is incurred in only 7 out of the 9 total cases in the original problem

89

Reiterating the Multi-Stage Property

• You can summarize this verbally as follows:

• Any split can be arrived at by a series of binary splits

• At each branch there is a per instance cost of computing/splitting

• The total cost at each branch is proportional to the number of cases at that branch

90

What is the Information Function? It is Entropy

• The information function for splitting trees is based on a concept from physics, known as entropy

• In physics (thermodynamics), entropy can be regarded as a measure of how “disorganized” a system is

• For our purposes, an unclassified data set is disorganized, while a classified data set is organized

91

• The book’s information function, based on logarithms, will be presented

• No derivation from physics will be given• Also, no proof that this function meets the

requirements for an information function will be given

92

• For what it’s worth, I tried to show that the function has the desired properties using calculus

• I was not successful• Kamal Narang looked at the problem and

speculated that it could be done as a multi-variable problem rather than a single variable problem…

93

• I didn’t have the time or energy to pursue the mathematical analysis any further

• We will simply accept that the formula given is what is used to compute the information function

94

• Intuitively, when looking at this, keep in mind the following:

• The information function should be 0 when there is no split to be done

• The information function should be maximum when the split is half and half

• The multi-stage property has to hold

95

Definition of the entropy() Function by Example

• Using the concrete example presented so far, this defines the information function based on a definition of an entropy function:

• info([2, 7])• = entropy(2/9, 7/9)• = -2/9 log2(2/9) – 7/9 log2(7/9)

96

• Note that since the logarithms of values <1 are negative, the negative coefficients lead to a positive value overall

97

General Definition of the entropy() Function

• Recall that pi can be used to represent proportional fractions of classification at nodes in general

• Then the info(), entropy() function for 2 classifications can be written:

• -p1 log2(p1) – p2 log2(p2)

98

• For multiple classifications you get:• -p1 log2(p1) – p2 log2(p2) - … - pn log2(pn)

99

Information, Entropy, with the Multi-Stage Property

• Remember that in the entropy version of the information function the pi are fractions

• Let p, q, and r represent fractions where p + q + r = 1

• Then this is the book’s presentation of the formula based on the multi-stage property:

100

Characteristics of the Information, Entropy Function

• Each of the logarithms in the expression is taken on a positive fraction less than one

• The logarithms of these fractions are negative

• The minus signs on the terms of the expression reverse this

• The value of the information function is positive overall

101

• Note also that each term consists of a fraction multiplied by a fraction, where the sum of the coefficient fractions is 1

• This means that the expression overall gives a value no greater than 1

• In other words, the information value is always <= 1

102

• Logarithms base 2 are used• If the properties for the information

function hold for base 2 logarithms, they would hold for logarithms of any base

• In a binary world, it’s convenient to use 2• Incidentally, although we will use decimal

numbers, the values of the information function can be referred to as “bits” of information

103

An Example Applying the Information Function to Tree Formation

• The book’s approach is to show you the formation of a tree by splitting and then explain where the information function came from

• My approach has been to tell you about the information function first

• Now I will work through the example, applying it to forming a tree

104

• Start by considering Figure 4.2, shown on the following overhead

• The basic question is this:• Which of the four attributes is best to

branch on?• If a decision leads to pure branches, that’s

the best• If the branches are not all pure, you use

the information function to decide which branching is the best

105

106

• Another way of posing the question is:• Which branching option gives the greatest

information gain?• 1. Calculate the amount of information

needed at the previous level• 2. Calculate the information needed if you

branch on each of the four attributes• 3. Calculate the information gain by

finding the difference

107

• In the following presentation I am not going to show the arithmetic of finding the logs, multiplying by the fractions, and summing up the terms

• I will just present the numerical results given in the book

108

Basic Elements of the Example

• In this example there isn’t literally a previous level

• We are at the first split, deciding which of the 4 attributes to split on

• There are 14 instances• The end result classification is either yes

or no (binary)• And in the training data set there are 9

yeses and 5 nos

109

The “Previous Level” in the Example

• The first measure, the so-called previous level, is simply a measure of the information needed overall to split 14 instances between 2 categories of 9 and 5 instances, respectively

• info([9, 5])• = entropy(9/14, 5/14)• = .940

110

The “Next Level” Branching on the “outlook” Attribute in the Example

• Now consider branching on the first attribute, outlook

• It is a three-valued attribute, so it gives three branches

• You calculate the information needed for each branch

• You multiply each value by the proportion of instances for that branch

• You then add these values up

111

• This sum represents the total information needed after branching

• You subtract• The information still needed after

branching• From the information needed before

branching• To arrive at the information gained by

branching

112

The Three “outlook” Branches

• Branch 1 gives: info([2, 3]) = entropy(2/5, 3/5) = .971

• Branch 2 gives: info([4, 0]) = entropy(4/4, 0/4) = 0

• Branch 3 gives: info([3, 2]) = entropy(3/5, 2/5) = .971

113

• In total:• info([2, 3], [4, 0], [3, 2])• = (5/14)(.971) + (4/14)(0) + (5/14)(.971)• = .693• Information gain• = .940 - .693• = .247

114

Branching on the Other Attributes

• If you do the same calculations for the other three attributes, you get this:

• Temperature: info gain = .029• Humidity: info gain = .152• Windy: info gain = .048• Branching on outlook gives the greatest

information gain

115

Tree Formation

• Forming a tree in this way is a “greedy” algorithm

• You split on the attribute with the greatest information gain (outlook)

• You continue recursively with the remaining attributes/levels of the tree

• The desired outcome of this greedy approach is to have as small and simple a tree as possible

116

A Few More Things to Note

• 1. Intuitively you might suspect that outlook is the best choice because one of its branches is pure

• For the overcast outcome, there is no further branching to be done

• Intuition is nice, but you can’t say anything for sure until you’ve done the math

117

• 2. When you do this, the goal is to end up with leaves that are all pure

• Keep in mind that the instances in a training set may not be consistent

• It is possible to end up, after a series of splits, with both yes and no instances in the same leaf node

• It is simply the case that values for the attributes at hand don’t fully determine the classification outcome

118

Highly Branching Attributes

• Recall the following idea:• It is possible to do data mining and

“discover” a 1-1 mapping from an identifier to a corresponding class value

• This is correct information, but you have “overfitted”

• No future instance will have the same identifier, so this is useless for practical classification prediction

119

• A similar problem can arise with trees• From a node that represented an ID, you’d

get a branch for each ID value, and one correctly classified instance in each child

• If such a key attribute existed in a data set, splitting based on information gain as described above would find it

120

• This is because at each ID branch, the resulting leaf would be pure

• It would contain exactly one correctly classified instance

• No more information would be needed for any of the branches, so none would be needed for all of them collectively

121

• Whatever the gain was, it would equal the total information still needed at the previous level

• The information gain would be 100%• Recall that you start forming the tree by

trying to find the best attribute to branch on

• The ID attribute will win every time and no further splitting will be needed

122

A Related Idea

• As noted, a tree based on an ID will have as many leaves as there are instances

• In general, this method for building trees will prefer attributes that have many branches, even if these aren’t ID attributes

• This goes against the grain of the goal of having small, simple trees

123

• The preference for many branches can be informally explained

• The greater the number of branches, the fewer the number of instances per branch, on average

• The fewer the number of instances per branch, the greater the likelihood of a pure branch or nearly pure branch

124

Counteracting the Preference for Many Branches

• The book goes into further computational detail

• I’m ready to simplify• The basic idea is that instead of

calculating the desirability of a split based on information gain alone—

• You calculate the gain ratio of the different branches and choose on that basis

125

• The gain ratio takes into account the number and size of the child nodes a split generates

• What happens, more or less, is that the information gain is divided by the number of branches a split generates

126

• Once you go down this road, there are further complications to keep in mind

• First of all, this adjustment doesn’t protect you from a split on an ID attribute because it will still win

• Secondly, if you use the gain ratio this can lead to branching on a less desirable attribute

127

• The book cites this rule of thumb:• Provisionally pick the attribute with the

highest gain ratio• Find the average absolute information gain

for branching on each attribute• Check the absolute information gain of the

provisional choice against the average• Only take the ratio winner if it is greater

than the average

128

Discussion

• Divide and conquer construction of decision trees is also known as top-down induction of decision trees

• The developers of this scheme have come up with methods for dealing with:– Numeric attributes– Missing values– Noisy data– The generation of rule sets from trees

129

• For what it’s worth, the algorithms discussed above have names

• ID3 is the name for the basic algorithm• C4.5 refers to a practical implementation

of this algorithm, with improvements, for decision tree induction

130

4.4 Covering Algorithms: Constructing Rules

• Recall, in brief, how top-down tree construction worked:

• Attribute by attribute, pick the one that does the best job of separating instances into distinct classes

• Covering is a converse approach• It works from the bottom up

131

• The goal is to create a set of classification rules directly, without building a tree

• By definition, a training set with a classification attribute is explicitly grouped into classes

• The goal is to come up with (simple) rules that will correctly classify some new instance that comes along

• I.e., the goal is prediction

132

• The process goes like this:• Choose one of the classes present in the

training data set• Devise a rule based on one attribute that

“covers” only instances of that class

133

• In the ideal case, “cover” would mean:• You wrote a single rule on a single

attribute (a single condition)• That rule identified all of the instances of

one of the classes in a data set• That rule identified only the instances of

that class in the data set

134

• In practice some rule may not identify or “capture” all of the instances of the class

• It may also not capture only instances of the class

• If so, the rule may be refined by adding conditions on other attributes (or other conditions on the attribute already used)

135

Measures of Goodness of Rules

• Recycling some of the terminology for trees, the idea can be expressed this way:

• When you pick a rule, you want its results to be “pure”

• “Pure” in this context is analogous to pure with trees

• The rule, when applied to instances, should ideally give only one classification as a result

136

• The rule should ideally also cover all instances

• You make the rule stronger by adding conditions

• Each addition should improve purity/coverage

137

• When building trees, picking the branching attribute was guided by an objective measure of information gain

• When building rule sets, you also need a measure of the goodness of the covering achieved by rules on different attributes

138

The Rule Building Process in More Detail

• Just to cut to the chase, evaluating the goodness of a rule just comes down to counting

• It doesn’t involve anything fancy like entropy

• You just compare rules based on how well they cover a classification

139

• Suppose you pick some classification value, say Y

• The covering process generates rules of this kind:

• If(the attribute of interest takes on value X)• Then the classification of the instance is Y

140

Notation

• In order to discuss this with some precision, let this notation be given:

• A given data set will have m attributes• Let an individual attribute be identified by

the subscript i• Thus, one of the attributes would be

identified in this way:• attributei

141

• A given attribute will have n different values

• Let an individual value be identified by the subscript j

• Thus, the specific, jth value for the ith attribute could be identified in this way:

• Xi,j

142

• There could be many different classifications

• However, in describing what’s going on, we’re only concerned with one classification at a time

• There is no need to introduce subscripting on the classifications

• A classification will simply be known as Y

143

• The kind of rule generated by covering could be made more specific and compact:

• If(the attribute of interest takes on value X)• Then the classification of the instance is Y• ≡• if(attributei = Xi,j) then classification = Y

144

• Then for some combination of attribute and value:

• attributei = Xi,j

• Let t represent the total number of instances in the training set where this condition holds true

• Let p represent the number of those instances that are classified as Y

145

• The ratio p/t is like a success rate for predicting Y with each attribute and value pair

• Notice how this is pretty much like a confidence ratio in association rule mining

• Now you want to pick the best rule to add to your rule set

146

• Find p/t for all of the attributes and values in the data set

• Pick the highest p/t (success) ratio• Then the condition for the attribute and

value with the highest p/t ratio becomes part of the covering rule:

• if(attributei = Xi,j) then classification = Y

147

Costs of Doing This

• Note that it is not complicated to identify all of the cases, but potentially there are many of them to consider

• There are m attributes with a varying number of different values, represented as n

• In general, there are m X n cases• For each case you have to count t and p

148

Reiteration of the Sources of Error

• Given the rule just devised:• There may also be instances in the data

set where attributei = Xi,j

• However, these may be instances that are not classified as Y

149

• This means you’ve got a rule with “impure” results

• You want a high p/t ratio of success• You want a low absolute t – p count of

failures• In reality, you would prefer to add only

rules with a 100% confidence rate to your rule set

150

• This is the second kind of problem with any interim rule:

• There may be instances in the data set where attributei <> Xi,j

• However, these may be instances that are classified as Y

• Here, the problem is less severe• This just means that the rule’s coverage is

not complete

151

Refining the Rule

• You refine the rule by repeating the process outlined above

• You just added a rule with this condition: attributei = Xi,j

• To that you want to add a new condition that does not involve attributei

152

• When counting successes for the new condition, you do not count instances in the data set where attributei = Xi,j holds

• Find p and t for the remaining attributes and their values

• Pick the attribute-value condition with the highest p/t ratio

• Add this condition to the existing rule using AND

153

• Continue until you’re satisfied• Or until you’ve run through all of the

attributes• Or until you’ve run out of instances to

classify

154

• Remember that what has been described thus far is the process of covering 1 classification

• You repeat the process for all classifications

• Or you do n – 1 of the classifications and leave the nth one as the default case

155

• Suppose you do this exhaustively, completely and explicitly covering every class

• The rules will tend to have many conditions

• If you only added individual rules with p/t = 100% they will also be “perfect”

• In other words, for the training set there will be no ambiguity whatsoever

156

• Compare this with trees• Successive splitting on attributes from the

top down doesn’t guarantee pure (perfect) leaves

• Working from the bottom up you can always devise sufficiently complex rule sets to cover all of the existing classes

157

Rules versus Decision Lists

• The rules derived from the process given above can be applied in any order

• For any one class, it’s true that the rule is composed of multiple conditions which successively classify more tightly

• However, the end result of the process is a single rule with conditions in conjunction

158

• In theory, you could apply these parts of a rule in succession

• That would be the moral equivalent of testing the conditions in order, from left to right

• However, since the conditions are in conjunction, you can test them in any order with the same result

159

• In the same vein, it doesn’t matter which order you handle the separate classes in

• If an instance doesn’t fall into one class, move on and try the next

160

• The derived rules are “perfect” at most for all of the cases in the training set (only)

• It’s possible to get an instance in the future where >1 rule applies or no rule applies

• As usual, the solutions to these problems are of the form, “Assign it to the most frequently occurring…”

161

• The book summarizes the approach given above as a separate-and-conquer algorithm

• This is pretty much analogous to a divide and conquer algorithm

• Working class by class is clearly divide and conquer

162

• Within a given class the process is also progressive, step-by-step

• You bite off a chunk with one rule, then bite off the next chunk with another rule, until you’ve eaten everything in the class

• Notice that like with trees, the motivation is “greedy”

• You always take the best p/t ratio first, and then refine from there

163

• For what it’s worth, this algorithm for finding a covering rule set is known as PRISM

164

The End

Documents

1 Data Mining Chapter 4, Part 1 Algorithms: The Basic Methods Kirk Scott