16
Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Embed Size (px)

Citation preview

Page 1: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Measuring Association Rules

Shan “Maggie” Duanmu

Project for CSCI 765

Dec 9th 2002

Page 2: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Outline

The problems Our solutions Work to do

Page 3: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Definitions Association rule: Association rule mining searches for

interesting relationships among items in a given data set. Such interesting relationships are typically expressed in an association rule in the form of X=>Y, where X and Y are sets of items. It can be read that, whenever a transaction T contains X, it probably will contain Y.

Metrics: The probability is defined as the percentage of transactions containing Y in addition to X with respect to the overall number of transactions containing X. This probability is called confidence (or strength). While the confidence measure represents the certainty of a rule, support is used to represent the usefulness of the rule [1]. Formally, the support of a rule is defined as the percentage of transactions containing both X and Y with respect to the number of transactions in the database.

Interesting rules. A rule is considered to be interesting if its confidence and support exceed certain thresholds. Such thresholds are generally assumed to be given by domain experts.

Page 4: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

The Problems

While the support-confidence framework has been widely used for measuring the interestingness of association rules, it is known that

1. the resulting rules may be misleading [4-8]. A rule with high support and high confidence may still not indicate that X and Y are dependent.

2. The use of thresholds of support and confidence for pruning may obscure important rules,

3. and also many unimportant rules may remain in the resulting rule set.

Page 5: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Many metrics…

To address the problems with support-confidence framework, many other metrics are proposed: interest, conviction, gini index, Laplace, phi-coefficients, collective strength, reliability, …. So far, we can find at least 21 metrics in the literature. What to choose???

P. Tan, V. Kumar, J. Srivastava, “Selecting the right Interesting measure for Association pattern.” ACM SIGKDD ’02, 2002.

Page 6: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Our Solutions

Six Principles plus partial order, in contrast to prior

total order or partial order of support-confidence

framework,

1. Implication

2. Correlation

3. Novelty

4. Utility

5. Top-N-rules

6. Efficiency

Page 7: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Implication principle

Principle 1 (implication principle): If a set of measures is defined to reflect the interestingness of an association rule , then at least one measure mi(X=>Y)in the set should satisfy the constraint mi(X=>Y)>mi(Y=>X) when P(X)<P(Y).

Page 8: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Correlation principle

Principle 2 (correlation principle): If a set of measures is defined to reflect the interestingness of an association rule X=>Y , then at least one measure mi(X=>Y) in the set should be directly proportional to the covariance of X and Y.

Page 9: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Novelty principle

Principle 3 (novelty principle): If a set of measures is defined to reflect the interestingness of an association rule X=>Y, then for a given P(XY), at least one measure mi in the set should reflect its novelty. The novelty measure mi should be inversely proportional to p=max{P(X),P(Y)}.

Page 10: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Utility principle

Principle 4 (utility principle): If a set of measures is defined to reflect the interestingness of an association rule X=>Y, then at least one measure mi in the set should reflect its utility, i.e., mi is a monotone increasing function with respect to P[XY].

Page 11: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Top-N-rule principle

Principle 5 (top-N-rule principle): If a synthetic measure is defined to sort the rules for presenting the top N rules to users, then it is desirable that this measure obeys the principles 1-4.

Page 12: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Efficiency principle

Principle 6 (efficiency principle): If a set of measures is defined to reflect the interestingness of an association rule, then it is desirable that thresholds used with those measures help reduce computation complexity.

Page 13: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Partial results

Support confidence Interest Conviction Reliability

Implication x X (when positively correlated)

X (when positively correlated)

Correlation X X X

Novelty X X (when negatively related)

Utility X

Page 14: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

A few conclusions No measure is absolutely better than the others for obtaining the

Top-N rules. When using a synthetic measure such as reliability or conviction,

support is still an important utility measure. Interest still should be used as a novelty measure in order to fully characterize rules.

Interest not only can be used as a good correlation measure, it also can be used as a good novelty measure. It is always 1 when the rule contains no novel information.

When Interest is used as a synthetic measure for ranking rules, then confidence should also be included in addition to support. This is because Interest is a poor measure for implication examination.

While we may have three alternate frameworks for fully characterizing rules (support-confidence-interest, support-conviction-interest, support-reliability-interest), the support-confidence-interest framework is best. The other two work well only when rules are positively correlated.

Page 15: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Partial Order

Instead of support-confidence framework, we suggest:

Support-confidence-interest framework Support-conviction-interest framework Support-reliability-interest framework Other Framework??? Which is the best???

Page 16: Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002

Work to Do

Evaluate the frameworks with realistic application data (Image data, KDD cup data, Skyrocket data, …, criticized for lack of support applications)

Efficiency principle? P-tree algorithms and other algorithms for comparison

Other possible frameworks? Ours are for objective metrics, how to

combine subjective metrics for top-N rules?