42
1 Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei

Association Rule Mining (II)

  • Upload
    xarles

  • View
    33

  • Download
    3

Embed Size (px)

DESCRIPTION

Association Rule Mining (II). Instructor: Qiang Yang Thanks: J.Han and J. Pei. Bottleneck of Frequent-pattern Mining. Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 - PowerPoint PPT Presentation

Citation preview

Page 1: Association Rule Mining (II)

1

Association Rule Mining (II)

Instructor: Qiang YangThanks: J.Han and J. Pei

Page 2: Association Rule Mining (II)

Frequent-pattern mining methods

2

Bottleneck of Frequent-pattern Mining

Multiple database scans are costly Mining long patterns needs many passes

of scanning and generates lots of candidates To find frequent itemset i1i2…i100

# of scans: 100 # of Candidates: (100

1) + (1002) + … + (1

100

00) = 2100-1

= 1.27*1030 !

Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Page 3: Association Rule Mining (II)

Frequent-pattern mining methods

3

FP-growth: Frequent-pattern Mining Without Candidate Generation

Heuristic: let P be a frequent itemset, S be the set of transactions contain P, and x be an item. If x is a frequent item in S, {x} P must be a frequent itemset

No candidate generation! A compact data structure, FP-tree, to

store information for frequent pattern mining

Recursive mining algorithm for mining complete set of frequent patterns

Page 4: Association Rule Mining (II)

Frequent-pattern mining methods

4

Example

Items Bought

f,a,c,d,g,i,m,p

a,b,c,f,l,m,o

b,f,h,j,o

b,c,k,s,p

a,f,c,e,l,p,m,n

Min Support = 3

Page 5: Association Rule Mining (II)

Frequent-pattern mining methods

5

Scan the database List of frequent items, sorted:

(item:support) <(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)>

The root of the tree is created and labeled with “{}”

Scan the database Scanning the first transaction leads to the

first branch of the tree: <(f:1),(c:1),(a:1),(m:1),(p:1)>

Order according to frequency

Page 6: Association Rule Mining (II)

Frequent-pattern mining methods

6

Scanning TID=100

TransactionDatabaseTID Items100 f,a,c,d,g,i,m,p

{}

f:1

c:1

a:1

m:1

p:1

Header TableNode

Item count head f 1c 1a 1

m 1

p 1

root

Page 7: Association Rule Mining (II)

Frequent-pattern mining methods

7

Scanning TID=200 Frequent Single Items:

F1=<f,c,a,b,m,p> TID=200

Possible frequent items:

Intersect with F1: f,c,a,b,m

Along the first branch of <f,c,a,m,p>, intersect:

<f,c,a> Generate two children

<b>, <m>

Items Bought

f,a,c,d,g,i,m,p

a,b,c,f,l,m,o

b,f,h,j,o

b,c,k,s,p

a,f,c,e,l,p,m,n

Page 8: Association Rule Mining (II)

Frequent-pattern mining methods

8

Scanning TID=200

TransactionDatabaseTID Items200 f,c,a,b,m

{}

f:2

c:2

a:2

m:1

p:1

Header TableNode

Item count head f 1c 1a 1b 1m 2

p 1

root

b:1

m:1

Page 9: Association Rule Mining (II)

Frequent-pattern mining methods

9

The final FP-tree

TransactionDatabaseTID Items100 f,a,c,d,g,i,m,p200 a,b,c,f,l,m,o300 b,f,h,j,o400 b,c,k,s,p500 a,f,c,e,l,p,m,n

Min support = 3

Frequent 1-items in frequency descending order:

f,c,a,b,m,p

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableNode

Item count head f 1c 2a 1b 3m 2p 2

Page 10: Association Rule Mining (II)

Frequent-pattern mining methods

10

FP-Tree Construction

Scans the database only twice Subsequent mining: based on the FP-

tree

Page 11: Association Rule Mining (II)

Frequent-pattern mining methods

11

How to Mine an FP-tree?

Step 1: form conditional pattern base

Step 2: construct conditional FP-tree

Step 3: recursively mine conditional

FP-trees

Page 12: Association Rule Mining (II)

Frequent-pattern mining methods

12

Conditional Pattern Base Let {I} be a frequent item

A sub database which consists of the set of prefix paths in the FP-tree With item {I} as a co-occurring suffix pattern

Example: {m} is a frequent item {m}’s conditional pattern base:

<f,c,a>: support =2 <f,c,a,b>: support = 1

Mine recursively on such databases

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Page 13: Association Rule Mining (II)

Frequent-pattern mining methods

13

Conditional Pattern Tree Let {I} be a suffix item, {DB|I} be the

conditional pattern base The frequent pattern tree TreeI is known

as the conditional pattern tree Example:

{m} is a frequent item {m}’s conditional pattern base:

<f,c,a>: support =2 <f,c,a,b>: support = 1

{m}’s conditional pattern tree

{}

f:4

c:3

a:3

m:2

Page 14: Association Rule Mining (II)

Frequent-pattern mining methods

14

Composition of patterns and

Let be a frequent item in DB, B be ’s conditional pattern base, and be an itemset in B. Then + is frequent in DB if and only if is frequent in B.

Example: Starting with ={p} {p}’s conditional pattern base (from the tree)

B= (f,c,a,m): 2

(c,b): 1 Let be {c}. Then ={p,c}, with support = 3.

Page 15: Association Rule Mining (II)

Frequent-pattern mining methods

15

Single path tree Let P be a single path

FP tree Let {I1, I2, …Ik} be an

itemset in the tree Let Ij have the lowest

support Then the support({I1,

I2, …Ik})=support(Ij) Example:

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Page 16: Association Rule Mining (II)

Frequent-pattern mining methods

16

FP_growth Algorithm Fig 6.10

Recursive Algorithm Input: A transaction database, min_supp Output: The complete set of frequent patterns

1. FP-Tree construction 2. Mining FP-Tree by calling

FP_growth(FP_tree, null) Key Idea: consider single path FP-tree

and multi-path FP-tree separately Continue to split until get single-path FP-tree

Page 17: Association Rule Mining (II)

Frequent-pattern mining methods

17

FP_Growth (tree, ) If tree contains a single path P, then

For each combination (denoted as ) of the nodes in the path P, then

Generate pattern + with support = min_supp of nodes in

Else for each a in the header of tree, do { Generate pattern = a + with support =

a.support; Construct

(1) ’s conditional pattern base and (2) ’s conditional FP-tree Tree

If Tree is not empty, then Call FP-growth(Tree, );

}

Page 18: Association Rule Mining (II)

Frequent-pattern mining methods

18

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

Page 19: Association Rule Mining (II)

Frequent-pattern mining methods

19

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold

0

20

40

60

80

100

120

140

0 0.5 1 1.5 2

Support threshold (%)

Ru

nti

me

(sec

.)

D2 FP-growth

D2 TreeProjection

Data set T25I20D100K

Page 20: Association Rule Mining (II)

Frequent-pattern mining methods

20

Why Is FP-Growth the Winner?

Divide-and-conquer: decompose both the mining task and DB

according to the frequent patterns obtained so far leads to focused search of smaller databases

Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not

pattern search and matching

Page 21: Association Rule Mining (II)

Frequent-pattern mining methods

21

Implications of the Methodology: Papers by Han, et al.

Mining closed frequent itemsets and max-patterns

CLOSET (DMKD’00)

Mining sequential patterns

FreeSpan (KDD’00), PrefixSpan (ICDE’01)

Constraint-based mining of frequent patterns

Convertible constraints (KDD’00, ICDE’01)

Computing iceberg data cubes with complex

measures

H-tree and H-cubing algorithm (SIGMOD’01)

Page 22: Association Rule Mining (II)

Frequent-pattern mining methods

22

Visualization of Association Rules: Pane Graph

Page 23: Association Rule Mining (II)

Frequent-pattern mining methods

23

Visualization of Association Rules: Rule Graph

Page 24: Association Rule Mining (II)

Frequent-pattern mining methods

24

Mining Various Kinds of Rules or Regularities

Multi-level, quantitative association rules,

correlation and causality, ratio rules,

sequential patterns, emerging patterns,

temporal associations, partial periodicity

Classification, clustering, iceberg cubes,

etc.

Page 25: Association Rule Mining (II)

Frequent-pattern mining methods

25

Multiple-level Association Rules

Items often form hierarchy Flexible support settings: Items at the lower

level are expected to have lower support. Transaction database can be encoded

based on dimensions and levels explore shared multi-level mining

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support

Page 26: Association Rule Mining (II)

Frequent-pattern mining methods

26

Quantitative Association Rules

age(X,”34-35”) income(X,”30K - 50K”) buys(X,”high resolution TV”)

Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is

maximized. 2-D quantitative association rules: Aquan1 Aquan2 Acat Cluster “adjacent”

association rulesto form general rules using a 2-D grid.

Example:

Page 27: Association Rule Mining (II)

Frequent-pattern mining methods

27

Redundant Rules [SA95] Which rule is redundant?

milk wheat bread, [support = 8%, confidence = 70%]

“skim milk” wheat bread, [support = 2%, confidence = 72%]

The first rule is more general than the second rule.

A rule is redundant if its support is close to the “expected” value, based on a general rule, and its confidence is close to that of the general rule.

Page 28: Association Rule Mining (II)

Frequent-pattern mining methods

INCREMENTAL MINING [CHNW96]

Rules in DB were found and a set of new tuples db is added to DB,

Task: to find new rules in DB + db. Usually, DB is much larger than db.

Properties of Itemsets: frequent in DB + db if frequent in both DB and db. infrequent in DB + db if also in both DB and db. frequent only in DB, then merge with counts in db.

No DB scan is needed! frequent only in db, then scan DB once to update their itemset

counts. Same principle applicable to distributed/parallel mining.

Page 29: Association Rule Mining (II)

Frequent-pattern mining methods

29

CORRELATION RULES Association does not measure correlation

[BMS97, AY98]. Among 5000 students

3000 play basketball, 3750 eat cereal, 2000 do both play basketball eat cereal [40%, 66.7%] Conclusion: “basketball and cereal are correlated”

is misleading because the overall percentage of students eating cereal

is 75%, higher than 66.7%.

Confidence does not always give correct picture!

Page 30: Association Rule Mining (II)

Frequent-pattern mining methods

30

Correlation Rules

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated the value is less than 1;

Otherwise A and B positively correlated.

P(B|A)/P(B) is known as the lift of rule BA

If less than one, then B and A are negatively correlated.

BasketballCereal 2000/

(3000*3750/5000)=2000*5000/3000*3750<1

)(/)|()()(

)(*)|(

)()(

)(BPABP

BPAP

APABP

BPAP

BAP

Page 31: Association Rule Mining (II)

Frequent-pattern mining methods

31

Chi-square Correlation [BMS97]

The cutoff value at 95% significance level is 3.84 > 0.9 Thus, we do not reject the independence

assumption.

Item2 not Item2 row sumItem1 1 2 3not Item1 4 2 6column sum 5 4 9

9.09/)59(*)39(

)9/)59(*)39(2(

9/5*)39(

)9/5*)39(4(

9/)59(*3

)9/)59(*32(

9/5*3

)9/5*31( 22222

Page 32: Association Rule Mining (II)

Frequent-pattern mining methods

32

Constraint-based Data Mining

Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused!

Data mining should be an interactive process User directs what to be mined using a data mining

query language (or a graphical user interface) Constraint-based mining

User flexibility: provides constraints on what to be mined

System optimization: explores such constraints for efficient mining—constraint-based mining

Page 33: Association Rule Mining (II)

Frequent-pattern mining methods

33

Constraints in Data Mining

Knowledge type constraint: classification, association, etc.

Data constraint — using SQL-like queries find product pairs sold together in stores in

Vancouver in Dec.’00 Dimension/level constraint

in relevance to region, price, brand, customer category

Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum >

$200) Interestingness constraint

strong rules: min_support 3%, min_confidence 60%

Page 34: Association Rule Mining (II)

Frequent-pattern mining methods

34

Constrained Mining vs. Constraint-Based Search

Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding

some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to

integrate them Constrained mining vs. query processing in DBMS

Database query processing requires to find all Constrained pattern mining shares a similar

philosophy as pushing selections deeply in query processing

Page 35: Association Rule Mining (II)

Frequent-pattern mining methods

35

Constrained Frequent Pattern Mining: A Mining Query Optimization Problem

Given a frequent pattern mining query with a set of constraints C, the algorithm should be sound: it only finds frequent sets that satisfy the

given constraints C complete: all frequent sets satisfying the given

constraints C are found A naïve solution

First find all frequent sets, and then test them for constraint satisfaction

More efficient approaches: Analyze the properties of constraints

comprehensively Push them as deeply as possible inside the

frequent pattern computation.

Page 36: Association Rule Mining (II)

Frequent-pattern mining methods

36

Anti-Monotonicity in Constraint-Based Mining

Anti-monotonicity intemset S satisfies the constraint, so

does any of its subset sum(S.Price) v is anti-monotone sum(S.Price) v is not anti-monotone

Example. C: range(S.profit) 15 is anti-monotone Itemset ab violates C So does every superset of ab

TransactionTID

a, b, c, d, f10

b, c, d, f, g, h20

a, c, d, e, f30

c, e, f, g40

TDB (min_sup=2)

Item Profit

a 40

b 0

c -20

d 10

e -30

f 30

g 20

h -10

Page 37: Association Rule Mining (II)

Frequent-pattern mining methods

37

Which Constraints Are Anti-Monotone?

Constraint Antimonotone

v S NoS V no

S V yesmin(S) v no

min(S) v yesmax(S) v yes

max(S) v nocount(S) v yes

count(S) v no

sum(S) v ( a S, a 0 ) yessum(S) v ( a S, a 0 ) no

range(S) v yesrange(S) v no

avg(S) v, { , , } convertiblesupport(S) yes

support(S) no

Page 38: Association Rule Mining (II)

Frequent-pattern mining methods

38

Monotonicity in Constraint-Based Mining

Monotonicity When an intemset S satisfies the

constraint, so does any of its superset

sum(S.Price) v is monotone min(S.Price) v is monotone

Example. C: range(S.profit) 15 Itemset ab satisfies C So does every superset of ab

TransactionTID

a, b, c, d, f10

b, c, d, f, g, h20

a, c, d, e, f30

c, e, f, g40

TDB (min_sup=2)

Item

Profit

a 40

b 0

c -20

d 10

e -30

f 30

g 20

h -10

Page 39: Association Rule Mining (II)

Frequent-pattern mining methods

39

Which Constraints Are Monotone?

Constraint Monotone

v S yes

S V yes

S V no

min(S) v yes

min(S) v no

max(S) v no

max(S) v yes

count(S) v no

count(S) v yes

sum(S) v ( a S, a 0 ) no

sum(S) v ( a S, a 0 ) yes

range(S) v no

range(S) v yes

avg(S) v, { , , } convertible

support(S) no

support(S) yes

Page 40: Association Rule Mining (II)

Frequent-pattern mining methods

40

Succinctness, Convertible, Inconvertable Constraints in Book

We will not consider these in this course.

Page 41: Association Rule Mining (II)

Frequent-pattern mining methods

41

Associative Classification

Mine association possible rules in form of itemset class Itemset: a set of attribute-value pairs Class: class label

Build Classifier Organize rules according to decreasing

precedence based on confidence and support B. Liu, W. Hsu & Y. Ma. Integrating classification

and association rule mining. In KDD’98

Page 42: Association Rule Mining (II)

Frequent-pattern mining methods

42

Classification by Aggregating Emerging Patterns

Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others. Age<=30 is frequent in class

“buys_computer=yes” and infrequent in class “buys_computer=no”

Rule: age<=30 buys computer G. Dong & J. Li. Efficient mining of

emerging patterns: discovering trends and differences. In KDD’99