38
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Embed Size (px)

Citation preview

Page 1: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

1

Association Rule Mining

Instructor Qiang YangSlides from Jiawei Han and Jian Pei

And fromIntroduction to Data MiningBy Tan, Steinbach, Kumar

Page 2: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

2

What Is Frequent Pattern Mining?

Frequent patterns: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]

Frequent pattern mining: finding regularities in data What products were often purchased

together? What are the subsequent purchases

after buying a PC?

Page 3: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

3

Why Is Frequent Pattern Mining an Essential Task in Data Mining?

Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association,

partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg

cube, fascicles (semantic data compression) Broad applications

Basket data analysis, cross-marketing, catalog design, sale campaign analysis

Web log (click stream) analysis, DNA sequence analysis, etc.

Page 4: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

4

Basic Concepts: Frequent Patterns and Association Rules

Itemset X={x1, …, xk} Find all the rules XY with min

confidence and support support, s, probability that a

transaction contains XY confidence, c, conditional

probability that a transaction having X also contains Y.

Let min_support = 50%, min_conf = 50%:

A C (50%, 66.7%)C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id

Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Page 5: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

5

Concept: Frequent Itemsets

OutlookTemperatu

reHumidit

yPlay

sunny hot high no

sunny hot high no

overcast hot high yes

rainy mild high yes

rainy cool normal yes

rainy cool normal no

overcast cool normal yes

sunny mild high no

sunny cool normal yes

rainy mild normal yes

sunny mild normal yes

overcast mild high yes

overcast hot normal yes

rainy mild high no

Minimum support=2 {sunny, hot, no} {sunny, hot, high,

no} {rainy, normal}

Min Support =3 ?

How strong is {sunny, no}? Count = Percentage =

Page 6: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

6

Concept: Itemset Rules {sunny, hot, no} = {Outlook=Sunny, Temp=hot, Play=no} Generate a rule:

Outlook=sunny and Temp=hot Play=no How strong is this rule? Support of the rule

= support of the itemset {sunny, hot, no} = 2 = Pr({sunny, hot, no})

Either expressed in count form or percentage form Confidence = Pr(Play=no | {Outlook=sunny,

Temp=hot}) In general LHS RHS, Confidence = Pr(RHS|LHS)

Confidence =Pr(RHS|LHS) =count(LHS and RHS) / count(LHS)

What is the confidence of Outlook=sunnyPlay=no?

Page 7: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

7

Frequent Patterns

Patterns = Item Sets {i1, i2, … in}, where each item is a pair:

(Attribute=value) Frequent Patterns

Itemsets whose support >= minimum support

Support count(itemset)/count(database)

Page 8: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

8

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

Page 9: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

9

Max-patterns

Max-pattern: frequent patterns

without proper frequent super pattern

BCDE, ACD are max-patterns

BCD is not a max-pattern Tid Items

10 A,B,C,D,E

20 B,C,D,E,

30 A,C,D,FMin_sup=2

Page 10: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

10

Maximal Frequent Itemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Border

Infrequent Itemsets

Maximal Itemsets

An itemset is maximal frequent if none of its immediate supersets is frequent

Page 11: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

11

Frequent Max Patterns Succinct Expression of frequent patterns

Let {a, b, c} be frequent Then, {a, b}, {b, c}, {a, c} must also be frequent Then {a}, {b}, {c}, must also be frequent

By writing down {a, b, c} once, we save lots of computation

Max Pattern If {a, b, c} is a frequent max pattern, then {a, b,

c, x} is NOT a frequent pattern, for any other item x.

Page 12: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

12

Find Frequent Max Patterns

OutlookTemperatu

reHumidit

yPlay

sunny hot high no

sunny hot high no

overcast hot high yes

rainy mild high yes

rainy cool normal yes

rainy cool normal no

overcast cool normal yes

sunny mild high no

sunny cool normal yes

rainy mild normal yes

sunny mild normal yes

overcast mild high yes

overcast hot normal yes

rainy mild high no

Minimum support=2 {sunny, hot, no} ??

Page 13: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

13

Closed Patterns

An itemset is closed if none of its immediate supersets has the same support as the itemset

• {a, b}, {a, b, d}, {a, b, c} are closed patterns But, {a, b} is not a max pattern

See where changes happen Reduce # of patterns and rules N. Pasquier et al. In ICDT’99

TID Items

10 a, b, c

20 a, b, c

30 a, b, d

40 a, b, d,

50 c, e, f

Page 14: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

14

Maximal vs Closed Itemsets

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Transaction Ids

Not supported by any transactions

indexes beside

an item set is

the transaction

#s.

Page 15: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

15

Maximal vs Closed Frequent Itemsets

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

Page 16: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

16

Note on Closed Patterns

Closed patterns have no need to specify the minimum support Given dataset, we can find a set of closed

patterns from it, so that for any minimum support values, we can immediately find the set of patterns (a subset of the closed patterns).

Closed frequent patterns Both closed and above the min support

Page 17: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

17

Maximal vs Closed Itemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Page 18: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

18

Mining Association Rules—an Example

For rule A C:support = support({A}{C}) = 50%confidence =

support({A}{C})/support({A}) = 66.6%

Min. support 50%Min. confidence 50%

Transaction-id

Items bought

10 A, B, C

20 A, C

30 A, D

40 B, E, F

Frequent pattern

Support

{A} 75%

{B} 50%

{C} 50%

{A, C} 50%

Page 19: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

19

Method 1:Apriori: A Candidate Generation-and-test Approach

Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains

{beer, diaper} Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!

Method: generate length (k+1) candidate itemsets from length k

frequent itemsets, and test the candidates against DB

The performance studies show its efficiency and scalability

Agrawal & Srikant 1994, Mannila, et al. 1994

Page 20: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

20

The Apriori Algorithm — An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset

sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset

sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset

sup

{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset

sup

{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset

sup

{B, C, E}

2

Page 21: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

21

Speeding up Association rules

Dynamic Hashing and Pruning technique

Thanks to Cheng Hong & Hu Haibo

Page 22: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

22

DHP: Reduce the Number of Candidates

A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of

count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based

algorithm for mining association rules. In SIGMOD’95

Page 23: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

23

Still challenging, the niche for DHP

DHP ( Park ’95 ): Dynamic Hashing and Pruning

Candidate large 2-itemsets are huge. DHP: trim them using hashing

Transaction database is huge that one scan per iteration is costly DHP: prune both number of transactions

and number of items in each transaction after each iteration

Page 24: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

24

Hash Table Construction Consider two items sets, all itesms are numbered as i1, i2, …

in. For any any pair (x, y), has according to

Hash function bucket #= h({x y}) = ((order of x)*10+(order of y)) % 7

Example: Items = A, B, C, D, E, Order = 1, 2, 3 4, 5, H({C, E})= (3*10 + 5)% 7 = 0 Thus, {C, E} belong to bucket 0.

Page 25: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

25

How to trim candidate itemsets

In k-iteration, hash all candidate k+1 itemsets in a hash table, and count all the itemsets in each bucket.

In k+1 iteration, examine each of the candidate itemset to see if its correspondent bucket value is above the support ( necessary condition )

Page 26: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

26

Example

TID Items

100 A C D

200 B C E

300 A B C E

400 B E

Figure1. An example transaction database

Page 27: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

27

Generation of C1 & L1(1st iteration)

C1 L1

Itemset Sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset Sup

{A} 2

{B} 3

{C} 3

{E} 3

Page 28: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

28

Hash Table Construction

Find all 2-itemset of each transaction

TID 2-itemset

100 {A C} {A D} {C D}

200 {B C} {B E} {C E}

300{A B} {A C} {A E} {B C} {B E} {C

E}

400 {B E}

Page 29: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

29

Hash Table Construction (2)

Hash functionh({x y}) = ((order of x)*10+(order of y)) % 7

Hash table {C E} {A E} {B C} {B E} {A B} {A C} {C E} {B C} {B E} {C D} {A D} {B E} {A C}

bucket 0 1 2 3 4 5 6

3 1 2 0 3 1 3

Page 30: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

30

C2 Generation (2nd iteration)

L1*L1

# in the bucket

{A B} 1

{A C} 3

{A E} 1

{B C} 2

{B E} 3

{C E} 3

Resulted C2

{A C}

{B C}

{B E}

{C E}

C2 of Apriori

{A B}

{A C}

{A E}

{B C}

{B E}

{C E}

Page 31: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

31

Effective Database Pruning

Apriori Don’t prune database. Prune Ck by support

counting on the original database.

DHP More efficient support

counting can be achieved on pruned database.

Page 32: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

32

Performance Comparison

Page 33: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

33

Performance Comparison (2)

Page 34: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

34

FP-growth Algorithm

Use a compressed representation of the database using an FP-tree

Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets

Page 35: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

35

FP-tree construction

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

null

A:1

B:1

B:1

C:1

D:1

After reading TID=1:

After reading TID=2:

Page 36: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

36

FP-Tree Construction

null

A:7

B:5

B:3

C:3

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

Pointers are used to assist frequent itemset generation

D:1

E:1

Transaction Database

Item PointerABCDE

Header table

Page 37: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

37

FP-growthnull

A:4

B:2

B:1

C:1

D:1

C:1

D:1C:1

D:1

D:1

Conditional Pattern base for D: (PB | D) = {(A:1,B:1,C:1),

(A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)}

Recursively apply FP-growth on PB, and then append to D

Thus, frequent Itemsets found from PB|D

(with min support = 2):

AD, BD, CD, ABD, ACD, BCD

D:1

Page 38: 1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar

Frequent-pattern mining methods

38

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K