43
COMP9318: Data Warehousing and Data Mining 1 COMP9318: Data Warehousing and Data Mining — L6: Association Rule Mining —

COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 1

COMP9318: Data Warehousing and Data Mining

— L6: Association Rule Mining —

Page 2: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 2

n Problem definition and preliminaries

Page 3: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 3

What Is Association Mining?n Association rule mining:

n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

n Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]

n Motivation: finding regularities in datan What products were often purchased together? — Beer

and diapers?!n What are the subsequent purchases after buying a PC?n What kinds of DNA are sensitive to this new drug?n Can we automatically classify web documents?

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 4: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining

Why Is Frequent Pattern or Assoiciation Mining an Essential Task in Data Mining?

n Foundation for many essential data mining tasksn Association, correlation, causalityn Sequential patterns, temporal or cyclic association,

partial periodicity, spatial and multimedia associationn Associative classification, cluster analysis, iceberg cube,

fascicles (semantic data compression)n Broad applications

n Basket data analysis, cross-marketing, catalog design, sale campaign analysis

n Web log (click stream) analysis, DNA sequence analysis, etc. c.f., google’s spelling suggestion

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 5: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 5

Basic Concepts: Frequent Patterns and Association Rules

n Itemset X={x1, …, xk}n Shorthand: x1 x2 … xk

n Find all the rules XàY with min confidence and supportn support, s, probability that a

transaction contains XÈYn confidence, c, conditional

probability that a transaction having X also contains Y.

Let min_support = 50%, min_conf = 70%:

sup(AC) = 2A è C (50%, 66.7%)C è A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought10 { A, B, C }20 { A, C }30 { A, D }40 { B, E, F }

frequent itemset

association rule

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 6: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 6

Mining Association Rules—an Example

For rule A è C:support = support({A}∪{C}) = 50%confidence = support({A}∪{C})/support({A}) = 66.6%

Min. support 50%Min. confidence 50%

Transaction-id Items bought10 A, B, C20 A, C30 A, D40 B, E, F

Frequent pattern Support{A} 75%{B} 50%{C} 50%

{A, C} 50%

major computation challenge: calculate the support of itemsetsç The frequent itemset mining problem

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 7: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 7

n Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases

Page 8: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Association Rule Mining Algorithms

n Naïve algorithmn Enumerate all possible itemsets

and check their support against min_sup

n Generate all association rules and check their confidence against min_conf

n The Apriori propertyn Apriori Algorithmn FP-growth Algorithm

COMP9318: Data Warehousing and Data Mining 8

Candidate Generation & Verification

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 9: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 9

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

All Candidate Itemsets for {A, B, C, D, E}

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 10: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 10

Apriori Property

n A frequent (used to be called large) itemset is an itemset whose support is ≥ min_sup.

n Apriori property (downward closure): any subsets of a frequent itemset are also frequent itemsets

n Aka the anti-monotone property of support

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD “any supersets of an infrequent itemset are also infrequent itemsets”

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 11: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 11

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principlenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Q: How to design an algorithm to improve the naïve algorithm?

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 12: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 12

Apriori: A Candidate Generation-and-test Approach

n Apriori pruning principle: If there is any itemsetwhich is infrequent, its superset should not be generated/tested!

n Algorithm [Agrawal & Srikant 1994]1. Ck ç Perform level-wise candidate generation

(from singleton itemsets)2. Lk ç Verify Ck against Lk3. Ck+1 ç generated from Lk

4. Goto 2 if Ck+1 is not empty

Wei Wang
Page 13: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 13

The Apriori Algorithmn Pseudo-code:

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;for each transaction t in database do begin

increment the count of all candidates in Ck+1that are contained in t

endLk+1 = candidates in Ck+1 with min_support

endreturn ∪kLk;

Wei Wang
Page 14: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 14

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C22nd scan

C3 L33rd scan

Tid Items10 A, C, D20 B, C, E30 A, B, C, E40 B, E

Itemset sup{A} 2{B} 3{C} 3{D} 1{E} 3

Itemset sup{A} 2{B} 3{C} 3{E} 3

Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset{B, C, E}

Itemset sup{B, C, E} 2

minsup = 50%

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 15: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 15

Important Details of Apriori

1. How to generate candidates?n Step 1: self-joining Lk (what’s the join condition? why?)n Step 2: pruning

2. How to count supports of candidates?

Example of Candidate-generationn L3={abc, abd, acd, ace, bcd}n Self-joining: L3*L3

n abcd from abc and abdn acde from acd and ace

n Pruning:n acde is removed because ade is not in L3

n C4={abcd}

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 16: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 16

Generating Candidates in SQL

n Suppose the items in Lk-1 are listed in an ordern Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

n Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck

Wei Wang
Wei Wang
Wei Wang
Page 17: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 17

Derive rules from frequent itemsets

n Frequent itemsets != association rulesn One more step is required to find association

rulesn For each frequent itemset X,

For each proper nonempty subset A of X, n Let B = X - An A à B is an association rule if

n Confidence (A à B) ≥ min_conf,where support (A à B) = support (AB), andconfidence (A à B) = support (AB) / support (A)

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 18: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 18

Example – deriving rules from frequent itemsets

n Suppose 234 is frequent, with supp=50%n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with

supp=50%, 50%, 75%, 75%, 75%, 75% respectivelyn These generate these association rules:

n 23 => 4, confidence=100%n 24 => 3, confidence=100%n 34 => 2, confidence=67%n 2 => 34, confidence=67%n 3 => 24, confidence=67%n 4 => 23, confidence=67%n All rules have support = 50%

= (N* 50%)/(N*75%)

Q: is there any optimization (e.g., pruning) for this step?

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 19: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 19

Deriving rulesn To recap, in order to obtain A à B, we need

to have Support(AB) and Support(A)n This step is not as time-consuming as

frequent itemsets generationn Why?

n It’s also easy to speedup using techniques such as parallel processing.n How?

n Do we really need candidate generation for deriving association rules? n Frequent-Pattern Growth (FP-Tree)

Wei Wang
Page 20: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 20

Bottleneck of Frequent-pattern Mining

n Multiple database scans are costlyn Mining long patterns needs many passes of

scanning and generates lots of candidatesn To find frequent itemset i1i2…i100

n # of scans: 100n # of Candidates:

n Bottleneck: candidate-generation-and-testCan we avoid candidate generation altogether?

✓100

1

◆+

✓100

2

◆+ . . .+

✓100

100

◆= 2100 � 1

Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Wei Wang
Page 21: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

n FP-growth

COMP9318: Data Warehousing and Data Mining 21

Wei Wang
Page 22: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

No Pain, No GainJava Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

minsup = 1

n Apriori: n L1 = {J, L, S, P, R}n C2 = all the (5

2) combinationsn Most of C2 do not contribute to the resultn There is no way to tell because

Page 23: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

No Pain, No GainJava Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

J

ɸ

minsup = 1

{A, C}

Ideas: • Keep the support set for

each frequent itemset• DFS

J è JL? J è ???Only need to look at support set for J

Page 24: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

No Pain, No GainJava Lisp Scheme Python Ruby

Alice X X

Bob X X

Charlie X X X

Dora X X

J

ɸ

JPRminsup = 1

{A, C}

{C}

{C}JP JR

Ideas: • Keep the support set for

each frequent itemset• DFS

{A,C}

Page 25: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Notations and Invariantsn CondiditonalDB:

n DB|p = {t ∈ DB | t contains itemset p}n DB = DB|∅ (i.e., conditioned on nothing)n Shorthand: DB|px = DB|(p∪x)

n SupportSet(p∪x, DB) = SupportSet(x, DB|p)n {x | x mod 6 = 0 ⋀ x ∈ [100] } =

{x | x mod 3 = 0 ⋀ x ∈ even([100]) }n A FP-tree|p is equivalent to a DB|p

n One can be converted to anothern Next, we illustrate the alg using conditionalDB

25

Page 26: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

FP-tree Essential Idea /1n Recursive algorithm again!

n FreqItemsets(DB|p):

n X = FindFrequentItems(DB|p)

n Foreach x in Xn DB*|px = GetConditionalDB+(DB*|p, x)n

n FreqItemsets(DB*|px)

all frequent itemsets in DB|p belong to one of the following categories:

output { (x p) | x ∈ X }

patterns ~ xippatterns ~ ★px1

patterns ~ ★px2

patterns ~ ★pxi

patterns ~ ★pxn

obtained via recursion

easy task, as only items (not itemsets) are needed

Page 27: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

No Pain, No GainJava Lisp Scheme Python Ruby

Alice X X

Charlie X X Xminsup = 1

DB|J

n FreqItemsets(DB|J):n {P, R} ç FindFrequentItems(DB|J)n Output {JP, JR}n Get DB*|JP; FreqItemsets(DB*|JP)n Get DB*|JR; FreqItemsets(DB*|JR)n // Guaranteed no other frequent itemset in DB|J

Page 28: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

FP-tree Essential Idea /2

n FreqItemsets(DB|p):n If boundary condition, then …n X = FindFrequentItems(DB|p)n [optional] DB*|p = PruneDB(DB|p, X)

n Foreach x in Xn DB*|px = GetConditionalDB+(DB*|p, x)n [optional] if DB*|px is degenerated, then powerset(DB*|px)n FreqItemsets(DB*|px) Also gets rid of items

already processed before x è avoid duplicates

Also output each item in X (appended with the conditional pattern)

Remove items not in X; potentially reduce # of transactions (∅ or dup). Improves the efficiency.

output { (x p) | x ∈ X }

Page 29: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Lv 1 Recursion

n minsup = 3

F C A D G I M PA B C F L M OB F H J O WB C K S PA F C E L P M N

F C A M PF C A B MF BC B PF C A M P

DB DB*

X = {F, C, A, B, M, P}

DB*|P

DB*|M (sans P)

DB*|B (sans MP)

DB*|A (sans BMP)

DB*|C (sans ABMP)

DB*|F (sans CABMP)

F C A M PC B PF C A M P

F C AF C AF C A

Grayed items are for illustration purpose only.

Output: F, C, A, B, M, P

Page 30: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Lv 2 Recursion on DB*|P

n minsup = 3

CCC

DB DB*

X = {C}

DB*|CF C A M PC B PF C A M P

Which is actually FullDB*|CP

CCC

Context = Lv 3 recursion on DB*|CP: DB has only empty

sets or X = {} èimmediately returnsOutput: CP

Page 31: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Lv 2 Recursion on DB*|A (sans …)n minsup = 3

F CF CF C

DB DB*

X = {F, C}

DB*|CF C AF C AF C A

Which is actually FullDB*|CA

FCFCFC

DB*|FFFF

boundary case

Further recursion

(output: FCA)

Output: FA, CA

Page 32: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Different Example:Lv 2 Recursion on DB*|P

n minsup = 2

F C AF CF A

DB DB*

X = {F, C, A}

DB*|A

F C A M PF C B PF A P

Which is actually FullDB*|AP F CF

DB*|CFF

DB*|F

F F

X = {F}

Output: FP, CP, AP

Output: FAP

Page 33: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

I will give you back the FP-tree

n An FP-tree tree of DB consists of: n A fixed order among items in DB n A prefix, threaded tree of sorted transactions

in DBn Header table: (item, freq, ptr)

n When used in the algorithm, the input DB is always pruned (c.f., PruneDB())n Remove infequent itemsn Remove infrequent items in every transaction

Page 34: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

FP-tree ExampleTID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

minsup = 3

Page 35: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

f : 1

{ }

c : 1

a : 1

m : 1

p : 1

Insert t1

f : 2

{ }

c : 2

a : 2

m : 1

p : 1

b : 1

m : 1

Insert t2

f : 4

{ }

c : 3

a : 3

m : 2

p : 2

b : 1

m : 1

b : 1

c : 1

b : 1

p : 1

Insert all ti

Item freq headf 4c 4a 3b 3m 3p 3

Outputfcabmp

Page 36: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Item freq headf 4c 4a 3b 3m 3p 3

f : 4

{ }

c : 3

a : 3

m : 2

p : 2

b : 1

m : 1

b : 1

c : 1

b : 1

p : 1

p's conditional pattern base f c a m : 2 c b : 1

2 3 2 1 2

Outputpc

c : 3

{ }

HeaderTableSTOP

TID frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}

Cleaned p’s conditional pattern baseC :2

C :1

Page 37: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

f : 4

{ }

c : 3

a : 3

m : 2 b : 1

m : 1

b : 1

c : 1

b : 1

Item freq headf 4c 4a 3b 3m 3

m's conditional pattern base f c a : 2 f c a b : 1

3 3 3 1

Outputmfmcma

{ }

HeaderTable

f : 3

c : 3

a : 3

gen_powerset

Outputmacmafmcfmacf

TID frequent items100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, b, p}500 {f, c, a, m, p}

Page 38: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

f : 4

{ }

c : 3

a : 3

b : 1

b : 1

c : 1

b : 1

Item freq headf 4c 4a 3b 3

b's conditional pattern base f c a : 1 f : 1

c : 1

2 2 1

STOP

Page 39: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

f : 4

{ }

c : 3

a : 3

c : 1Item freq headf 4c 4a 3

a's conditional pattern base f c : 3

3 3

Outputafac

{ }

HeaderTable

f : 3

c : 3

gen_powerset

Outputacf

Page 40: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

f : 4

{ }

c : 3

c : 1Item freq headf 4c 4

c's conditional pattern base f : 3

3

Outputcf

{ }

HeaderTable

f : 3

STOP

Page 41: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

Item freq headf 4

f : 4

{ }

STOP

Page 42: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 42

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3Support threshold(%)

Run

time(

sec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

Page 43: COMP9318: Data Warehousing and Data Miningcs9318/20t1/lect/6asso.pdf · 2020-04-11 · COMP9318: Data Warehousing and Data Mining 5 Basic Concepts: Frequent Patterns and Association

COMP9318: Data Warehousing and Data Mining 43

Why Is FP-Growth the Winner?

n Divide-and-conquer: n decompose both the mining task and DB according to

the frequent patterns obtained so farn leads to focused search of smaller databases

n Other factorsn no candidate generation, no candidate testn compressed database: FP-tree structuren no repeated scan of entire database n basic ops—counting local freq items and building sub

FP-tree, no pattern search and matching