63
CS246 Association Rule Mining

CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Embed Size (px)

Citation preview

Page 1: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

CS246

Association Rule Mining

Page 2: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 2

Association Rule Mining

What is the problem? What is an association rule?

Page 3: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 3

Motivating Problem

If a customer buys, “Diet Coke,” is she likely to buy a nutrition bar? To arrange store shelves, etc. Beer and diaper

Life as a parent is tough…

Page 4: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 4

Word of Caution

Famous example: David Rhine at Duke Tested students for “extrasensory perception” Asked them to guess 10 cards – red or black 1/1000 of them guess all 10 correctly.

If done many times, some unlikely events happen for purely statistical reasons No physical validity

Page 5: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 5

Problem Definition

Input: transaction records (set of items)T1: Bread, Milk, Apple

T2: Beer, Chips

T3: Pants, Brush, Toothpaste, Chopstick

… Output: all “association rules”

Bread, Milk Apple If a customer buys bread and milk, he is likely to buy

an apple.

Page 6: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 6

Confidence Bread Apple:

If a customer buys bread, he is likely to buy an apple.

What does likely mean? A large fraction of baskets with bread also have apple. Formally, P{ I1 | I2 , I3 } > c

c : confidence, say 0.95 Probability to buy an item given other items If a customer buys I2 , I3 , she is likely to buy I1 with 95%

probability “Strength” of the rule

Identify all association rules satisfying confidence threshold c

Page 7: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 7

Support

Do we really want to find all association rules? If we sell only 5 items of a particular product, who cares

what it is sold with? Find association rules only for the set of items that

appear often enough. Formally, P{ I1 , I2 , I3 } > s

s: support Fraction of records containing the itemset Statistical “significance” I1 , I2 , I3 : frequent itemset

Find association rules for frequent itemsets

Page 8: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 8

Problem Definition Input: transaction records (set of items)

Output: All association rules

I1 , I2 I3

with support: P{ I1 , I2 , I3 } > s

and confidence: P{ I1 | I2 , I3 } > c

Is the difference between confidence and

support clear?

Page 9: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 9

Basic Algorithm?

Step 1:Find all frequent itemsets

P{ I1 , I2 , I3 } > s Step 2:

From the large itemsets, identify high confidence rules

P{ I1 | I2 , I3 } > c

Page 10: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 10

Step 1: Frequent Itemsets

Find all with : frequent itemset

More informally, find all sets of items appearing in more than k transactions

Is it really difficult? How can we solve it?

kIIII ,...,,, 321 sIIIIP k },...,,,{ 321

kIIII ,...,,, 321

Page 11: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 11

Naïve Approach

Keep counters for all subsets of items {A, B, C}

{A}, {B}, {C}, {AB}, {BC}, {AC} {ABC}

Scan all transaction records and increase counters Transaction {A, C}

{A}++, {C}++, {AC}++ What is difficult?

Page 12: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 12

Main Challenge?

Problem: 2n subsets for n items 1000 items: 21000 = 10301

Clearly not feasible Lesson: When data size is large, even a

simple problem can be very difficult. What was their main idea?

Page 13: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 13

Main Idea of Apriori Algorithm

If (A, B, C) is a frequent itemset, (A, B) is a frequent itemset If (A, B) is not a frequent itemset, (A, B, C) cannot

be a frequent itemset Consider (A, B, C) only if all its subsets are

frequent itemsets

},{},,{ BAPCBAP

Page 14: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 14

Apriori Algorithm

1. L1 = { frequent 1-itemsets }, k = 1

2. Candidate set generation Candidate set Ck : potentially frequent k itemset {A, B, C} is a candidate set iff all its subsets

{A, B}, {B, C} and {A, C} are frequent itemsets Generate candidate set Ck+1 using Lk

3. Scanning Check whether candidate sets are actually frequent

4. Increase k by 1, and go to step 2

Page 15: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 15

Example

Items: {A, B, C, D} Transactions:

{A, B},

{A, D}

{A, B, C}

{B} Support: 0.5 = 2 transactions

Page 16: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 16

Example

A B C D

{A, B} {A, D} {A, B, C} {B}

{A,B}

Page 17: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 17

Why Does Apriori Work?

Typical grocery-store scenario: 100,000 different items 10M baskets with 10 items each (108 items) support = 0.01

Q: How many items can Apriori eliminate? A: At most 1000 items remain (less than 1%)

An item should appear at least 0.01*107 = 105

108 items in total, so 108/105 = 1000 items

Page 18: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 18

Basic Algorithm

Step 1:Find all frequent itemsets P{ I1 , I2 , I3 } > s Apriori algorithm

Step 2:From the large itemsets, identify high confidence rules

P{ I1 | I2 , I3 } > c

Page 19: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 19

Step 2: High Confidence Rules

In principle, second step is straightforward:

We already estimated values in the first step Piece of cake. Simple division!

},...,,{

},...,,,{},...,,|{

32

321321

k

kk IIIP

IIIIPIIIIP

},...,,,{ 321 kIIIIP

Page 20: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 20

More On Step 2

Q: But given a frequent k-itemset, how many potential rules?

A: 2k! Any efficient algorithm?

Page 21: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 21

Questions (1)

Is support pruning valid? What about Castillo de Ygay ($5000 wine) Caviar? Even if we only sell 100 items, significant profit…

Technically very challenging Finding all association rules without support

pruning Topic of the next paper

Page 22: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 22

Questions (2)

Is P{Beer|Diaper} > 0.95 really meaningful? What if beer appears in 95% of baskets?

Interest: P{Beer, Diaper} / P{Beer} P{Diaper}

Implication strength:Beer Diaper == ~(Beer, ~Diaper)P{~Diaper} P{Beer} / P{~Diaper, Beer}

Page 23: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 23

Follow-up Works

Candidate set generation still costly Iceberg queries No candidate set generation stage

Minimizing number of passes

Page 24: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 24

Mining without Support Pruning

What is the Problem? How can we identify “Castillo de Ygay

Caviar”? Apriori is efficient only for frequent items

Problem definition Data mining: Low support, high correlation Finding rare, but very similar items

Page 25: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 25

Matrix Representation

Typical scenario 100,000 items 10M baskets with 10 items each

Matrix Columns = items Rows = baskets (i, j) = 1 if item cj is in basket ri

Very sparse: almost all 0’s (less than 0.01% 1’s)

Page 26: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 26

Matrix Example

a b c d e f g

1 1 0 0 0 0 0

1 0 0 0 0 1 0

0 1 0 1 0 0 1

0 1 0 0 1 0 0

0 0 1 1 1 0 0

1 0 0 0 0 0 0

0 0 0 0 1 1 0

{a, b}

{a, f}

{b, d, g}

{b, e}

{c, d, e}

{a}

{e, f}

Page 27: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 27

Association Rule and Similarity

Think of column Ci as the set of rows with 1 Association Rule (confidence)

Similarity

|1|

|12|}1|2{

C

CCCCP

|21|

|21|)2,1Sim(

CC

CCCC

Page 28: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 28

Example

C1 C2

0 1

1 1

1 0

1 1

1 0

0 0

Sim(C1, C2) = 2/5

P(C2|C1) = 2/4

Page 29: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 29

Problem Definition

Find all highly similar pairs All Ci, Cj with Sim(Ci, Cj) > s* s*: Similarity threshold

Page 30: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 30

Why Similarity (not Confidence)?

A1: Techniques work only for similarity A2: High similarity implies high confidence

|C1C2| / |C1C2| < |C1C2| / |C1| All similar pairs are of high confidence

Page 31: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 31

Assumption

Matrix does not fit into main memory Number of columns is relatively small

Can store some information in main memory per each item

Number of rows can be very big Sparse data: mostly 0 in the matrix

Page 32: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 32

Key Idea?

“Compress” the matrix into a smaller one Load the compressed matrix into main memory

Find high similarity pairs from the compressed matrix Much easier than disk-based computation

Page 33: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 33

Min-Hash? LSH? Hamming?

What are the for? Min-Hash?: compression LSH?: similarity pair computation Hamming LSH?: compression+similarity

Page 34: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 34

How To Compress? (1)

“Hash” each column C to a small signature Sig(C) such that Sim(C1, C2) is the same as the “similarity” of

Sig(C1) and Sig(C2) Sig(C) is small enough, so that we can store the

“compressed” matrix in main memory

Page 35: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 35

How To Compress? (2)

Idea 1 Pick 100 random rows Sig(C1) = the 100 bits of the selected rows Would it work?

Idea 1 does not work Matrix is sparse Most of the columns will be “0000…0” But the columns are different because 1’s are in

different rows

Page 36: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 36

Min-Hashing

Imagine rows are permuted randomly “Hash” function h(C)

The first row number with 1 in column C

Page 37: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 37

Example

C1 C2 C3

1 1 0 1

2 0 1 1

3 1 0 0

4 0 1 0

5 1 0 0

Permutation = (45123)

S1 S2 S3

5 4 1

Page 38: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 38

Important Property

The probability that h(C1) = h(C2) is the same as Sim(C1, C2)

Why?

Page 39: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 39

Row Types

Given C1 and C2, rows can be classified as

C1 C2

a 1 1

b 1 0

c 0 1

d 0 0

a = # of rows of type a Sim(C1, C2) = a / (a + b + c) Q: What’s P{ h(C1) = h(C2) }? A: a / (a + b + c)

Look down C1 and C2 until we see 1

If it’s type a, then h(C1) = h(C2)If it’s type b or c, not.

Page 40: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 40

Min-Hash Signature

Pick (say) 100 random permutations of the rows

Get Min-Hash values from each permutation Sig(C) = the list of 100 Min-Hash values Sim( Sig(C1), Sig(C2) ) =

fraction of signatures for which Min-Hash value agrees

Page 41: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 41

Example

C1 C2 C3

1 1 0 1

2 0 1 1

3 1 0 0

4 1 0 1

5 0 1 0

121

454

453

S3S2S1

Perm1 = (12345)

Perm2 = (54321)

Perm3 = (34512)

Similarities:

1-2 1-3 2-3

Matrix 0 0.5 0.25

Sig 0 0.67 0

Page 42: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 42

Basic Idea

“Compress” the matrix into a smaller one Min-Hash signature

Find high similarity pairs from the compressed matrix How?

Page 43: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 43

Problem

From the signature matrix (which fits into main memory), identify all similar pairs

Assuming 100,000 items Potentially 1010 similar pairs? One counter per one pair? No way

How?

Page 44: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 44

Locality Sensitive Hashing

A technique to limit the number of similar pairs to consider

Approach Using LSH, identify “candidate similar pairs” Scan the Min-Hash signature matrix for

verification

Page 45: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 45

Locality Sensitive Hashing

Partition the signature matrix into l bands of r rows each

C1 C2 C3 C4 C5 C6 C7

h1

h2

h3

h4

h5

h6

r rows band 1

band 2

l bands

Page 46: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 46

Locality Sensitive Hashing

Hash each column in each band into buckets

C1 C2 C3 C4 C5 C6 C7

h1

h2

h3

h4

h5

h6

Page 47: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 47

Locality Sensitive Hashing

Two columns are candidate pair if they hash to the same bucket in any band

C1 C2 C3 C4 C5 C6 C7

h1

h2

h3

h4

h5

h6

Candidate pair !

Page 48: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 48

Locality Sensitive Hashing

Final verification After identifying candidates, verify each

candidate-pair (Ci, Cj) by examining Sig(Ci) and Sig (Cj) for similarity

Page 49: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 49

Example

100,000 columns 100 Min-Hash integer signature Total signature table size

4 x 100 x 100,000 = 40 MB (not bad) Potential similar pairs

100000 x 100000 / 2 = 5,000,000,000 (too many!) 20 bands of 5 integers per band Compute false positive and false negative rates

Page 50: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 50

False Negative: 80% Similar

Probability C1, C2 identical in one band

0.8^5 = 0.328 Probability C1, C2 not identical in any of the

20 bands

(1 – 0.328)^20 = 0.00035 We miss only 1/3000 of 80% similar column

pairs! Very few false negative

Page 51: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 51

False Positive: 40% Similar

Probability C1, C2 identical in one band0.4^5 = 0.01

Probability C1, C2 identical in at least one of the 20 bands

1 – (1 – 0.01)^20 = 0.18 Only about 20% of unsimilar pairs are

identified as candidate pairs False negatives much lower when similarities <<

40%

Page 52: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 52

LSH Summary

Similar signature column pair identification algorithm Split the signature matrix into l bands of r rows

each Identify almost all similar pairs and a small

number of unsimilar pairs By adjusting r and l

Page 53: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 53

Hamming LSH

Life is simpler if the matrix has about 50% 1’s We can take a random collection of rows Let us make the matrix denser!

How? Construct a series of matrices by OR-ing together

pairs of rows 0 disappears over time…

Page 54: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 54

Example

00010010

0101

11

1OR

More 1’s

Page 55: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 55

Hamming LSH Construct all matrices

No more than log n matrices for n rows Total number of rows in all matrices is 2n

Twice as much work as the original matrix Identify similar columns from each matrix

From each matrix, apply LHS to the columns with density between 30% -- 70% 1’s

Report similar columns Note that similar columns have similar densities, so

they will be considered together in at least one matrix No point ever comparing columns whose number of 1’s are

very different

Page 56: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 56

Summary

Apriori, Min-Hash, LSH, Hamming LSH Finding frequent pairs?

Apriori Finding similar pairs?

Min-Hash+LSH or Hamming LSH Min-Hash: Sparse matrix compression LSH: Similar signature identification Hamming LSH: Amplification of 1

Page 57: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 57

Questions

Can we extend the techniques to multiple column rules C1, C2 C3?

Page 58: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Any Questions?

Junghoo "John" Cho (UCLA Computer Science) 58

Page 59: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 59

AprioriTid (1)

Q: What was the main idea? A: Some transactions may not need to be checked

Candidate itemsets: {A, B}, {A, C} Transaction: {A, D, E, F}? We may eliminate many transactions

Q: How do we know {A, B, E, F} is not necessary? A: When we check {A, B} and {A, C} we can tell that

{A, B, E, F} does not have any candidate sets

Page 60: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 60

AprioriTid (2)

In each pass, Substitute each transaction with a set of

candidate itemsets Candidate set: {A, B, C}, {A, C, D}, {A, C, M} Transaction

T1: {A, B, C, D, F, G} T1: {{A, B, C}, {A, C, D}}

Candidate itemset {A, C, D} appears in T1 if {A, C} and {A, D} appears in T1

Page 61: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 61

AprioriTid (3)

Q: Advantage? A: Many transactions/items may be

eliminated Especially in later passes

Q: Disadvantage? A: A transaction may be blown up

T1: {A, B, C, D} T1: {{A, B, C}, {A, B, D}} Why not just eliminate “infrequent items”?

Page 62: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 62

AprioriHybrid

In earlier passes, use Apriori In later passes, use AprioriTid Switching criteria

Does the generated set of transactions fit in main memory?

)support(ckCallc

Page 63: CS246 Association Rule Mining. Junghoo "John" Cho (UCLA Computer Science)2 Association Rule Mining What is the problem? What is an association rule?

Junghoo "John" Cho (UCLA Computer Science) 63

History of the paper

Earlier SIGMOD93 paper (AIS Algorithm) Very difficult to read. Poor organization Did not use the “obvious” pruning criteria Very naïve and simple heuristics

Techniques in the paper may not be very important Much more efficient algorithms proposed next year

Even great research starts with small ideas As you can see from the history

Learn how a “simple” idea can change things…