Upload
lesley-james
View
223
Download
4
Embed Size (px)
Citation preview
1
Association Rule Mining
Instructor Qiang YangSlides from Jiawei Han and Jian Pei
And fromIntroduction to Data MiningBy Tan, Steinbach, Kumar
Frequent-pattern mining methods
2
What Is Frequent Pattern Mining?
Frequent patterns: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]
Frequent pattern mining: finding regularities in data What products were often purchased
together? What are the subsequent purchases
after buying a PC?
Frequent-pattern mining methods
3
Why Is Frequent Pattern Mining an Essential Task in Data Mining?
Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg
cube, fascicles (semantic data compression) Broad applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis
Web log (click stream) analysis, DNA sequence analysis, etc.
Frequent-pattern mining methods
4
Basic Concepts: Frequent Patterns and Association Rules
Itemset X={x1, …, xk} Find all the rules XY with min
confidence and support support, s, probability that a
transaction contains XY confidence, c, conditional
probability that a transaction having X also contains Y.
Let min_support = 50%, min_conf = 50%:
A C (50%, 66.7%)C A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id
Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent-pattern mining methods
5
Concept: Frequent Itemsets
OutlookTemperatu
reHumidit
yPlay
sunny hot high no
sunny hot high no
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
Minimum support=2 {sunny, hot, no} {sunny, hot, high,
no} {rainy, normal}
Min Support =3 ?
How strong is {sunny, no}? Count = Percentage =
Frequent-pattern mining methods
6
Concept: Itemset Rules {sunny, hot, no} = {Outlook=Sunny, Temp=hot, Play=no} Generate a rule:
Outlook=sunny and Temp=hot Play=no How strong is this rule? Support of the rule
= support of the itemset {sunny, hot, no} = 2 = Pr({sunny, hot, no})
Either expressed in count form or percentage form Confidence = Pr(Play=no | {Outlook=sunny,
Temp=hot}) In general LHS RHS, Confidence = Pr(RHS|LHS)
Confidence =Pr(RHS|LHS) =count(LHS and RHS) / count(LHS)
What is the confidence of Outlook=sunnyPlay=no?
Frequent-pattern mining methods
7
Frequent Patterns
Patterns = Item Sets {i1, i2, … in}, where each item is a pair:
(Attribute=value) Frequent Patterns
Itemsets whose support >= minimum support
Support count(itemset)/count(database)
Frequent-pattern mining methods
8
Frequent Itemset Generationnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets
Frequent-pattern mining methods
9
Max-patterns
Max-pattern: frequent patterns
without proper frequent super pattern
BCDE, ACD are max-patterns
BCD is not a max-pattern Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,FMin_sup=2
Frequent-pattern mining methods
10
Maximal Frequent Itemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
An itemset is maximal frequent if none of its immediate supersets is frequent
Frequent-pattern mining methods
11
Frequent Max Patterns Succinct Expression of frequent patterns
Let {a, b, c} be frequent Then, {a, b}, {b, c}, {a, c} must also be frequent Then {a}, {b}, {c}, must also be frequent
By writing down {a, b, c} once, we save lots of computation
Max Pattern If {a, b, c} is a frequent max pattern, then {a, b,
c, x} is NOT a frequent pattern, for any other item x.
Frequent-pattern mining methods
12
Find Frequent Max Patterns
OutlookTemperatu
reHumidit
yPlay
sunny hot high no
sunny hot high no
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no
Minimum support=2 {sunny, hot, no} ??
Frequent-pattern mining methods
13
Closed Patterns
An itemset is closed if none of its immediate supersets has the same support as the itemset
• {a, b}, {a, b, d}, {a, b, c} are closed patterns But, {a, b} is not a max pattern
See where changes happen Reduce # of patterns and rules N. Pasquier et al. In ICDT’99
TID Items
10 a, b, c
20 a, b, c
30 a, b, d
40 a, b, d,
50 c, e, f
Frequent-pattern mining methods
14
Maximal vs Closed Itemsets
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Transaction Ids
Not supported by any transactions
indexes beside
an item set is
the transaction
#s.
Frequent-pattern mining methods
15
Maximal vs Closed Frequent Itemsets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
Frequent-pattern mining methods
16
Note on Closed Patterns
Closed patterns have no need to specify the minimum support Given dataset, we can find a set of closed
patterns from it, so that for any minimum support values, we can immediately find the set of patterns (a subset of the closed patterns).
Closed frequent patterns Both closed and above the min support
Frequent-pattern mining methods
17
Maximal vs Closed Itemsets
FrequentItemsets
ClosedFrequentItemsets
MaximalFrequentItemsets
Frequent-pattern mining methods
18
Mining Association Rules—an Example
For rule A C:support = support({A}{C}) = 50%confidence =
support({A}{C})/support({A}) = 66.6%
Min. support 50%Min. confidence 50%
Transaction-id
Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern
Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
Frequent-pattern mining methods
19
Method 1:Apriori: A Candidate Generation-and-test Approach
Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains
{beer, diaper} Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method: generate length (k+1) candidate itemsets from length k
frequent itemsets, and test the candidates against DB
The performance studies show its efficiency and scalability
Agrawal & Srikant 1994, Mannila, et al. 1994
Frequent-pattern mining methods
20
The Apriori Algorithm — An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset
sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset
sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset
sup
{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset
sup
{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset
sup
{B, C, E}
2
21
Speeding up Association rules
Dynamic Hashing and Pruning technique
Thanks to Cheng Hong & Hu Haibo
Frequent-pattern mining methods
22
DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of
count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95
Frequent-pattern mining methods
23
Still challenging, the niche for DHP
DHP ( Park ’95 ): Dynamic Hashing and Pruning
Candidate large 2-itemsets are huge. DHP: trim them using hashing
Transaction database is huge that one scan per iteration is costly DHP: prune both number of transactions
and number of items in each transaction after each iteration
Frequent-pattern mining methods
24
Hash Table Construction Consider two items sets, all itesms are numbered as i1, i2, …
in. For any any pair (x, y), has according to
Hash function bucket #= h({x y}) = ((order of x)*10+(order of y)) % 7
Example: Items = A, B, C, D, E, Order = 1, 2, 3 4, 5, H({C, E})= (3*10 + 5)% 7 = 0 Thus, {C, E} belong to bucket 0.
Frequent-pattern mining methods
25
How to trim candidate itemsets
In k-iteration, hash all candidate k+1 itemsets in a hash table, and count all the itemsets in each bucket.
In k+1 iteration, examine each of the candidate itemset to see if its correspondent bucket value is above the support ( necessary condition )
Frequent-pattern mining methods
26
Example
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Figure1. An example transaction database
Frequent-pattern mining methods
27
Generation of C1 & L1(1st iteration)
C1 L1
Itemset Sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset Sup
{A} 2
{B} 3
{C} 3
{E} 3
Frequent-pattern mining methods
28
Hash Table Construction
Find all 2-itemset of each transaction
TID 2-itemset
100 {A C} {A D} {C D}
200 {B C} {B E} {C E}
300{A B} {A C} {A E} {B C} {B E} {C
E}
400 {B E}
Frequent-pattern mining methods
29
Hash Table Construction (2)
Hash functionh({x y}) = ((order of x)*10+(order of y)) % 7
Hash table {C E} {A E} {B C} {B E} {A B} {A C} {C E} {B C} {B E} {C D} {A D} {B E} {A C}
bucket 0 1 2 3 4 5 6
3 1 2 0 3 1 3
Frequent-pattern mining methods
30
C2 Generation (2nd iteration)
L1*L1
# in the bucket
{A B} 1
{A C} 3
{A E} 1
{B C} 2
{B E} 3
{C E} 3
Resulted C2
{A C}
{B C}
{B E}
{C E}
C2 of Apriori
{A B}
{A C}
{A E}
{B C}
{B E}
{C E}
Frequent-pattern mining methods
31
Effective Database Pruning
Apriori Don’t prune database. Prune Ck by support
counting on the original database.
DHP More efficient support
counting can be achieved on pruned database.
Frequent-pattern mining methods
32
Performance Comparison
Frequent-pattern mining methods
33
Performance Comparison (2)
Frequent-pattern mining methods
34
FP-growth Algorithm
Use a compressed representation of the database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
Frequent-pattern mining methods
35
FP-tree construction
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:
Frequent-pattern mining methods
36
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1C:3
D:1
D:1
E:1E:1
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}
Pointers are used to assist frequent itemset generation
D:1
E:1
Transaction Database
Item PointerABCDE
Header table
Frequent-pattern mining methods
37
FP-growthnull
A:4
B:2
B:1
C:1
D:1
C:1
D:1C:1
D:1
D:1
Conditional Pattern base for D: (PB | D) = {(A:1,B:1,C:1),
(A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)}
Recursively apply FP-growth on PB, and then append to D
Thus, frequent Itemsets found from PB|D
(with min support = 2):
AD, BD, CD, ABD, ACD, BCD
D:1
Frequent-pattern mining methods
38
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K