Upload
annabella-cole
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
Mining Generalized Mining Generalized Association RulesAssociation Rules
Ramkrishnan StrikantRamkrishnan StrikantRakesh AgrawalRakesh Agrawal
Data Mining Seminar, spring semester, 2003
Prof. Amos Fiat
Student: Idit Haran
Idit Haran, Data Mining Seminar, 2003
2
OutlineOutlineMotivationTerms & Definitions Interest MeasureAlgorithms for mining generalized
association rulesComparison Conclusions
Idit Haran, Data Mining Seminar, 2003
3
MotivationMotivation Find Association Rules of the form:
Diapers Beer Different kinds of diapers:
Huggies/Pampers, S/M/L, etc. Different kinds of beers:
Heineken/Maccabi, in a bottle/in a can, etc. The information on the bar-code is of type:
Huggies Diapers, M Heineken Beer in bottle The preliminary rule is not interesting, and probably
will not have minimum support.
Idit Haran, Data Mining Seminar, 2003
4
TaxonomyTaxonomy is-a hierarchies
Clothes
Outwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Idit Haran, Data Mining Seminar, 2003
5
Taxonomy - ExampleTaxonomy - ExampleLet say we found the rule:
Outwear Hiking Bootswith minimum support and confidence.
The rule Jackets Hiking Boots
may not have minimum supportThe rule
Clothes Hiking Boots may not have minimum confidence.
Idit Haran, Data Mining Seminar, 2003
6
TaxonomyTaxonomy Users are interested in generating rules that span
different levels of the taxonomy. Rules of lower levels may not have minimum
support Taxonomy can be used to prune uninteresting or
redundant rules Multiple taxonomies may be present.
for example: category, price(cheap, expensive), “items-on-sale”. etc.
Multiple taxonomies may be modeled as a forest, or a DAG.
Idit Haran, Data Mining Seminar, 2003
7
NotationsNotations
c1
p
c2
zancestors
(marked with ^)
descendants
child
parentedge:
is_a relationship
Idit Haran, Data Mining Seminar, 2003
8
NotationsNotationsI = {i1, i2, …, im}- items.
T- transaction, set of items TI(we expect the items in T to be leaves in T .)
D – set of transactionsT supports item x, if x is in T or x is an
ancestor of some item in T.T supports XI if it supports every item
in X.
Idit Haran, Data Mining Seminar, 2003
9
NotationsNotations
A generalized association rule: X Y if XI , YI , XY = , and no item in Y is an ancestor of any item in X.
The rule XY has confidence c in D if c% of transactions in D that support X also support Y.
The rule XY has support s in D if s% of transactions in D supports XY.
Idit Haran, Data Mining Seminar, 2003
10
Problem StatementProblem Statement
To find all generalized association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf) respectively.
Idit Haran, Data Mining Seminar, 2003
11
ExampleExampleRecall the taxonomy:
Clothes
Outwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Idit Haran, Data Mining Seminar, 2003
12
Database D
TransactionItems Bought
100Shirt
200Jacket, Hiking Boots
300Ski Pants, Hiking Boots
400Shoes
500Shoes
600Jacket
Frequent Itemsets
ItemsetSupport
{Jacket}2
{Outwear}3
{Clothes}4
{Shoes}2
{Hiking Boots}2
{Footwear}4
{Outwear, Hiking Boots}2
{Clothes,Hiking Boots}2
{Outwear, Footwear}2
{Clothes, Footwear}2Rules
RuleSupportConfidence
Outwear Hiking Boots33%66.6%
Outwear Footwear33%66.6%
Hiking Boots Outwear33%100%
Hiking Boots Clothes33%100%
ExampleExample
minsup = 30%
minconf = 60%
Idit Haran, Data Mining Seminar, 2003
13
Observation 1Observation 1If the set{x,y} has minimum support,
so do {x^,y^} {x^,y} and {x^,y^} For example:
if {Jacket, Shoes} has minsup, so will {Outwear, Shoes}, {Jacket,Footwear}, and {Outwear,Footwear}
Clothes
Outwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Idit Haran, Data Mining Seminar, 2003
14
Observation 2Observation 2
If the rule xy has minimum support and confidence, only xy^ is guaranteed to have both minsup and minconf.
The rule OutwearHiking Boots has minsup and minconf. The rule OutwearFootwear has both minsup and minconf.
Clothes
Outwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Idit Haran, Data Mining Seminar, 2003
15
Observation 2 – cont.Observation 2 – cont.
However, the rules x^y and x^y^ will have minsup, they may not have minconf.
For example: The rules ClothesHiking Boots and ClothesFootwear have minsup, but not minconf.
Clothes
Outwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Idit Haran, Data Mining Seminar, 2003
16
Interesting Rules – Interesting Rules – Previous WorkPrevious Work
a rule XY is not interesting if:support(XY) support(X)•support(Y)
Previous work does not consider taxonomy.The previous interest measure pruned less
than 1% of the rules on a real database.
Idit Haran, Data Mining Seminar, 2003
17
Interesting Rules – Interesting Rules – Using the TaxonomyUsing the Taxonomy
MilkCereal (8% support, 70% conf)Milk is parent of Skim Milk, and 25% of
sales of Milk are Skim MilkWe expect:
Skim MilkCereal to have 2% support and 70% confidence
Idit Haran, Data Mining Seminar, 2003
18
R-Interesting RulesR-Interesting RulesA rule is XY is R-interesting w.r.t an
ancestor X^Y^ if:
or,
With R = 1.1 about 40-55% of the rules were prunes.
real support(XY) >
expected support (XY) based on (X^Y^)R •
real confidence(XY) > expected confidence (XY)
based on (X^Y^)R •
Idit Haran, Data Mining Seminar, 2003
19
Problem Statement (new)Problem Statement (new)
To find all generalized R-interesting association rules (R is a user-specified minimum interest called min-interest) that have support and confidence greater than minsup and minconf respectively.
Idit Haran, Data Mining Seminar, 2003
20
Algorithms – 3 stepsAlgorithms – 3 steps
1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets.
2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
*All presented algorithms will only implement step 1.
Idit Haran, Data Mining Seminar, 2003
21
Algorithms – 3 stepsAlgorithms – 3 steps
1. Find all itemsets whose support is greater than minsup. These itemsets are called frequent itemsets.
2. Use the frequent itemsets to generate the desired rules: if ABCD and AB are frequent then conf(ABCD) = support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
*All presented algorithms will only implement step 1.
Idit Haran, Data Mining Seminar, 2003
22
Algorithms (step 1)Algorithms (step 1)
Input: Database, TaxonomyOutput: All frequent itemsets3 algorithms (same output, different run-time):
Basic, Cumulate, EstMerge
Idit Haran, Data Mining Seminar, 2003
23
Algorithm Basic – Main IdeaAlgorithm Basic – Main Idea
Is itemset X is frequent? Does transaction T supports X?
(X contains items from different levels of taxonomy, T contains only leaves)
T’ = T + ancestors(T); Answer: T supports X X T’
Idit Haran, Data Mining Seminar, 2003
24
Algorithm BasicAlgorithm Basic
k
k
kk
t
kt
k-k
k-1
;L Answer
minsup}|c.countC { c L
c.count
Cc
,t)(C C
Ttt
Dt
)(L C
); k 2; L( k
s} 1-itemset {frequent L
end
end
end
;
do candidates forall
subset
),(ancestor-add
begin do on transactiforall
;genapriori-
begin do For
1
1
Count item occurrences
Generate new k-itemsets candidates
Add all ancestors of each item in t to t, removing any duplication
Find the support of all the candidates
Take only those with support over minsup
Idit Haran, Data Mining Seminar, 2003
25
Candidate generationCandidate generation
Join step
Prune step
1k1k2k2k11
1k1k
1k1k1
k
q.itemp.item,q.itemp.item,...,q.itemp.item
qp,LL
itemqitempitempp.item
C
where
from
.,.,.,select
intoinsert
2
k
k-1
k
c from C
) L(s
ets s of c(k-1)-subs
C itemsets c
delete
then if
do forall
do forall
P and q are 2 k-1 frequent itemsets identical in all k-2 first items.
Join by adding the last item of q to p
Check all the subsets, remove a candidate with “small” subset
Idit Haran, Data Mining Seminar, 2003
26
Optimization 1Optimization 1
Filtering the ancestors added to transactionsWe only need to add to transaction t the
ancestors that are in one of the candidates.If the original item is not in any itemsets, it can
be dropped from the transaction.Example:
candidates: {clothes,shoes}.Transaction t: {Jacket, …} can be replaced with {clothes, …}
Clothes
Outwear Shirts
Jackets Ski Pants
Idit Haran, Data Mining Seminar, 2003
27
Optimization 2Optimization 2
Pre-computing ancestors Rather than finding ancestors for each item
by traversing the taxonomy graph, we can pre-compute the ancestors for each item.
We can drop ancestors that are not contained in any of the candidates in the same time.
Idit Haran, Data Mining Seminar, 2003
28
Optimization 3Optimization 3
Pruning itemsets containing an item and its ancestor If we have {Jacket} and {Outwear}, we will have
candidate {Jacket, Outwear} which is not interesting. support({Jacket} ) = support({Jacket, Outwear}) Delete ({Jacket, Outwear}) in k=2 will ensure it will not
erase in k>2. (because of the prune step of candidate generation method)
Therefore, we can prune the rules containing an item an its ancestor only for k=2, and in the next steps all candidates will not include item + ancestor.
Idit Haran, Data Mining Seminar, 2003
29
Algorithm CumulateAlgorithm Cumulate
k
k
kk
t
kt
k-k
k-1
;L Answer
minsup}|c.countC { c L
c.count
Cc
,t)(C C
Ttt
Dt
)(L C
); k 2; L( k
s} 1-itemset {frequent L
TfromTCompute
end
end
end
;
do candidates forall
subset
),(ancestor-add
begin do on transactiforall
)C,y(Tunnecessar-remove T
)prune(C then 2)(k if
;genapriori-
begin do For
*
k**
2
1
1
*
Optimization 2: compute the set of all ancestors T* from T
Optimization 3: Delete any candidate in C2 that consists of an item and its ancestor
Optimization 1: Delete any ancestors in T* that are not present in any of the candidates in CkOptimzation2: foreach item xt add all ancestor of x in T* to t. Then, remove any duplicates in t.
Idit Haran, Data Mining Seminar, 2003
30
StratificationStratification
Candidates: {Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes}
If {Clothes, Shoes} does not have minimum support, we don’t need to count either {Outwear,Shoes} or {Jacket,Shoes}
We will count in steps: step 1: count {Clothes, Shoes}, and if it has minsup -
step 2: count {Outwear,Shoes}, if has minsup – step 3: count {Jacket,Shoes}
Clothes
Outwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Idit Haran, Data Mining Seminar, 2003
31
Version 1: StratifyVersion 1: StratifyDepth of an itemset:
itemsets with no parents are of depth 0.others:
depth(X) = max({depth(X^) |X^ is a parent of X}) + 1
The algorithm: Count all itemsets C0 of depth 0. Delete candidates that are descendants to the itemsets in C0 that
didn’t have minsup. Count remaining itemsets at depth 1 (C1) Delete candidates that are descendants to the itemsets in C1 that
didn’t have minsup. Count remaining itemsets at depth 2 (C2), etc…
Idit Haran, Data Mining Seminar, 2003
32
Tradeoff & OptimizationsTradeoff & Optimizations
#candidates counted #passes over DB
CumulateCount each depth on different pass
Optimiztion 1: Count together multiple depths from certain level
Optimiztion 2: Count more than 20% of candidates per pass
Idit Haran, Data Mining Seminar, 2003
33
Version 2: EstimateVersion 2: EstimateEstimating candidates support using sample1st pass: (C’k)
count candidates that are expected to have minsup (we count these candidates as candidates that has 0.9*minsup in the sample)
count candidates whose parents expect to have minsup.
2nd pass: (C”k) count children of candidates in C’k that were not
expected to have minsup.
Idit Haran, Data Mining Seminar, 2003
34
Example for EstimateExample for Estimate
Candidates
Itemsets
Support in
Sample
Support in Database
Scenario AScenario B
{Clothes, Shoes}8%7%9%
{Outwear, Shoes}4%4%6%
{Jacket, Shoes}2%
minsup = 5%
Idit Haran, Data Mining Seminar, 2003
35
Version 3: EstMergeVersion 3: EstMerge
Motivation: eliminate 2nd pass of algorithm Estimate Implementation: count these candidates of C”k with
the candidates in C’k+1.
Restriction: to create C’k+1 we assume that all candidates in C”k has minsup.
The tradeoff: extra candidates counted by EstMerge v.s. extra pass made by Estimate.
Idit Haran, Data Mining Seminar, 2003
36
Algorithm EstMergeAlgorithm EstMerge
k
k
kkk
kk
kkk
k-kk
k-k
ksk
k-k-k
k-k-1
s
;L Answer
minsup}c.count|C{c L L
minsup}c.count|C {c L
C' C C
,C"D,C' C
,C"D,C'
CD C
C"L C
; k C" ; LC2 k
DD
s} 1-itemset {frequent L
end
"
'
; - "
;)(sdescendent-prune
; )(support-find
);,(sons-and-frequent-expected'
);,(candidates-neratege
begin do ) or ",(For
;)(sample-generate
111
1
1
11
11
1
Count item occurrences
Generate a sample over the Database, in the first pass
Find the support of C’kC”k-1 by making a pass over D
Generate new k-itemsets candidates from Lk-1C”k-1
Delete candidates in Ck whose ancestors in C’k don’t have minsup
Estimate Ck candidate’s support by making a pass over Ds. C’k = candidates that are expected to have minsup + candidates whose parents are expected to have minsup
Remaining candidates in Ck that are not in C’k
Add all candidate in C”k with minsup
All candidate in C’k with minsup
Idit Haran, Data Mining Seminar, 2003
37
Stratify - VariantsStratify - Variants
Idit Haran, Data Mining Seminar, 2003
38
Size of SampleSize of Sample
Pr[support in sample < a]
P=5%P=1%P=0.5%P=0.1%
a=.8pa=.9pa=.8pa=.9pa=.8pa=.9pa=.8pa=.9p
n=10000.320.760.800.950.890.970.980.99
n=10,0000.000.070.110.590.340.770.800.95
n=100,0000.000.000.000.010.000.070.120.60
n=1,000,0000.000.000.000.000.000.000.000.01
Idit Haran, Data Mining Seminar, 2003
39
Size of SampleSize of Sample
Idit Haran, Data Mining Seminar, 2003
40
Performance EvaluationPerformance Evaluation
Compare running time of 3 algorithms:Basic, Cumulate and EstMerge
On synthetic data:effect of each parameter on performance
On real data:Supermarket DataDepartment Store Data
Idit Haran, Data Mining Seminar, 2003
41
Synthetic Data GenerationSynthetic Data Generation
Parameter
Default Value
|D|Number of transactions1,000,000
|T|Average size of the Transactions10
|I|Average size of the maximal potentially frequent itemsets
4
|I |Number of maximal potentially frequent itemsets10,000
NNumber of items100,000
RNumber of Roots250
LNumber of Levels4-5
FFanout5
DDepth-ration ( probability that item in a rule comes from level i / probability that item comes from level i+1)
1
Idit Haran, Data Mining Seminar, 2003
42
Minimum SupportMinimum Support
Idit Haran, Data Mining Seminar, 2003
43
Number of TransactionsNumber of Transactions
Idit Haran, Data Mining Seminar, 2003
44
FanoutFanout
Idit Haran, Data Mining Seminar, 2003
45
Number of ItemsNumber of Items
Idit Haran, Data Mining Seminar, 2003
46
Reality CheckReality CheckSupermarket Data
548,000 items Taxonomy: 4 levels, 118 roots ~1.5 million transactions Average of 9.6 items per transaction
Department Store Data 228,000 items Taxonomy: 7 levels, 89 roots 570,000 transactions Average of 4.4 items per transaction
Idit Haran, Data Mining Seminar, 2003
47
ResultsResults
Idit Haran, Data Mining Seminar, 2003
48
ConclusionsConclusionsCumulate and EstMerge were 2 to 5 times
faster than Basic on all synthetic datasets. On the supermarket database they were 100 times faster !
EstMerge was ~25-30% faster than Cumulate. Both EstMerge and Cumulate exhibits linear
scale-up with the number of transactions.
Idit Haran, Data Mining Seminar, 2003
49
SummarySummaryThe use of taxonomy is necessary for finding
association rules between items at any level of hierarchy.
The obvious solution (algorithm Basic) is not very fast.
New algorithms that use the taxonomy benefits are much faster
We can use the taxonomy to prune uninteresting rules.
Idit Haran, Data Mining Seminar, 2003
50