Upload
xarles
View
33
Download
3
Embed Size (px)
DESCRIPTION
Association Rule Mining (II). Instructor: Qiang Yang Thanks: J.Han and J. Pei. Bottleneck of Frequent-pattern Mining. Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 - PowerPoint PPT Presentation
Citation preview
1
Association Rule Mining (II)
Instructor: Qiang YangThanks: J.Han and J. Pei
Frequent-pattern mining methods
2
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly Mining long patterns needs many passes
of scanning and generates lots of candidates To find frequent itemset i1i2…i100
# of scans: 100 # of Candidates: (100
1) + (1002) + … + (1
100
00) = 2100-1
= 1.27*1030 !
Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
Frequent-pattern mining methods
3
FP-growth: Frequent-pattern Mining Without Candidate Generation
Heuristic: let P be a frequent itemset, S be the set of transactions contain P, and x be an item. If x is a frequent item in S, {x} P must be a frequent itemset
No candidate generation! A compact data structure, FP-tree, to
store information for frequent pattern mining
Recursive mining algorithm for mining complete set of frequent patterns
Frequent-pattern mining methods
4
Example
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
Min Support = 3
Frequent-pattern mining methods
5
Scan the database List of frequent items, sorted:
(item:support) <(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)>
The root of the tree is created and labeled with “{}”
Scan the database Scanning the first transaction leads to the
first branch of the tree: <(f:1),(c:1),(a:1),(m:1),(p:1)>
Order according to frequency
Frequent-pattern mining methods
6
Scanning TID=100
TransactionDatabaseTID Items100 f,a,c,d,g,i,m,p
{}
f:1
c:1
a:1
m:1
p:1
Header TableNode
Item count head f 1c 1a 1
m 1
p 1
root
Frequent-pattern mining methods
7
Scanning TID=200 Frequent Single Items:
F1=<f,c,a,b,m,p> TID=200
Possible frequent items:
Intersect with F1: f,c,a,b,m
Along the first branch of <f,c,a,m,p>, intersect:
<f,c,a> Generate two children
<b>, <m>
Items Bought
f,a,c,d,g,i,m,p
a,b,c,f,l,m,o
b,f,h,j,o
b,c,k,s,p
a,f,c,e,l,p,m,n
Frequent-pattern mining methods
8
Scanning TID=200
TransactionDatabaseTID Items200 f,c,a,b,m
{}
f:2
c:2
a:2
m:1
p:1
Header TableNode
Item count head f 1c 1a 1b 1m 2
p 1
root
b:1
m:1
Frequent-pattern mining methods
9
The final FP-tree
TransactionDatabaseTID Items100 f,a,c,d,g,i,m,p200 a,b,c,f,l,m,o300 b,f,h,j,o400 b,c,k,s,p500 a,f,c,e,l,p,m,n
Min support = 3
Frequent 1-items in frequency descending order:
f,c,a,b,m,p
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableNode
Item count head f 1c 2a 1b 3m 2p 2
Frequent-pattern mining methods
10
FP-Tree Construction
Scans the database only twice Subsequent mining: based on the FP-
tree
Frequent-pattern mining methods
11
How to Mine an FP-tree?
Step 1: form conditional pattern base
Step 2: construct conditional FP-tree
Step 3: recursively mine conditional
FP-trees
Frequent-pattern mining methods
12
Conditional Pattern Base Let {I} be a frequent item
A sub database which consists of the set of prefix paths in the FP-tree With item {I} as a co-occurring suffix pattern
Example: {m} is a frequent item {m}’s conditional pattern base:
<f,c,a>: support =2 <f,c,a,b>: support = 1
Mine recursively on such databases
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Frequent-pattern mining methods
13
Conditional Pattern Tree Let {I} be a suffix item, {DB|I} be the
conditional pattern base The frequent pattern tree TreeI is known
as the conditional pattern tree Example:
{m} is a frequent item {m}’s conditional pattern base:
<f,c,a>: support =2 <f,c,a,b>: support = 1
{m}’s conditional pattern tree
{}
f:4
c:3
a:3
m:2
Frequent-pattern mining methods
14
Composition of patterns and
Let be a frequent item in DB, B be ’s conditional pattern base, and be an itemset in B. Then + is frequent in DB if and only if is frequent in B.
Example: Starting with ={p} {p}’s conditional pattern base (from the tree)
B= (f,c,a,m): 2
(c,b): 1 Let be {c}. Then ={p,c}, with support = 3.
Frequent-pattern mining methods
15
Single path tree Let P be a single path
FP tree Let {I1, I2, …Ik} be an
itemset in the tree Let Ij have the lowest
support Then the support({I1,
I2, …Ik})=support(Ij) Example:
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Frequent-pattern mining methods
16
FP_growth Algorithm Fig 6.10
Recursive Algorithm Input: A transaction database, min_supp Output: The complete set of frequent patterns
1. FP-Tree construction 2. Mining FP-Tree by calling
FP_growth(FP_tree, null) Key Idea: consider single path FP-tree
and multi-path FP-tree separately Continue to split until get single-path FP-tree
Frequent-pattern mining methods
17
FP_Growth (tree, ) If tree contains a single path P, then
For each combination (denoted as ) of the nodes in the path P, then
Generate pattern + with support = min_supp of nodes in
Else for each a in the header of tree, do { Generate pattern = a + with support =
a.support; Construct
(1) ’s conditional pattern base and (2) ’s conditional FP-tree Tree
If Tree is not empty, then Call FP-growth(Tree, );
}
Frequent-pattern mining methods
18
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
Frequent-pattern mining methods
19
FP-Growth vs. Tree-Projection: Scalability with the Support Threshold
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%)
Ru
nti
me
(sec
.)
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
Frequent-pattern mining methods
20
Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB
according to the frequent patterns obtained so far leads to focused search of smaller databases
Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not
pattern search and matching
Frequent-pattern mining methods
21
Implications of the Methodology: Papers by Han, et al.
Mining closed frequent itemsets and max-patterns
CLOSET (DMKD’00)
Mining sequential patterns
FreeSpan (KDD’00), PrefixSpan (ICDE’01)
Constraint-based mining of frequent patterns
Convertible constraints (KDD’00, ICDE’01)
Computing iceberg data cubes with complex
measures
H-tree and H-cubing algorithm (SIGMOD’01)
Frequent-pattern mining methods
22
Visualization of Association Rules: Pane Graph
Frequent-pattern mining methods
23
Visualization of Association Rules: Rule Graph
Frequent-pattern mining methods
24
Mining Various Kinds of Rules or Regularities
Multi-level, quantitative association rules,
correlation and causality, ratio rules,
sequential patterns, emerging patterns,
temporal associations, partial periodicity
Classification, clustering, iceberg cubes,
etc.
Frequent-pattern mining methods
25
Multiple-level Association Rules
Items often form hierarchy Flexible support settings: Items at the lower
level are expected to have lower support. Transaction database can be encoded
based on dimensions and levels explore shared multi-level mining
uniform support
Milk[support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Level 1min_sup = 5%
Level 2min_sup = 3%
reduced support
Frequent-pattern mining methods
26
Quantitative Association Rules
age(X,”34-35”) income(X,”30K - 50K”) buys(X,”high resolution TV”)
Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is
maximized. 2-D quantitative association rules: Aquan1 Aquan2 Acat Cluster “adjacent”
association rulesto form general rules using a 2-D grid.
Example:
Frequent-pattern mining methods
27
Redundant Rules [SA95] Which rule is redundant?
milk wheat bread, [support = 8%, confidence = 70%]
“skim milk” wheat bread, [support = 2%, confidence = 72%]
The first rule is more general than the second rule.
A rule is redundant if its support is close to the “expected” value, based on a general rule, and its confidence is close to that of the general rule.
Frequent-pattern mining methods
INCREMENTAL MINING [CHNW96]
Rules in DB were found and a set of new tuples db is added to DB,
Task: to find new rules in DB + db. Usually, DB is much larger than db.
Properties of Itemsets: frequent in DB + db if frequent in both DB and db. infrequent in DB + db if also in both DB and db. frequent only in DB, then merge with counts in db.
No DB scan is needed! frequent only in db, then scan DB once to update their itemset
counts. Same principle applicable to distributed/parallel mining.
Frequent-pattern mining methods
29
CORRELATION RULES Association does not measure correlation
[BMS97, AY98]. Among 5000 students
3000 play basketball, 3750 eat cereal, 2000 do both play basketball eat cereal [40%, 66.7%] Conclusion: “basketball and cereal are correlated”
is misleading because the overall percentage of students eating cereal
is 75%, higher than 66.7%.
Confidence does not always give correct picture!
Frequent-pattern mining methods
30
Correlation Rules
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated the value is less than 1;
Otherwise A and B positively correlated.
P(B|A)/P(B) is known as the lift of rule BA
If less than one, then B and A are negatively correlated.
BasketballCereal 2000/
(3000*3750/5000)=2000*5000/3000*3750<1
)(/)|()()(
)(*)|(
)()(
)(BPABP
BPAP
APABP
BPAP
BAP
Frequent-pattern mining methods
31
Chi-square Correlation [BMS97]
The cutoff value at 95% significance level is 3.84 > 0.9 Thus, we do not reject the independence
assumption.
Item2 not Item2 row sumItem1 1 2 3not Item1 4 2 6column sum 5 4 9
9.09/)59(*)39(
)9/)59(*)39(2(
9/5*)39(
)9/5*)39(4(
9/)59(*3
)9/)59(*32(
9/5*3
)9/5*31( 22222
Frequent-pattern mining methods
32
Constraint-based Data Mining
Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused!
Data mining should be an interactive process User directs what to be mined using a data mining
query language (or a graphical user interface) Constraint-based mining
User flexibility: provides constraints on what to be mined
System optimization: explores such constraints for efficient mining—constraint-based mining
Frequent-pattern mining methods
33
Constraints in Data Mining
Knowledge type constraint: classification, association, etc.
Data constraint — using SQL-like queries find product pairs sold together in stores in
Vancouver in Dec.’00 Dimension/level constraint
in relevance to region, price, brand, customer category
Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum >
$200) Interestingness constraint
strong rules: min_support 3%, min_confidence 60%
Frequent-pattern mining methods
34
Constrained Mining vs. Constraint-Based Search
Constrained mining vs. constraint-based search/reasoning Both are aimed at reducing search space Finding all patterns satisfying constraints vs. finding
some (or one) answer in constraint-based search in AI Constraint-pushing vs. heuristic search It is an interesting research problem on how to
integrate them Constrained mining vs. query processing in DBMS
Database query processing requires to find all Constrained pattern mining shares a similar
philosophy as pushing selections deeply in query processing
Frequent-pattern mining methods
35
Constrained Frequent Pattern Mining: A Mining Query Optimization Problem
Given a frequent pattern mining query with a set of constraints C, the algorithm should be sound: it only finds frequent sets that satisfy the
given constraints C complete: all frequent sets satisfying the given
constraints C are found A naïve solution
First find all frequent sets, and then test them for constraint satisfaction
More efficient approaches: Analyze the properties of constraints
comprehensively Push them as deeply as possible inside the
frequent pattern computation.
Frequent-pattern mining methods
36
Anti-Monotonicity in Constraint-Based Mining
Anti-monotonicity intemset S satisfies the constraint, so
does any of its subset sum(S.Price) v is anti-monotone sum(S.Price) v is not anti-monotone
Example. C: range(S.profit) 15 is anti-monotone Itemset ab violates C So does every superset of ab
TransactionTID
a, b, c, d, f10
b, c, d, f, g, h20
a, c, d, e, f30
c, e, f, g40
TDB (min_sup=2)
Item Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
Frequent-pattern mining methods
37
Which Constraints Are Anti-Monotone?
Constraint Antimonotone
v S NoS V no
S V yesmin(S) v no
min(S) v yesmax(S) v yes
max(S) v nocount(S) v yes
count(S) v no
sum(S) v ( a S, a 0 ) yessum(S) v ( a S, a 0 ) no
range(S) v yesrange(S) v no
avg(S) v, { , , } convertiblesupport(S) yes
support(S) no
Frequent-pattern mining methods
38
Monotonicity in Constraint-Based Mining
Monotonicity When an intemset S satisfies the
constraint, so does any of its superset
sum(S.Price) v is monotone min(S.Price) v is monotone
Example. C: range(S.profit) 15 Itemset ab satisfies C So does every superset of ab
TransactionTID
a, b, c, d, f10
b, c, d, f, g, h20
a, c, d, e, f30
c, e, f, g40
TDB (min_sup=2)
Item
Profit
a 40
b 0
c -20
d 10
e -30
f 30
g 20
h -10
Frequent-pattern mining methods
39
Which Constraints Are Monotone?
Constraint Monotone
v S yes
S V yes
S V no
min(S) v yes
min(S) v no
max(S) v no
max(S) v yes
count(S) v no
count(S) v yes
sum(S) v ( a S, a 0 ) no
sum(S) v ( a S, a 0 ) yes
range(S) v no
range(S) v yes
avg(S) v, { , , } convertible
support(S) no
support(S) yes
Frequent-pattern mining methods
40
Succinctness, Convertible, Inconvertable Constraints in Book
We will not consider these in this course.
Frequent-pattern mining methods
41
Associative Classification
Mine association possible rules in form of itemset class Itemset: a set of attribute-value pairs Class: class label
Build Classifier Organize rules according to decreasing
precedence based on confidence and support B. Liu, W. Hsu & Y. Ma. Integrating classification
and association rule mining. In KDD’98
Frequent-pattern mining methods
42
Classification by Aggregating Emerging Patterns
Emerging pattern (EP): A pattern frequent in one class of data but infrequent in others. Age<=30 is frequent in class
“buys_computer=yes” and infrequent in class “buys_computer=no”
Rule: age<=30 buys computer G. Dong & J. Li. Efficient mining of
emerging patterns: discovering trends and differences. In KDD’99