1
Synthesizing High-Frequency Rules from
Different Data Sources
Xindong Wu and Shichao Zhang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 2, MARCH/APRIL 2003
2
Pre-work
Knowledge management.
Knowledge discovery
Data mining.
Data warehouse
3
Knowledge Management
Building data warehousing by Knowledge management
4
Knowledge Discovery and Data Mining
Data mining is a tool of knowledge discovery
5
Why data mining
Simon
Commodities
Supermarket
If a supermarket manager, simon, want to arrange these commodities into supermarket, how to do will make more revenues, conveniences….
if one customer buys milk then he is likely to buy bread, so...
6
Why data mining
Simon
Before long, if simon want to send some advertisement letters for customers, how to consider the individual differences is an important task.
Mary always buys diapers and milk powders, she may have a baby, so ….
7
The role of Data mining
Preprocess data
Useful patterns
Knowledge and strategy
8
Mining association rules
Bread
Milk
IF bread is bought then milk is bought
9
Mining steps
step1: define minsup and minconfex: minsup=50%
minconf=50%
step2: find large itemsets
step3: generate association rules
10
Example
Large itemsets
11
Outline
Introduction
Weights of Data Sources
Rule Selection
Synthesizing High-Frequency Rules Algorithm
Relative Synthesizing Model
Experiments
Conclusion
12
AB→CA→DB→E
AB→CA→DB→E
Introduction Framework
DB1 DB2
...DBn
RD1 RD2 RDn...
GRB Synthesizing High-Frequency Rules
• Weighting
• Ranking
AB→CA→DB→E
13
Weights of Data Sources
Definition Di : data sources
Si : set of association rules from Di
Ri : association rule
3 Steps Step 1 : union of all Si
Step 2 : assigning each Ri a weight
Step 3 : assigning each Di a weight & normalization
14
Example
3 Data Sources (minsupp=0.2, minconf=0.3)
S1 AB→C with supp=0.4, conf=0.72 A→D with supp=0.3, conf=0.64 B→E with supp=0.34, conf=0.7
S2 B→C with supp=0.45, conf=0.87 A→D with supp=0.36, conf=0.7 B→E with supp=0.4, conf=0.6
S3 AB→C with supp=0.5, conf=0.82 A→D with supp=0.25, conf=0.62
15
Step 1
Union of all Si
S’ = {S1, S2, S3}
R1 : AB→C
S1, S3 2 times
R2 : A→D S1, S2, S3 3 times
R3 : B→E S1, S2 2 times
R4 : B→C S2 1 time
S1
1. AB→ C with supp=0.4, conf=0.722. A→ D with supp=0.3, conf=0.643. B→ E with supp=0.34, conf=0.7
S2
1. B→ C with supp=0.45, conf=0.872. A→ D with supp=0.36, conf=0.73. B→ E with supp=0.4, conf=0.6
S3
1. AB→ C with supp=0.5, conf=0.822. A→ D with supp=0.25, conf=0.62
16
Step 2 Assigning each Ri a weight
R1
R2
R3
R4
WR1 = 2 + 3 + 2 + 12
= 0.25
WR2 = 2 + 3 + 2 + 13
= 0.375
WR3 = 2 + 3 + 2 + 12
= 0.25
WR4 = 2 + 3 + 2 + 11
= 0.125
17
Step 3
Assigning each Di a weight WD1
2*0.25+3*0.375+2*0.25=2.125
WD2 1*0.125+2*0.25+3*0.375=2
WD3 2*0.25+3*0.375=1.625
Normalization WD1 2.125/(2.125+2+1.625)=0.3695
WD2 2/(2.125+2+1.625)=0.348
WD3 1.625/(2.125+2+1.625)=0.2825
Ri WRi Time Si
R1:AB→C 0.25 2 S1, S3
R2:A→D 0.375 3 S1,S2, S3
R3:B→E 0.25 2 S1, S2
R4:B→C 0.125 1 S2
18
Why Rule Selection ?
Goal Extracting High-Frequency Rules
Low-Frequency Rules Noise
Solution If
Num(Ri) / n < n : data sources, Num(Ri) : frequency of Ri
Then Rule Ri be wiped out
19
Rule Selection
Example : 10 Data Sources D1~D9 : {R1 : X→Y}
D10 : {R1 : X→Y, R2: X1→Y1, …, R11: X10→Y10 }
Let =0.8Num(R1) / 10 = 10/10 = 1
> keep
Num(R2~11) / 10 = 1/10 = 0.1 < be wiped out
D1~D10 : {R1 : X→Y} WR1 : 10/10=1 WD1~10 : 10*1 / 10*10*1 = 0.1
n Num(R1)
WR1
20
Comparison
Without Rules Selection WD1~9 0.099
WD10 0.109
With Rules Selection WD1~10 0.1
From High-Frequency Rules Point of viewWeight Errors
D1~9 |0.1-0.099| 0.001
D10 |0.1-0.109| 0.009
Total Error = 0.01
21
Synthesizing High-Frequency Rules Algorithm
5 Steps Step 1 : Rules Selection
Step 2 : Weights of Data Sources Step 2.1 : union of all Si
Step 2.2 : assigning each Ri a weight
Step 2.3 : assigning each Di a weight & normalization
Step 3 : computing supp & conf of each Ri
Step 4 : ranking all rules by support
Step 5 : output the High-Frequency Rules
22
An Example
3 Data Sources
=0.4, minsupp=0.2, minconf=0.3
S1
1. AB→ C with supp=0.4, conf=0.722. A→ D with supp=0.3, conf=0.643. B→ E with supp=0.34, conf=0.7
S2
1. B→ C with supp=0.45, conf=0.872. A→ D with supp=0.36, conf=0.73. B→ E with supp=0.4, conf=0.6
S3
1. AB→ C with supp=0.5, conf=0.822. A→ D with supp=0.25, conf=0.62
23
Step 1
Rules Selection R1 : AB→C
S1, S3 2 times
Num(R1) / 3 = 0.66 keep
R2 : A→D S1, S2, S3 3 times
Num(R2) / 3 = 1 keep
R3 : B→E S1, S2 2 times
Num(R3) / 3 = 0.66 keep
R4 : B→C S2 1 time
Num(R4) / 3 = 0.33 wiped out
24
Step 2 : Weights of Data Sources Weights of Ri
Weight of Di
WD1 2*0.29+3*0.42+2*0.29=2.42
WD2 3*0.42+2*0.29=1.84
WD3 2*0.29+3*0.42=1.84
Normalization WD1 2.42/(2.42+1.84+1.84)=0.3695=0.396
WD2 1.84/(2.42+1.84+1.84)=0.302
WD3 1.84/(2.42+1.84+1.84)=0.302
WR1 = 2 + 3 + 22 = 0.29
Ri WRi Time Si
R1:AB→ C 0.29 2 S1, S3
R2:A→ D 0.42 3 S1,S2, S3
R3:B→ E 0.29 2 S1, S2
WR2 = 2 + 3 + 23 = 0.42
WR2 = 2 + 3 + 22 = 0.29
25
Step 3 Computing supp & conf of each Ri
Support ABC
0.396*0.4+0.302*0.5=0.3094 AD
0.396*0.3+0.302*0.36=0.228 BE
0.396*0.34+0.302*0.4=0.255
Confidence ABC
0.396*0.72+0.302*0.82=0.532 AD
0.396*0.64+0.302*0.7=0.465 BE
0.396*0.7+0.302*0.6=0.458
S1
1. AB→ C with supp=0.4, conf=0.722. A→ D with supp=0.3, conf=0.643. B→ E with supp=0.34, conf=0.7
S2
2. A→ D with supp=0.36, conf=0.73. B→ E with supp=0.4, conf=0.6
S3
1. AB→ C with supp=0.5, conf=0.822. A→ D with supp=0.25, conf=0.62
WD1 =0.396WD2 =0.302WD3 =0.302
26
Step 4 & Step 5
Ranking all rules by support & output minsupp=0.2, minconf=0.3
ABC, BE, AD
Ranking 1. ABC (0.3094) 2. BE (0.255) 3. AD (0.228)
Output – 3 rules ABC(0.3094, 0.532) BE (0.255, 0.458) AD (0.228, 0.465)
27
Internet
Relative Synthesizing Model
Framework
Web books journals
X→Yconf=0.7
X→Yconf=0.72
X→Yconf=0.68
X→Yconf=?
Synthesizing• clustering method• roughly method
Unknown Di
28
Synthesizing Methods
Physical Meaning if the confidences irregularly distributed
Maximum synthesizing operator
Minimum synthesizing operator
Average synthesizing operator
if the confidences (X) normal distribution clustering interval [a, b]
satisfy
1. P{ a Xb } (m/n) 2. | b – a | 3. a, b > minconf.
29
Clustering Method
5 Steps Step 1 : closeness 1 - | confi – confj |
The distance relation table
Step 2 : closeness degree measure The confidence-confidence matrix
Step 3 : two confidences close enough ? The confidence relationship matrix
Step 4 : classes creating [a, b] interval of the confidence of rule X→Y
Step 5 : interval verifying satisfy the constraints ?
30
An Example
Assume rule X→Y
conf1=0.7, conf2=0.72, conf3=0.68, conf4=0.5
conf5=0.71, conf6=0.69, conf7=0.7, conf8=0.91
3 parameters =0.7
=0.08
=0.69
31
Step 1 : Closeness
Example conf1=0.7, conf2=0.72
c1, 2= 1 - | conf1 - conf2 | = 1 - |0.70-0.72|=0.98
32
Step 2 : Closeness Degree Measure
Example
33
Step 3 : Close Enough ? Example
=6.9
> 6.9
< 6.9
34
Step 4 : Classes Creating
Example
Class 1 : conf1~3, conf5~7
1
Class 2 : conf4 Class 3 : conf82
3
35
Step 5 : Interval Verifying
Example Class 1
conf1=0.7, conf2=0.72, conf3=0.68, conf5=0.71, conf6=0.69, conf7=0.7
[min, max] = [conf3, conf2] = [0.68, 0.72] constraint 1 P{ 0.68 X 0.72 } (6/8) (0.7) constraint 2 |0.72-0.68| (0.04) < (0.08) constraint 3 0.68, 0.75 > minconf. (0.65)
In the same way Class 2 & Class 3 be wiped out
Result X→Y : conf=[0.68, 0.72]
Support ? In the same way Interval
36
Roughly Method
Example R : AB→C
supp1=0.4, conf1=0.72
supp2=0.5, conf2=0.82
Maximum max ( supp (R) )=max (0.4, 0.5)=0.5
max ( conf (R) )=max (0.72, 0.82)=0.82
Minimum & Average min 0.4, 0.72
avg 0.45, 0.77
37
Experiments
Time SWNBS (without rules selection)
SWBRS (with rules selection)
SWNBS > SWBRS
Error first 20 frequent itemset
Max=0.000065
Avg=0.00003165
38
Conclusion
Synthesizing Model Data Sources known
weighting
Data Sources unknown clustering method
roughly method
39
Future works
Sequence pattern
Combine GA and other techniques