Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
An association rules miner based on run length encoding
97 6
V
204
VI
(Run-length Encoding RLE)
VII
An association rules miner based on run length encoding
StudentHsieh Tsung Hao AdvisorsDr. Hung-Pin Chiu
Department of Information Management
The M.I.M. Program Nan-Hua University
ABSTRACT
It is very hot topics discussed over these several years to talk about the
related rule, importance and practicability with related and regular attention
slowly too in all trades and professions, but it will be a focal point that
everybody cares about that the treatment performing the algorithm carries out
efficiency.
It stores member's database relevant trade record in the database to
perform algorithms in the related rule in early days, then carry on materials
prospect, because materials punish operation of the hard disk, so in efficiency
is dealt with, it is not all very ideal to carry out the speed, efficiency and
speed that the relative reducing materials prospect.
Question the method can change above-mentioned facing put forward by
us, the method is ' coding method of the continuous length (Run-length
Encoding RLE) ',it is by code skill enter method because mainly whether
among storing device compress by materials, it is prospect not to go on,
needn't be decompressed and encoded to prospect course in it, and then
increase the efficiency prospected!
Key wordsData Mining, Association Rules, Data Compression
VIII
.....................................................II
......................................III
.........................................IV
............................................V
.......................................................VI
..................................................VII
.................................................VIII
.......................................................IX
......................................................X
.....................................................XI
................................................1
.........................................1
...................................2
...................................4
.........................................5
......................................6
Apriori ....................................9
FP-Growth ................................12
DLGDirect Large itemsets Generation...13
...........................................18
RLE ...................19
RLE ........................20
RLE ......................................22
RLE ...............................30
.................................35
..................................35
............................37
............................37
..................46
.....................................52
...................................................53
IX
2-17 2-2DLG 15 2-3DLG L1 Bit-Map16 2-417 3-121 3-221 3-3 RLE31 3-41-32 3-5 2-(C2) 32 3-6 23- RLE33 4-136 4-237 4-338 4-441 4-544 4-647 4-7 RLE 48 4-849 4-9 RLE 49 4-1050 4-11 RLE 51
X
2-1Apriori 10 2-2Apriori 10 2-3DLG 16 3-1RLE 19 3-2 22 3-3RLE 23 3-4RLE A () 24 3-5RLE B () 25 3-6RLE C () 25 3-7RLE D () 26 3-8RLE F () 26 3-9RLE F () 27 3-10 RLE 27 3-11 RLE 28 3-12RLE 34 4-1 1 RLE FP Growth 39 4-2 3 RLE FP Growth 39 4-3 5 RLE FP Growth 40 4-4 10 RLE FP Growth . 40
4-5 100 RLE FP Growth 42
4-6 500 RLE FP Growth 42
4-7 1000 RLE FP Growth 43
4-83RLEFP Growth44 4-95RLEFP Growth45 4-10 10 RLE FP Growth 45 4-1148 4-1249 4-1351
XI
1.1
(Data Mining)
1
1.2
(Association Rules)
(Classification)(Clustering)(Sequential
Patterns)
(1) (Association Rule)
(Item)
(2) (Classification)
[8]
2
(Decision Tree)
(3) (Clustering)
K-means
agglomeration
(4) (Sequential Patterns)
Agrawal and Srikant
3
1.3
(Association rule)
Apriori 1994 R.Agrawal
Apriori (Large Itemsets)
(Run-length Encoding RLE)
4
1.4
(1)
Run-Length Encoding(RLE List Set)
(2)
Fp-Growth
5
(Association Rule)Agrawal1993
6% 86%
86%
I={I1,I2,I3.,In}
(items)I(Transaction)
D
(Item)
TID 2-1
6
(Itemset) 1-
(1-Itemset) 2-(2-Itemset)
2-1
TID
T10 EFG T20 ABF T30 ABD
(Frequent itemsets)
(Frequent
Itemset or Large Itemset)
(Complete Large Itemsets)
I
Support(I)=
7
I
X Y X Y
X Y
X Y (Confidence)
X Y
XY
Confidence(X Y) X
X Y X Y
X Y
8
2.1 Apriori
Apriori [8] Agrawal 1994
(Frequent Itemsets)
(Association Rules)
Apriori
2-2 Apriori Apriori
1.
2.
- 2-1 Apriori
9
1. L1={large 1-itemsets}
2. for (K = 2 ; LK-1 0 ; K++) do begin
3. CK= apriori-gen (LK-1) 4. for all transactions t D do begin 5. Ct=subset (CK, t) 6. for all candidates c Ct do 7. c.count ++; 8. end
9. LK={c CK| c.count minsup}
10. end
11. Answer = K LK;
2-1Apriori
2-2Apriori
10
Apriori
(1)(Itemset)
2- 1- 1-
k (k-1)+(k-2)++1 2-
k*(k-1)/2 1- 1000 45
2-
(2)
(1)
(support)
Apriori I/O
11
2.2 Frequent-Pattern treeFP-Growth
FP-Growth Han, Pei, and Yin2000
candidate itemsets
Frequent-Pattern tree candidate
itemsets
I/O FP-Growth
() Frequent-Pattern tree
(1) large 1-itemset
(2) large items
(3) FP- tree
() FP-Tree
(1) FP-Tree node conditional pattern base
(2) conditional pattern base conditional FP-Tree
(3) conditional FP-Tree
conditional FP-Tree Frequent Pattern
(4) conditional FP-Tree pattern
12
FP-Growth
FP-Tree
2.3 DLGDirect Large itemsets Generation
DLG[11] Bit-Map
Bit- Map
0 1 01
Bit-Map
AND1
DLG 2 2 Apriori
DLG DLG L
L Directed-Graph{X,Y} L
X Y
13
DLG
i.
Bit- Map Bit-Map
10
nBit-Vector n
Bits
ii. Im Inm < n
AND AND
1
SupportMinsup
L2
iii. L X Y L2
X Y
iv. K-K > 2
Bit-Map C K L
L K-1 K-1{i1 ,i2,,ik } i
14
u 1 2 K-1 K-1 C
{ i1 ,i2,,ik ,u}K 1 2 K-1
AND
K-K-Large
Itemsets
v.
DLG
= 2
2-2DLG
TID Items ()
1
2
3
4
A C D
B C E
A B C E
B E
15
2-3DLGL1Bit-Map
L1 Bit-Vector
A
B
C
E
1010
0111
1110
0111
C2 ={{A B}, {A C}, {A E}, {B C}, {B E}, {C E}}
L2 ={{A C}, {B C}, {B E}, {C E}}
A B
E C
2-3 DLG
C3 ={{A C E}, {B C E}}
L3 ={B C E}
16
DLG
DLG Bit-Vector
DLG
Bit-Vector
2-4
2-4
Apriori
DB
FP Growth
DB
DLG
17
(Run-Length
Encoding)
18
3.1 RLE
Run-Length EncodingRLE Apriori
RLE 3-1RLE
Database
() RLE
Run Length Encoding
RLE
RLE
3-1RLE
19
3-1 RLE
(min sup)L1L1
() RLE Run Length Encoding
Lk-1CkRLE
Lk
3.2 RLE
(Item)I Run Length Encoding (RLE) I t
t,t+1,t+2 . . . ,t+( l -1) I Run l RLE (t,l )
(Item) I RLE List()
I k Run RLE
(t1, l 1 ),(t2, l 2) . . . .(tk, l k) I RLE List [(t1, l1 ),(t2, l2)...(tk, l k)]
3-1 3-2
20
3-1
Tid Item set
T1 B,C,D
T2 A,F
T3 B,E,F,G
T4 A,B,E
T5 A,E,F
T6 A,D,E,F
T7 B,D,E
T8 B,C,D,G
T9 A,B
T10 A,B,E,F
3-2
Run RLE
A [(2,1),(4,3),(9,2)] B [ (1,1),(3,2),(7,4) ] C [ (1,1),(8,1) ] D [ (1,1),(6,3) ] E [ (3,5),(10,1) ] F [ (2,2),(5,2),(10,1) ] G [ (3,1),(8,1) ]
3-2 B
(1,1),(3,2),(7,4)(1,1) B
1 1(3,2) B
2(7,4)
B
4 B Run list
[(1,1),(3,2),(7,4)] Run List
21
3.3 RLE
3-2
L1
Lk-1 Ck
RLE
RLE
RLE getSupport() Lk
3-2
22
3.3.1 RLE
RLE 3-3
RLE
2 3-3
ABC
DEFdest Src ABCDE
Fdest Src AF
RLE
dest
D E B A
F
C
left right
Src
left right
Src
dest
3-3 RLE
23
A Run1 (dest) B Run1 (Src)
dest_right Src_right dest_right >= Src_right
ABC dest_right < Src_right D
EF
dest (5,3)Src (1,1)
dest_right >= Src_rihgt
A = A (1,1) dest
3-4
dest(5,3)
Src=A(1,1)
1 1
5 7
3-4RLE A ()
B = dest (5,3)Src
(4,2)B(4,2) dest 2 dest,B
5 6 dest,B(5,2) 3-5
24
dest(5,3)
Src=B(4,2)
4 6
5 7
3-5RLE B ()
= dest (5,3)Src
(6,1)C(6,1) dest 1 dest,C
6 6 dest,C(6,1) 3-6
dest(5,3)
Src=(6,1)
6 6
5 7
3-6RLE C ()
D =dest (5,3)Src
(6,3)D(6,3) dest 2 dest,D
25
6 7 dest,D(6,2) 3-7
dest(5,3)
Src=D(6,3)
6 8
5 7
3-7RLE D ()
2
E = dest (5,3)Src
(8,2) 3-8
dest(5,3)
Src=E(8,2)
8 9
5 7
3-8RLE F ()
dest_right < Src_rihgt
F =dest (5,3)Src (4,5)
26
F(4,5) dest 3 dest,F 5
7 dest,F(5,3) 3-9
Dest (5,3)
Src=F (4,5)
4 8
5 7
3-9RLE F ()
3
3.3.2 RLE
RLE 3-10
3-10 RLE
1. L1=find_frequent_1-itemsets(D); 2. for(K=2 ; LK-1 0 ; K++) do begin 3. { 4. LK=; 5. CK-itemset = apriori-gen(LK-1) // Ck 6. for all candidates c in CK-itemset do ; 7. { 8. count = getSupport(CK-itemset , min_sup) 9. if(count >= min_sup) LK=LK {C}; 10. } 11. } 12. return L = K LK;
3-10 RLE
27
RLE
Lk-1 Ck
CkRLE ListRLE
And_op()
getSupport() K
Int getSupport(K-itemset S,min_sup) { resRLE RLE(S(1)) for(I=2 to k) { (Count,resRLE) And_OP(resRLE,RLE(S(I))) If(Count
If(SrcRight
Public void nextSRun() { Sindex++; SrcRight=s[Sindex].base+s[Sindex].length-1; SrcLeft=s[Sindex].base; } Public void nextDRun() { Sindex++; DestRight=d[Dindex].base +d[Dindex].length-1; DestLeft=d[Dindex].base;
}
3-11 RLE
3.4 RLE
RLE
L1-itemset RLE
L1 C2
C2 RLE List Set RLE And_op()
L2 RLE List Set
3.4.1 RLE
k- RLE
(K+1)- RLE RLE
3-1 10
30
3-3 RLE 3-4
RLE list Set
Support =3
3-3 RLE
itemset RLE List Set supcount
A [(2,1),(4,3),(9,2)] 6
B [ (1,1),(3,2),(7,4) ] 7
C [ (1,1),(8,1) ] 2
D [ (1,1),(6,3) ] 4
E [ (3,5),(10,1) ] 6
F [ (2,2),(5,2),(10,1) ] 5
G [ (3,1),(8,1) ] 2
RLE List RUN
1- 3-3 CG Support
items L1 3-4
1- Apriori
31
3-41-
itemset RLE List Set supcount
A [(2,1),(4,3),(9,2)] 6 B [ (1,1),(3,2),(7,4) ] 7 D [ (1,1),(6,3) ] 4 E [ (3,5),(10,1) ] 6 F [ (2,2),(5,2),(10,1) ] 5
1- 2-itemset 3-5
3-5 2-(C2)
2-itemset
AB AD AE AF BD BE BF DE DF EF
AB A B RLE A=[(2,1),(4,3),(9,2)]
B=[ (1,1),(3,2),(7,4) ] A B RLE
And_op() AB= (4,1),(9,2) 4
9 2 AB
3 3-6 23- RLE
32
3-6 23- RLE
itemset supcount
AB 3 AE 4 AF 4 BD 3 BE 3 EF 4
itemset supcount
AEF 3
RLE
LRE List Set
33
3.4.2 RLE
= 3
3-12RLE
Itemset Sup. Count A B C D E F G
6 7 2 4 6 5 2
Itemset Sup. Count A B D E F
6 7 4 6 5
C1 1
Itemset Sup. Count AB AD AE AF BD BE BF DE DF EF
3 1 4 4 3 4 2 2 1 4
Itemset Sup. Count AB AE AF BD BE EF
3 4 4 3 4 4
C2 2
Itemset Sup. Count ABE ABF AEF BDE BDF BEF
2 1 3 1 0 2
Itemset Sup. Count AEF
3
C3 3
34
RLE
4.1
4.2
4.3
4.4
4.1
35
Microsoft Visual C# 2005
CPU Pentium 4G1024MB 80G
IDE OS Windows XP
RLE
IBM Almaden Research Center[4]
Apriori[8]
4-1
4-1
Ntrans Tlen Patlen nitems
D T I N Ta.Ib.Nc.Db
T10I4N1KD100K
10
10
4
1000
36
4.2
RLE
RLE List Set
4-2
4-2
Apriori
FP-tree
DLG
RLE RLE List Set
4.3
37
FP Growth
4-3
1000 10 4
13 5 10
4-3
D T I N
10000 10 4 1000
30000 10 4 1000
50000 10 4 1000
100000 10 4 1000
T10I4N1KD10K
T10I4N1KD30K
T10I4N1KD50K
T10I4N1KD100K
38
T10I4N1KD10K
0
20
40
60
80
100
120
140
160
FP
RLE
FP 0.32 0.56 0.92 1.46 98.86 123 150
RLE 5.46 0.05 0.02 0.04 0.01 0.01 0.06
10% 5% 3% 2% 1% 0.9% 0.7%
4-1 10000 RLE FP Growth
T10I4N1KD30K
0
20
40
60
80
100
120
140
160
FP
RLE
FP 0.82 2.17 99.41 83.22 83.88 151.03
RLE 16.06 0.01 0.015 0.031 0.093 0.25
3.3% 1.6% 1% 0.6% 0.3% 0.16%
4-2 30000 RLE FP Growth
39
T10I4N1KD50K
0
20
40
60
80
100
120
140
160
FP
RLE
FP 1.47 100.48 88.27 92.97 87.55 91.48
RLE 26.95 0.031 0.046 0.125 0.39 1.937
2% 1% 0.6% 0.4% 0.2% 0.1%
4-3 50000 RLE FP Growth
T10I4N1KD100K
0
20
40
60
80
100
120
140
160
FP
RLE
FP 92.63 88.81 87.77 92.42 96.02 160.84
RLE 54.39 0.171 0.953 1.515 4.015 3.625
1% 0.5% 0.3% 0.2% 0.1% 0.09%
4-4 100000 RLE FP Growth
40
RLE
FP Growth 1% 0
FP Growth
RLE
4-4
10000 10
4 1005001000
4-4
D T I N
10000 10 4 100
10000 10 4 500
10000 10 4 1000
T10I4N100D10K
T10I4N500D10K
T10I4N1KD10K
41
T10I4N100D10K
0102030405060708090
FP
RLE
FP 0.31 0.42 0.7 2.06 91.64 80.31
RLE 2.218 0.421 0.046 0.093 0.203 0.218
10% 5% 3% 2% 1% 0.9%
4-5 100 RLE FP Growth
T10I4N500D10K
0102030405060708090
100
FP
RLE
FP 0.27 0.31 0.64 1.2 90.72 100.05
RLE 5.046 0.02 0.013 0.015 0.011 0.015
10% 5% 3% 2% 1% 0.9%
4-6 500 RLE FP Growth
42
T10I4N1KD10K
0102030405060708090
100110120130
FP
RLE
FP 0.32 0.56 0.92 1.46 98.86 123
RLE 5.468 0.011 0.021 0.015 0.017 0.015
10% 5% 3% 2% 1% 0.9%
4-7 1000 RLE FP Growth
RLE
FP Growth
4-5
1000 1
4 3510
43
4-5
D T I N
10000 3 4 1000
10000 5 4 1000 10000 10 4 1000
T3I4N1KD10K
T5I4N1KD10K
T10I4N1KD10K
T3I4N1KD10K
0
5
FP
RLE
FP 0.18 0.12 0.24 0.41 3.59 3.47
RLE 2.75 0.015 0.016 0.015 0.013 0.011
10% 5% 3% 2% 1% 0.9%
4-8 3 RLE FP Growth
44
T5I4N1KD10K
0
5
10
15
FP
RLE
FP 0.08 0.16 0.3 0.57 11.42 10.83
RLE 3.234 0.011 0.015 0.018 0.012 0.013
10% 5% 3% 2% 1% 0.9%
4-9 5 RLE FP Growth
T10I4N1KD10K
0
20
40
60
80
100
120
FP
RLE
FP 0.32 0.56 0.92 1.46 98.86 123
RLE 5.468 0.015 0.011 0.017 0.02 0.013
10% 5% 3% 2% 1% 0.9%
4-10 10 RLE FP Growth
45
RLE
4.4
Bit-Vector RLE
Bit-Vector
Bit-Vector
RLE
Bit-Vector
RLE
46
1
10 4-6 Bit-Vector RLE
4-6
D T I N
10000 10 4 1000
30000 10 4 1000
50000 10 4 1000
100000 10 4 1000
T10I4N1KD10K
T10I4N1KD30K
T10I4N1KD50K
T10I4N1KD100K
4-11 RLE Bit-Vector
4-1 RLE Bit-Vector
RLE
B=Bit-Vector
R=RLE
B - R B
( 4-1 ) =
47
4-7 RLE
T10I4N1KD10K 86.52 %
T10I4N1KD30K 86.74 %
T10I4N1KD50K 87.31 % T10I4N1KD100K 87.65 %
0
2000000
4000000
6000000
8000000
10000000 12000000 14000000
T10I4N1KD10K T10I4N1KD30K T10I4N1KD50K T10I4N1KD100K
KB
RLE
Bit Vector
4-11
100
1000 4-8
Bit-Vector RLE
48
4-8
D T I N
10000 10 4 100
10000 10 4 500
10000 10 4 1000
T10I4N100D10K
T10I4N500D10K
T10I4N1KD10K
4-12 RLE Bit-Vector
0
200000
400000
600000
800000
1000000 1200000 1400000
T10I4N100D10K T10I4N500D10K T10I4N1KD10K
KB
RLE
Bit Vector
4-12
4-9 RLE
T10I4N100D10K 12.54 % T10I4N500D10K 77.71 % T10I4N1KD10K 87.64 %
49
3
10 4-10
Bit-Vector RLE
4-10
D T I N
10000 3 4 1000
10000 5 4 1000 10000 10 4 1000
T3I4N1KD10K
T5I4N1KD10K
T10I4N1KD10K
4-13 RLE Bit-Vector
4-11
50
0
200000
400000
600000
800000
1000000 1200000 1400000
T3I4N1KD10K T5I4N1KD10K T10I4N1KD10K
K
B
RLE
Bit Vector
4-13
4-11 RLE
T3I4N1KD10K 94.1 % T5I4N1KD10K 92.5 % T10I4N1KD10K 87.64 %
RLE
FP Growth
RLE List Set RLE
RLE List Set Apriori FP
Growth FP Growth
FP-tree FP Growth
tree tree
51
Run-Length
Encoding(RLE)
RLE List Set
RLE
RLE FP-Growth
52
[1] S. J. Yen and A. L. P. Chen. "An Efficient Approach to
Discovering Knowledge from Large Databases. " Proc. IEEE/ACM
International Conference on Parallel and Distributed Information
Systems (PDIS), (1996).
[2] Han, J., Pei J. and Yin Y., "Mining Frequent Patterns without
Candidate Generation " Proceedings of ACM-SIGMOD Interna-tional
Conference on Management of Data, pages 1-12, May (2000).
[3] J. Pei, J. Han. H. Lu, S. Nishio, S. Tang, and D. Yang. "H-Mine:
Hyper-structure Mining of Frequent Patterns in Large
Databases " Proc. The 2001 IEEE International Confer-ence On
Data Mining (ICDM01), San Jose, California, November
29-December 2, (2001).
[4] IBM Data Mining Web Site:
http://www.almaden.ibm.com/cs/quest/index.html
[5] M. S. Chen, J. Han, and P.S. Yu, "Data Mining: An Overview from
a database Perspective", IEEE Transaction on Knowledge and
Data Engineering, Vol. 8,No. 6,December (1996).
53
[6] R. Agrawal and R. Srikant, "Fast Algorithm for Mining
Assoication Rule in Large Databases", In Proc. 1994 Intl Conf.
VLDB, pp. 487-499, Santiago, Chile, Sep. (1994).
[7] Savasere, E. Omiecinski, and S. Navathe, "An Efficient Algorithm for Mining
Association Rule in Large Database", Proc. 21th VLDB,pp. 432-444, (1995).
[8] R. Agrawal, T. Imielinski, & A. Swami, "Mining Association RuSets
of Items in Large Database, " Proceedings of SIGMOD, USA, pp. 207-216
(1993).
[9] R. Agrawal, T. Imielinski and A. Swami, "Database Mining: A
Perspective, " IEEE Transactions on Knowledge and Data
Engi914-925 (1993).
[10] R. Agrawal, C. Faloutsos and A. Swami, "Efficient
SimilaritSequence Databases, " Lecture Notes in Computer
Science 73Verlag, pp. 69-84 (1993).
[11] ""
(2002)
[12] ""
(2003)
54
----------------------------------1.doc--------------------------------------------2.docnew_.doc