南華大學 - nhuir.nhu.edu.twnhuir.nhu.edu.tw/retrieve/26744/096NHU05396026-001.pdf · 說到關聯規則是這幾年來非常熱門所討論的話題，在各行各業中也

An association rules miner based on run length encoding

97 6

204

VI

(Run-length Encoding RLE)

VII

An association rules miner based on run length encoding

StudentHsieh Tsung Hao AdvisorsDr. Hung-Pin Chiu

Department of Information Management

The M.I.M. Program Nan-Hua University

ABSTRACT

It is very hot topics discussed over these several years to talk about the

related rule, importance and practicability with related and regular attention

slowly too in all trades and professions, but it will be a focal point that

everybody cares about that the treatment performing the algorithm carries out

efficiency.

It stores member's database relevant trade record in the database to

perform algorithms in the related rule in early days, then carry on materials

prospect, because materials punish operation of the hard disk, so in efficiency

is dealt with, it is not all very ideal to carry out the speed, efficiency and

speed that the relative reducing materials prospect.

Question the method can change above-mentioned facing put forward by

us, the method is ' coding method of the continuous length (Run-length

Encoding RLE) ',it is by code skill enter method because mainly whether

among storing device compress by materials, it is prospect not to go on,

needn't be decompressed and encoded to prospect course in it, and then

increase the efficiency prospected!

Key wordsData Mining, Association Rules, Data Compression

VIII

.....................................................II

......................................III

.........................................IV

............................................V

.......................................................VI

..................................................VII

.................................................VIII

.......................................................IX

......................................................X

.....................................................XI

................................................1

.........................................1

...................................2

...................................4

.........................................5

......................................6

Apriori ....................................9

FP-Growth ................................12

DLGDirect Large itemsets Generation...13

...........................................18

RLE ...................19

RLE ........................20

RLE ......................................22

RLE ...............................30

.................................35

..................................35

............................37

............................37

..................46

.....................................52

...................................................53

IX

2-17 2-2DLG 15 2-3DLG L1 Bit-Map16 2-417 3-121 3-221 3-3 RLE31 3-41-32 3-5 2-(C2) 32 3-6 23- RLE33 4-136 4-237 4-338 4-441 4-544 4-647 4-7 RLE 48 4-849 4-9 RLE 49 4-1050 4-11 RLE 51

X

2-1Apriori 10 2-2Apriori 10 2-3DLG 16 3-1RLE 19 3-2 22 3-3RLE 23 3-4RLE A () 24 3-5RLE B () 25 3-6RLE C () 25 3-7RLE D () 26 3-8RLE F () 26 3-9RLE F () 27 3-10 RLE 27 3-11 RLE 28 3-12RLE 34 4-1 1 RLE FP Growth 39 4-2 3 RLE FP Growth 39 4-3 5 RLE FP Growth 40 4-4 10 RLE FP Growth . 40

4-5 100 RLE FP Growth 42



4-83RLEFP Growth44 4-95RLEFP Growth45 4-10 10 RLE FP Growth 45 4-1148 4-1249 4-1351

XI

1.1

(Data Mining)

1

1.2

(Association Rules)

(Classification)(Clustering)(Sequential

Patterns)

(1) (Association Rule)

(Item)

(2) (Classification)

[8]

2

(Decision Tree)

(3) (Clustering)

K-means

agglomeration

(4) (Sequential Patterns)

Agrawal and Srikant

3

1.3

(Association rule)

Apriori 1994 R.Agrawal

Apriori (Large Itemsets)

(Run-length Encoding RLE)

4

1.4

(1)

Run-Length Encoding(RLE List Set)

(2)

Fp-Growth

5

(Association Rule)Agrawal1993

6% 86%

86%

I={I1,I2,I3.,In}

(items)I(Transaction)

D

(Item)

TID 2-1

6

(Itemset) 1-

(1-Itemset) 2-(2-Itemset)

2-1

TID

T10 EFG T20 ABF T30 ABD

(Frequent itemsets)

(Frequent

Itemset or Large Itemset)

(Complete Large Itemsets)

I

Support(I)=

7

I

X Y X Y

X Y

X Y (Confidence)

X Y

XY

Confidence(X Y) X

X Y X Y

X Y

8

2.1 Apriori

Apriori [8] Agrawal 1994

(Frequent Itemsets)

(Association Rules)

Apriori

2-2 Apriori Apriori

1.

2.

- 2-1 Apriori

9

1. L1={large 1-itemsets}

2. for (K = 2 ; LK-1 0 ; K++) do begin

3. CK= apriori-gen (LK-1) 4. for all transactions t D do begin 5. Ct=subset (CK, t) 6. for all candidates c Ct do 7. c.count ++; 8. end

9. LK={c CK| c.count minsup}

10. end

11. Answer = K LK;

2-1Apriori

2-2Apriori

10

Apriori

(1)(Itemset)

2- 1- 1-

k (k-1)+(k-2)++1 2-

k*(k-1)/2 1- 1000 45

2-

(2)

(1)

(support)

Apriori I/O

11

2.2 Frequent-Pattern treeFP-Growth

FP-Growth Han, Pei, and Yin2000

candidate itemsets

Frequent-Pattern tree candidate

itemsets

I/O FP-Growth

() Frequent-Pattern tree

(1) large 1-itemset

(2) large items

(3) FP- tree

() FP-Tree

(1) FP-Tree node conditional pattern base

(2) conditional pattern base conditional FP-Tree

(3) conditional FP-Tree

conditional FP-Tree Frequent Pattern

(4) conditional FP-Tree pattern

12

FP-Growth

FP-Tree

2.3 DLGDirect Large itemsets Generation

DLG[11] Bit-Map

Bit- Map

0 1 01

Bit-Map

AND1

DLG 2 2 Apriori

DLG DLG L

L Directed-Graph{X,Y} L

X Y

13

DLG

i.

Bit- Map Bit-Map

10

nBit-Vector n

Bits

ii. Im Inm < n

AND AND

1

SupportMinsup

L2

iii. L X Y L2

X Y

iv. K-K > 2

Bit-Map C K L

L K-1 K-1{i1 ,i2,,ik } i

14

u 1 2 K-1 K-1 C

{ i1 ,i2,,ik ,u}K 1 2 K-1

AND

K-K-Large

Itemsets

v.

DLG

= 2

2-2DLG

TID Items ()

1

2

3

4

A C D

B C E

A B C E

B E

15

2-3DLGL1Bit-Map

L1 Bit-Vector

A

B

C

E

1010

0111

1110

0111

C2 ={{A B}, {A C}, {A E}, {B C}, {B E}, {C E}}

L2 ={{A C}, {B C}, {B E}, {C E}}

A B

E C

2-3 DLG

C3 ={{A C E}, {B C E}}

L3 ={B C E}

16

DLG

DLG Bit-Vector

DLG

Bit-Vector

2-4

2-4

Apriori

DB

FP Growth

DB

DLG

17

(Run-Length

Encoding)

18

3.1 RLE

Run-Length EncodingRLE Apriori

RLE 3-1RLE

Database

() RLE

Run Length Encoding

RLE

RLE

3-1RLE

19

3-1 RLE

(min sup)L1L1

() RLE Run Length Encoding

Lk-1CkRLE

Lk

3.2 RLE

(Item)I Run Length Encoding (RLE) I t

t,t+1,t+2 . . . ,t+( l -1) I Run l RLE (t,l )

(Item) I RLE List()

I k Run RLE

(t1, l 1 ),(t2, l 2) . . . .(tk, l k) I RLE List [(t1, l1 ),(t2, l2)...(tk, l k)]

3-1 3-2

20

3-1

Tid Item set

T1 B,C,D

T2 A,F

T3 B,E,F,G

T4 A,B,E

T5 A,E,F

T6 A,D,E,F

T7 B,D,E

T8 B,C,D,G

T9 A,B

T10 A,B,E,F

3-2

Run RLE

A [(2,1),(4,3),(9,2)] B [ (1,1),(3,2),(7,4) ] C [ (1,1),(8,1) ] D [ (1,1),(6,3) ] E [ (3,5),(10,1) ] F [ (2,2),(5,2),(10,1) ] G [ (3,1),(8,1) ]

3-2 B

(1,1),(3,2),(7,4)(1,1) B

1 1(3,2) B

2(7,4)

B

4 B Run list

[(1,1),(3,2),(7,4)] Run List

21

3.3 RLE

3-2

L1

Lk-1 Ck

RLE

RLE

RLE getSupport() Lk

3-2

22

3.3.1 RLE

RLE 3-3

RLE

2 3-3

ABC

DEFdest Src ABCDE

Fdest Src AF

RLE

dest

D E B A

F

C

left right

Src

left right

Src

dest

3-3 RLE

23

A Run1 (dest) B Run1 (Src)

dest_right Src_right dest_right >= Src_right

ABC dest_right < Src_right D

EF

dest (5,3)Src (1,1)

dest_right >= Src_rihgt

A = A (1,1) dest

3-4

dest(5,3)

Src=A(1,1)

1 1

5 7

3-4RLE A ()

B = dest (5,3)Src

(4,2)B(4,2) dest 2 dest,B

5 6 dest,B(5,2) 3-5

24

dest(5,3)

Src=B(4,2)

4 6

5 7

3-5RLE B ()

= dest (5,3)Src

(6,1)C(6,1) dest 1 dest,C

6 6 dest,C(6,1) 3-6

dest(5,3)

Src=(6,1)

6 6

5 7

3-6RLE C ()

D =dest (5,3)Src

(6,3)D(6,3) dest 2 dest,D

25

6 7 dest,D(6,2) 3-7

dest(5,3)

Src=D(6,3)

6 8

5 7

3-7RLE D ()

2

E = dest (5,3)Src

(8,2) 3-8

dest(5,3)

Src=E(8,2)

8 9

5 7

3-8RLE F ()

dest_right < Src_rihgt

F =dest (5,3)Src (4,5)

26

F(4,5) dest 3 dest,F 5

7 dest,F(5,3) 3-9

Dest (5,3)

Src=F (4,5)

4 8

5 7

3-9RLE F ()

3

3.3.2 RLE

RLE 3-10

3-10 RLE

1. L1=find_frequent_1-itemsets(D); 2. for(K=2 ; LK-1 0 ; K++) do begin 3. { 4. LK=; 5. CK-itemset = apriori-gen(LK-1) // Ck 6. for all candidates c in CK-itemset do ; 7. { 8. count = getSupport(CK-itemset , min_sup) 9. if(count >= min_sup) LK=LK {C}; 10. } 11. } 12. return L = K LK;

3-10 RLE

27

RLE

Lk-1 Ck

CkRLE ListRLE

And_op()

getSupport() K

Int getSupport(K-itemset S,min_sup) { resRLE RLE(S(1)) for(I=2 to k) { (Count,resRLE) And_OP(resRLE,RLE(S(I))) If(Count

If(SrcRight

Public void nextSRun() { Sindex++; SrcRight=s[Sindex].base+s[Sindex].length-1; SrcLeft=s[Sindex].base; } Public void nextDRun() { Sindex++; DestRight=d[Dindex].base +d[Dindex].length-1; DestLeft=d[Dindex].base;

}

3-11 RLE

3.4 RLE

RLE

L1-itemset RLE

L1 C2

C2 RLE List Set RLE And_op()

L2 RLE List Set

3.4.1 RLE

k- RLE

(K+1)- RLE RLE

3-1 10

30

3-3 RLE 3-4

RLE list Set

Support =3

3-3 RLE

itemset RLE List Set supcount

A [(2,1),(4,3),(9,2)] 6

B [ (1,1),(3,2),(7,4) ] 7

C [ (1,1),(8,1) ] 2

D [ (1,1),(6,3) ] 4

E [ (3,5),(10,1) ] 6

F [ (2,2),(5,2),(10,1) ] 5

G [ (3,1),(8,1) ] 2

RLE List RUN

1- 3-3 CG Support

items L1 3-4

1- Apriori

31

3-41-

itemset RLE List Set supcount

A [(2,1),(4,3),(9,2)] 6 B [ (1,1),(3,2),(7,4) ] 7 D [ (1,1),(6,3) ] 4 E [ (3,5),(10,1) ] 6 F [ (2,2),(5,2),(10,1) ] 5

1- 2-itemset 3-5

3-5 2-(C2)

2-itemset

AB AD AE AF BD BE BF DE DF EF

AB A B RLE A=[(2,1),(4,3),(9,2)]

B=[ (1,1),(3,2),(7,4) ] A B RLE

And_op() AB= (4,1),(9,2) 4

9 2 AB

3 3-6 23- RLE

32

3-6 23- RLE

itemset supcount

AB 3 AE 4 AF 4 BD 3 BE 3 EF 4

itemset supcount

AEF 3

RLE

LRE List Set

33

3.4.2 RLE

= 3

3-12RLE

Itemset Sup. Count A B C D E F G

6 7 2 4 6 5 2

Itemset Sup. Count A B D E F

6 7 4 6 5

C1 1

Itemset Sup. Count AB AD AE AF BD BE BF DE DF EF

3 1 4 4 3 4 2 2 1 4

Itemset Sup. Count AB AE AF BD BE EF

3 4 4 3 4 4

C2 2

Itemset Sup. Count ABE ABF AEF BDE BDF BEF

2 1 3 1 0 2

Itemset Sup. Count AEF

3

C3 3

34

RLE

4.1

4.2

4.3

4.4

4.1

35

Microsoft Visual C# 2005

CPU Pentium 4G1024MB 80G

IDE OS Windows XP

RLE

IBM Almaden Research Center[4]

Apriori[8]

4-1

4-1

Ntrans Tlen Patlen nitems

D T I N Ta.Ib.Nc.Db

T10I4N1KD100K

10

10

4

1000

36

4.2

RLE

RLE List Set

4-2

4-2

Apriori

FP-tree

DLG

RLE RLE List Set

4.3

37

FP Growth

4-3

1000 10 4

13 5 10

4-3

D T I N

10000 10 4 1000

30000 10 4 1000

50000 10 4 1000

100000 10 4 1000

T10I4N1KD10K

T10I4N1KD30K

T10I4N1KD50K

T10I4N1KD100K

38

T10I4N1KD10K

0

20

40

60

80

100

120

140

160

FP

RLE

FP 0.32 0.56 0.92 1.46 98.86 123 150

RLE 5.46 0.05 0.02 0.04 0.01 0.01 0.06

10% 5% 3% 2% 1% 0.9% 0.7%

4-1 10000 RLE FP Growth

T10I4N1KD30K

0

20

40

60

80

100

120

140

160

FP

RLE

FP 0.82 2.17 99.41 83.22 83.88 151.03

RLE 16.06 0.01 0.015 0.031 0.093 0.25

3.3% 1.6% 1% 0.6% 0.3% 0.16%


39

T10I4N1KD50K

0

20

40

60

80

100

120

140

160

FP

RLE

FP 1.47 100.48 88.27 92.97 87.55 91.48

RLE 26.95 0.031 0.046 0.125 0.39 1.937

2% 1% 0.6% 0.4% 0.2% 0.1%


T10I4N1KD100K

0

20

40

60

80

100

120

140

160

FP

RLE

FP 92.63 88.81 87.77 92.42 96.02 160.84

RLE 54.39 0.171 0.953 1.515 4.015 3.625

1% 0.5% 0.3% 0.2% 0.1% 0.09%


40

RLE

FP Growth 1% 0

FP Growth

RLE

4-4

10000 10

4 1005001000

4-4

D T I N

10000 10 4 100

10000 10 4 500

10000 10 4 1000

T10I4N100D10K

T10I4N500D10K

T10I4N1KD10K

41

T10I4N100D10K

0102030405060708090

FP

RLE

FP 0.31 0.42 0.7 2.06 91.64 80.31

RLE 2.218 0.421 0.046 0.093 0.203 0.218

10% 5% 3% 2% 1% 0.9%


T10I4N500D10K

0102030405060708090

100

FP

RLE

FP 0.27 0.31 0.64 1.2 90.72 100.05

RLE 5.046 0.02 0.013 0.015 0.011 0.015

10% 5% 3% 2% 1% 0.9%


42

T10I4N1KD10K

0102030405060708090

100110120130

FP

RLE

FP 0.32 0.56 0.92 1.46 98.86 123

RLE 5.468 0.011 0.021 0.015 0.017 0.015

10% 5% 3% 2% 1% 0.9%


RLE

FP Growth

4-5

1000 1

4 3510

43

4-5

D T I N

10000 3 4 1000

10000 5 4 1000 10000 10 4 1000

T3I4N1KD10K

T5I4N1KD10K

T10I4N1KD10K

T3I4N1KD10K

0

5

FP

RLE

FP 0.18 0.12 0.24 0.41 3.59 3.47

RLE 2.75 0.015 0.016 0.015 0.013 0.011

10% 5% 3% 2% 1% 0.9%

4-8 3 RLE FP Growth

44

T5I4N1KD10K

0

5

10

15

FP

RLE

FP 0.08 0.16 0.3 0.57 11.42 10.83

RLE 3.234 0.011 0.015 0.018 0.012 0.013

10% 5% 3% 2% 1% 0.9%

4-9 5 RLE FP Growth

T10I4N1KD10K

0

20

40

60

80

100

120

FP

RLE

FP 0.32 0.56 0.92 1.46 98.86 123

RLE 5.468 0.015 0.011 0.017 0.02 0.013

10% 5% 3% 2% 1% 0.9%


45

RLE

4.4

Bit-Vector RLE

Bit-Vector

Bit-Vector

RLE

Bit-Vector

RLE

46

1

10 4-6 Bit-Vector RLE

4-6

D T I N

10000 10 4 1000

30000 10 4 1000

50000 10 4 1000

100000 10 4 1000

T10I4N1KD10K

T10I4N1KD30K

T10I4N1KD50K

T10I4N1KD100K

4-11 RLE Bit-Vector

4-1 RLE Bit-Vector

RLE

B=Bit-Vector

R=RLE

B - R B

( 4-1 ) =

47

4-7 RLE

T10I4N1KD10K 86.52 %

T10I4N1KD30K 86.74 %

T10I4N1KD50K 87.31 % T10I4N1KD100K 87.65 %

0

2000000

4000000

6000000

8000000

10000000 12000000 14000000

T10I4N1KD10K T10I4N1KD30K T10I4N1KD50K T10I4N1KD100K

KB

RLE

Bit Vector

4-11

100

1000 4-8

Bit-Vector RLE

48

4-8

D T I N

10000 10 4 100

10000 10 4 500

10000 10 4 1000

T10I4N100D10K

T10I4N500D10K

T10I4N1KD10K

4-12 RLE Bit-Vector

0

200000

400000

600000

800000

1000000 1200000 1400000

T10I4N100D10K T10I4N500D10K T10I4N1KD10K

KB

RLE

Bit Vector

4-12

4-9 RLE

T10I4N100D10K 12.54 % T10I4N500D10K 77.71 % T10I4N1KD10K 87.64 %

49

3

10 4-10

Bit-Vector RLE

4-10

D T I N

10000 3 4 1000

10000 5 4 1000 10000 10 4 1000

T3I4N1KD10K

T5I4N1KD10K

T10I4N1KD10K

4-13 RLE Bit-Vector

4-11

50

0

200000

400000

600000

800000

1000000 1200000 1400000

T3I4N1KD10K T5I4N1KD10K T10I4N1KD10K

K

B

RLE

Bit Vector

4-13

4-11 RLE

T3I4N1KD10K 94.1 % T5I4N1KD10K 92.5 % T10I4N1KD10K 87.64 %

RLE

FP Growth

RLE List Set RLE

RLE List Set Apriori FP

Growth FP Growth

FP-tree FP Growth

tree tree

51

Run-Length

Encoding(RLE)

RLE List Set

RLE

RLE FP-Growth

52

[1] S. J. Yen and A. L. P. Chen. "An Efficient Approach to

Discovering Knowledge from Large Databases. " Proc. IEEE/ACM

International Conference on Parallel and Distributed Information

Systems (PDIS), (1996).

[2] Han, J., Pei J. and Yin Y., "Mining Frequent Patterns without

Candidate Generation " Proceedings of ACM-SIGMOD Interna-tional

Conference on Management of Data, pages 1-12, May (2000).

[3] J. Pei, J. Han. H. Lu, S. Nishio, S. Tang, and D. Yang. "H-Mine:

Hyper-structure Mining of Frequent Patterns in Large

Databases " Proc. The 2001 IEEE International Confer-ence On

Data Mining (ICDM01), San Jose, California, November

29-December 2, (2001).

[4] IBM Data Mining Web Site:

http://www.almaden.ibm.com/cs/quest/index.html

[5] M. S. Chen, J. Han, and P.S. Yu, "Data Mining: An Overview from

a database Perspective", IEEE Transaction on Knowledge and

Data Engineering, Vol. 8,No. 6,December (1996).

53

[6] R. Agrawal and R. Srikant, "Fast Algorithm for Mining

Assoication Rule in Large Databases", In Proc. 1994 Intl Conf.

VLDB, pp. 487-499, Santiago, Chile, Sep. (1994).

[7] Savasere, E. Omiecinski, and S. Navathe, "An Efficient Algorithm for Mining

Association Rule in Large Database", Proc. 21th VLDB,pp. 432-444, (1995).

[8] R. Agrawal, T. Imielinski, & A. Swami, "Mining Association RuSets

of Items in Large Database, " Proceedings of SIGMOD, USA, pp. 207-216

(1993).

[9] R. Agrawal, T. Imielinski and A. Swami, "Database Mining: A

Perspective, " IEEE Transactions on Knowledge and Data

Engi914-925 (1993).

[10] R. Agrawal, C. Faloutsos and A. Swami, "Efficient

SimilaritSequence Databases, " Lecture Notes in Computer

Science 73Verlag, pp. 69-84 (1993).

[11] ""

(2002)

[12] ""

(2003)

54
----------------------------------1.doc--------------------------------------------2.docnew_.doc

Documents

南 華 大 學 - nhuir.nhu.edu.twnhuir.nhu.edu.tw/retrieve/26744/096NHU05396026-001.pdf · 說到關聯規則是這幾年來非常熱門所討論的話題，在各行各業中也

南華大學 - nhuir.nhu.edu.twnhuir.nhu.edu.tw/retrieve/26744/096NHU05396026-001.pdf · 說到關聯規則是這幾年來非常熱門所討論的話題，在各行各業中也