62
資訊管理學系 碩士論文 基於連續長度編碼之關聯法則挖掘法 An association rules miner based on run length encoding 生 :謝宗豪 指 導 教 授:邱宏彬 博士 中華民國 97 年 6 月

南 華 大 學 - nhuir.nhu.edu.twnhuir.nhu.edu.tw/retrieve/26744/096NHU05396026-001.pdf · 說到關聯規則是這幾年來非常熱門所討論的話題,在各行各業中也

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • An association rules miner based on run length encoding

    97 6

  • V

  • 204

    VI

  • (Run-length Encoding RLE)

    VII

  • An association rules miner based on run length encoding

    StudentHsieh Tsung Hao AdvisorsDr. Hung-Pin Chiu

    Department of Information Management

    The M.I.M. Program Nan-Hua University

    ABSTRACT

    It is very hot topics discussed over these several years to talk about the

    related rule, importance and practicability with related and regular attention

    slowly too in all trades and professions, but it will be a focal point that

    everybody cares about that the treatment performing the algorithm carries out

    efficiency.

    It stores member's database relevant trade record in the database to

    perform algorithms in the related rule in early days, then carry on materials

    prospect, because materials punish operation of the hard disk, so in efficiency

    is dealt with, it is not all very ideal to carry out the speed, efficiency and

    speed that the relative reducing materials prospect.

    Question the method can change above-mentioned facing put forward by

    us, the method is ' coding method of the continuous length (Run-length

    Encoding RLE) ',it is by code skill enter method because mainly whether

    among storing device compress by materials, it is prospect not to go on,

    needn't be decompressed and encoded to prospect course in it, and then

    increase the efficiency prospected!

    Key wordsData Mining, Association Rules, Data Compression

    VIII

  • .....................................................II

    ......................................III

    .........................................IV

    ............................................V

    .......................................................VI

    ..................................................VII

    .................................................VIII

    .......................................................IX

    ......................................................X

    .....................................................XI

    ................................................1

    .........................................1

    ...................................2

    ...................................4

    .........................................5

    ......................................6

    Apriori ....................................9

    FP-Growth ................................12

    DLGDirect Large itemsets Generation...13

    ...........................................18

    RLE ...................19

    RLE ........................20

    RLE ......................................22

    RLE ...............................30

    .................................35

    ..................................35

    ............................37

    ............................37

    ..................46

    .....................................52

    ...................................................53

    IX

  • 2-17 2-2DLG 15 2-3DLG L1 Bit-Map16 2-417 3-121 3-221 3-3 RLE31 3-41-32 3-5 2-(C2) 32 3-6 23- RLE33 4-136 4-237 4-338 4-441 4-544 4-647 4-7 RLE 48 4-849 4-9 RLE 49 4-1050 4-11 RLE 51

    X

  • 2-1Apriori 10 2-2Apriori 10 2-3DLG 16 3-1RLE 19 3-2 22 3-3RLE 23 3-4RLE A () 24 3-5RLE B () 25 3-6RLE C () 25 3-7RLE D () 26 3-8RLE F () 26 3-9RLE F () 27 3-10 RLE 27 3-11 RLE 28 3-12RLE 34 4-1 1 RLE FP Growth 39 4-2 3 RLE FP Growth 39 4-3 5 RLE FP Growth 40 4-4 10 RLE FP Growth . 40

    4-5 100 RLE FP Growth 42

    4-6 500 RLE FP Growth 42

    4-7 1000 RLE FP Growth 43

    4-83RLEFP Growth44 4-95RLEFP Growth45 4-10 10 RLE FP Growth 45 4-1148 4-1249 4-1351

    XI

  • 1.1

    (Data Mining)

    1

  • 1.2

    (Association Rules)

    (Classification)(Clustering)(Sequential

    Patterns)

    (1) (Association Rule)

    (Item)

    (2) (Classification)

    [8]

    2

  • (Decision Tree)

    (3) (Clustering)

    K-means

    agglomeration

    (4) (Sequential Patterns)

    Agrawal and Srikant

    3

  • 1.3

    (Association rule)

    Apriori 1994 R.Agrawal

    Apriori (Large Itemsets)

    (Run-length Encoding RLE)

    4

  • 1.4

    (1)

    Run-Length Encoding(RLE List Set)

    (2)

    Fp-Growth

    5

  • (Association Rule)Agrawal1993

    6% 86%

    86%

    I={I1,I2,I3.,In}

    (items)I(Transaction)

    D

    (Item)

    TID 2-1

    6

  • (Itemset) 1-

    (1-Itemset) 2-(2-Itemset)

    2-1

    TID

    T10 EFG T20 ABF T30 ABD

    (Frequent itemsets)

    (Frequent

    Itemset or Large Itemset)

    (Complete Large Itemsets)

    I

    Support(I)=

    7

  • I

    X Y X Y

    X Y

    X Y (Confidence)

    X Y

    XY

    Confidence(X Y) X

    X Y X Y

    X Y

    8

  • 2.1 Apriori

    Apriori [8] Agrawal 1994

    (Frequent Itemsets)

    (Association Rules)

    Apriori

    2-2 Apriori Apriori

    1.

    2.

    - 2-1 Apriori

    9

  • 1. L1={large 1-itemsets}

    2. for (K = 2 ; LK-1 0 ; K++) do begin

    3. CK= apriori-gen (LK-1) 4. for all transactions t D do begin 5. Ct=subset (CK, t) 6. for all candidates c Ct do 7. c.count ++; 8. end

    9. LK={c CK| c.count minsup}

    10. end

    11. Answer = K LK;

    2-1Apriori

    2-2Apriori

    10

  • Apriori

    (1)(Itemset)

    2- 1- 1-

    k (k-1)+(k-2)++1 2-

    k*(k-1)/2 1- 1000 45

    2-

    (2)

    (1)

    (support)

    Apriori I/O

    11

  • 2.2 Frequent-Pattern treeFP-Growth

    FP-Growth Han, Pei, and Yin2000

    candidate itemsets

    Frequent-Pattern tree candidate

    itemsets

    I/O FP-Growth

    () Frequent-Pattern tree

    (1) large 1-itemset

    (2) large items

    (3) FP- tree

    () FP-Tree

    (1) FP-Tree node conditional pattern base

    (2) conditional pattern base conditional FP-Tree

    (3) conditional FP-Tree

    conditional FP-Tree Frequent Pattern

    (4) conditional FP-Tree pattern

    12

  • FP-Growth

    FP-Tree

    2.3 DLGDirect Large itemsets Generation

    DLG[11] Bit-Map

    Bit- Map

    0 1 01

    Bit-Map

    AND1

    DLG 2 2 Apriori

    DLG DLG L

    L Directed-Graph{X,Y} L

    X Y

    13

  • DLG

    i.

    Bit- Map Bit-Map

    10

    nBit-Vector n

    Bits

    ii. Im Inm < n

    AND AND

    1

    SupportMinsup

    L2

    iii. L X Y L2

    X Y

    iv. K-K > 2

    Bit-Map C K L

    L K-1 K-1{i1 ,i2,,ik } i

    14

  • u 1 2 K-1 K-1 C

    { i1 ,i2,,ik ,u}K 1 2 K-1

    AND

    K-K-Large

    Itemsets

    v.

    DLG

    = 2

    2-2DLG

    TID Items ()

    1

    2

    3

    4

    A C D

    B C E

    A B C E

    B E

    15

  • 2-3DLGL1Bit-Map

    L1 Bit-Vector

    A

    B

    C

    E

    1010

    0111

    1110

    0111

    C2 ={{A B}, {A C}, {A E}, {B C}, {B E}, {C E}}

    L2 ={{A C}, {B C}, {B E}, {C E}}

    A B

    E C

    2-3 DLG

    C3 ={{A C E}, {B C E}}

    L3 ={B C E}

    16

  • DLG

    DLG Bit-Vector

    DLG

    Bit-Vector

    2-4

    2-4

    Apriori

    DB

    FP Growth

    DB

    DLG

    17

  • (Run-Length

    Encoding)

    18

  • 3.1 RLE

    Run-Length EncodingRLE Apriori

    RLE 3-1RLE

    Database

    () RLE

    Run Length Encoding

    RLE

    RLE

    3-1RLE

    19

  • 3-1 RLE

    (min sup)L1L1

    () RLE Run Length Encoding

    Lk-1CkRLE

    Lk

    3.2 RLE

    (Item)I Run Length Encoding (RLE) I t

    t,t+1,t+2 . . . ,t+( l -1) I Run l RLE (t,l )

    (Item) I RLE List()

    I k Run RLE

    (t1, l 1 ),(t2, l 2) . . . .(tk, l k) I RLE List [(t1, l1 ),(t2, l2)...(tk, l k)]

    3-1 3-2

    20

  • 3-1

    Tid Item set

    T1 B,C,D

    T2 A,F

    T3 B,E,F,G

    T4 A,B,E

    T5 A,E,F

    T6 A,D,E,F

    T7 B,D,E

    T8 B,C,D,G

    T9 A,B

    T10 A,B,E,F

    3-2

    Run RLE

    A [(2,1),(4,3),(9,2)] B [ (1,1),(3,2),(7,4) ] C [ (1,1),(8,1) ] D [ (1,1),(6,3) ] E [ (3,5),(10,1) ] F [ (2,2),(5,2),(10,1) ] G [ (3,1),(8,1) ]

    3-2 B

    (1,1),(3,2),(7,4)(1,1) B

    1 1(3,2) B

    2(7,4)

    B

    4 B Run list

    [(1,1),(3,2),(7,4)] Run List

    21

  • 3.3 RLE

    3-2

    L1

    Lk-1 Ck

    RLE

    RLE

    RLE getSupport() Lk

    3-2

    22

  • 3.3.1 RLE

    RLE 3-3

    RLE

    2 3-3

    ABC

    DEFdest Src ABCDE

    Fdest Src AF

    RLE

    dest

    D E B A

    F

    C

    left right

    Src

    left right

    Src

    dest

    3-3 RLE

    23

  • A Run1 (dest) B Run1 (Src)

    dest_right Src_right dest_right >= Src_right

    ABC dest_right < Src_right D

    EF

    dest (5,3)Src (1,1)

    dest_right >= Src_rihgt

    A = A (1,1) dest

    3-4

    dest(5,3)

    Src=A(1,1)

    1 1

    5 7

    3-4RLE A ()

    B = dest (5,3)Src

    (4,2)B(4,2) dest 2 dest,B

    5 6 dest,B(5,2) 3-5

    24

  • dest(5,3)

    Src=B(4,2)

    4 6

    5 7

    3-5RLE B ()

    = dest (5,3)Src

    (6,1)C(6,1) dest 1 dest,C

    6 6 dest,C(6,1) 3-6

    dest(5,3)

    Src=(6,1)

    6 6

    5 7

    3-6RLE C ()

    D =dest (5,3)Src

    (6,3)D(6,3) dest 2 dest,D

    25

  • 6 7 dest,D(6,2) 3-7

    dest(5,3)

    Src=D(6,3)

    6 8

    5 7

    3-7RLE D ()

    2

    E = dest (5,3)Src

    (8,2) 3-8

    dest(5,3)

    Src=E(8,2)

    8 9

    5 7

    3-8RLE F ()

    dest_right < Src_rihgt

    F =dest (5,3)Src (4,5)

    26

  • F(4,5) dest 3 dest,F 5

    7 dest,F(5,3) 3-9

    Dest (5,3)

    Src=F (4,5)

    4 8

    5 7

    3-9RLE F ()

    3

    3.3.2 RLE

    RLE 3-10

    3-10 RLE

    1. L1=find_frequent_1-itemsets(D); 2. for(K=2 ; LK-1 0 ; K++) do begin 3. { 4. LK=; 5. CK-itemset = apriori-gen(LK-1) // Ck 6. for all candidates c in CK-itemset do ; 7. { 8. count = getSupport(CK-itemset , min_sup) 9. if(count >= min_sup) LK=LK {C}; 10. } 11. } 12. return L = K LK;

    3-10 RLE

    27

  • RLE

    Lk-1 Ck

    CkRLE ListRLE

    And_op()

    getSupport() K

    Int getSupport(K-itemset S,min_sup) { resRLE RLE(S(1)) for(I=2 to k) { (Count,resRLE) And_OP(resRLE,RLE(S(I))) If(Count

  • If(SrcRight

  • Public void nextSRun() { Sindex++; SrcRight=s[Sindex].base+s[Sindex].length-1; SrcLeft=s[Sindex].base; } Public void nextDRun() { Sindex++; DestRight=d[Dindex].base +d[Dindex].length-1; DestLeft=d[Dindex].base;

    }

    3-11 RLE

    3.4 RLE

    RLE

    L1-itemset RLE

    L1 C2

    C2 RLE List Set RLE And_op()

    L2 RLE List Set

    3.4.1 RLE

    k- RLE

    (K+1)- RLE RLE

    3-1 10

    30

  • 3-3 RLE 3-4

    RLE list Set

    Support =3

    3-3 RLE

    itemset RLE List Set supcount

    A [(2,1),(4,3),(9,2)] 6

    B [ (1,1),(3,2),(7,4) ] 7

    C [ (1,1),(8,1) ] 2

    D [ (1,1),(6,3) ] 4

    E [ (3,5),(10,1) ] 6

    F [ (2,2),(5,2),(10,1) ] 5

    G [ (3,1),(8,1) ] 2

    RLE List RUN

    1- 3-3 CG Support

    items L1 3-4

    1- Apriori

    31

  • 3-41-

    itemset RLE List Set supcount

    A [(2,1),(4,3),(9,2)] 6 B [ (1,1),(3,2),(7,4) ] 7 D [ (1,1),(6,3) ] 4 E [ (3,5),(10,1) ] 6 F [ (2,2),(5,2),(10,1) ] 5

    1- 2-itemset 3-5

    3-5 2-(C2)

    2-itemset

    AB AD AE AF BD BE BF DE DF EF

    AB A B RLE A=[(2,1),(4,3),(9,2)]

    B=[ (1,1),(3,2),(7,4) ] A B RLE

    And_op() AB= (4,1),(9,2) 4

    9 2 AB

    3 3-6 23- RLE

    32

  • 3-6 23- RLE

    itemset supcount

    AB 3 AE 4 AF 4 BD 3 BE 3 EF 4

    itemset supcount

    AEF 3

    RLE

    LRE List Set

    33

  • 3.4.2 RLE

    = 3

    3-12RLE

    Itemset Sup. Count A B C D E F G

    6 7 2 4 6 5 2

    Itemset Sup. Count A B D E F

    6 7 4 6 5

    C1 1

    Itemset Sup. Count AB AD AE AF BD BE BF DE DF EF

    3 1 4 4 3 4 2 2 1 4

    Itemset Sup. Count AB AE AF BD BE EF

    3 4 4 3 4 4

    C2 2

    Itemset Sup. Count ABE ABF AEF BDE BDF BEF

    2 1 3 1 0 2

    Itemset Sup. Count AEF

    3

    C3 3

    34

  • RLE

    4.1

    4.2

    4.3

    4.4

    4.1

    35

  • Microsoft Visual C# 2005

    CPU Pentium 4G1024MB 80G

    IDE OS Windows XP

    RLE

    IBM Almaden Research Center[4]

    Apriori[8]

    4-1

    4-1

    Ntrans Tlen Patlen nitems

    D T I N Ta.Ib.Nc.Db

    T10I4N1KD100K

    10

    10

    4

    1000

    36

  • 4.2

    RLE

    RLE List Set

    4-2

    4-2

    Apriori

    FP-tree

    DLG

    RLE RLE List Set

    4.3

    37

  • FP Growth

    4-3

    1000 10 4

    13 5 10

    4-3

    D T I N

    10000 10 4 1000

    30000 10 4 1000

    50000 10 4 1000

    100000 10 4 1000

    T10I4N1KD10K

    T10I4N1KD30K

    T10I4N1KD50K

    T10I4N1KD100K

    38

  • T10I4N1KD10K

    0

    20

    40

    60

    80

    100

    120

    140

    160

    FP

    RLE

    FP 0.32 0.56 0.92 1.46 98.86 123 150

    RLE 5.46 0.05 0.02 0.04 0.01 0.01 0.06

    10% 5% 3% 2% 1% 0.9% 0.7%

    4-1 10000 RLE FP Growth

    T10I4N1KD30K

    0

    20

    40

    60

    80

    100

    120

    140

    160

    FP

    RLE

    FP 0.82 2.17 99.41 83.22 83.88 151.03

    RLE 16.06 0.01 0.015 0.031 0.093 0.25

    3.3% 1.6% 1% 0.6% 0.3% 0.16%

    4-2 30000 RLE FP Growth

    39

  • T10I4N1KD50K

    0

    20

    40

    60

    80

    100

    120

    140

    160

    FP

    RLE

    FP 1.47 100.48 88.27 92.97 87.55 91.48

    RLE 26.95 0.031 0.046 0.125 0.39 1.937

    2% 1% 0.6% 0.4% 0.2% 0.1%

    4-3 50000 RLE FP Growth

    T10I4N1KD100K

    0

    20

    40

    60

    80

    100

    120

    140

    160

    FP

    RLE

    FP 92.63 88.81 87.77 92.42 96.02 160.84

    RLE 54.39 0.171 0.953 1.515 4.015 3.625

    1% 0.5% 0.3% 0.2% 0.1% 0.09%

    4-4 100000 RLE FP Growth

    40

  • RLE

    FP Growth 1% 0

    FP Growth

    RLE

    4-4

    10000 10

    4 1005001000

    4-4

    D T I N

    10000 10 4 100

    10000 10 4 500

    10000 10 4 1000

    T10I4N100D10K

    T10I4N500D10K

    T10I4N1KD10K

    41

  • T10I4N100D10K

    0102030405060708090

    FP

    RLE

    FP 0.31 0.42 0.7 2.06 91.64 80.31

    RLE 2.218 0.421 0.046 0.093 0.203 0.218

    10% 5% 3% 2% 1% 0.9%

    4-5 100 RLE FP Growth

    T10I4N500D10K

    0102030405060708090

    100

    FP

    RLE

    FP 0.27 0.31 0.64 1.2 90.72 100.05

    RLE 5.046 0.02 0.013 0.015 0.011 0.015

    10% 5% 3% 2% 1% 0.9%

    4-6 500 RLE FP Growth

    42

  • T10I4N1KD10K

    0102030405060708090

    100110120130

    FP

    RLE

    FP 0.32 0.56 0.92 1.46 98.86 123

    RLE 5.468 0.011 0.021 0.015 0.017 0.015

    10% 5% 3% 2% 1% 0.9%

    4-7 1000 RLE FP Growth

    RLE

    FP Growth

    4-5

    1000 1

    4 3510

    43

  • 4-5

    D T I N

    10000 3 4 1000

    10000 5 4 1000 10000 10 4 1000

    T3I4N1KD10K

    T5I4N1KD10K

    T10I4N1KD10K

    T3I4N1KD10K

    0

    5

    FP

    RLE

    FP 0.18 0.12 0.24 0.41 3.59 3.47

    RLE 2.75 0.015 0.016 0.015 0.013 0.011

    10% 5% 3% 2% 1% 0.9%

    4-8 3 RLE FP Growth

    44

  • T5I4N1KD10K

    0

    5

    10

    15

    FP

    RLE

    FP 0.08 0.16 0.3 0.57 11.42 10.83

    RLE 3.234 0.011 0.015 0.018 0.012 0.013

    10% 5% 3% 2% 1% 0.9%

    4-9 5 RLE FP Growth

    T10I4N1KD10K

    0

    20

    40

    60

    80

    100

    120

    FP

    RLE

    FP 0.32 0.56 0.92 1.46 98.86 123

    RLE 5.468 0.015 0.011 0.017 0.02 0.013

    10% 5% 3% 2% 1% 0.9%

    4-10 10 RLE FP Growth

    45

  • RLE

    4.4

    Bit-Vector RLE

    Bit-Vector

    Bit-Vector

    RLE

    Bit-Vector

    RLE

    46

  • 1

    10 4-6 Bit-Vector RLE

    4-6

    D T I N

    10000 10 4 1000

    30000 10 4 1000

    50000 10 4 1000

    100000 10 4 1000

    T10I4N1KD10K

    T10I4N1KD30K

    T10I4N1KD50K

    T10I4N1KD100K

    4-11 RLE Bit-Vector

    4-1 RLE Bit-Vector

    RLE

    B=Bit-Vector

    R=RLE

    B - R B

    ( 4-1 ) =

    47

  • 4-7 RLE

    T10I4N1KD10K 86.52 %

    T10I4N1KD30K 86.74 %

    T10I4N1KD50K 87.31 % T10I4N1KD100K 87.65 %

    0

    2000000

    4000000

    6000000

    8000000

    10000000 12000000 14000000

    T10I4N1KD10K T10I4N1KD30K T10I4N1KD50K T10I4N1KD100K

    KB

    RLE

    Bit Vector

    4-11

    100

    1000 4-8

    Bit-Vector RLE

    48

  • 4-8

    D T I N

    10000 10 4 100

    10000 10 4 500

    10000 10 4 1000

    T10I4N100D10K

    T10I4N500D10K

    T10I4N1KD10K

    4-12 RLE Bit-Vector

    0

    200000

    400000

    600000

    800000

    1000000 1200000 1400000

    T10I4N100D10K T10I4N500D10K T10I4N1KD10K

    KB

    RLE

    Bit Vector

    4-12

    4-9 RLE

    T10I4N100D10K 12.54 % T10I4N500D10K 77.71 % T10I4N1KD10K 87.64 %

    49

  • 3

    10 4-10

    Bit-Vector RLE

    4-10

    D T I N

    10000 3 4 1000

    10000 5 4 1000 10000 10 4 1000

    T3I4N1KD10K

    T5I4N1KD10K

    T10I4N1KD10K

    4-13 RLE Bit-Vector

    4-11

    50

  • 0

    200000

    400000

    600000

    800000

    1000000 1200000 1400000

    T3I4N1KD10K T5I4N1KD10K T10I4N1KD10K

    K

    B

    RLE

    Bit Vector

    4-13

    4-11 RLE

    T3I4N1KD10K 94.1 % T5I4N1KD10K 92.5 % T10I4N1KD10K 87.64 %

    RLE

    FP Growth

    RLE List Set RLE

    RLE List Set Apriori FP

    Growth FP Growth

    FP-tree FP Growth

    tree tree

    51

  • Run-Length

    Encoding(RLE)

    RLE List Set

    RLE

    RLE FP-Growth

    52

  • [1] S. J. Yen and A. L. P. Chen. "An Efficient Approach to

    Discovering Knowledge from Large Databases. " Proc. IEEE/ACM

    International Conference on Parallel and Distributed Information

    Systems (PDIS), (1996).

    [2] Han, J., Pei J. and Yin Y., "Mining Frequent Patterns without

    Candidate Generation " Proceedings of ACM-SIGMOD Interna-tional

    Conference on Management of Data, pages 1-12, May (2000).

    [3] J. Pei, J. Han. H. Lu, S. Nishio, S. Tang, and D. Yang. "H-Mine:

    Hyper-structure Mining of Frequent Patterns in Large

    Databases " Proc. The 2001 IEEE International Confer-ence On

    Data Mining (ICDM01), San Jose, California, November

    29-December 2, (2001).

    [4] IBM Data Mining Web Site:

    http://www.almaden.ibm.com/cs/quest/index.html

    [5] M. S. Chen, J. Han, and P.S. Yu, "Data Mining: An Overview from

    a database Perspective", IEEE Transaction on Knowledge and

    Data Engineering, Vol. 8,No. 6,December (1996).

    53

  • [6] R. Agrawal and R. Srikant, "Fast Algorithm for Mining

    Assoication Rule in Large Databases", In Proc. 1994 Intl Conf.

    VLDB, pp. 487-499, Santiago, Chile, Sep. (1994).

    [7] Savasere, E. Omiecinski, and S. Navathe, "An Efficient Algorithm for Mining

    Association Rule in Large Database", Proc. 21th VLDB,pp. 432-444, (1995).

    [8] R. Agrawal, T. Imielinski, & A. Swami, "Mining Association RuSets

    of Items in Large Database, " Proceedings of SIGMOD, USA, pp. 207-216

    (1993).

    [9] R. Agrawal, T. Imielinski and A. Swami, "Database Mining: A

    Perspective, " IEEE Transactions on Knowledge and Data

    Engi914-925 (1993).

    [10] R. Agrawal, C. Faloutsos and A. Swami, "Efficient

    SimilaritSequence Databases, " Lecture Notes in Computer

    Science 73Verlag, pp. 69-84 (1993).

    [11] ""

    (2002)

    [12] ""

    (2003)

    54

    ----------------------------------1.doc--------------------------------------------2.docnew_.doc