10 Assrule Eng

Embed Size (px)

Citation preview

  • 8/3/2019 10 Assrule Eng

    1/39

    Association rules Introduction

    Association rules

    A Binary database is a database where each row is composed ofbinary attributes

    A B C D . . .T1 1 0 0 1 . . .T2 0 1 1 1 . . .T3 1 0 1 0 . . .T4 0 0 1 0 . . .

    From this database frequent patterns of occurrence can be discovered

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 1 / 39

  • 8/3/2019 10 Assrule Eng

    2/39

    Association rules Introduction

    Association rules

    This kind of databases appear for example on transactional data(market basket analysis)

    An association rule represents coocurence of events in the database

    TID Items1 Bread, Chips, Beer, Yogourt, Eggs2 Flour, Beer, Eggs, Butter,3 Bread, Ham, Eggs, Butter, Milk4 Flour, Eggs, Butter, Chocolate5 Beer, Chips, Bacon, Eggs

    {Flour} {Eggs}{Beer} {Chips}{Bacon} {Eggs}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 2 / 39

  • 8/3/2019 10 Assrule Eng

    3/39

    Association rules Definitions

    Definitions (I)

    We define R as the set of attributes of the database

    We define X as a subset of attributes from R.

    We say that X is a pattern from DB if there are any row where all

    the attributes of X are 1.We define the support (frequency) of a pattern X as the function:

    fr(X) =|M(X, r)|

    |r|

    Where |M(X, r)| is the number of times that X appears in the DBand |r| is the size of the BD.

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 3 / 39

  • 8/3/2019 10 Assrule Eng

    4/39

    Association rules Definitions

    Definitions (II)

    Given a minimum support (min sup [0, 1]), we say that the patternX

    is frequent if: fr(X) min sup

    F(r, min sup) is the set of patterns that are frequent in :

    F(r, min sup) = {X R/fr(X) min sup}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 4 / 39

    A i i l D fi i i

  • 8/3/2019 10 Assrule Eng

    5/39

    Association rules Definitions

    Definitions (III)

    Given a BD with R attributes and X and Y attribute subsets, withX Y = , an association rule is the expression:

    X Y

    conf(X Y) is the confidence of an association rule, computed as:

    conf(X Y) =fr(X Y)

    fr(X)

    We consider a minimum value of confidence (min conf [0, 1])

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 5 / 39

    A i i l ith Fi di i ti l

  • 8/3/2019 10 Assrule Eng

    6/39

    Apriori algorithm Finding association rules

    Finding association rules

    Given a minimum confidence and a minimum support, a associationrule (X Y) exists in a DB if:

    (fr(X Y) min sup) (conf(X Y) min conf )

    Trivial approach:

    It is enough to find all frequent subsets from RWe have to explore X Y ((X R) (Y X)) the association rule

    (X Y) Y

    There are 2|R| candidates

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 6 / 39

    Apriori algorithm Finding association rules

  • 8/3/2019 10 Assrule Eng

    7/39

    Apriori algorithm Finding association rules

    Properties of frequent sets

    Given X and Y with Y X:

    If fr(Y) fr(X), if X is frequent, Y also is frequentIf any subset Y from X it is not frequent then X is not frequent

    The feasible approach is to find the frequent sets starting from size 1and increasing its size

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 7 / 39

    Apriori algorithm Finding association rules

  • 8/3/2019 10 Assrule Eng

    8/39

    Apriori algorithm Finding association rules

    Finding frequent sets

    Fl(r, min sup) is the set of frequent sets from R of size l.

    Given a set of patterns of size L, the only frequent set candidates oflength L + 1 will be those which all subsets are in the frequent sets of

    length L.

    C(Fl+1(r)) = {X R/|X| = l+1 Y X |Y| = l Y Fl(r)}

    The computing of association rules can be done iteratively starting

    with the smallest frequent subsets until the complete subset isanalyzed

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 8 / 39

    Apriori algorithm Example: Properties

  • 8/3/2019 10 Assrule Eng

    9/39

    Apriori algorithm Example: Properties

    A DCBL=1

    L=2

    L=3

    L=4 ABCD

    ABC ABD ACD BCD

    AB AC AD BC BD CD

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 9 / 39

    Apriori algorithm Example: Properties

  • 8/3/2019 10 Assrule Eng

    10/39

    Apriori algorithm Example: Properties

    A DC

    L=1

    L=2

    L=3

    L=4 ABCD

    ABC ABD ACD BCD

    AB AC AD BC BD CD

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 10 / 39

    Apriori algorithm Example: Properties

  • 8/3/2019 10 Assrule Eng

    11/39

    p g p p

    A DC

    L=1

    L=2

    L=3

    L=4 ABCD

    ABC ABD ACD BCD

    AC AD CD

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 11 / 39

    Apriori algorithm Example: Properties

  • 8/3/2019 10 Assrule Eng

    12/39

    p g p p

    A DC

    L=1

    L=2

    L=3

    L=4 ABCD

    ACD

    AC AD CD

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 12 / 39

    Apriori algorithm Example: Properties

  • 8/3/2019 10 Assrule Eng

    13/39

    g

    A DC

    L=1

    L=2

    L=3

    L=4

    ACD

    AC AD CD

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 13 / 39

    Apriori algorithm Algorithm

  • 8/3/2019 10 Assrule Eng

    14/39

    Association Rules Algorithm

    Algorithm: Apriori (R,min sup,min conf)C,CCan,CTemp:set of frequent subsetsRAs:Set of association rules, L:integerL=1CCan=Frequent sets 1(R,fr min)RAs=while CCan = do

    L=L+1CTemp=Candidate sets(L,R,CCan) = C(Fl+1(r))

    C=Frequent sets(CTemp,min sup)RAs=RAs Confidence rules(C,min conf)CCan=C

    end

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 14 / 39

    Apriori algorithm Algorithm

  • 8/3/2019 10 Assrule Eng

    15/39

    Effect of Support

    The specific value used as min sup has effect on the computationalcost of the search

    If it is too high, only a few patterns will appear (we could miss

    interesting rare occurrences)If it is too low the computational cost will be too high (too manyassociation will appear)

    Unfortunately the threshold value can no be known beforehand

    Sometimes multiple minimum supports are needed (different fordifferent groups of items)

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 15 / 39

    Apriori algorithm Measures of interestingness

  • 8/3/2019 10 Assrule Eng

    16/39

    Measures for association rules interestingness

    Sometimes the confidence of a rule does not represent well itsinterestingness

    Y YX 15 5 20

    X 75 5 8090 10 100

    conf(X Y) = 0,75 but conf(X Y) = 0,9357

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 16 / 39

    Apriori algorithm Measures of interestingness

  • 8/3/2019 10 Assrule Eng

    17/39

    Measures for association rules interestingness

    Other measures can be computed using probabilistic independence

    Lift(X Y) =P(Y|X)

    P(Y)

    Interest(X Y) =P(X, Y)

    P(X) P(Y)PS(X Y) = P(X, Y) P(X) P(Y)

    There are plenty of measures in the literature, some good for certain

    application, but not for others

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 17 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    18/39

    Example (1)

    A B C D E F

    1 1 1 0 0 1 02 1 0 1 0 1 13 0 1 1 1 0 14 1 1 1 0 0 05 0 1 0 0 1 16 1 0 1 1 0 07 1 1 0 0 1 0

    8 1 1 0 0 0 09 1 0 1 1 0 1

    10 1 1 1 0 1 111 1 1 0 0 1 012 1 1 0 1 1 013 1 1 1 1 1 114 0 1 1 0 0 015 1 1 0 1 0 016 1 1 0 0 1 017 0 1 0 0 1 018 0 1 1 0 1 019 1 0 0 0 1 020 1 1 0 0 1 0

    fr([A])=15/20

    fr([B])=16/20fr([C])= 9/20fr([D])= 6/20fr([E])=13/20fr([F])= 6/20

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 18 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    19/39

    Example (2)

    Supposing a minimum support of 25 %. All columns have larger frequency,thus:

    F1= {[A],[B],[C],[D],[E],[F]}From this we can compute the candidates for frequent sets (all

    combinations):

    C(F1)= {[A,B],[A,C],[A,D],[A,E],[A,F],[B,C],[B,D],[B,E],[B,F],[C,D],[C,E],[C,F],

    [D,E],[D,F],[E,F]}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 19 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    20/39

    AB AC AD AE AF BC BD BE BF CD CE CF DE DF EF

    FR=25%

    A B C D E F

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 20 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    21/39

    Example (3)

    fr([A,B])= 11/20 fr([B,C])= 6/20 fr([C,D])= 4/20 fr([D,E])= 2/20fr([A,C])= 6/20 fr([B,D])= 4/20 fr([C,E])= 4/20 fr([D,F])= 2/20fr([A,D])= 4/20 fr([B,E])= 11/20 fr([C,F])= 5/20 fr([E,F])= 4/20fr([A,E])= 10/20 fr([B,F])= 4/20fr([A,F])= 4/20

    Of the candidates, the ones that have a support greater than 25 % are:

    F2={[A,B],[A,C],[A,E],[B,C],[B,E],[C,F]}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 21 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    22/39

    A B C D E F

    AF BD BF CD CE DE DF EFAB AC AE BC BE CFAD

    FR=25%

    ABC ABE

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 22 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    23/39

    Example (4)

    The confidence of the association rules are:

    conf(A B)= 11/15 conf(A E)= 10/15 conf(B E)= 11/16conf(B A)= 11/16 conf(E A)= 10/13 conf(E B)= 11/13conf(A C)= 6/15 conf(B C)= 6/16 conf(C F)= 5/9conf(C A)= 6/9 conf(C B)= 6/9 conf(F C)= 5/13

    If we choose as confidence value 2/3 the rules that are significant:

    {A B, B A, C A, A E, E A, C B, B E, E B}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 23 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    24/39

    Example (5)

    From the set F2, we can compute the set of candidates for F3

    C(F2)={[A,B,C],[A,B,E]}

    fr([A,B,C])= 3/20 fr([A,B,E])= 8/20

    Of the candidates, the ones that have support greater than 25 % are:

    F3={[A,B,E]}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 24 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    25/39

    A B C D E F

    AF BD BF CD CE DE DF EFAD

    FR=25%

    AB AC AE BC BE CF

    ABEABC

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 25 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    26/39

    Example (6)

    The confidence of their corresponding association rules are:

    conf(A B E)= 8/11conf(A E B)= 8/10conf(B E A)= 8/11

    If we choose as confidence value 2/3 the rules that are significant:

    {A B E, A E B, B E A}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 26 / 39

    Apriori algorithm Example

  • 8/3/2019 10 Assrule Eng

    27/39

    Example (7)

    Now we choose a minimum support of 50 %. The columns that have largerfrequency are only A, B and E. Therefore:

    F1= {[A],[B],[E]}

    From this we can compute the support set candidates:

    C(F1)={[A,B],[A,E],[B,E]}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 27 / 39

    Apriori algorithm Example

    ( )

  • 8/3/2019 10 Assrule Eng

    28/39

    Example (8)

    All have a support greater than 50 %, therefore

    F2={[A,B],[A,E],[B,E]}

    And the candidates are:

    C(F2)={[A,B,E]}

    That also have a support greater than 50 %, thus:

    F3={[A,B,E]}

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 28 / 39

    FP-Grow Introduction

    G

  • 8/3/2019 10 Assrule Eng

    29/39

    FP-Grow

    Han, Pei, Yin, Mao Mining Frequent Patterns without CandidateGeneration: A Frequent-Pattern Tree Approach (2004)

    The problem of the apriori approach is that the number of candidates

    to explore in order to find long patterns can be very largeThis approach tries to obtain patterns from transaction databaseswithout candidate generation

    It is based on a specialized data structure (FP-Tree) that summarizesthe frequency of the patterns in the DB

    The patterns are explored incrementally adding prefixes withoutcandidate generation

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 29 / 39

    FP-Grow FP-tree

    FP T

  • 8/3/2019 10 Assrule Eng

    30/39

    FP-Tree

    The goal of this structure is to avoid to query the DB to compute thefrequency of patterns

    It assumes that there is an order among the elements of a transactionThis order allows to obtain common prefixes from the transactions

    The transactions with common prefixes are merged in the structuremaintaining the frequency of each subprefix

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 30 / 39

    FP-Grow FP-tree

    FP T Al i h

  • 8/3/2019 10 Assrule Eng

    31/39

    FP-Tree - Algorithm

    1 Compute the frequency of each individual item in the DB

    2 Create the tree root (empty prefix)3 For each transaction from the DB

    1 Pick the more frequent items2 Order the items by their original frequency3 Iterate for each item, inserting it in the tree

    If the node has a descendant equal to the actual item increase its

    frequency

    Otherwise, a new node is created

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 31 / 39

    FP-Grow FP-tree

    FP T E l

  • 8/3/2019 10 Assrule Eng

    32/39

    FP-Tree - Example

    BD Transactions

    1 a, e, f 2 a, c, b, f, g3 b, a, e, d4 d, e, a, b, c5 c, d, f, b6 c, f, a, d

    Item frequency

    a 5b 4

    c 4d 4e 3f 4g 1

    BD Transactions (fr=4)

    1 a, f 2 a, c, b, f 3 a, b, d4 a, b, c, d5 b, c, d6 a, c, d, f

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 32 / 39

    FP-Grow FP-tree

    FP T E l

  • 8/3/2019 10 Assrule Eng

    33/39

    FP-Tree - Example

    a

    b

    c

    d

    f4

    4

    4

    4

    5

    a

    f

    1

    1

    Trans(a,f)

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 33 / 39

    FP-Grow FP-tree

    FP T E l

  • 8/3/2019 10 Assrule Eng

    34/39

    FP-Tree - Example

    a

    b

    c

    d

    f4

    4

    4

    4

    5

    a

    f1

    b

    c

    f

    2

    1

    1

    1

    Trans(a,b,c,f

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 34 / 39

    FP-Grow FP-tree

    FP Tree Example

  • 8/3/2019 10 Assrule Eng

    35/39

    FP-Tree - Example

    a

    b

    c

    d

    f4

    4

    4

    4

    5

    a

    f1

    b

    c

    f

    1

    1

    d1

    Trans(a,b,d)

    2

    3

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 35 / 39

    FP-Grow FP-tree

    FP Tree Example

  • 8/3/2019 10 Assrule Eng

    36/39

    FP-Tree - Example

    a

    b

    c

    d

    f4

    4

    4

    4

    5

    a

    f1

    b

    c

    f1

    d1

    Trans(a,b,cd)

    d

    1

    2

    3

    4

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 36 / 39

    FP-Grow FP-tree

    FP Tree Example

  • 8/3/2019 10 Assrule Eng

    37/39

    FP-Tree - Example

    a

    b

    c

    d

    f4

    4

    4

    4

    5

    a

    f1

    b

    c

    f1

    d1

    d

    1

    2

    3

    4

    Trans(b,c,d,f

    b

    c

    d

    f

    1

    1

    1

    1

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 37 / 39

    FP-Grow FP-tree

    FP Tree Example

  • 8/3/2019 10 Assrule Eng

    38/39

    FP-Tree - Example

    a

    b

    c

    d

    f4

    4

    4

    4

    5

    a

    f1

    b

    c

    f1

    d1

    d

    1

    2

    3b

    c

    d

    f

    1

    1

    1

    1

    Trans(a,c,d,f

    d

    f

    c

    1

    1

    1

    5

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 38 / 39

    FP-Grow Pattern extraction

    Pattern extraction

  • 8/3/2019 10 Assrule Eng

    39/39

    Pattern extraction

    From this data structure we can obtain all frequent patterns using theproperties of a FP-tree:

    For any given frequent item ai we can obtain all the frequent patterns

    that contain the item following the links ofai

    in the treeTo compute the frequent patterns with ai as suffix only the prefixsubpaths of the nodes ai have to be used and their frequencycorresponds to the frequency of node ai

    The frequent patterns can be obtained from a recursive traversal of

    the paths in the FP-tree and the combination of subpatterns

    Javier Bejar cbea (LSI - Master AI) Association rules TAVA - Term 2010/2011 39 / 39