Association Rule -data mining

Embed Size (px)

Citation preview

  • 8/12/2019 Association Rule -data mining

    1/131

    ASSOCIATION RULE

    MINING

  • 8/12/2019 Association Rule -data mining

    2/131

    As noted earlier, huge amount of data is storedelectronically in many retail outlets due tobarcoding of goods sold. Natural to try to find

    some useful information from this mountains ofdata.

    A conceptually simple yet interesting technique

    is to find association rules from these largedatabases.

    The problem was invented by Rakesh Agarwalat IBM. 2

    Introduction

  • 8/12/2019 Association Rule -data mining

    3/131

    Association rules mining (or market basket

    analysis) searches for interesting customer

    habits by looking at associations.

    Applications in marketing, store layout,customer segmentation, medicine, finance,

    and many more.

    Question: Suppose you are able to find that two products x and y (say bread and

    milk or crackers and cheese) are frequently bought together. How can you use that

    information?

    3

    Introduction

  • 8/12/2019 Association Rule -data mining

    4/131

    Applications

    Market Basket Analysis: given a database of customertransactions, where each transaction is a set of items the goalis to find groups of items which are frequently purchasedtogether.

    Telecommunication (each customer is a transactioncontaining the set of phone calls)

    Credit Cards/ Banking Services (each card/account is atransaction containing the set of customerspayments)

    Medical Treatments (each patient is represented as a

    transaction containing the ordered set of diseases)

  • 8/12/2019 Association Rule -data mining

    5/131

    Transaction ID Items

    10 Bread, Cheese, Newspaper

    20 Bread, Cheese, Juice

    30 Bread, Milk

    40 Cheese, Juice. Milk, Coffee

    50 Sugar, Tea, Coffee, Biscuits, Newspaper

    60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper

    70Bread, Cheese

    80 Bread, Cheese, Juice, Coffee

    90 Bread, Milk

    100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper

    5

    A Simple Example

    Consider the following ten transactions of a shop selling only nine items. Wewill return to it after explaining some terminology.

  • 8/12/2019 Association Rule -data mining

    6/131

    Let the number of different items sold be n

    Let the number of transactions be N

    Let the set of items be I={i1, i2, , in}. The

    number of items may be large, perhaps severalthousands.

    Let the set of transactions be T= {t1, t2, , tN}.Each transaction ti contains a subset of items

    from the itemset {i1, i2, , in}. These are thethings a customer buys when they visit thesupermarket. N is assumed to be large, perhapsin millions.

    Not considering the quantities of items bought. 6

    Terminology

  • 8/12/2019 Association Rule -data mining

    7/131

    Want to find a group of items that tend to occur

    together frequently.

    The association rules are often written as

    XY meaning that whenever X appears Yalso tends to appear. X and Y may be single

    items or sets of items but the same item does

    not appear in both.

    7

    Terminology

  • 8/12/2019 Association Rule -data mining

    8/131

    Suppose X and Y appear together in only 1%

    of the transactions but whenever X appears

    there is 80% chance that Y also appears.

    The 1% presence of X and Y together is calledthe support (or prevalence) of the rule and

    80% is called the confidence (or predictability)

    of the rule.

    These are measures of interestingness of the

    rule.

    8

    Terminology

  • 8/12/2019 Association Rule -data mining

    9/131

    Terminology9

    Confidence denotes the strength of the associationbetween X and Y. Support indicates the frequency of thepattern. A minimum support is necessary if anassociation is going to be of some business value.

    Let the chance of finding an item X in the Ntransactions is x% then we can say probability of X isP(X) = x/100.

    Question: Why is minimum support necessary? Why is minimum confidence necessary?

  • 8/12/2019 Association Rule -data mining

    10/131

  • 8/12/2019 Association Rule -data mining

    11/131

    11

    Transaction data: supermarket

    data Market basket transactions:

    t1: {bread, cheese, milk}

    t2: {apple, eggs, salt, yogurt}

    tn: {biscuit, eggs, milk}

    Concepts:

    An item: an item/article in a basket

    I:the set of all items sold in the storeA transaction:items purchased in a basket; it may

    have TID (transaction ID)

    A transactionaldataset: A set of transactions

  • 8/12/2019 Association Rule -data mining

    12/131

    12

    Transaction data: a set of

    documents

    A text document data set. Each document is

    treated as a bag of keywords

    doc1: Student, Teach, School

    doc2: Student, School

    doc3: Teach, School, City, Game

    doc4: Baseball, Basketball

    doc5: Basketball, Player, Spectator

    doc6: Baseball, Coach, Game, Team

    doc7: Basketball, Team, City, Game

  • 8/12/2019 Association Rule -data mining

    13/131

    13

    The model: rules

    A transaction tcontainsX, a set of items(itemset) in I, ifXt.

    An association ruleis an implication of the form:

    X

    Y, whereX, Y

    I, and X

    Y =

    An itemsetis a set of items.

    E.g., X = {milk, bread, cereal} is an itemset.

    A k-itemset is an itemset with kitems. E.g., {milk, bread, cereal} is a 3-itemset

  • 8/12/2019 Association Rule -data mining

    14/131

    14

    Rule strength measures

    Support:The rule holds with supportsupin T(the transaction data set) if sup% oftransactionscontainXY.

    Confidence:The rule holds in Twithconfidence confif conf% of tranactions thatcontainXalso contain Y.

    An association rule is a pattern that states

    whenXoccurs, Yoccurs with certainprobability.

  • 8/12/2019 Association Rule -data mining

    15/131

    15

    Support and Confidence

    Support count: The support count of an

    itemsetX, denoted byX.count, in a data set T

    is the number of transactions in Tthat contain

    X. Assume Thas ntransactions. Then,

    n

    countYXsupport

    ).(

    countX

    countYXconfidence

    .

    ).(

  • 8/12/2019 Association Rule -data mining

    16/131

    Terminology16

    Suppose we know P(X Y) then what is theprobability of Y appearing if we know that X alreadyexists. It is written as P(Y|X)

    The supportfor XY is the probability of both X and

    Y appearing together, that is P(X Y).

    The confidenceof XY is the conditional probabilityof Y appearing given that X exists. It is written asP(Y|X) and read as P of Y given X.

  • 8/12/2019 Association Rule -data mining

    17/131

    17Association Rule Problem

    Given:

    a setIof all the items;

    a database Tof transactions;

    minimum support p;

    minimum confidence q;

    Find:

    all association rulesX Ywith a minimumsupportpand confidence q.

  • 8/12/2019 Association Rule -data mining

    18/131

    18Problem Decomposition

    1. Find all sets of items that have minimum

    support (frequent itemsets)

    2. Use the frequent itemsets to generate the

    desired rules

  • 8/12/2019 Association Rule -data mining

    19/131

    Want to find all associations which have at least

    p% support with at least q% confidence such

    that

    all rules satisfying any user constraints are found the rules are found efficiently from large

    databases

    the rules found are actionable

    19

    The task

  • 8/12/2019 Association Rule -data mining

    20/131

    20

    Many mining algorithms

    There are a large number of them!!

    They use different strategies and data structures.

    Their resulting sets of rules are all the same.

    Given a transaction data set T, and a minimum support anda minimum confident, the set of association rules existing in

    Tis uniquely determined.

    Any algorithm should find the same set of rules

    although their computational efficiencies andmemory requirements may be different.

    We study only one: the Apriori Algorithm

  • 8/12/2019 Association Rule -data mining

    21/131

    Association Rule Mining Task

    Given a set of transactions T, the goal of

    association rule mining is to find all rules

    having

    support minsup threshold confidence minconf threshold

    Brute-force approach: List all possible association rules

    Compute the support and confidence for each

    rule

    Prune rules that fail the minsupand minconf

  • 8/12/2019 Association Rule -data mining

    22/131

    Transaction ID Items

    10 Bread, Cheese, Newspaper

    20 Bread, Cheese, Juice

    30 Bread, Milk40 Cheese, Juice. Milk, Coffee

    50 Sugar, Tea, Coffee, Biscuits, Newspaper

    60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper

    70 Bread, Cheese

    80 Bread, Cheese, Juice, Coffee

    90 Bread, Milk

    100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper

    22

    Example

    We have9 items and 10 transactions. Find support and confidence for each pair.How many pairs are there?

  • 8/12/2019 Association Rule -data mining

    23/131

    There are 9 x 8 or 72 pairs, half of themduplicates. 36 is too many to analyse, so we usean even simpler example of n = 4 (Bread, Cheese,Juice, Milk) and N = 4. We want to find rules with

    minimum support of 50% and minimumconfidence of 75%.

    23

    Example

    Transaction I Items100 Bread, Cheese

    200 Bread, Cheese, Juice300 Bread, Milk

    400 Cheese, Juice. Milk

  • 8/12/2019 Association Rule -data mining

    24/131

    N = 4 results in the following frequencies. All items have

    the 50% support we

    require. Only two pairs

    have the 50% support

    and no 3-itemset hasthe support.

    We will look at the

    two pairs that have

    the minimum support

    to find if they have

    the required confidence.

    24

    Example

    Itemsets FrequencyBread 3

    Cheese 3

    Juice 2

    Milk 2(Bread, Cheese) 2

    (Bread, Juice) 1

    (Bread, Milk) 1

    (Cheese, Juice) 2

    (Cheese, Milk) 1

    (Juice, Milk) 1

    (Bread, Cheese, Juice) 1

    (Bread, Cheese, Milk) 0

    (Bread, Juice, Milk) 0

    (Cheese, Juice, Milk) 1

    (Bread, Cheese, Juice, Milk) 0

  • 8/12/2019 Association Rule -data mining

    25/131

    Items or itemsets that have the minimumsupport are called frequent. In our example, allthe four items and two pairs are frequent.

    We will now determine if the two pairs {Bread,Cheese} and {Cheese, Juice} lead toassociation rules with 75% confidence.

    Every pair {A, B} can lead to two rules A B

    and B A if both satisfy the minimumconfidence. Confidence of A B is given bythe support for A and B together divided by thesupport of A.

    25

    Terminology

  • 8/12/2019 Association Rule -data mining

    26/131

    We have four possible rules and theirconfidence is given as follows:

    Bread Cheese with confidence of 2/3 = 67%

    Cheese Bread with confidence of 2/3 = 67% Cheese Juice with confidence of 2/3 = 67%

    Juice Cheese with confidence of 100%

    Therefore only the last rule Juice Cheese

    has confidence above the minimum 75% andqualifies. Rules that have more than user-specified minimum confidence are calledconfident.

    26

    Rules

  • 8/12/2019 Association Rule -data mining

    27/131

    This simple algorithm works well with fouritems since there were only a total of 16combinations that we needed to look at but ifthe number of items is say 100, the number ofcombinations is much larger, in billions.

    The number of combinations becomes about amillion with 20 items since the number of

    combinations is 2n

    with n items (why?). Thenave algorithm can be improved to deal moreeffectively with larger data sets.

    27

    Problems with Brute Force

  • 8/12/2019 Association Rule -data mining

    28/131

    Frequent Itemset Generation

    null

    AB AC AD AE BC BD BE CD CE DE

    A B C D E

    ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

    ABCD ABCE ABDE ACDE BCDE

    ABCDE

    Given d items, thereare 2dpossiblecandidate itemsets

  • 8/12/2019 Association Rule -data mining

    29/131

    Question: Can you think of an improvement over

    the brute force method in which we looked at

    every itemset combination possible?

    Nave algorithms even with improvements dont

    work efficiently enough to deal with large

    number of items and transactions. We nowdefine a better algorithm.

    29

    Problems with Brute Force

  • 8/12/2019 Association Rule -data mining

    30/131

    Improved Brute Force

    Rather than counting all possible item

    combinations we can look at each transaction

    and count only the combinations that actually

    occur as shown below (that is, we dont countitemsets with zero frequency).ransaction ID Items Combinations

    100 Bread, Cheese {Bread, Cheese}200 Bread, Cheese,

    Juice{Bread, Cheese}, {Bread, Juice},{Cheese, Juice}, {Bread, Cheese, Juice}

    300 Bread, Milk {Bread, Milk}

    400 Cheese, Juice, Milk {Cheese, Juice}, {Cheese, Milk}, {Juice,Milk}, {Cheese, Juice, Milk}

    30

  • 8/12/2019 Association Rule -data mining

    31/131

    Improved Brute Force

    Itemsets FrequencyBread 3Cheese 3Juice 2Milk 2(Bread, Cheese) 2(Bread, Juice) 1

    (Bread, Milk) 1(Cheese, Milk) 2(Cheese, Juice) 2(Juice, Milk) 1(Bread, Cheese, Juice) 1(Cheese, Juice, Milk) 1

    31

    Removing zero frequencies leads to the smaller tablebelow. We then proceed as before but the improvement isnot large and we need better techniques.

    F t It t G ti

  • 8/12/2019 Association Rule -data mining

    32/131

    Frequent Itemset Generation

    Strategies

    Reduce the number of candidates(M)Complete search: M=2d

    Use pruning techniques to reduce M

    Reduce the number of transactions (N)Reduce size of N as the size of itemset increases

    Reduce the number of comparisons(NM)

    Use efficient data structures to store thecandidates or transactions

    No need to match every candidate against everytransaction

  • 8/12/2019 Association Rule -data mining

    33/131

    33

    The Apriori algorithm

    Probably the best known algorithm Two steps:

    Find all itemsets that have minimum support

    (frequent itemsets, also called large itemsets). Use frequent itemsets to generate rules.

    E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]

    and one rule from the frequent itemset

    Clothes Milk, Chicken [sup = 3/7, conf = 3/3]

  • 8/12/2019 Association Rule -data mining

    34/131

    34

    Step 1: Mining all frequent

    itemsetsA frequent itemsetis an itemset whose support

    is minsup.

    Key idea: The apriori property(downward

    closure property): any subsets of a frequentitemset are also frequent itemsets

    AB AC AD BC BD CD

    A B C D

    ABC ABD ACD BCD

  • 8/12/2019 Association Rule -data mining

    35/131

    The Apriori Algorithm

    Frequent I temset Property:

    Any subset of a frequent itemset is frequent.

    Contrapositive:

    If an itemset is not frequent, none of its

    supersets are frequent.

    R d i N b f

  • 8/12/2019 Association Rule -data mining

    36/131

    Reducing Number of

    Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also

    be frequent

    Apriori principle holds due to the following property ofthe support measure:

    Support of an itemset never exceeds the support of its

    monotonic function (or monotone function) is a function

    between ordered sets that preserves the given order

    subsets

    This is known as the anti-monotone ro ert of su ort

    )()()(:, YsXsYXYX

  • 8/12/2019 Association Rule -data mining

    37/131

  • 8/12/2019 Association Rule -data mining

    38/131

    Illustrating Apriori Principle

    Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1

    Itemset Count

    {Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3

    {Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

    Itemset Count

    {Bread,Milk,Diaper} 3

    Items (1-itemsets)

    Pairs (2-itemsets)

    (No need to generatecandidates involving Coke

    or Eggs)

    Triplets (3-itemsets)Minimum Support = 3

    If every subset is considered,6C1+

    6C2+6C3= 41

    With support-based pruning,6 + 6 + 1 = 13

  • 8/12/2019 Association Rule -data mining

    39/131

    39The Apriori Algorithm: Basics

    The Apriori Algorithm is an influential algorithm formining frequent itemsets for boolean association rules.

    Key Concepts : Frequent Itemsets: The sets of item which has minimum

    support (denoted by Lifor ith-Itemset).

    Apriori Property: Any subset of frequent itemset must

    be frequent. Join Operation: To find Lk, a set of candidate k-itemsets

    is generated by joining Lk-1with itself.

  • 8/12/2019 Association Rule -data mining

    40/131

    The Apriori Algorithm40

    To find associations, this classical Apriori algorithmmay be simply described by a two step approach:

    Step 1 discover all frequent (single) items that have

    support above the minimum support required

    Step 2 use the set of frequent items to generate theassociation rules that have high enough confidencelevel

  • 8/12/2019 Association Rule -data mining

    41/131

    Apriori Algorithm Terminology41

    A kitemset is a set of k items.

    The set Ckis a set of candidate kitemsets that are

    potentially frequent.The set Lkis a subset of Ckand is the set of kitemsetsthat are frequent.

  • 8/12/2019 Association Rule -data mining

    42/131

    Step 1Computing L1

    Scan all transactions. Find all frequent items

    that have support above the required p%. Let

    these frequent items be labeled L1. Step 2Apriori-gen Function

    Use the frequent items L1 to build all possible

    item pairs like {Bread, Cheese} if Bread andCheese are in L1. The set of these item pairs

    is called C2, the candidate set.

    42

    Step-by-step Apriori Algorithm

  • 8/12/2019 Association Rule -data mining

    43/131

    Step 3Pruning

    Scan all transactions and find all pairs in the

    candidate pair set C2 that are frequent. Let

    these frequent pairs be L2. Step 4General rule

    A generalization of Step 2. Build candidate set

    of k items Ck by combining frequent itemsetsin the set Lk-1.

    43

    The Apriori Algorithm

  • 8/12/2019 Association Rule -data mining

    44/131

    Step 5Pruning

    Step 5, generalization of Step 3. Scan all

    transactions and find all item sets in Ck that

    are frequent. Let these frequent itemsets beLk.

    Step 6Continue

    Continue with Step 4 unless Lk is empty. Step 7Stop

    Stop when Lk is empty.

    44

    The Apriori Algorithm

    The Apriori Algorithm in a

  • 8/12/2019 Association Rule -data mining

    45/131

    45

    The Apriori Algorithm in aNutshell

    Find thefrequent itemsets: the sets of items that have

    minimum support

    A subset of a frequent itemset must also be a frequent

    itemset i.e., if {AB} isa frequent itemset, both {A} and {B}

    should be a frequent itemset

    Iteratively find frequent itemsets with cardinality from 1

    to k (k-itemset)

    Use the frequent itemsets to generate association rules.

    The Apriori Algorithm : Pseudo

  • 8/12/2019 Association Rule -data mining

    46/131

    46

    The Apriori Algorithm : Pseudocode

    Join Step: Ckis generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a

    subset of a frequent k-itemset

    Pseudo-code:Ck: Candidate itemset of size k

    Lk:frequent itemset of size k

    L1= {frequent items};for(k= 1; Lk!=; k++) do begin

    Ck+1= candidates generated from Lk;for eachtransaction tin database do

    increment the count of all candidates in Ck+1that are contained in t

    Lk+1 = candidates in Ck+1with min_supportend

    returnkLk;

  • 8/12/2019 Association Rule -data mining

    47/131

    47The Apriori Algorithm: Example

    Consider a database, D , consistingof 9 transactions.

    Suppose min. support countrequired is 2 (i.e. min_sup = 2/9 =22 % )

    Let minimum confidence requiredis 70%.

    We have to first find out thefrequent itemset using Apriori

    algorithm. Then, Association rules will be

    generated using min. support &min. confidence.

    TID List of Items

    T100 I1, I2, I5

    T100 I2, I4

    T100 I2, I3

    T100 I1, I2, I4

    T100 I1, I3

    T100 I2, I3

    T100 I1, I3

    T100 I1, I2 ,I3, I5

    T100 I1, I2, I3

  • 8/12/2019 Association Rule -data mining

    48/131

    StateUniversity ofNew York,Stony Brook

    48

    Step 1: Generating 1-itemset Frequent Pattern

    Itemset Sup.Count

    {I1} 6

    {I2} 7

    {I3} 6

    {I4} 2{I5} 2

    Itemset Sup.Count

    {I1} 6

    {I2} 7

    {I3} 6

    {I4} 2{I5} 2

    In the first iteration of the algorithm, each item is a member of the set

    of candidate.The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.

    Scan D forcount of eachcandidate

    Compare candidatesupport count withminimum supportcount

    C1 L1

  • 8/12/2019 Association Rule -data mining

    49/131

    49Step 2: Generating 2-itemset Frequent Pattern

    Itemset

    {I1, I2}

    {I1, I3}

    {I1, I4}

    {I1, I5}

    {I2, I3}

    {I2, I4}

    {I2, I5}

    {I3, I4}

    {I3, I5}{I4, I5}

    Itemset Sup.

    Count

    {I1, I2} 4

    {I1, I3} 4

    {I1, I4} 1

    {I1, I5} 2

    {I2, I3} 4

    {I2, I4} 2

    {I2, I5} 2

    {I3, I4} 0

    {I3, I5} 1

    {I4, I5} 0

    Itemset Sup

    Count

    {I1, I2} 4

    {I1, I3} 4

    {I1, I5} 2

    {I2, I3} 4

    {I2, I4} 2

    {I2, I5} 2

    GenerateC2candidatesfrom L1

    C2

    C2

    L2

    Scan D forcount ofeachcandidate

    Comparecandidatesupport countwithminimumsupport count

  • 8/12/2019 Association Rule -data mining

    50/131

  • 8/12/2019 Association Rule -data mining

    51/131

  • 8/12/2019 Association Rule -data mining

    52/131

    52

    Step 3: Generating 3-itemset Frequent Pattern [Cont.]

    Based on the Apriori property that all subsets of a frequent itemset mustalso be frequent, we can determine that four latter candidates cannotpossibly be frequent. How ?

    For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3}& {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will

    keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is

    performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

    BUT, {I3, I5} is not a member of L2and hence it is not frequent violatingApriori Property. Thus We will have to remove {I2, I3, I5} from C3.

    Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members ofresult of Join operation for Pruning.

    Now, the transactions in D are scanned in order to determine L3, consistingof those candidates 3-itemsets in C3having minimum support.

  • 8/12/2019 Association Rule -data mining

    53/131

    53

    Step 4: Generating 4-itemset Frequent Pattern

    The algorithm uses L3JoinL3to generate a candidate setof 4-itemsets, C4. Although the join results in {{I1, I2, I3,I5}}, this itemset is pruned since its subset {{I2, I3, I5}} isnot frequent.

    Thus, C4= , and algorithm terminates, having foundall of the frequent items. This completes our AprioriAlgorithm.

    Whats Next ?

    These frequent itemsets will be used to generate strongassociation rules( where strong association rules satisfyboth minimum support & minimum confidence).

    St 5 G ti A i ti R l f F t

  • 8/12/2019 Association Rule -data mining

    54/131

    54

    Step 5:Generating Association Rules from FrequentItemsets

    Procedure:

    For each frequent itemset l,generate all nonempty subsets of l.

    For every nonempty subset sof l, output the rule s (l-s)if

    support_count(l) / support_count(s) >= min_confwhere min_conf

    is minimum confidence threshold.

    Back To Example:

    We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},

    {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. Lets take l = {I1,I2,I5}.

    Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.

    St 5 G ti A i ti R l f F t

  • 8/12/2019 Association Rule -data mining

    55/131

    55

    Step 5:Generating Association Rules from FrequentItemsets [Cont.]

    Let minimum confidence threshold is , say 70%. The resulting association rules are shown below,

    each listed with its confidence. R1: I1 ^ I2I5

    Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected.

    R2: I1 ^ I5I2 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is Selected.

    R3: I2 ^ I5I1 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.

    Step 5: Generating Association Rules from

  • 8/12/2019 Association Rule -data mining

    56/131

    56

    Step 5:Generating Association Rules fromFrequent Itemsets [Cont.]

    R4: I1I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%

    R4 is Rejected.

    R5: I2I1 ^ I5

    Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is Rejected.

    R6: I5I1 ^ I2

    Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%

    R6 is Selected.In this way, We have found three strong associationrules.

    A h E l

  • 8/12/2019 Association Rule -data mining

    57/131

    57

    Another Example

    Database TDB

    1stscan

    C1L1

    L2

    C2 C2

    2ndscan

    C3 L33rdscan

    Tid Items

    10 A, B, D

    20 A, C, E

    30 A, B, C, E

    40 C, E

    Itemset sup{A} 3

    {B} 2

    {C} 3

    {D} 1

    {E} 3

    Itemset sup

    {A} 3

    {B} 2

    {C} 3

    {E} 3

    Itemset

    {A, B}

    {A, C}

    {A, E}{B, C}

    {B, E}

    {C, E}

    Itemset sup

    {A, B} 2

    {A, C} 2

    {A, E} 2

    {B, C} 1{B, E} 1

    {C, E} 3

    Itemset sup

    {A, B} 2

    {A, C} 2{A, E} 2

    {C, E} 3

    Itemset

    {A, C, E}

    Itemset sup

    {A, C, E} 2

    Supmin

    = 2

  • 8/12/2019 Association Rule -data mining

    58/131

    An Example58

    This example is similar to the last example but we haveadded two more items and another transaction makingfive transactions and six items. We want to findassociation rules with 50% support and 75% confidence.

    Transaction ID Items100 Bread, Cheese, Eggs, Juice

    200 Bread, Cheese, Juice

    300 Bread, Milk, Yogurt400 Bread, Juice, Milk

    500 Cheese, Juice, Milk

  • 8/12/2019 Association Rule -data mining

    59/131

    First find L1. 50% support requires that each

    frequent item appear in at least three

    transactions. Therefore L1 is given by:

    59

    Example

    Item FrequnecyBread 4

    Cheese 3

    Juice 4

    Milk 3

  • 8/12/2019 Association Rule -data mining

    60/131

    The candidate 2-itemsets or C2 therefore has

    six pairs (why?). These pairs and their

    frequencies are:

    60

    Example

    Item Pairs Frequency(Bread, Cheese) 2(Bread, Juice) 3

    (Bread, Milk) 2

    (Cheese, Juice) 3

    (Cheese, Milk) 1

    (Juice, Milk) 2

    Question: Why did we not select (Egg, Milk)?

  • 8/12/2019 Association Rule -data mining

    61/131

    L2has only two frequent item pairs {Bread, Juice}and {Cheese, Juice}. After these two frequentpairs, there are no candidate 3-itemsets (why?)since we do not have two 2-itemsets that have the

    same first item (why?).The two frequent pairs lead to the followingpossible rules:

    Bread Juice

    Juice BreadCheese Juice

    Juice Cheese

    61

    Deriving Rules

  • 8/12/2019 Association Rule -data mining

    62/131

    The confidence of these rules is obtained by

    dividing the support for both items in the rule by

    the support of the item on the left hand side of the

    rule.

    The confidence of the four rules therefore are

    3/4 = 75% (Bread Juice)

    3/4 = 75% (Juice Bread)

    3/3 = 100% (Cheese Juice) 3/4 = 75% (Juice Cheese)

    Since all of them have a minimum 75%

    confidence, they all qualify. We are done since

    there are no 3itemsets. 62

    Deriving Rules

  • 8/12/2019 Association Rule -data mining

    63/131

    Rule Generation

    Given a frequent itemset L, find all non-empty subsets fL such that f Lf satisfies the minimum confidence

    requirement

    If {A,B,C,D} is a frequent itemset, candidate rules:

    ABC D, ABD C, ACD B, BCD A,ABCD, B ACD, C ABD, D ABC

    AB CD, AC BD, AD BC, BC AD,

    BD AC, CD AB,

    If |L| = k, then there are 2k2 candidate association

    rules (ignoring L and L)

    63

    G

  • 8/12/2019 Association Rule -data mining

    64/131

    Rule Generation

    How to efficiently generate rules from frequent itemsets?

    In general, confidence does not have an anti-monotone property

    c(ABC D) can be larger or smaller than c(AB D)

    But confidence of rules generated from the same itemset has ananti-monotone property

    e.g., L = {A,B,C,D}:

    c(ABC D) c(AB CD) c(A BCD)

    Confidence is anti-monotone w.r.t. the RHS of the rule

    64

  • 8/12/2019 Association Rule -data mining

    65/131

    Rule Generation for Apriori Algorithm

    ABCD=>{ }

    BCD=>A ACD=>B ABD=>C ABC=>D

    BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

    D=>ABC C=>ABD B=>ACD A=>BCD

    Lattice of rulesABCD=>{ }

    BCD=>A ACD=>B ABD=>C ABC=>D

    BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

    D=>ABC C=>ABD B=>ACD A=>BCD

    Pruned

    Rules

    LowConfidenceRule

    65

  • 8/12/2019 Association Rule -data mining

    66/131

    Rule Generation for Apriori Algorithm

    Candidate rule is generated by merging two rules thatshare the same prefix

    in the rule consequent

    join(CD=>AB,BD=>AC)

    would produce the candidate

    rule D => ABC

    Prune rule D=>ABC if its

    subset AD=>BC does not have

    high confidence

    BD=>ACCD=>AB

    D=>ABC

    66

    R l ti l ith

  • 8/12/2019 Association Rule -data mining

    67/131

    37Rule generation algorithm

    Moving items from the antecedent to the consequentnever changes support, and never increases confidence

    Algorithm

    For each itemsetI with minsup:

    Find all minconf rules with a single consequent of the form (I - L1->L1)

    Repeat:

    Guess candidate consequents Ckby appending items fromI - Lk-1toLk-1

    Verify confidence of each ruleI - Ck-> Ckusing known itemset support values

    Key fact:

    A i ti R l

  • 8/12/2019 Association Rule -data mining

    68/131

    68Association Rules

    consider:AB C (90% confidence)

    and A C (92% confidence)

    Clearly the first rule is of no use. We should look formore complex rules only if they are better than simplerules.

    P tt E l ti

  • 8/12/2019 Association Rule -data mining

    69/131

    Pattern Evaluation

    Association rule algorithms tend to produce toomany rules

    many of them are uninteresting or redundant

    Redundant if {A,B,C} {D} and {A,B} {D}

    have same support & confidence

    Interestingness measures can be used toprune/rank the derived patterns

    In the original formulation of association rules,support & confidence are the only measures used

    A i ti & C l ti

  • 8/12/2019 Association Rule -data mining

    70/131

    70

    Association & Correlation

    As we can see support-confidence framework canbe misleading; it can identify a rule (A=>B) asinteresting (strong) when, in fact the occurrence ofA might not imply the occurrence of B.

    Correlation Analysis provides an alternativeframework for finding interesting relationships, or

    to improve understanding of meaning of someassociation rules (a lift of an association rule).

    Computing Interestingness

  • 8/12/2019 Association Rule -data mining

    71/131

    Computing Interestingness

    Measure Given a rule X Y, information needed to compute rule

    interestingness can be obtained from a contingency table

    Y Y

    X f11

    f10

    f1+

    X f01 f00 fo+

    f+1 f+0 |T|

    Contingency tablefor X Y

    f11: support of X and Y

    f10: support of X and Yf01: support of X and Yf00: support of X and Y

    Used to define various measures

    support, confidence, lift, Gini,J-measure, etc.

    D b k f C fid

  • 8/12/2019 Association Rule -data mining

    72/131

    Drawback of Confidence

    Coffee Coffee

    Tea 15 5 20

    Tea 75 5 80

    90 10 100

    Association Rule: Tea Coffee

    Confidence= P(Coffee|Tea) = 0.75

    but P(Coffee) = 0.9

    Although confidence is high, rule is misleading

    P(Coffee|Tea) = 0.9375

    C l ti C t

  • 8/12/2019 Association Rule -data mining

    73/131

    73

    Correlation Concepts

    Two item sets A and B are independent (theoccurrence of A is independent of the occurrence ofitem set B) iffP(A B) = P(A) P(B)

    Otherwise A and B are dependent andcorrelated

    The measure of correlation, orcorrelation betweenA and Bis given by the formula:

    Corr(A,B)= P(A U B ) / P(A) . P(B)

    C l ti C t [C t ]

  • 8/12/2019 Association Rule -data mining

    74/131

    74

    Correlation Concepts [Cont.]

    corr(A,B) >1 means that A and B are positivelycorrelatedi.e. the occurrence of one implies theoccurrence of the other.

    corr(A,B) < 1 means that the occurrence of A is

    negatively correlated with ( or discourages) theoccurrence of B.

    corr(A,B) =1 means that A and B are independentand there is no correlation between them.

    St ti ti l I d d

  • 8/12/2019 Association Rule -data mining

    75/131

    Statistical Independence

    Population of 1000 students 600 students know how to swim (S)

    700 students know how to bike (B)

    420 students know how to swim and bike (S,B)

    P(SB) = 420/1000 = 0.42

    P(S) P(B) = 0.6 0.7 = 0.42

    P(SB) = P(S) P(B) => Statistical independence

    P(SB) > P(S) P(B) => Positively correlated

    P(SB) < P(S) P(B) => Negatively correlated

    A i ti & C l ti

  • 8/12/2019 Association Rule -data mining

    76/131

    76

    Association & Correlation

    The correlation formula can be re-written as Corr(A,B) = P(B|A) / P(B)

    We already know that Support(AB)= P(AUB) Confidence(AB)= P(B|A)

    That means that,Confidence(AB)= corr(A,B) P(B)

    So correlation, support and confidence are all different, but thecorrelation provides an extra information about the association rule (AB).

    We say that the correlation corr(A,B) provides the LIFT of theassociation rule (A=>B), i.e. A is said to increase (or LIFT) the likelihood of B by the factor of

    the value returned by the formula for corr(A,B). Measures how much more likely is the RHS given the LHS than merely

    the RHS

    Statistical based Meas res

  • 8/12/2019 Association Rule -data mining

    77/131

    Statistical-based Measures

    Measures that take into account statisticaldependence

    )](1)[()](1)[(

    )()(),(

    )()(),(

    )()(

    ),()(

    )|(

    YPYPXPXP

    YPXPYXPtcoefficien

    YPXPYXPPS

    YPXP

    YXPInterest

    YP

    XYPLif t

    Example: Lift/Interest

  • 8/12/2019 Association Rule -data mining

    78/131

    Example: Lift/Interest

    Coffee Coffee

    Tea 15 5 20

    Tea 75 5 8090 10 100

    Association Rule: Tea Coffee

    Confidence= P(Coffee|Tea) = 0.75

    but P(Coffee) = 0.9

    Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

    Rule Evaluation - Lift

  • 8/12/2019 Association Rule -data mining

    79/131

    Rule Evaluation - Lift

    Transaction No. Item 1 Item 2 Item 3 Item 4

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Milk Vodka Chocolate

    103 Beer Milk Diaper Chocolate

    104 Milk Diaper Beer

    Whats the support and confidence for rule {Chocolate}{Milk}?

    Support = 3/5 Confidence = 3/4

    Very high support and confidence.

    Does Chocolate really lead to Milk purchase?

    No! Because Milk occurs in 4 out of 5 transactions. Chocolate is evendecreasing the chance of Milk purchase (3/4 < 4/5)

    Lift = (3/4)/(4/5) = 0.9375 < 1

    There are lots of

  • 8/12/2019 Association Rule -data mining

    80/131

    measures proposedin the literature

    Some measures aregood for certainapplications, but notfor others

    What criteria shouldwe use to determinewhether a measure isgood or bad?

    Efficiency

  • 8/12/2019 Association Rule -data mining

    81/131

    Efficiency

    81

    Consider an example of a supermarket database whichmight have several thousand items including 1000frequent items and several million transactions.

    Which part of the apriori algorithm will be the mostexpensive to compute? Why?

    Efficiency

  • 8/12/2019 Association Rule -data mining

    82/131

    Efficiency

    82

    The algorithm to construct the candidate set Cito findthe frequent set Liis crucial to the performance of the

    Apriori algorithm. The larger the candidate set, higherthe processing cost of discovering the frequent itemsetssince the transactions must be scanned for each. Giventhat the numbers of early candidate itemsets are verylarge, the initial iterations dominate the cost.

    In a supermarket database with about 1000 frequentitems, there will be almost half a million candidatepairs C2that need to be searched for to find L2.

    Efficiency

  • 8/12/2019 Association Rule -data mining

    83/131

    Efficiency

    83

    Generally the number of frequent pairs out of half amillion pairs will be small and therefore (why?) thenumber of 3-itemsets should be small.

    Therefore, it is the generation of the frequent set L2that is the key to improving the performance of the

    Apriori algorithm.

    Comment

  • 8/12/2019 Association Rule -data mining

    84/131

    Comment

    In class examples we usually require highsupport, for example, 25%, 30% or even50%. These support values are very highif the number of items and number of

    transactions is large. For example, 25%support in a supermarket transactiondatabase means searching for items thathave been purchased by one in four

    customers! Not many items wouldprobably qualify.

    Practical applications therefore deal withmuch smaller support, sometimes even84

    Transactions Storage

  • 8/12/2019 Association Rule -data mining

    85/131

    Transactions Storage

    Transaction ID Items10 A, B, D20 D, E, F30 A, F40 B, C, D50 E, F60 D, E, F70 C, D, F80 A, C, D, F

    85

    Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows

    Horizontal Representation

  • 8/12/2019 Association Rule -data mining

    86/131

    Horizontal Representation

    TID A B C D E F10 1 1 0 1 0 020 0 0 0 1 1 130 1 0 0 0 0 140 0 1 1 1 0 050 0 0 0 0 1 160 0 0 0 1 1 170 0 0 1 1 0 1

    80 1 0 1 1 0 1

    86

    Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows

    Vertical Representation

  • 8/12/2019 Association Rule -data mining

    87/131

    Vertical Representation

    Items 1 2 3 4 5 6 7 8A 1 0 1 0 0 0 0 1B 1 0 0 1 0 0 0 0C 0 0 0 1 0 0 1 1D 1 1 0 1 0 1 1 1E 0 1 0 0 1 1 0 0F 0 1 1 0 1 1 1 1

    87

    Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows

    Bigger Example

  • 8/12/2019 Association Rule -data mining

    88/131

    88

    Bigger Example

    TID Items1 Biscuits, Bread, Cheese, Coffee, Yogurt2 Bread, Cereal, Cheese, Coffee3 Cheese, Chocolate, Donuts, Juice, Milk4 Bread, Cheese, Coffee, Cereal, Juice5 Bread, Cereal, Chocolate, Donuts, Juice6 Milk, Tea7 Biscuits, Bread, Cheese, Coffee, Milk8 Eggs, Milk, Tea

    9 Bread, Cereal, Cheese, Chocolate, Coffee10 Bread, Cereal, Chocolate, Donuts, Juice11 Bread, Cheese, Juice12 Bread, Cheese, Coffee, Donuts, Juice13 Biscuits, Bread, Cereal14 Cereal, Cheese, Chocolate, Donuts, Juice15 Chocolate, Coffee16 Donuts17 Donuts, Eggs, Juice18 Biscuits, Bread, Cheese, Coffee19 Bread, Cereal, Chocolate, Donuts, Juice20 Cheese, Chocolate, Donuts, Juice21 Milk, Tea, Yogurt22 Bread, Cereal, Cheese, Coffee23 Chocolate, Donuts, Juice, Milk, Newspaper24 Newspaper, Pastry, Rolls25 Rolls, Sugar, Tea

    Frequency of Items

  • 8/12/2019 Association Rule -data mining

    89/131

    89

    Frequency of Items

    Item No Item name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 9

    7 Donuts 108 Eggs 29 Juice 11

    10 Milk 611 Newspaper 212 Pastry 113 Rolls 2

    14 Sugar 115 Tea 416 Yogurt 2

    Frequent Items

  • 8/12/2019 Association Rule -data mining

    90/131

    90

    Frequent Items

    Assume 25% support. In 25 transactions, a frequent item mustoccur in at least 7 transactions. The frequent 1-itemset or L1isnow given below. How many candidates in C2? List them.

    Item FrequencyBread 13Cereal 10Cheese 11Chocolate 9Coffee 9Donuts 10

    Juice 11

    L

  • 8/12/2019 Association Rule -data mining

    91/131

    91

    L2

    The following pairs are frequent. Now find C3and then L3and the rules.

    Frequent 2 itemset Frequency{Bread, Cereal} 9

    {Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9{Chocolate, Donuts} 7

    {Chocolate, Juice} 7{Donuts, Juice} 9

    Rules

  • 8/12/2019 Association Rule -data mining

    92/131

    92

    Rules

    The full set of rules are given below. Could some rules be removed?Cheese Bread

    Cheese Coffee

    Coffee Bread

    Coffee Cheese

    Cheese, Coffee Bread

    Bread, Coffee Cheese

    Bread, Cheese CoffeeChocolate Donuts

    Chocolate Juice

    Donuts Chocolate

    Donuts Juice

    Donuts, Juice Chocolate

    Chocolate, Juice Donuts

    Chocolate, Donuts JuiceBread Cereal

    Cereal Bread

    Study the above rules carefully.

    Improving the Apriori Algorithm

  • 8/12/2019 Association Rule -data mining

    93/131

    93

    Improving the Apriori Algorithm

    Many techniques for improving the efficiency havebeen proposed:

    Pruning (already mentioned)Hashing based technique

    Transaction reductionPartitioningSamplingDynamic itemset counting

    Pruning

  • 8/12/2019 Association Rule -data mining

    94/131

    94Pruning

    Pruning can reduce the size of the candidate setCk. We want to transform Ckinto a set of frequentitems Lk. To reduce the work of checking, we may

    use the rule that all subsets of Ckmust also befrequent.

    Example

  • 8/12/2019 Association Rule -data mining

    95/131

    95Example

    Suppose the items are A, B, C, D, E, F, .., X, Y, Z

    Suppose L1is A, C, E, P, Q, S, T, V, W, X

    Suppose L2is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},

    {E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}

    Are you able to identify errors in the L2list?

    What is C3?

    How to prune C3?

    C3is {A, C, P}, {E, P, V}, {Q, S, X}

  • 8/12/2019 Association Rule -data mining

    96/131

    Hash Tables

  • 8/12/2019 Association Rule -data mining

    97/131

    Hash tables are a common approach to the

    storing/searching problem.

    Hash Tables

    What is a Hash Table ?

  • 8/12/2019 Association Rule -data mining

    98/131

    What is a Hash Table ?

    The simplest kind of hashtable is an array ofrecords.

    This example has 701

    records.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

    An array of records. . .

    [ 700]

    What is a Hash Table ?[ 4 ]

  • 8/12/2019 Association Rule -data mining

    99/131

    What is a Hash Table ?

    Each record has aspecial field, called itskey.

    In this example, the keyis a long integer field

    called Number.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]. . .

    [ 700]

    Number 506643548

    What is a Hash Table ?[ 4 ]

  • 8/12/2019 Association Rule -data mining

    100/131

    What is a Hash Table ?

    The number might be aperson's identificationnumber, and the rest of

    the record hasinformation about theperson.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

    . . .[ 700]

    Number 506643548

    What is a Hash Table ?

  • 8/12/2019 Association Rule -data mining

    101/131

    What is a Hash Table ?

    When a hash table is inuse, some spots containvalid records, and other

    spots are "empty".

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .

    Inserting a New Record

  • 8/12/2019 Association Rule -data mining

    102/131

    Inserting a New Record

    In order to insert a newrecord, the keymustsomehow be converted

    to an array index.

    The index is called the

    hash value of the key.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .

    Number 580625685

    Inserting a New Record

  • 8/12/2019 Association Rule -data mining

    103/131

    Inserting a New Record

    Typical way create a hashvalue:

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .

    Number 580625685

    (Number mod 701)What is (580625685 mod 701) ?

    Inserting a New Record

  • 8/12/2019 Association Rule -data mining

    104/131

    Inserting a New Record

    Typical way to create ahash value:

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .

    Number 580625685

    (Number mod 701)What is (580625685 mod 701) ?

    3

    Inserting a New Record

  • 8/12/2019 Association Rule -data mining

    105/131

    Inserting a New Record

    The hash value is usedfor the location of thenew record.

    Number 580625685

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .

    [3]

    Inserting a New Record

  • 8/12/2019 Association Rule -data mining

    106/131

    Inserting a New Record

    The hash value is usedfor the location of thenew record.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685

    Collisions

  • 8/12/2019 Association Rule -data mining

    107/131

    Collisions

    Here is another newrecord to insert, with ahash value of 2.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685

    Number 701466868

    My hashvalue is

    [2].

    Collisions

  • 8/12/2019 Association Rule -data mining

    108/131

    Collisions

    This is called acollision, becausethere is already another

    valid record at [2].

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685

    Number 701466868

    When a collision occurs,move forward until youfind an empty spot.

    Collisions

  • 8/12/2019 Association Rule -data mining

    109/131

    Collisions

    This is called acollision, becausethere is already another

    valid record at [2].

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685

    Number 701466868

    When a collision occurs,move forward until youfind an empty spot.

    Collisions

  • 8/12/2019 Association Rule -data mining

    110/131

    Collisions

    This is called acollision, becausethere is already another

    valid record at [2].

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685

    Number 701466868

    When a collision occurs,move forward until youfind an empty spot.

    Collisions

  • 8/12/2019 Association Rule -data mining

    111/131

    Collisions

    This is called acollision, becausethere is already another

    valid record at [2].

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    The new record goesin the empty spot.

    Searching for a Key

  • 8/12/2019 Association Rule -data mining

    112/131

    Searching for a Key

    The data that's attachedto a key can be foundfairly quickly.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Number 701466868

    Searching for a Key

  • 8/12/2019 Association Rule -data mining

    113/131

    Searching for a Key

    Calculate the hash value.

    Check that location of the

    array for the key.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Number 701466868

    My hashvalue is

    [2].Not me.

    Searching for a Key

  • 8/12/2019 Association Rule -data mining

    114/131

    Searching for a Key

    Keep moving forward untilyou find the key, or youreach an empty spot.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Number 701466868

    My hashvalue is

    [2].Not me.

    Searching for a Key

  • 8/12/2019 Association Rule -data mining

    115/131

    Searching for a Key

    Keep moving forward untilyou find the key, or youreach an empty spot.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Number 701466868

    My hashvalue is

    [2].Not me.

    Searching for a Key

  • 8/12/2019 Association Rule -data mining

    116/131

    Searching for a Key

    Keep moving forward untilyou find the key, or youreach an empty spot.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Number 701466868

    My hashvalue is

    [2].Yes

    Searching for a Key

  • 8/12/2019 Association Rule -data mining

    117/131

    Searching for a Key

    When the item is found, theinformation can be copiedto the necessary location.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Number 701466868

    My hashvalue is

    [2].Yes

  • 8/12/2019 Association Rule -data mining

    118/131

    Deleting a Record

  • 8/12/2019 Association Rule -data mining

    119/131

    Deleting a Record

    Records may also be deleted from a hashtable.

    But the location must not be left as an ordinary

    "empty spot" since that could interfere withsearches.

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Deleting a Record

  • 8/12/2019 Association Rule -data mining

    120/131

    Deleting a Record

    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

    Records may also be deleted from a hashtable.

    But the location must not be left as an ordinary

    "empty spot" since that could interfere withsearches.

    The location must be marked in some special

    way so that a search can tell that the spot used

    to have something in it.

    Example

  • 8/12/2019 Association Rule -data mining

    121/131

    121

    p

    Transaction ID Items100 Bread, Cheese, Eggs, Juice

    200 Bread, Cheese, Juice300 Bread, Milk, Yogurt

    400 Bread, Juice, Milk

    500 Cheese, Juice, Milk

    100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)

    200 (B, C) (B, J) (C, J)300 (B, M) (B, Y) (M, Y)400 (B, J) (B, M) (J, M)500 (C, J) (C, M) (J, M)

    Consider the transaction database in the first table below used in anearlier example. The second table below shows allpossible 2-itemsetsfor each transaction.

    Hashing Example

  • 8/12/2019 Association Rule -data mining

    122/131

    122

    Hashing Example

    The possible 2-itemsets in the last table are now hashed to a hashtable below. The last column shown in the table below is not

    required in the hash table but we have included it for explaining the

    technique.

    Bit vector Bucket number Count Pairs C21 0 3 (C, J) (B, Y) (M, Y) (C, J)0 1 1 (C, M)0 2 1 (E, J)0 3 0

    0 4 2 (B, C)1 5 3 (B, E) (J, M) (J, M)1 6 3 (B, J) (B, J)1 7 3 (C, E) (B, M) (B, M)

  • 8/12/2019 Association Rule -data mining

    123/131

    Find C2

  • 8/12/2019 Association Rule -data mining

    124/131

    1242

    The major aim of the algorithm is to reduce the size of C2. It istherefore essential that the hash table is large enough so that

    collisions are low. Collisions result in loss of effectiveness of

    the hash table. This is what happened in the example in which

    we had collisions in three of the eight rows of the hash tablewhich required us finding which pair was frequent.

    Transaction Reduction

  • 8/12/2019 Association Rule -data mining

    125/131

    125

    As discussed earlier, any transaction that does notcontain any frequent k-itemsets cannot contain anyfrequent (k+1)-itemsets and such a transaction maybe marked or removed.

  • 8/12/2019 Association Rule -data mining

    126/131

    Partitioning

  • 8/12/2019 Association Rule -data mining

    127/131

    127

    The set of transactions may be divided into a numberof disjoint subsets. Then each partition is searchedfor frequent itemsets. These frequent itemsets arecalled local frequent itemsets.

    Partitioning

  • 8/12/2019 Association Rule -data mining

    128/131

    128

    Phase 1 Divide n transactions into m partitions

    Find the frequent itemsets in each partition

    Combine all local frequent itemsets to form

    candidate itemsets

    Phase 2

    Find global frequent itemsets

    Sampling

  • 8/12/2019 Association Rule -data mining

    129/131

    129

    A random sample (usually large enough to fit in themain memory) may be obtained from the overall set oftransactions and the sample is searched for frequentitemsets. These frequent itemsets are called samplefrequent itemsets.

    Sampling

  • 8/12/2019 Association Rule -data mining

    130/131

    130

    Not guaranteed to be accurate but we sacrificeaccuracy for efficiency. A lower support threshold maybe used for the sample to ensure not missing anyfrequent datasets.

    The actual frequencies of the sample frequent itemsetsare then obtained.

    More than one sample could be used to improve

    accuracy.

    Exercise

  • 8/12/2019 Association Rule -data mining

    131/131

    Given the above list of transactions, do the following:

    1) Find all the frequent itemsets (minimum support 40%)

    2) Find all the association rules (minimum confidence 70%)3) For the discovered association rules, calculate the lift

    Transaction No. Item 1 Item 2 Item 3 Item 4

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Soap Vodka

    103 Beer Cheese Wine

    104 Milk Diaper Beer Chocolate