Upload
ajemla213
View
229
Download
0
Embed Size (px)
Citation preview
8/12/2019 Association Rule -data mining
1/131
ASSOCIATION RULE
MINING
8/12/2019 Association Rule -data mining
2/131
As noted earlier, huge amount of data is storedelectronically in many retail outlets due tobarcoding of goods sold. Natural to try to find
some useful information from this mountains ofdata.
A conceptually simple yet interesting technique
is to find association rules from these largedatabases.
The problem was invented by Rakesh Agarwalat IBM. 2
Introduction
8/12/2019 Association Rule -data mining
3/131
Association rules mining (or market basket
analysis) searches for interesting customer
habits by looking at associations.
Applications in marketing, store layout,customer segmentation, medicine, finance,
and many more.
Question: Suppose you are able to find that two products x and y (say bread and
milk or crackers and cheese) are frequently bought together. How can you use that
information?
3
Introduction
8/12/2019 Association Rule -data mining
4/131
Applications
Market Basket Analysis: given a database of customertransactions, where each transaction is a set of items the goalis to find groups of items which are frequently purchasedtogether.
Telecommunication (each customer is a transactioncontaining the set of phone calls)
Credit Cards/ Banking Services (each card/account is atransaction containing the set of customerspayments)
Medical Treatments (each patient is represented as a
transaction containing the ordered set of diseases)
8/12/2019 Association Rule -data mining
5/131
Transaction ID Items
10 Bread, Cheese, Newspaper
20 Bread, Cheese, Juice
30 Bread, Milk
40 Cheese, Juice. Milk, Coffee
50 Sugar, Tea, Coffee, Biscuits, Newspaper
60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper
70Bread, Cheese
80 Bread, Cheese, Juice, Coffee
90 Bread, Milk
100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
5
A Simple Example
Consider the following ten transactions of a shop selling only nine items. Wewill return to it after explaining some terminology.
8/12/2019 Association Rule -data mining
6/131
Let the number of different items sold be n
Let the number of transactions be N
Let the set of items be I={i1, i2, , in}. The
number of items may be large, perhaps severalthousands.
Let the set of transactions be T= {t1, t2, , tN}.Each transaction ti contains a subset of items
from the itemset {i1, i2, , in}. These are thethings a customer buys when they visit thesupermarket. N is assumed to be large, perhapsin millions.
Not considering the quantities of items bought. 6
Terminology
8/12/2019 Association Rule -data mining
7/131
Want to find a group of items that tend to occur
together frequently.
The association rules are often written as
XY meaning that whenever X appears Yalso tends to appear. X and Y may be single
items or sets of items but the same item does
not appear in both.
7
Terminology
8/12/2019 Association Rule -data mining
8/131
Suppose X and Y appear together in only 1%
of the transactions but whenever X appears
there is 80% chance that Y also appears.
The 1% presence of X and Y together is calledthe support (or prevalence) of the rule and
80% is called the confidence (or predictability)
of the rule.
These are measures of interestingness of the
rule.
8
Terminology
8/12/2019 Association Rule -data mining
9/131
Terminology9
Confidence denotes the strength of the associationbetween X and Y. Support indicates the frequency of thepattern. A minimum support is necessary if anassociation is going to be of some business value.
Let the chance of finding an item X in the Ntransactions is x% then we can say probability of X isP(X) = x/100.
Question: Why is minimum support necessary? Why is minimum confidence necessary?
8/12/2019 Association Rule -data mining
10/131
8/12/2019 Association Rule -data mining
11/131
11
Transaction data: supermarket
data Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
tn: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
I:the set of all items sold in the storeA transaction:items purchased in a basket; it may
have TID (transaction ID)
A transactionaldataset: A set of transactions
8/12/2019 Association Rule -data mining
12/131
12
Transaction data: a set of
documents
A text document data set. Each document is
treated as a bag of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
8/12/2019 Association Rule -data mining
13/131
13
The model: rules
A transaction tcontainsX, a set of items(itemset) in I, ifXt.
An association ruleis an implication of the form:
X
Y, whereX, Y
I, and X
Y =
An itemsetis a set of items.
E.g., X = {milk, bread, cereal} is an itemset.
A k-itemset is an itemset with kitems. E.g., {milk, bread, cereal} is a 3-itemset
8/12/2019 Association Rule -data mining
14/131
14
Rule strength measures
Support:The rule holds with supportsupin T(the transaction data set) if sup% oftransactionscontainXY.
Confidence:The rule holds in Twithconfidence confif conf% of tranactions thatcontainXalso contain Y.
An association rule is a pattern that states
whenXoccurs, Yoccurs with certainprobability.
8/12/2019 Association Rule -data mining
15/131
15
Support and Confidence
Support count: The support count of an
itemsetX, denoted byX.count, in a data set T
is the number of transactions in Tthat contain
X. Assume Thas ntransactions. Then,
n
countYXsupport
).(
countX
countYXconfidence
.
).(
8/12/2019 Association Rule -data mining
16/131
Terminology16
Suppose we know P(X Y) then what is theprobability of Y appearing if we know that X alreadyexists. It is written as P(Y|X)
The supportfor XY is the probability of both X and
Y appearing together, that is P(X Y).
The confidenceof XY is the conditional probabilityof Y appearing given that X exists. It is written asP(Y|X) and read as P of Y given X.
8/12/2019 Association Rule -data mining
17/131
17Association Rule Problem
Given:
a setIof all the items;
a database Tof transactions;
minimum support p;
minimum confidence q;
Find:
all association rulesX Ywith a minimumsupportpand confidence q.
8/12/2019 Association Rule -data mining
18/131
18Problem Decomposition
1. Find all sets of items that have minimum
support (frequent itemsets)
2. Use the frequent itemsets to generate the
desired rules
8/12/2019 Association Rule -data mining
19/131
Want to find all associations which have at least
p% support with at least q% confidence such
that
all rules satisfying any user constraints are found the rules are found efficiently from large
databases
the rules found are actionable
19
The task
8/12/2019 Association Rule -data mining
20/131
20
Many mining algorithms
There are a large number of them!!
They use different strategies and data structures.
Their resulting sets of rules are all the same.
Given a transaction data set T, and a minimum support anda minimum confident, the set of association rules existing in
Tis uniquely determined.
Any algorithm should find the same set of rules
although their computational efficiencies andmemory requirements may be different.
We study only one: the Apriori Algorithm
8/12/2019 Association Rule -data mining
21/131
Association Rule Mining Task
Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold confidence minconf threshold
Brute-force approach: List all possible association rules
Compute the support and confidence for each
rule
Prune rules that fail the minsupand minconf
8/12/2019 Association Rule -data mining
22/131
Transaction ID Items
10 Bread, Cheese, Newspaper
20 Bread, Cheese, Juice
30 Bread, Milk40 Cheese, Juice. Milk, Coffee
50 Sugar, Tea, Coffee, Biscuits, Newspaper
60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper
70 Bread, Cheese
80 Bread, Cheese, Juice, Coffee
90 Bread, Milk
100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
22
Example
We have9 items and 10 transactions. Find support and confidence for each pair.How many pairs are there?
8/12/2019 Association Rule -data mining
23/131
There are 9 x 8 or 72 pairs, half of themduplicates. 36 is too many to analyse, so we usean even simpler example of n = 4 (Bread, Cheese,Juice, Milk) and N = 4. We want to find rules with
minimum support of 50% and minimumconfidence of 75%.
23
Example
Transaction I Items100 Bread, Cheese
200 Bread, Cheese, Juice300 Bread, Milk
400 Cheese, Juice. Milk
8/12/2019 Association Rule -data mining
24/131
N = 4 results in the following frequencies. All items have
the 50% support we
require. Only two pairs
have the 50% support
and no 3-itemset hasthe support.
We will look at the
two pairs that have
the minimum support
to find if they have
the required confidence.
24
Example
Itemsets FrequencyBread 3
Cheese 3
Juice 2
Milk 2(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2
(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Bread, Cheese, Milk) 0
(Bread, Juice, Milk) 0
(Cheese, Juice, Milk) 1
(Bread, Cheese, Juice, Milk) 0
8/12/2019 Association Rule -data mining
25/131
Items or itemsets that have the minimumsupport are called frequent. In our example, allthe four items and two pairs are frequent.
We will now determine if the two pairs {Bread,Cheese} and {Cheese, Juice} lead toassociation rules with 75% confidence.
Every pair {A, B} can lead to two rules A B
and B A if both satisfy the minimumconfidence. Confidence of A B is given bythe support for A and B together divided by thesupport of A.
25
Terminology
8/12/2019 Association Rule -data mining
26/131
We have four possible rules and theirconfidence is given as follows:
Bread Cheese with confidence of 2/3 = 67%
Cheese Bread with confidence of 2/3 = 67% Cheese Juice with confidence of 2/3 = 67%
Juice Cheese with confidence of 100%
Therefore only the last rule Juice Cheese
has confidence above the minimum 75% andqualifies. Rules that have more than user-specified minimum confidence are calledconfident.
26
Rules
8/12/2019 Association Rule -data mining
27/131
This simple algorithm works well with fouritems since there were only a total of 16combinations that we needed to look at but ifthe number of items is say 100, the number ofcombinations is much larger, in billions.
The number of combinations becomes about amillion with 20 items since the number of
combinations is 2n
with n items (why?). Thenave algorithm can be improved to deal moreeffectively with larger data sets.
27
Problems with Brute Force
8/12/2019 Association Rule -data mining
28/131
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, thereare 2dpossiblecandidate itemsets
8/12/2019 Association Rule -data mining
29/131
Question: Can you think of an improvement over
the brute force method in which we looked at
every itemset combination possible?
Nave algorithms even with improvements dont
work efficiently enough to deal with large
number of items and transactions. We nowdefine a better algorithm.
29
Problems with Brute Force
8/12/2019 Association Rule -data mining
30/131
Improved Brute Force
Rather than counting all possible item
combinations we can look at each transaction
and count only the combinations that actually
occur as shown below (that is, we dont countitemsets with zero frequency).ransaction ID Items Combinations
100 Bread, Cheese {Bread, Cheese}200 Bread, Cheese,
Juice{Bread, Cheese}, {Bread, Juice},{Cheese, Juice}, {Bread, Cheese, Juice}
300 Bread, Milk {Bread, Milk}
400 Cheese, Juice, Milk {Cheese, Juice}, {Cheese, Milk}, {Juice,Milk}, {Cheese, Juice, Milk}
30
8/12/2019 Association Rule -data mining
31/131
Improved Brute Force
Itemsets FrequencyBread 3Cheese 3Juice 2Milk 2(Bread, Cheese) 2(Bread, Juice) 1
(Bread, Milk) 1(Cheese, Milk) 2(Cheese, Juice) 2(Juice, Milk) 1(Bread, Cheese, Juice) 1(Cheese, Juice, Milk) 1
31
Removing zero frequencies leads to the smaller tablebelow. We then proceed as before but the improvement isnot large and we need better techniques.
F t It t G ti
8/12/2019 Association Rule -data mining
32/131
Frequent Itemset Generation
Strategies
Reduce the number of candidates(M)Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)Reduce size of N as the size of itemset increases
Reduce the number of comparisons(NM)
Use efficient data structures to store thecandidates or transactions
No need to match every candidate against everytransaction
8/12/2019 Association Rule -data mining
33/131
33
The Apriori algorithm
Probably the best known algorithm Two steps:
Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets). Use frequent itemsets to generate rules.
E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
8/12/2019 Association Rule -data mining
34/131
34
Step 1: Mining all frequent
itemsetsA frequent itemsetis an itemset whose support
is minsup.
Key idea: The apriori property(downward
closure property): any subsets of a frequentitemset are also frequent itemsets
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD
8/12/2019 Association Rule -data mining
35/131
The Apriori Algorithm
Frequent I temset Property:
Any subset of a frequent itemset is frequent.
Contrapositive:
If an itemset is not frequent, none of its
supersets are frequent.
R d i N b f
8/12/2019 Association Rule -data mining
36/131
Reducing Number of
Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also
be frequent
Apriori principle holds due to the following property ofthe support measure:
Support of an itemset never exceeds the support of its
monotonic function (or monotone function) is a function
between ordered sets that preserves the given order
subsets
This is known as the anti-monotone ro ert of su ort
)()()(:, YsXsYXYX
8/12/2019 Association Rule -data mining
37/131
8/12/2019 Association Rule -data mining
38/131
Illustrating Apriori Principle
Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1
Itemset Count
{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3
{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Coke
or Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered,6C1+
6C2+6C3= 41
With support-based pruning,6 + 6 + 1 = 13
8/12/2019 Association Rule -data mining
39/131
39The Apriori Algorithm: Basics
The Apriori Algorithm is an influential algorithm formining frequent itemsets for boolean association rules.
Key Concepts : Frequent Itemsets: The sets of item which has minimum
support (denoted by Lifor ith-Itemset).
Apriori Property: Any subset of frequent itemset must
be frequent. Join Operation: To find Lk, a set of candidate k-itemsets
is generated by joining Lk-1with itself.
8/12/2019 Association Rule -data mining
40/131
The Apriori Algorithm40
To find associations, this classical Apriori algorithmmay be simply described by a two step approach:
Step 1 discover all frequent (single) items that have
support above the minimum support required
Step 2 use the set of frequent items to generate theassociation rules that have high enough confidencelevel
8/12/2019 Association Rule -data mining
41/131
Apriori Algorithm Terminology41
A kitemset is a set of k items.
The set Ckis a set of candidate kitemsets that are
potentially frequent.The set Lkis a subset of Ckand is the set of kitemsetsthat are frequent.
8/12/2019 Association Rule -data mining
42/131
Step 1Computing L1
Scan all transactions. Find all frequent items
that have support above the required p%. Let
these frequent items be labeled L1. Step 2Apriori-gen Function
Use the frequent items L1 to build all possible
item pairs like {Bread, Cheese} if Bread andCheese are in L1. The set of these item pairs
is called C2, the candidate set.
42
Step-by-step Apriori Algorithm
8/12/2019 Association Rule -data mining
43/131
Step 3Pruning
Scan all transactions and find all pairs in the
candidate pair set C2 that are frequent. Let
these frequent pairs be L2. Step 4General rule
A generalization of Step 2. Build candidate set
of k items Ck by combining frequent itemsetsin the set Lk-1.
43
The Apriori Algorithm
8/12/2019 Association Rule -data mining
44/131
Step 5Pruning
Step 5, generalization of Step 3. Scan all
transactions and find all item sets in Ck that
are frequent. Let these frequent itemsets beLk.
Step 6Continue
Continue with Step 4 unless Lk is empty. Step 7Stop
Stop when Lk is empty.
44
The Apriori Algorithm
The Apriori Algorithm in a
8/12/2019 Association Rule -data mining
45/131
45
The Apriori Algorithm in aNutshell
Find thefrequent itemsets: the sets of items that have
minimum support
A subset of a frequent itemset must also be a frequent
itemset i.e., if {AB} isa frequent itemset, both {A} and {B}
should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)
Use the frequent itemsets to generate association rules.
The Apriori Algorithm : Pseudo
8/12/2019 Association Rule -data mining
46/131
46
The Apriori Algorithm : Pseudocode
Join Step: Ckis generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
Pseudo-code:Ck: Candidate itemset of size k
Lk:frequent itemset of size k
L1= {frequent items};for(k= 1; Lk!=; k++) do begin
Ck+1= candidates generated from Lk;for eachtransaction tin database do
increment the count of all candidates in Ck+1that are contained in t
Lk+1 = candidates in Ck+1with min_supportend
returnkLk;
8/12/2019 Association Rule -data mining
47/131
47The Apriori Algorithm: Example
Consider a database, D , consistingof 9 transactions.
Suppose min. support countrequired is 2 (i.e. min_sup = 2/9 =22 % )
Let minimum confidence requiredis 70%.
We have to first find out thefrequent itemset using Apriori
algorithm. Then, Association rules will be
generated using min. support &min. confidence.
TID List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
8/12/2019 Association Rule -data mining
48/131
StateUniversity ofNew York,Stony Brook
48
Step 1: Generating 1-itemset Frequent Pattern
Itemset Sup.Count
{I1} 6
{I2} 7
{I3} 6
{I4} 2{I5} 2
Itemset Sup.Count
{I1} 6
{I2} 7
{I3} 6
{I4} 2{I5} 2
In the first iteration of the algorithm, each item is a member of the set
of candidate.The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.
Scan D forcount of eachcandidate
Compare candidatesupport count withminimum supportcount
C1 L1
8/12/2019 Association Rule -data mining
49/131
49Step 2: Generating 2-itemset Frequent Pattern
Itemset
{I1, I2}
{I1, I3}
{I1, I4}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}{I4, I5}
Itemset Sup.
Count
{I1, I2} 4
{I1, I3} 4
{I1, I4} 1
{I1, I5} 2
{I2, I3} 4
{I2, I4} 2
{I2, I5} 2
{I3, I4} 0
{I3, I5} 1
{I4, I5} 0
Itemset Sup
Count
{I1, I2} 4
{I1, I3} 4
{I1, I5} 2
{I2, I3} 4
{I2, I4} 2
{I2, I5} 2
GenerateC2candidatesfrom L1
C2
C2
L2
Scan D forcount ofeachcandidate
Comparecandidatesupport countwithminimumsupport count
8/12/2019 Association Rule -data mining
50/131
8/12/2019 Association Rule -data mining
51/131
8/12/2019 Association Rule -data mining
52/131
52
Step 3: Generating 3-itemset Frequent Pattern [Cont.]
Based on the Apriori property that all subsets of a frequent itemset mustalso be frequent, we can determine that four latter candidates cannotpossibly be frequent. How ?
For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3}& {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will
keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is
performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2and hence it is not frequent violatingApriori Property. Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members ofresult of Join operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consistingof those candidates 3-itemsets in C3having minimum support.
8/12/2019 Association Rule -data mining
53/131
53
Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3JoinL3to generate a candidate setof 4-itemsets, C4. Although the join results in {{I1, I2, I3,I5}}, this itemset is pruned since its subset {{I2, I3, I5}} isnot frequent.
Thus, C4= , and algorithm terminates, having foundall of the frequent items. This completes our AprioriAlgorithm.
Whats Next ?
These frequent itemsets will be used to generate strongassociation rules( where strong association rules satisfyboth minimum support & minimum confidence).
St 5 G ti A i ti R l f F t
8/12/2019 Association Rule -data mining
54/131
54
Step 5:Generating Association Rules from FrequentItemsets
Procedure:
For each frequent itemset l,generate all nonempty subsets of l.
For every nonempty subset sof l, output the rule s (l-s)if
support_count(l) / support_count(s) >= min_confwhere min_conf
is minimum confidence threshold.
Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},
{I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
St 5 G ti A i ti R l f F t
8/12/2019 Association Rule -data mining
55/131
55
Step 5:Generating Association Rules from FrequentItemsets [Cont.]
Let minimum confidence threshold is , say 70%. The resulting association rules are shown below,
each listed with its confidence. R1: I1 ^ I2I5
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected.
R2: I1 ^ I5I2 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is Selected.
R3: I2 ^ I5I1 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.
Step 5: Generating Association Rules from
8/12/2019 Association Rule -data mining
56/131
56
Step 5:Generating Association Rules fromFrequent Itemsets [Cont.]
R4: I1I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: I2I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is Rejected.
R6: I5I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.In this way, We have found three strong associationrules.
A h E l
8/12/2019 Association Rule -data mining
57/131
57
Another Example
Database TDB
1stscan
C1L1
L2
C2 C2
2ndscan
C3 L33rdscan
Tid Items
10 A, B, D
20 A, C, E
30 A, B, C, E
40 C, E
Itemset sup{A} 3
{B} 2
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 3
{B} 2
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 2
{A, C} 2
{A, E} 2
{B, C} 1{B, E} 1
{C, E} 3
Itemset sup
{A, B} 2
{A, C} 2{A, E} 2
{C, E} 3
Itemset
{A, C, E}
Itemset sup
{A, C, E} 2
Supmin
= 2
8/12/2019 Association Rule -data mining
58/131
An Example58
This example is similar to the last example but we haveadded two more items and another transaction makingfive transactions and six items. We want to findassociation rules with 50% support and 75% confidence.
Transaction ID Items100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice
300 Bread, Milk, Yogurt400 Bread, Juice, Milk
500 Cheese, Juice, Milk
8/12/2019 Association Rule -data mining
59/131
First find L1. 50% support requires that each
frequent item appear in at least three
transactions. Therefore L1 is given by:
59
Example
Item FrequnecyBread 4
Cheese 3
Juice 4
Milk 3
8/12/2019 Association Rule -data mining
60/131
The candidate 2-itemsets or C2 therefore has
six pairs (why?). These pairs and their
frequencies are:
60
Example
Item Pairs Frequency(Bread, Cheese) 2(Bread, Juice) 3
(Bread, Milk) 2
(Cheese, Juice) 3
(Cheese, Milk) 1
(Juice, Milk) 2
Question: Why did we not select (Egg, Milk)?
8/12/2019 Association Rule -data mining
61/131
L2has only two frequent item pairs {Bread, Juice}and {Cheese, Juice}. After these two frequentpairs, there are no candidate 3-itemsets (why?)since we do not have two 2-itemsets that have the
same first item (why?).The two frequent pairs lead to the followingpossible rules:
Bread Juice
Juice BreadCheese Juice
Juice Cheese
61
Deriving Rules
8/12/2019 Association Rule -data mining
62/131
The confidence of these rules is obtained by
dividing the support for both items in the rule by
the support of the item on the left hand side of the
rule.
The confidence of the four rules therefore are
3/4 = 75% (Bread Juice)
3/4 = 75% (Juice Bread)
3/3 = 100% (Cheese Juice) 3/4 = 75% (Juice Cheese)
Since all of them have a minimum 75%
confidence, they all qualify. We are done since
there are no 3itemsets. 62
Deriving Rules
8/12/2019 Association Rule -data mining
63/131
Rule Generation
Given a frequent itemset L, find all non-empty subsets fL such that f Lf satisfies the minimum confidence
requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,ABCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
If |L| = k, then there are 2k2 candidate association
rules (ignoring L and L)
63
G
8/12/2019 Association Rule -data mining
64/131
Rule Generation
How to efficiently generate rules from frequent itemsets?
In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
But confidence of rules generated from the same itemset has ananti-monotone property
e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD)
Confidence is anti-monotone w.r.t. the RHS of the rule
64
8/12/2019 Association Rule -data mining
65/131
Rule Generation for Apriori Algorithm
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Lattice of rulesABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned
Rules
LowConfidenceRule
65
8/12/2019 Association Rule -data mining
66/131
Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules thatshare the same prefix
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
BD=>ACCD=>AB
D=>ABC
66
R l ti l ith
8/12/2019 Association Rule -data mining
67/131
37Rule generation algorithm
Moving items from the antecedent to the consequentnever changes support, and never increases confidence
Algorithm
For each itemsetI with minsup:
Find all minconf rules with a single consequent of the form (I - L1->L1)
Repeat:
Guess candidate consequents Ckby appending items fromI - Lk-1toLk-1
Verify confidence of each ruleI - Ck-> Ckusing known itemset support values
Key fact:
A i ti R l
8/12/2019 Association Rule -data mining
68/131
68Association Rules
consider:AB C (90% confidence)
and A C (92% confidence)
Clearly the first rule is of no use. We should look formore complex rules only if they are better than simplerules.
P tt E l ti
8/12/2019 Association Rule -data mining
69/131
Pattern Evaluation
Association rule algorithms tend to produce toomany rules
many of them are uninteresting or redundant
Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence
Interestingness measures can be used toprune/rank the derived patterns
In the original formulation of association rules,support & confidence are the only measures used
A i ti & C l ti
8/12/2019 Association Rule -data mining
70/131
70
Association & Correlation
As we can see support-confidence framework canbe misleading; it can identify a rule (A=>B) asinteresting (strong) when, in fact the occurrence ofA might not imply the occurrence of B.
Correlation Analysis provides an alternativeframework for finding interesting relationships, or
to improve understanding of meaning of someassociation rules (a lift of an association rule).
Computing Interestingness
8/12/2019 Association Rule -data mining
71/131
Computing Interestingness
Measure Given a rule X Y, information needed to compute rule
interestingness can be obtained from a contingency table
Y Y
X f11
f10
f1+
X f01 f00 fo+
f+1 f+0 |T|
Contingency tablefor X Y
f11: support of X and Y
f10: support of X and Yf01: support of X and Yf00: support of X and Y
Used to define various measures
support, confidence, lift, Gini,J-measure, etc.
D b k f C fid
8/12/2019 Association Rule -data mining
72/131
Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Although confidence is high, rule is misleading
P(Coffee|Tea) = 0.9375
C l ti C t
8/12/2019 Association Rule -data mining
73/131
73
Correlation Concepts
Two item sets A and B are independent (theoccurrence of A is independent of the occurrence ofitem set B) iffP(A B) = P(A) P(B)
Otherwise A and B are dependent andcorrelated
The measure of correlation, orcorrelation betweenA and Bis given by the formula:
Corr(A,B)= P(A U B ) / P(A) . P(B)
C l ti C t [C t ]
8/12/2019 Association Rule -data mining
74/131
74
Correlation Concepts [Cont.]
corr(A,B) >1 means that A and B are positivelycorrelatedi.e. the occurrence of one implies theoccurrence of the other.
corr(A,B) < 1 means that the occurrence of A is
negatively correlated with ( or discourages) theoccurrence of B.
corr(A,B) =1 means that A and B are independentand there is no correlation between them.
St ti ti l I d d
8/12/2019 Association Rule -data mining
75/131
Statistical Independence
Population of 1000 students 600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) = P(S) P(B) => Statistical independence
P(SB) > P(S) P(B) => Positively correlated
P(SB) < P(S) P(B) => Negatively correlated
A i ti & C l ti
8/12/2019 Association Rule -data mining
76/131
76
Association & Correlation
The correlation formula can be re-written as Corr(A,B) = P(B|A) / P(B)
We already know that Support(AB)= P(AUB) Confidence(AB)= P(B|A)
That means that,Confidence(AB)= corr(A,B) P(B)
So correlation, support and confidence are all different, but thecorrelation provides an extra information about the association rule (AB).
We say that the correlation corr(A,B) provides the LIFT of theassociation rule (A=>B), i.e. A is said to increase (or LIFT) the likelihood of B by the factor of
the value returned by the formula for corr(A,B). Measures how much more likely is the RHS given the LHS than merely
the RHS
Statistical based Meas res
8/12/2019 Association Rule -data mining
77/131
Statistical-based Measures
Measures that take into account statisticaldependence
)](1)[()](1)[(
)()(),(
)()(),(
)()(
),()(
)|(
YPYPXPXP
YPXPYXPtcoefficien
YPXPYXPPS
YPXP
YXPInterest
YP
XYPLif t
Example: Lift/Interest
8/12/2019 Association Rule -data mining
78/131
Example: Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 8090 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
Rule Evaluation - Lift
8/12/2019 Association Rule -data mining
79/131
Rule Evaluation - Lift
Transaction No. Item 1 Item 2 Item 3 Item 4
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Milk Vodka Chocolate
103 Beer Milk Diaper Chocolate
104 Milk Diaper Beer
Whats the support and confidence for rule {Chocolate}{Milk}?
Support = 3/5 Confidence = 3/4
Very high support and confidence.
Does Chocolate really lead to Milk purchase?
No! Because Milk occurs in 4 out of 5 transactions. Chocolate is evendecreasing the chance of Milk purchase (3/4 < 4/5)
Lift = (3/4)/(4/5) = 0.9375 < 1
There are lots of
8/12/2019 Association Rule -data mining
80/131
measures proposedin the literature
Some measures aregood for certainapplications, but notfor others
What criteria shouldwe use to determinewhether a measure isgood or bad?
Efficiency
8/12/2019 Association Rule -data mining
81/131
Efficiency
81
Consider an example of a supermarket database whichmight have several thousand items including 1000frequent items and several million transactions.
Which part of the apriori algorithm will be the mostexpensive to compute? Why?
Efficiency
8/12/2019 Association Rule -data mining
82/131
Efficiency
82
The algorithm to construct the candidate set Cito findthe frequent set Liis crucial to the performance of the
Apriori algorithm. The larger the candidate set, higherthe processing cost of discovering the frequent itemsetssince the transactions must be scanned for each. Giventhat the numbers of early candidate itemsets are verylarge, the initial iterations dominate the cost.
In a supermarket database with about 1000 frequentitems, there will be almost half a million candidatepairs C2that need to be searched for to find L2.
Efficiency
8/12/2019 Association Rule -data mining
83/131
Efficiency
83
Generally the number of frequent pairs out of half amillion pairs will be small and therefore (why?) thenumber of 3-itemsets should be small.
Therefore, it is the generation of the frequent set L2that is the key to improving the performance of the
Apriori algorithm.
Comment
8/12/2019 Association Rule -data mining
84/131
Comment
In class examples we usually require highsupport, for example, 25%, 30% or even50%. These support values are very highif the number of items and number of
transactions is large. For example, 25%support in a supermarket transactiondatabase means searching for items thathave been purchased by one in four
customers! Not many items wouldprobably qualify.
Practical applications therefore deal withmuch smaller support, sometimes even84
Transactions Storage
8/12/2019 Association Rule -data mining
85/131
Transactions Storage
Transaction ID Items10 A, B, D20 D, E, F30 A, F40 B, C, D50 E, F60 D, E, F70 C, D, F80 A, C, D, F
85
Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows
Horizontal Representation
8/12/2019 Association Rule -data mining
86/131
Horizontal Representation
TID A B C D E F10 1 1 0 1 0 020 0 0 0 1 1 130 1 0 0 0 0 140 0 1 1 1 0 050 0 0 0 0 1 160 0 0 0 1 1 170 0 0 1 1 0 1
80 1 0 1 1 0 1
86
Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows
Vertical Representation
8/12/2019 Association Rule -data mining
87/131
Vertical Representation
Items 1 2 3 4 5 6 7 8A 1 0 1 0 0 0 0 1B 1 0 0 1 0 0 0 0C 0 0 0 1 0 0 1 1D 1 1 0 1 0 1 1 1E 0 1 0 0 1 1 0 0F 0 1 1 0 1 1 1 1
87
Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows
Bigger Example
8/12/2019 Association Rule -data mining
88/131
88
Bigger Example
TID Items1 Biscuits, Bread, Cheese, Coffee, Yogurt2 Bread, Cereal, Cheese, Coffee3 Cheese, Chocolate, Donuts, Juice, Milk4 Bread, Cheese, Coffee, Cereal, Juice5 Bread, Cereal, Chocolate, Donuts, Juice6 Milk, Tea7 Biscuits, Bread, Cheese, Coffee, Milk8 Eggs, Milk, Tea
9 Bread, Cereal, Cheese, Chocolate, Coffee10 Bread, Cereal, Chocolate, Donuts, Juice11 Bread, Cheese, Juice12 Bread, Cheese, Coffee, Donuts, Juice13 Biscuits, Bread, Cereal14 Cereal, Cheese, Chocolate, Donuts, Juice15 Chocolate, Coffee16 Donuts17 Donuts, Eggs, Juice18 Biscuits, Bread, Cheese, Coffee19 Bread, Cereal, Chocolate, Donuts, Juice20 Cheese, Chocolate, Donuts, Juice21 Milk, Tea, Yogurt22 Bread, Cereal, Cheese, Coffee23 Chocolate, Donuts, Juice, Milk, Newspaper24 Newspaper, Pastry, Rolls25 Rolls, Sugar, Tea
Frequency of Items
8/12/2019 Association Rule -data mining
89/131
89
Frequency of Items
Item No Item name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 9
7 Donuts 108 Eggs 29 Juice 11
10 Milk 611 Newspaper 212 Pastry 113 Rolls 2
14 Sugar 115 Tea 416 Yogurt 2
Frequent Items
8/12/2019 Association Rule -data mining
90/131
90
Frequent Items
Assume 25% support. In 25 transactions, a frequent item mustoccur in at least 7 transactions. The frequent 1-itemset or L1isnow given below. How many candidates in C2? List them.
Item FrequencyBread 13Cereal 10Cheese 11Chocolate 9Coffee 9Donuts 10
Juice 11
L
8/12/2019 Association Rule -data mining
91/131
91
L2
The following pairs are frequent. Now find C3and then L3and the rules.
Frequent 2 itemset Frequency{Bread, Cereal} 9
{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9{Chocolate, Donuts} 7
{Chocolate, Juice} 7{Donuts, Juice} 9
Rules
8/12/2019 Association Rule -data mining
92/131
92
Rules
The full set of rules are given below. Could some rules be removed?Cheese Bread
Cheese Coffee
Coffee Bread
Coffee Cheese
Cheese, Coffee Bread
Bread, Coffee Cheese
Bread, Cheese CoffeeChocolate Donuts
Chocolate Juice
Donuts Chocolate
Donuts Juice
Donuts, Juice Chocolate
Chocolate, Juice Donuts
Chocolate, Donuts JuiceBread Cereal
Cereal Bread
Study the above rules carefully.
Improving the Apriori Algorithm
8/12/2019 Association Rule -data mining
93/131
93
Improving the Apriori Algorithm
Many techniques for improving the efficiency havebeen proposed:
Pruning (already mentioned)Hashing based technique
Transaction reductionPartitioningSamplingDynamic itemset counting
Pruning
8/12/2019 Association Rule -data mining
94/131
94Pruning
Pruning can reduce the size of the candidate setCk. We want to transform Ckinto a set of frequentitems Lk. To reduce the work of checking, we may
use the rule that all subsets of Ckmust also befrequent.
Example
8/12/2019 Association Rule -data mining
95/131
95Example
Suppose the items are A, B, C, D, E, F, .., X, Y, Z
Suppose L1is A, C, E, P, Q, S, T, V, W, X
Suppose L2is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},
{E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}
Are you able to identify errors in the L2list?
What is C3?
How to prune C3?
C3is {A, C, P}, {E, P, V}, {Q, S, X}
8/12/2019 Association Rule -data mining
96/131
Hash Tables
8/12/2019 Association Rule -data mining
97/131
Hash tables are a common approach to the
storing/searching problem.
Hash Tables
What is a Hash Table ?
8/12/2019 Association Rule -data mining
98/131
What is a Hash Table ?
The simplest kind of hashtable is an array ofrecords.
This example has 701
records.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
An array of records. . .
[ 700]
What is a Hash Table ?[ 4 ]
8/12/2019 Association Rule -data mining
99/131
What is a Hash Table ?
Each record has aspecial field, called itskey.
In this example, the keyis a long integer field
called Number.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]. . .
[ 700]
Number 506643548
What is a Hash Table ?[ 4 ]
8/12/2019 Association Rule -data mining
100/131
What is a Hash Table ?
The number might be aperson's identificationnumber, and the rest of
the record hasinformation about theperson.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .[ 700]
Number 506643548
What is a Hash Table ?
8/12/2019 Association Rule -data mining
101/131
What is a Hash Table ?
When a hash table is inuse, some spots containvalid records, and other
spots are "empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .
Inserting a New Record
8/12/2019 Association Rule -data mining
102/131
Inserting a New Record
In order to insert a newrecord, the keymustsomehow be converted
to an array index.
The index is called the
hash value of the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .
Number 580625685
Inserting a New Record
8/12/2019 Association Rule -data mining
103/131
Inserting a New Record
Typical way create a hashvalue:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .
Number 580625685
(Number mod 701)What is (580625685 mod 701) ?
Inserting a New Record
8/12/2019 Association Rule -data mining
104/131
Inserting a New Record
Typical way to create ahash value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .
Number 580625685
(Number mod 701)What is (580625685 mod 701) ?
3
Inserting a New Record
8/12/2019 Association Rule -data mining
105/131
Inserting a New Record
The hash value is usedfor the location of thenew record.
Number 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .
[3]
Inserting a New Record
8/12/2019 Association Rule -data mining
106/131
Inserting a New Record
The hash value is usedfor the location of thenew record.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685
Collisions
8/12/2019 Association Rule -data mining
107/131
Collisions
Here is another newrecord to insert, with ahash value of 2.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685
Number 701466868
My hashvalue is
[2].
Collisions
8/12/2019 Association Rule -data mining
108/131
Collisions
This is called acollision, becausethere is already another
valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685
Number 701466868
When a collision occurs,move forward until youfind an empty spot.
Collisions
8/12/2019 Association Rule -data mining
109/131
Collisions
This is called acollision, becausethere is already another
valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685
Number 701466868
When a collision occurs,move forward until youfind an empty spot.
Collisions
8/12/2019 Association Rule -data mining
110/131
Collisions
This is called acollision, becausethere is already another
valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685
Number 701466868
When a collision occurs,move forward until youfind an empty spot.
Collisions
8/12/2019 Association Rule -data mining
111/131
Collisions
This is called acollision, becausethere is already another
valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
The new record goesin the empty spot.
Searching for a Key
8/12/2019 Association Rule -data mining
112/131
Searching for a Key
The data that's attachedto a key can be foundfairly quickly.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Number 701466868
Searching for a Key
8/12/2019 Association Rule -data mining
113/131
Searching for a Key
Calculate the hash value.
Check that location of the
array for the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Not me.
Searching for a Key
8/12/2019 Association Rule -data mining
114/131
Searching for a Key
Keep moving forward untilyou find the key, or youreach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Not me.
Searching for a Key
8/12/2019 Association Rule -data mining
115/131
Searching for a Key
Keep moving forward untilyou find the key, or youreach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Not me.
Searching for a Key
8/12/2019 Association Rule -data mining
116/131
Searching for a Key
Keep moving forward untilyou find the key, or youreach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Yes
Searching for a Key
8/12/2019 Association Rule -data mining
117/131
Searching for a Key
When the item is found, theinformation can be copiedto the necessary location.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is
[2].Yes
8/12/2019 Association Rule -data mining
118/131
Deleting a Record
8/12/2019 Association Rule -data mining
119/131
Deleting a Record
Records may also be deleted from a hashtable.
But the location must not be left as an ordinary
"empty spot" since that could interfere withsearches.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Deleting a Record
8/12/2019 Association Rule -data mining
120/131
Deleting a Record
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868
Records may also be deleted from a hashtable.
But the location must not be left as an ordinary
"empty spot" since that could interfere withsearches.
The location must be marked in some special
way so that a search can tell that the spot used
to have something in it.
Example
8/12/2019 Association Rule -data mining
121/131
121
p
Transaction ID Items100 Bread, Cheese, Eggs, Juice
200 Bread, Cheese, Juice300 Bread, Milk, Yogurt
400 Bread, Juice, Milk
500 Cheese, Juice, Milk
100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)
200 (B, C) (B, J) (C, J)300 (B, M) (B, Y) (M, Y)400 (B, J) (B, M) (J, M)500 (C, J) (C, M) (J, M)
Consider the transaction database in the first table below used in anearlier example. The second table below shows allpossible 2-itemsetsfor each transaction.
Hashing Example
8/12/2019 Association Rule -data mining
122/131
122
Hashing Example
The possible 2-itemsets in the last table are now hashed to a hashtable below. The last column shown in the table below is not
required in the hash table but we have included it for explaining the
technique.
Bit vector Bucket number Count Pairs C21 0 3 (C, J) (B, Y) (M, Y) (C, J)0 1 1 (C, M)0 2 1 (E, J)0 3 0
0 4 2 (B, C)1 5 3 (B, E) (J, M) (J, M)1 6 3 (B, J) (B, J)1 7 3 (C, E) (B, M) (B, M)
8/12/2019 Association Rule -data mining
123/131
Find C2
8/12/2019 Association Rule -data mining
124/131
1242
The major aim of the algorithm is to reduce the size of C2. It istherefore essential that the hash table is large enough so that
collisions are low. Collisions result in loss of effectiveness of
the hash table. This is what happened in the example in which
we had collisions in three of the eight rows of the hash tablewhich required us finding which pair was frequent.
Transaction Reduction
8/12/2019 Association Rule -data mining
125/131
125
As discussed earlier, any transaction that does notcontain any frequent k-itemsets cannot contain anyfrequent (k+1)-itemsets and such a transaction maybe marked or removed.
8/12/2019 Association Rule -data mining
126/131
Partitioning
8/12/2019 Association Rule -data mining
127/131
127
The set of transactions may be divided into a numberof disjoint subsets. Then each partition is searchedfor frequent itemsets. These frequent itemsets arecalled local frequent itemsets.
Partitioning
8/12/2019 Association Rule -data mining
128/131
128
Phase 1 Divide n transactions into m partitions
Find the frequent itemsets in each partition
Combine all local frequent itemsets to form
candidate itemsets
Phase 2
Find global frequent itemsets
Sampling
8/12/2019 Association Rule -data mining
129/131
129
A random sample (usually large enough to fit in themain memory) may be obtained from the overall set oftransactions and the sample is searched for frequentitemsets. These frequent itemsets are called samplefrequent itemsets.
Sampling
8/12/2019 Association Rule -data mining
130/131
130
Not guaranteed to be accurate but we sacrificeaccuracy for efficiency. A lower support threshold maybe used for the sample to ensure not missing anyfrequent datasets.
The actual frequencies of the sample frequent itemsetsare then obtained.
More than one sample could be used to improve
accuracy.
Exercise
8/12/2019 Association Rule -data mining
131/131
Given the above list of transactions, do the following:
1) Find all the frequent itemsets (minimum support 40%)
2) Find all the association rules (minimum confidence 70%)3) For the discovered association rules, calculate the lift
Transaction No. Item 1 Item 2 Item 3 Item 4
100 Beer Diaper Chocolate
101 Milk Chocolate Shampoo
102 Beer Soap Vodka
103 Beer Cheese Wine
104 Milk Diaper Beer Chocolate