Association Rule -data mining

8/12/2019 Association Rule -data mining

1/131

ASSOCIATION RULE

MINING


2/131

As noted earlier, huge amount of data is storedelectronically in many retail outlets due tobarcoding of goods sold. Natural to try to find

some useful information from this mountains ofdata.

A conceptually simple yet interesting technique

is to find association rules from these largedatabases.

The problem was invented by Rakesh Agarwalat IBM. 2

Introduction


3/131

Association rules mining (or market basket

analysis) searches for interesting customer

habits by looking at associations.

Applications in marketing, store layout,customer segmentation, medicine, finance,

and many more.

Question: Suppose you are able to find that two products x and y (say bread and

milk or crackers and cheese) are frequently bought together. How can you use that

information?

3

Introduction


4/131

Applications

Market Basket Analysis: given a database of customertransactions, where each transaction is a set of items the goalis to find groups of items which are frequently purchasedtogether.

Telecommunication (each customer is a transactioncontaining the set of phone calls)

Credit Cards/ Banking Services (each card/account is atransaction containing the set of customerspayments)

Medical Treatments (each patient is represented as a

transaction containing the ordered set of diseases)


5/131

Transaction ID Items

10 Bread, Cheese, Newspaper

20 Bread, Cheese, Juice

30 Bread, Milk

40 Cheese, Juice. Milk, Coffee

50 Sugar, Tea, Coffee, Biscuits, Newspaper

60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper

70Bread, Cheese

80 Bread, Cheese, Juice, Coffee

90 Bread, Milk

100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper

5

A Simple Example

Consider the following ten transactions of a shop selling only nine items. Wewill return to it after explaining some terminology.


6/131

Let the number of different items sold be n

Let the number of transactions be N

Let the set of items be I={i1, i2, , in}. The

number of items may be large, perhaps severalthousands.

Let the set of transactions be T= {t1, t2, , tN}.Each transaction ti contains a subset of items

from the itemset {i1, i2, , in}. These are thethings a customer buys when they visit thesupermarket. N is assumed to be large, perhapsin millions.

Not considering the quantities of items bought. 6

Terminology


7/131

Want to find a group of items that tend to occur

together frequently.

The association rules are often written as

XY meaning that whenever X appears Yalso tends to appear. X and Y may be single

items or sets of items but the same item does

not appear in both.

7

Terminology


8/131

Suppose X and Y appear together in only 1%

of the transactions but whenever X appears

there is 80% chance that Y also appears.

The 1% presence of X and Y together is calledthe support (or prevalence) of the rule and

80% is called the confidence (or predictability)

of the rule.

These are measures of interestingness of the

rule.

8

Terminology


9/131

Terminology9

Confidence denotes the strength of the associationbetween X and Y. Support indicates the frequency of thepattern. A minimum support is necessary if anassociation is going to be of some business value.

Let the chance of finding an item X in the Ntransactions is x% then we can say probability of X isP(X) = x/100.

Question: Why is minimum support necessary? Why is minimum confidence necessary?


10/131


11/131

11

Transaction data: supermarket

data Market basket transactions:

t1: {bread, cheese, milk}

t2: {apple, eggs, salt, yogurt}

tn: {biscuit, eggs, milk}

Concepts:

An item: an item/article in a basket

I:the set of all items sold in the storeA transaction:items purchased in a basket; it may

have TID (transaction ID)

A transactionaldataset: A set of transactions


12/131

12

Transaction data: a set of

documents

A text document data set. Each document is

treated as a bag of keywords

doc1: Student, Teach, School

doc2: Student, School

doc3: Teach, School, City, Game

doc4: Baseball, Basketball

doc5: Basketball, Player, Spectator

doc6: Baseball, Coach, Game, Team

doc7: Basketball, Team, City, Game


13/131

13

The model: rules

A transaction tcontainsX, a set of items(itemset) in I, ifXt.

An association ruleis an implication of the form:

X

Y, whereX, Y

I, and X

Y =

An itemsetis a set of items.

E.g., X = {milk, bread, cereal} is an itemset.

A k-itemset is an itemset with kitems. E.g., {milk, bread, cereal} is a 3-itemset


14/131

14

Rule strength measures

Support:The rule holds with supportsupin T(the transaction data set) if sup% oftransactionscontainXY.

Confidence:The rule holds in Twithconfidence confif conf% of tranactions thatcontainXalso contain Y.

An association rule is a pattern that states

whenXoccurs, Yoccurs with certainprobability.


15/131

15

Support and Confidence

Support count: The support count of an

itemsetX, denoted byX.count, in a data set T

is the number of transactions in Tthat contain

X. Assume Thas ntransactions. Then,

n

countYXsupport

).(

countX

countYXconfidence

.

).(


16/131

Terminology16

Suppose we know P(X Y) then what is theprobability of Y appearing if we know that X alreadyexists. It is written as P(Y|X)

The supportfor XY is the probability of both X and

Y appearing together, that is P(X Y).

The confidenceof XY is the conditional probabilityof Y appearing given that X exists. It is written asP(Y|X) and read as P of Y given X.


17/131

17Association Rule Problem

Given:

a setIof all the items;

a database Tof transactions;

minimum support p;

minimum confidence q;

Find:

all association rulesX Ywith a minimumsupportpand confidence q.


18/131

18Problem Decomposition

1. Find all sets of items that have minimum

support (frequent itemsets)

2. Use the frequent itemsets to generate the

desired rules


19/131

Want to find all associations which have at least

p% support with at least q% confidence such

that

all rules satisfying any user constraints are found the rules are found efficiently from large

databases

the rules found are actionable

19

The task


20/131

20

Many mining algorithms

There are a large number of them!!

They use different strategies and data structures.

Their resulting sets of rules are all the same.

Given a transaction data set T, and a minimum support anda minimum confident, the set of association rules existing in

Tis uniquely determined.

Any algorithm should find the same set of rules

although their computational efficiencies andmemory requirements may be different.

We study only one: the Apriori Algorithm


21/131

Association Rule Mining Task

Given a set of transactions T, the goal of

association rule mining is to find all rules

having

support minsup threshold confidence minconf threshold

Brute-force approach: List all possible association rules

Compute the support and confidence for each

rule

Prune rules that fail the minsupand minconf


22/131

Transaction ID Items

10 Bread, Cheese, Newspaper


30 Bread, Milk40 Cheese, Juice. Milk, Coffee

50 Sugar, Tea, Coffee, Biscuits, Newspaper

60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper

70 Bread, Cheese

80 Bread, Cheese, Juice, Coffee

90 Bread, Milk

100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper

22

Example

We have9 items and 10 transactions. Find support and confidence for each pair.How many pairs are there?


23/131

There are 9 x 8 or 72 pairs, half of themduplicates. 36 is too many to analyse, so we usean even simpler example of n = 4 (Bread, Cheese,Juice, Milk) and N = 4. We want to find rules with

minimum support of 50% and minimumconfidence of 75%.

23

Example

Transaction I Items100 Bread, Cheese

200 Bread, Cheese, Juice300 Bread, Milk

400 Cheese, Juice. Milk


24/131

N = 4 results in the following frequencies. All items have

the 50% support we

require. Only two pairs

have the 50% support

and no 3-itemset hasthe support.

We will look at the

two pairs that have

the minimum support

to find if they have

the required confidence.

24

Example

Itemsets FrequencyBread 3

Cheese 3

Juice 2

Milk 2(Bread, Cheese) 2

(Bread, Juice) 1

(Bread, Milk) 1

(Cheese, Juice) 2

(Cheese, Milk) 1

(Juice, Milk) 1

(Bread, Cheese, Juice) 1

(Bread, Cheese, Milk) 0

(Bread, Juice, Milk) 0

(Cheese, Juice, Milk) 1

(Bread, Cheese, Juice, Milk) 0


25/131

Items or itemsets that have the minimumsupport are called frequent. In our example, allthe four items and two pairs are frequent.

We will now determine if the two pairs {Bread,Cheese} and {Cheese, Juice} lead toassociation rules with 75% confidence.

Every pair {A, B} can lead to two rules A B

and B A if both satisfy the minimumconfidence. Confidence of A B is given bythe support for A and B together divided by thesupport of A.

25

Terminology


26/131

We have four possible rules and theirconfidence is given as follows:

Bread Cheese with confidence of 2/3 = 67%

Cheese Bread with confidence of 2/3 = 67% Cheese Juice with confidence of 2/3 = 67%

Juice Cheese with confidence of 100%

Therefore only the last rule Juice Cheese

has confidence above the minimum 75% andqualifies. Rules that have more than user-specified minimum confidence are calledconfident.

26

Rules


27/131

This simple algorithm works well with fouritems since there were only a total of 16combinations that we needed to look at but ifthe number of items is say 100, the number ofcombinations is much larger, in billions.

The number of combinations becomes about amillion with 20 items since the number of

combinations is 2n

with n items (why?). Thenave algorithm can be improved to deal moreeffectively with larger data sets.

27

Problems with Brute Force


28/131

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, thereare 2dpossiblecandidate itemsets


29/131

Question: Can you think of an improvement over

the brute force method in which we looked at

every itemset combination possible?

Nave algorithms even with improvements dont

work efficiently enough to deal with large

number of items and transactions. We nowdefine a better algorithm.

29

Problems with Brute Force


30/131

Improved Brute Force

Rather than counting all possible item

combinations we can look at each transaction

and count only the combinations that actually

occur as shown below (that is, we dont countitemsets with zero frequency).ransaction ID Items Combinations

100 Bread, Cheese {Bread, Cheese}200 Bread, Cheese,

Juice{Bread, Cheese}, {Bread, Juice},{Cheese, Juice}, {Bread, Cheese, Juice}

300 Bread, Milk {Bread, Milk}

400 Cheese, Juice, Milk {Cheese, Juice}, {Cheese, Milk}, {Juice,Milk}, {Cheese, Juice, Milk}

30


31/131

Improved Brute Force

Itemsets FrequencyBread 3Cheese 3Juice 2Milk 2(Bread, Cheese) 2(Bread, Juice) 1

(Bread, Milk) 1(Cheese, Milk) 2(Cheese, Juice) 2(Juice, Milk) 1(Bread, Cheese, Juice) 1(Cheese, Juice, Milk) 1

31

Removing zero frequencies leads to the smaller tablebelow. We then proceed as before but the improvement isnot large and we need better techniques.

F t It t G ti


32/131

Frequent Itemset Generation

Strategies

Reduce the number of candidates(M)Complete search: M=2d

Use pruning techniques to reduce M

Reduce the number of transactions (N)Reduce size of N as the size of itemset increases

Reduce the number of comparisons(NM)

Use efficient data structures to store thecandidates or transactions

No need to match every candidate against everytransaction


33/131

33

The Apriori algorithm

Probably the best known algorithm Two steps:

Find all itemsets that have minimum support

(frequent itemsets, also called large itemsets). Use frequent itemsets to generate rules.

E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]

and one rule from the frequent itemset

Clothes Milk, Chicken [sup = 3/7, conf = 3/3]


34/131

34

Step 1: Mining all frequent

itemsetsA frequent itemsetis an itemset whose support

is minsup.

Key idea: The apriori property(downward

closure property): any subsets of a frequentitemset are also frequent itemsets

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD


35/131

The Apriori Algorithm

Frequent I temset Property:

Any subset of a frequent itemset is frequent.

Contrapositive:

If an itemset is not frequent, none of its

supersets are frequent.

R d i N b f


36/131

Reducing Number of

Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also

be frequent

Apriori principle holds due to the following property ofthe support measure:

Support of an itemset never exceeds the support of its

monotonic function (or monotone function) is a function

between ordered sets that preserves the given order

subsets

This is known as the anti-monotone ro ert of su ort

)()()(:, YsXsYXYX


37/131


38/131

Illustrating Apriori Principle

Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1

Itemset Count

{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3

{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Itemset Count

{Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Coke

or Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered,6C1+

6C2+6C3= 41

With support-based pruning,6 + 6 + 1 = 13


39/131

39The Apriori Algorithm: Basics

The Apriori Algorithm is an influential algorithm formining frequent itemsets for boolean association rules.

Key Concepts : Frequent Itemsets: The sets of item which has minimum

support (denoted by Lifor ith-Itemset).

Apriori Property: Any subset of frequent itemset must

be frequent. Join Operation: To find Lk, a set of candidate k-itemsets

is generated by joining Lk-1with itself.


40/131

The Apriori Algorithm40

To find associations, this classical Apriori algorithmmay be simply described by a two step approach:

Step 1 discover all frequent (single) items that have

support above the minimum support required

Step 2 use the set of frequent items to generate theassociation rules that have high enough confidencelevel


41/131

Apriori Algorithm Terminology41

A kitemset is a set of k items.

The set Ckis a set of candidate kitemsets that are

potentially frequent.The set Lkis a subset of Ckand is the set of kitemsetsthat are frequent.


42/131

Step 1Computing L1

Scan all transactions. Find all frequent items

that have support above the required p%. Let

these frequent items be labeled L1. Step 2Apriori-gen Function

Use the frequent items L1 to build all possible

item pairs like {Bread, Cheese} if Bread andCheese are in L1. The set of these item pairs

is called C2, the candidate set.

42

Step-by-step Apriori Algorithm


43/131

Step 3Pruning

Scan all transactions and find all pairs in the

candidate pair set C2 that are frequent. Let

these frequent pairs be L2. Step 4General rule

A generalization of Step 2. Build candidate set

of k items Ck by combining frequent itemsetsin the set Lk-1.

43



44/131

Step 5Pruning

Step 5, generalization of Step 3. Scan all

transactions and find all item sets in Ck that

are frequent. Let these frequent itemsets beLk.

Step 6Continue

Continue with Step 4 unless Lk is empty. Step 7Stop

Stop when Lk is empty.

44


The Apriori Algorithm in a


45/131

45

The Apriori Algorithm in aNutshell

Find thefrequent itemsets: the sets of items that have

minimum support

A subset of a frequent itemset must also be a frequent

itemset i.e., if {AB} isa frequent itemset, both {A} and {B}

should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1

to k (k-itemset)

Use the frequent itemsets to generate association rules.

The Apriori Algorithm : Pseudo


46/131

46

The Apriori Algorithm : Pseudocode

Join Step: Ckis generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a

subset of a frequent k-itemset

Pseudo-code:Ck: Candidate itemset of size k

Lk:frequent itemset of size k

L1= {frequent items};for(k= 1; Lk!=; k++) do begin

Ck+1= candidates generated from Lk;for eachtransaction tin database do

increment the count of all candidates in Ck+1that are contained in t

Lk+1 = candidates in Ck+1with min_supportend

returnkLk;


47/131

47The Apriori Algorithm: Example

Consider a database, D , consistingof 9 transactions.

Suppose min. support countrequired is 2 (i.e. min_sup = 2/9 =22 % )

Let minimum confidence requiredis 70%.

We have to first find out thefrequent itemset using Apriori

algorithm. Then, Association rules will be

generated using min. support &min. confidence.

TID List of Items

T100 I1, I2, I5

T100 I2, I4

T100 I2, I3

T100 I1, I2, I4

T100 I1, I3

T100 I2, I3

T100 I1, I3

T100 I1, I2 ,I3, I5

T100 I1, I2, I3


48/131

StateUniversity ofNew York,Stony Brook

48

Step 1: Generating 1-itemset Frequent Pattern

Itemset Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2{I5} 2

Itemset Sup.Count

{I1} 6

{I2} 7

{I3} 6

{I4} 2{I5} 2

In the first iteration of the algorithm, each item is a member of the set

of candidate.The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.

Scan D forcount of eachcandidate

Compare candidatesupport count withminimum supportcount

C1 L1


49/131

49Step 2: Generating 2-itemset Frequent Pattern

Itemset

{I1, I2}

{I1, I3}

{I1, I4}

{I1, I5}

{I2, I3}

{I2, I4}

{I2, I5}

{I3, I4}

{I3, I5}{I4, I5}

Itemset Sup.

Count

{I1, I2} 4

{I1, I3} 4

{I1, I4} 1

{I1, I5} 2

{I2, I3} 4

{I2, I4} 2

{I2, I5} 2

{I3, I4} 0

{I3, I5} 1

{I4, I5} 0

Itemset Sup

Count

{I1, I2} 4

{I1, I3} 4

{I1, I5} 2

{I2, I3} 4

{I2, I4} 2

{I2, I5} 2

GenerateC2candidatesfrom L1

C2

C2

L2

Scan D forcount ofeachcandidate

Comparecandidatesupport countwithminimumsupport count


50/131


51/131


52/131

52

Step 3: Generating 3-itemset Frequent Pattern [Cont.]

Based on the Apriori property that all subsets of a frequent itemset mustalso be frequent, we can determine that four latter candidates cannotpossibly be frequent. How ?

For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3}& {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will

keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is

performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.

BUT, {I3, I5} is not a member of L2and hence it is not frequent violatingApriori Property. Thus We will have to remove {I2, I3, I5} from C3.

Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members ofresult of Join operation for Pruning.

Now, the transactions in D are scanned in order to determine L3, consistingof those candidates 3-itemsets in C3having minimum support.


53/131

53

Step 4: Generating 4-itemset Frequent Pattern

The algorithm uses L3JoinL3to generate a candidate setof 4-itemsets, C4. Although the join results in {{I1, I2, I3,I5}}, this itemset is pruned since its subset {{I2, I3, I5}} isnot frequent.

Thus, C4= , and algorithm terminates, having foundall of the frequent items. This completes our AprioriAlgorithm.

Whats Next ?

These frequent itemsets will be used to generate strongassociation rules( where strong association rules satisfyboth minimum support & minimum confidence).

St 5 G ti A i ti R l f F t


54/131

54

Step 5:Generating Association Rules from FrequentItemsets

Procedure:

For each frequent itemset l,generate all nonempty subsets of l.

For every nonempty subset sof l, output the rule s (l-s)if

support_count(l) / support_count(s) >= min_confwhere min_conf

is minimum confidence threshold.

Back To Example:

We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},

{I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. Lets take l = {I1,I2,I5}.

Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.

St 5 G ti A i ti R l f F t


55/131

55

Step 5:Generating Association Rules from FrequentItemsets [Cont.]

Let minimum confidence threshold is , say 70%. The resulting association rules are shown below,

each listed with its confidence. R1: I1 ^ I2I5

Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected.

R2: I1 ^ I5I2 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is Selected.

R3: I2 ^ I5I1 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.

Step 5: Generating Association Rules from


56/131

56

Step 5:Generating Association Rules fromFrequent Itemsets [Cont.]

R4: I1I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%

R4 is Rejected.

R5: I2I1 ^ I5

Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is Rejected.

R6: I5I1 ^ I2

Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%

R6 is Selected.In this way, We have found three strong associationrules.

A h E l


57/131

57

Another Example

Database TDB

1stscan

C1L1

L2

C2 C2

2ndscan

C3 L33rdscan

Tid Items

10 A, B, D

20 A, C, E

30 A, B, C, E

40 C, E

Itemset sup{A} 3

{B} 2

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 3

{B} 2

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}{B, C}

{B, E}

{C, E}

Itemset sup

{A, B} 2

{A, C} 2

{A, E} 2

{B, C} 1{B, E} 1

{C, E} 3

Itemset sup

{A, B} 2

{A, C} 2{A, E} 2

{C, E} 3

Itemset

{A, C, E}

Itemset sup

{A, C, E} 2

Supmin

= 2


58/131

An Example58

This example is similar to the last example but we haveadded two more items and another transaction makingfive transactions and six items. We want to findassociation rules with 50% support and 75% confidence.

Transaction ID Items100 Bread, Cheese, Eggs, Juice


300 Bread, Milk, Yogurt400 Bread, Juice, Milk

500 Cheese, Juice, Milk


59/131

First find L1. 50% support requires that each

frequent item appear in at least three

transactions. Therefore L1 is given by:

59

Example

Item FrequnecyBread 4

Cheese 3

Juice 4

Milk 3


60/131

The candidate 2-itemsets or C2 therefore has

six pairs (why?). These pairs and their

frequencies are:

60

Example

Item Pairs Frequency(Bread, Cheese) 2(Bread, Juice) 3

(Bread, Milk) 2

(Cheese, Juice) 3

(Cheese, Milk) 1

(Juice, Milk) 2

Question: Why did we not select (Egg, Milk)?


61/131

L2has only two frequent item pairs {Bread, Juice}and {Cheese, Juice}. After these two frequentpairs, there are no candidate 3-itemsets (why?)since we do not have two 2-itemsets that have the

same first item (why?).The two frequent pairs lead to the followingpossible rules:

Bread Juice

Juice BreadCheese Juice

Juice Cheese

61

Deriving Rules


62/131

The confidence of these rules is obtained by

dividing the support for both items in the rule by

the support of the item on the left hand side of the

rule.

The confidence of the four rules therefore are

3/4 = 75% (Bread Juice)

3/4 = 75% (Juice Bread)

3/3 = 100% (Cheese Juice) 3/4 = 75% (Juice Cheese)

Since all of them have a minimum 75%

confidence, they all qualify. We are done since

there are no 3itemsets. 62

Deriving Rules


63/131

Rule Generation

Given a frequent itemset L, find all non-empty subsets fL such that f Lf satisfies the minimum confidence

requirement

If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A,ABCD, B ACD, C ABD, D ABC

AB CD, AC BD, AD BC, BC AD,

BD AC, CD AB,

If |L| = k, then there are 2k2 candidate association

rules (ignoring L and L)

63

G


64/131

Rule Generation

How to efficiently generate rules from frequent itemsets?

In general, confidence does not have an anti-monotone property

c(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same itemset has ananti-monotone property

e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD)

Confidence is anti-monotone w.r.t. the RHS of the rule

64


65/131

Rule Generation for Apriori Algorithm

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rulesABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned

Rules

LowConfidenceRule

65


66/131

Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules thatshare the same prefix

in the rule consequent

join(CD=>AB,BD=>AC)

would produce the candidate

rule D => ABC

Prune rule D=>ABC if its

subset AD=>BC does not have

high confidence

BD=>ACCD=>AB

D=>ABC

66

R l ti l ith


67/131

37Rule generation algorithm

Moving items from the antecedent to the consequentnever changes support, and never increases confidence

Algorithm

For each itemsetI with minsup:

Find all minconf rules with a single consequent of the form (I - L1->L1)

Repeat:

Guess candidate consequents Ckby appending items fromI - Lk-1toLk-1

Verify confidence of each ruleI - Ck-> Ckusing known itemset support values

Key fact:

A i ti R l


68/131

68Association Rules

consider:AB C (90% confidence)

and A C (92% confidence)

Clearly the first rule is of no use. We should look formore complex rules only if they are better than simplerules.

P tt E l ti


69/131

Pattern Evaluation

Association rule algorithms tend to produce toomany rules

many of them are uninteresting or redundant

Redundant if {A,B,C} {D} and {A,B} {D}

have same support & confidence

Interestingness measures can be used toprune/rank the derived patterns

In the original formulation of association rules,support & confidence are the only measures used

A i ti & C l ti


70/131

70

Association & Correlation

As we can see support-confidence framework canbe misleading; it can identify a rule (A=>B) asinteresting (strong) when, in fact the occurrence ofA might not imply the occurrence of B.

Correlation Analysis provides an alternativeframework for finding interesting relationships, or

to improve understanding of meaning of someassociation rules (a lift of an association rule).

Computing Interestingness


71/131

Computing Interestingness

Measure Given a rule X Y, information needed to compute rule

interestingness can be obtained from a contingency table

Y Y

X f11

f10

f1+

X f01 f00 fo+

f+1 f+0 |T|

Contingency tablefor X Y

f11: support of X and Y

f10: support of X and Yf01: support of X and Yf00: support of X and Y

Used to define various measures

support, confidence, lift, Gini,J-measure, etc.

D b k f C fid


72/131

Drawback of Confidence

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Although confidence is high, rule is misleading

P(Coffee|Tea) = 0.9375

C l ti C t


73/131

73

Correlation Concepts

Two item sets A and B are independent (theoccurrence of A is independent of the occurrence ofitem set B) iffP(A B) = P(A) P(B)

Otherwise A and B are dependent andcorrelated

The measure of correlation, orcorrelation betweenA and Bis given by the formula:

Corr(A,B)= P(A U B ) / P(A) . P(B)

C l ti C t [C t ]


74/131

74

Correlation Concepts [Cont.]

corr(A,B) >1 means that A and B are positivelycorrelatedi.e. the occurrence of one implies theoccurrence of the other.

corr(A,B) < 1 means that the occurrence of A is

negatively correlated with ( or discourages) theoccurrence of B.

corr(A,B) =1 means that A and B are independentand there is no correlation between them.

St ti ti l I d d


75/131

Statistical Independence

Population of 1000 students 600 students know how to swim (S)

700 students know how to bike (B)

420 students know how to swim and bike (S,B)

P(SB) = 420/1000 = 0.42

P(S) P(B) = 0.6 0.7 = 0.42

P(SB) = P(S) P(B) => Statistical independence

P(SB) > P(S) P(B) => Positively correlated

P(SB) < P(S) P(B) => Negatively correlated

A i ti & C l ti


76/131

76

Association & Correlation

The correlation formula can be re-written as Corr(A,B) = P(B|A) / P(B)

We already know that Support(AB)= P(AUB) Confidence(AB)= P(B|A)

That means that,Confidence(AB)= corr(A,B) P(B)

So correlation, support and confidence are all different, but thecorrelation provides an extra information about the association rule (AB).

We say that the correlation corr(A,B) provides the LIFT of theassociation rule (A=>B), i.e. A is said to increase (or LIFT) the likelihood of B by the factor of

the value returned by the formula for corr(A,B). Measures how much more likely is the RHS given the LHS than merely

the RHS

Statistical based Meas res


77/131

Statistical-based Measures

Measures that take into account statisticaldependence

)](1)[()](1)[(

)()(),(

)()(),(

)()(

),()(

)|(

YPYPXPXP

YPXPYXPtcoefficien

YPXPYXPPS

YPXP

YXPInterest

YP

XYPLif t

Example: Lift/Interest


78/131

Example: Lift/Interest

Coffee Coffee

Tea 15 5 20

Tea 75 5 8090 10 100

Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Rule Evaluation - Lift


79/131

Rule Evaluation - Lift

Transaction No. Item 1 Item 2 Item 3 Item 4

100 Beer Diaper Chocolate

101 Milk Chocolate Shampoo

102 Beer Milk Vodka Chocolate

103 Beer Milk Diaper Chocolate

104 Milk Diaper Beer

Whats the support and confidence for rule {Chocolate}{Milk}?

Support = 3/5 Confidence = 3/4

Very high support and confidence.

Does Chocolate really lead to Milk purchase?

No! Because Milk occurs in 4 out of 5 transactions. Chocolate is evendecreasing the chance of Milk purchase (3/4 < 4/5)

Lift = (3/4)/(4/5) = 0.9375 < 1

There are lots of


80/131

measures proposedin the literature

Some measures aregood for certainapplications, but notfor others

What criteria shouldwe use to determinewhether a measure isgood or bad?

Efficiency


81/131

Efficiency

81

Consider an example of a supermarket database whichmight have several thousand items including 1000frequent items and several million transactions.

Which part of the apriori algorithm will be the mostexpensive to compute? Why?

Efficiency


82/131

Efficiency

82

The algorithm to construct the candidate set Cito findthe frequent set Liis crucial to the performance of the

Apriori algorithm. The larger the candidate set, higherthe processing cost of discovering the frequent itemsetssince the transactions must be scanned for each. Giventhat the numbers of early candidate itemsets are verylarge, the initial iterations dominate the cost.

In a supermarket database with about 1000 frequentitems, there will be almost half a million candidatepairs C2that need to be searched for to find L2.

Efficiency


83/131

Efficiency

83

Generally the number of frequent pairs out of half amillion pairs will be small and therefore (why?) thenumber of 3-itemsets should be small.

Therefore, it is the generation of the frequent set L2that is the key to improving the performance of the

Apriori algorithm.

Comment


84/131

Comment

In class examples we usually require highsupport, for example, 25%, 30% or even50%. These support values are very highif the number of items and number of

transactions is large. For example, 25%support in a supermarket transactiondatabase means searching for items thathave been purchased by one in four

customers! Not many items wouldprobably qualify.

Practical applications therefore deal withmuch smaller support, sometimes even84

Transactions Storage


85/131

Transactions Storage

Transaction ID Items10 A, B, D20 D, E, F30 A, F40 B, C, D50 E, F60 D, E, F70 C, D, F80 A, C, D, F

85

Consider only eight transactions with transaction IDs {10, 20, 30, 40, 50,60, 70, 80}. This set of eight transactions with six items can be representedin at least three different ways as follows

Horizontal Representation


86/131

Horizontal Representation

TID A B C D E F10 1 1 0 1 0 020 0 0 0 1 1 130 1 0 0 0 0 140 0 1 1 1 0 050 0 0 0 0 1 160 0 0 0 1 1 170 0 0 1 1 0 1

80 1 0 1 1 0 1

86


Vertical Representation


87/131

Vertical Representation

Items 1 2 3 4 5 6 7 8A 1 0 1 0 0 0 0 1B 1 0 0 1 0 0 0 0C 0 0 0 1 0 0 1 1D 1 1 0 1 0 1 1 1E 0 1 0 0 1 1 0 0F 0 1 1 0 1 1 1 1

87


Bigger Example


88/131

88

Bigger Example

TID Items1 Biscuits, Bread, Cheese, Coffee, Yogurt2 Bread, Cereal, Cheese, Coffee3 Cheese, Chocolate, Donuts, Juice, Milk4 Bread, Cheese, Coffee, Cereal, Juice5 Bread, Cereal, Chocolate, Donuts, Juice6 Milk, Tea7 Biscuits, Bread, Cheese, Coffee, Milk8 Eggs, Milk, Tea

9 Bread, Cereal, Cheese, Chocolate, Coffee10 Bread, Cereal, Chocolate, Donuts, Juice11 Bread, Cheese, Juice12 Bread, Cheese, Coffee, Donuts, Juice13 Biscuits, Bread, Cereal14 Cereal, Cheese, Chocolate, Donuts, Juice15 Chocolate, Coffee16 Donuts17 Donuts, Eggs, Juice18 Biscuits, Bread, Cheese, Coffee19 Bread, Cereal, Chocolate, Donuts, Juice20 Cheese, Chocolate, Donuts, Juice21 Milk, Tea, Yogurt22 Bread, Cereal, Cheese, Coffee23 Chocolate, Donuts, Juice, Milk, Newspaper24 Newspaper, Pastry, Rolls25 Rolls, Sugar, Tea

Frequency of Items


89/131

89

Frequency of Items

Item No Item name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 9

7 Donuts 108 Eggs 29 Juice 11

10 Milk 611 Newspaper 212 Pastry 113 Rolls 2

14 Sugar 115 Tea 416 Yogurt 2

Frequent Items


90/131

90

Frequent Items

Assume 25% support. In 25 transactions, a frequent item mustoccur in at least 7 transactions. The frequent 1-itemset or L1isnow given below. How many candidates in C2? List them.

Item FrequencyBread 13Cereal 10Cheese 11Chocolate 9Coffee 9Donuts 10

Juice 11

L


91/131

91

L2

The following pairs are frequent. Now find C3and then L3and the rules.

Frequent 2 itemset Frequency{Bread, Cereal} 9

{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9{Chocolate, Donuts} 7

{Chocolate, Juice} 7{Donuts, Juice} 9

Rules


92/131

92

Rules

The full set of rules are given below. Could some rules be removed?Cheese Bread

Cheese Coffee

Coffee Bread

Coffee Cheese

Cheese, Coffee Bread

Bread, Coffee Cheese

Bread, Cheese CoffeeChocolate Donuts

Chocolate Juice

Donuts Chocolate

Donuts Juice

Donuts, Juice Chocolate

Chocolate, Juice Donuts

Chocolate, Donuts JuiceBread Cereal

Cereal Bread

Study the above rules carefully.

Improving the Apriori Algorithm


93/131

93

Improving the Apriori Algorithm

Many techniques for improving the efficiency havebeen proposed:

Pruning (already mentioned)Hashing based technique

Transaction reductionPartitioningSamplingDynamic itemset counting

Pruning


94/131

94Pruning

Pruning can reduce the size of the candidate setCk. We want to transform Ckinto a set of frequentitems Lk. To reduce the work of checking, we may

use the rule that all subsets of Ckmust also befrequent.

Example


95/131

95Example

Suppose the items are A, B, C, D, E, F, .., X, Y, Z

Suppose L1is A, C, E, P, Q, S, T, V, W, X

Suppose L2is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},

{E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}

Are you able to identify errors in the L2list?

What is C3?

How to prune C3?

C3is {A, C, P}, {E, P, V}, {Q, S, X}


96/131

Hash Tables


97/131

Hash tables are a common approach to the

storing/searching problem.

Hash Tables

What is a Hash Table ?


98/131


The simplest kind of hashtable is an array ofrecords.

This example has 701

records.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

An array of records. . .

[ 700]

What is a Hash Table ?[ 4 ]


99/131


Each record has aspecial field, called itskey.

In this example, the keyis a long integer field

called Number.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]. . .

[ 700]

Number 506643548

What is a Hash Table ?[ 4 ]


100/131


The number might be aperson's identificationnumber, and the rest of

the record hasinformation about theperson.[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

. . .[ 700]

Number 506643548



101/131


When a hash table is inuse, some spots containvalid records, and other

spots are "empty".

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .

Inserting a New Record


102/131


In order to insert a newrecord, the keymustsomehow be converted

to an array index.

The index is called the

hash value of the key.


Number 580625685



103/131


Typical way create a hashvalue:


Number 580625685

(Number mod 701)What is (580625685 mod 701) ?



104/131


Typical way to create ahash value:


Number 580625685

(Number mod 701)What is (580625685 mod 701) ?

3



105/131


The hash value is usedfor the location of thenew record.

Number 580625685


[3]



106/131


The hash value is usedfor the location of thenew record.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685

Collisions


107/131

Collisions

Here is another newrecord to insert, with ahash value of 2.


Number 701466868

My hashvalue is

[2].

Collisions


108/131

Collisions

This is called acollision, becausethere is already another

valid record at [2].


Number 701466868

When a collision occurs,move forward until youfind an empty spot.

Collisions


109/131

Collisions




Number 701466868


Collisions


110/131

Collisions




Number 701466868


Collisions


111/131

Collisions



[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548umber 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

The new record goesin the empty spot.

Searching for a Key


112/131

Searching for a Key

The data that's attachedto a key can be foundfairly quickly.


Number 701466868

Searching for a Key


113/131

Searching for a Key

Calculate the hash value.

Check that location of the

array for the key.


Number 701466868

My hashvalue is

[2].Not me.

Searching for a Key


114/131

Searching for a Key

Keep moving forward untilyou find the key, or youreach an empty spot.


Number 701466868

My hashvalue is

[2].Not me.

Searching for a Key


115/131

Searching for a Key



Number 701466868

My hashvalue is

[2].Not me.

Searching for a Key


116/131

Searching for a Key



Number 701466868

My hashvalue is

[2].Yes

Searching for a Key


117/131

Searching for a Key

When the item is found, theinformation can be copiedto the necessary location.


Number 701466868

My hashvalue is

[2].Yes


118/131

Deleting a Record


119/131

Deleting a Record

Records may also be deleted from a hashtable.

But the location must not be left as an ordinary

"empty spot" since that could interfere withsearches.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

Deleting a Record


120/131

Deleting a Record

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136umber 281942902 Number 155778322. . .Number 580625685 Number 701466868

Records may also be deleted from a hashtable.

But the location must not be left as an ordinary

"empty spot" since that could interfere withsearches.

The location must be marked in some special

way so that a search can tell that the spot used

to have something in it.

Example


121/131

121

p

Transaction ID Items100 Bread, Cheese, Eggs, Juice

200 Bread, Cheese, Juice300 Bread, Milk, Yogurt

400 Bread, Juice, Milk

500 Cheese, Juice, Milk

100 (B, C) (B, E) (B, J) (C, E) (C, J) (E, J)

200 (B, C) (B, J) (C, J)300 (B, M) (B, Y) (M, Y)400 (B, J) (B, M) (J, M)500 (C, J) (C, M) (J, M)

Consider the transaction database in the first table below used in anearlier example. The second table below shows allpossible 2-itemsetsfor each transaction.

Hashing Example


122/131

122

Hashing Example

The possible 2-itemsets in the last table are now hashed to a hashtable below. The last column shown in the table below is not

required in the hash table but we have included it for explaining the

technique.

Bit vector Bucket number Count Pairs C21 0 3 (C, J) (B, Y) (M, Y) (C, J)0 1 1 (C, M)0 2 1 (E, J)0 3 0

0 4 2 (B, C)1 5 3 (B, E) (J, M) (J, M)1 6 3 (B, J) (B, J)1 7 3 (C, E) (B, M) (B, M)


123/131

Find C2


124/131

1242

The major aim of the algorithm is to reduce the size of C2. It istherefore essential that the hash table is large enough so that

collisions are low. Collisions result in loss of effectiveness of

the hash table. This is what happened in the example in which

we had collisions in three of the eight rows of the hash tablewhich required us finding which pair was frequent.

Transaction Reduction


125/131

125

As discussed earlier, any transaction that does notcontain any frequent k-itemsets cannot contain anyfrequent (k+1)-itemsets and such a transaction maybe marked or removed.


126/131

Partitioning


127/131

127

The set of transactions may be divided into a numberof disjoint subsets. Then each partition is searchedfor frequent itemsets. These frequent itemsets arecalled local frequent itemsets.

Partitioning


128/131

128

Phase 1 Divide n transactions into m partitions

Find the frequent itemsets in each partition

Combine all local frequent itemsets to form

candidate itemsets

Phase 2

Find global frequent itemsets

Sampling


129/131

129

A random sample (usually large enough to fit in themain memory) may be obtained from the overall set oftransactions and the sample is searched for frequentitemsets. These frequent itemsets are called samplefrequent itemsets.

Sampling


130/131

130

Not guaranteed to be accurate but we sacrificeaccuracy for efficiency. A lower support threshold maybe used for the sample to ensure not missing anyfrequent datasets.

The actual frequencies of the sample frequent itemsetsare then obtained.

More than one sample could be used to improve

accuracy.

Exercise


131/131

Given the above list of transactions, do the following:

1) Find all the frequent itemsets (minimum support 40%)

2) Find all the association rules (minimum confidence 70%)3) For the discovered association rules, calculate the lift

Transaction No. Item 1 Item 2 Item 3 Item 4

100 Beer Diaper Chocolate

101 Milk Chocolate Shampoo

102 Beer Soap Vodka

103 Beer Cheese Wine

104 Milk Diaper Beer Chocolate

Documents

Association Rule -data mining