34
COMP6237 Data Mining “Market Basket” Analysis & Association Rule Mining Jonathon Hare [email protected] Content based on material from slides from Evgueni Smirnov at the University of Maastricht, slides from João Mendes Moreira & José Luís Borges at the University of Porto, and notes from Nitin Patel at MIT

COMP6237 Data Mining “Market Basket” Analysis ...comp6237.ecs.soton.ac.uk/lectures/pdf/MarketBasket.pdf · “Market Basket” Analysis & Association Rule Mining ... • Contrapositive:

Embed Size (px)

Citation preview

COMP6237 Data Mining

“Market Basket” Analysis & Association Rule MiningJonathon Hare [email protected]

Content based on material from slides from Evgueni Smirnov at the University of Maastricht, slides from João Mendes Moreira & José Luís Borges at the University of Porto, and notes from Nitin Patel at MIT

Introduction

• Association Rule Problem

• Applications

• The Apriori Algorithm

• Discovering Association Rules

• Measures for Association Rules

Association Rule

X⇒Yif X then Y

Association Rule Problem

• Given a database of transactions:

Find all the association rules:

Applications 1

• Market Basket Analysis:

• given a database of customer transactions, where each transaction is a set of items the goal is to find groups of items which are frequently purchased together.

• One basket tells you about what one customer purchased at one time.

Why analyse market baskets?

• Identify who customers are (not by name)

• Understand why they make certain purchases

• Gain insight about products:

• Fast and slow movers

• Products which are purchased together

• Products which might benefit from promotion

• Take action:

• Store layouts

• Which products to put on specials, promote, coupons...

More than just the contents of a shopping basket…

• What customers do not purchase, and why:

• If customers purchase baking powder, but no flour, what are they baking?

• If customers purchase a mobile phone, but no case, are you missing an opportunity?

• Key drivers of purchases:

• e.g. gourmet mustard that seems to lie on a shelf collecting dust until a customer buys that particular brand of special gourmet mustard in a shopping excursion that includes hundreds of pounds worth of other products.

• Would eliminating the mustard (to replace it with a better‐selling item) threaten the entire customer relationship?

A story about MBA

13

Probably mom was calling dad at work to buy diapers on way home and he decided to buy a six‐pack as well. 

The retailer could move diapers and beers to separate places and position high‐profit items of interest to young fathers along the path.

How can Association Rules be used?Probably mom was

calling dad at work to buy diapers on way

home and he decided to buy a six‐pack as

well.

The retailer could move diapers and beers to separate places and position high‐profit items of interest to young fathers along

the path.

Applications 2

• Telecommunication (each customer is a transaction containing the set of phone calls)

• Credit Cards/ Banking Services (each card/account is a transaction containing the set of customer’s payments)

• Medical Treatments (each patient is represented as a transaction containing the ordered set of diseases)

• Basketball-Game Analysis (each game is represented as a transaction containing the ordered set of ball passes)

• … more coming later…

Mining Association Rules

Association Rule Definitions

• I={i1, i2, …, in}: a set of all the items

• Transaction T: a set of items such that T ⊆ I

• Transaction Database D: a set of transactions

• A transaction T ⊆ I contains a set X ⊆ I of some items, if X ⊆ T

• An Association Rule: is an implication of the form X ⟹ Y, where X, Y ⊆ I

Association Rule Definitions

• A set of items is referred as an itemset.

• An itemset that contains k items is a k-itemset.

• The support s of an itemset X is the percentage of transactions in the transaction database D that contain X.

• The support of the rule X ⟹ Y in the transaction database D is the support of the items set X ∪ Y in D.

• The confidence of the rule X ⟹ Y in the transaction database D is the ratio of the number of transactions in D that contain X ∪ Y to the number of transactions that contain X in D.

Association Rule Problem

• Given:

• a set I of all the items;

• a database D of transactions;

• minimum support s;

• minimum confidence c;

• Find:

• all association rules X ⇒ Y with a minimum support s and confidence c.

Example

• Given a database of transactions:

Find all the association rules:

Problem Decomposition

1. Find all sets of items that have minimum support (frequent itemsets)

2. Use the frequent itemsets to generate the desired rules

Transaction ID Items Bought1 Shoes, Shirt, Jacket2 Shoes,Jacket3 Shoes, Jeans4 Shirt, Sweatshirt

If the minimum support is 50%, then {Shoes,Jacket} is the only 2- itemset that satisfies the minimum support.

Frequent Itemset Support{Shoes} 75%{Shirt} 50%{Jacket} 50%{Shoes, Jacket} 50%

If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater

than 50%, are: Shoes  ⇒  Jacket    Support=50%,  Confidence=66%  Jacket  ⇒  Shoes      Support=50%,  Confidence=100%

Problem Decomposition

The Apriori Algorithm

• Frequent itemset property:

• Any subset of a frequent itemset is frequent.

• Contrapositive:

• If an itemset is not frequent, none of its supersets are frequent.

Frequent Itemset Property

The Apriori Algorithm

• Lk: Set of frequent itemsets of size k (with min support)

• Ck: Set of candidate itemset of size k (potentially frequent itemsets)

L1  =  {frequent  items};  for  (k=  1;  Lk  !=∅;  k++)  do  

Ck+1  =  candidates  generated  from  Lk;    for  each  transaction  t  in  database  do:  

increment  the  count  of  all  candidates  in  Ck+1  that  are  contained  in  t;  

Lk+1  =  candidates  in  Ck+1  with  min_support;  return  ⋃kLk;

The Apriori Algorithm — Example

Scan D

itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

C1 itemset sup.{1} 2{2} 3{3} 3{5} 3

L1

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

C2 itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

C2 Scan D

C3 itemset{2 3 5}

Scan D L3 itemset sup{2 3 5} 2

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D Min support =50%

How to Generate Candidates

Input:  Li-­‐1  :  set  of  frequent  itemsets  of  size  i-­‐1  Output:  Ci  :  set  of  candidate  itemsets  of  size  i  Ci  =  empty  set;  for  each  itemset  J  in  Li-­‐1  do     for  each  itemset  K  in  Li-­‐1  s.t.  K≠J  do       if  i-­‐2  of  the  elements  in  J  and  K  are  equal  then         if  all  subsets  of  {K  ⋃  J}  are  in  Li-­‐1  then           Ci  =  Ci  ⋃  {K  ⋃  J}  

return  Ci;

Example of Generating Candidates

• L3 = {abc, abd, acd, ace, bcd}

• Generating C4 from L3

• abcd from abc and abd (or abc and acd, …)

• acde from acd and ace

• Pruning:

• acde is removed because ade is not in L3

• C4 = {abcd}

How can we generate rules?

• Let use consider the 3-itemset {I1, I2, I5}:

• I1 ⋀ I2 ⟹ I5

• I1 ⋀ I5 ⟹ I2

• I2 ⋀ I5 ⟹ I1

• I1 ⋀ I2 ⟹ I5

• I2 ⋀ I1 ⟹ I5

• I5 ⋀ I1 ⟹ I2

Consider all permutations of rules from the

items

Discovering Rules

for  each  frequent  itemset  I  do      for  each  subset  C  of  I  do            if  (support(I)  /  support(I  -­‐  C)  >=  minconf)  then                    output  the  rule  (I  -­‐  C)  ⟹  C,                    with  confidence  =  support(I)  /  support  (I  -­‐  C)                            and  support  =  support(I)

Considerations of the Apriori algorithm

• Advantages:

• Uses large itemset property

• Easily parallelised

• Easy to implement

• Disadvantages:

• Assumes transaction database is memory resident

• Requires many database scans

Better measures for rules

Problems with confidence

• The confidence of X ⟹ Y in database D is the ratio of the number of transactions containing X ⋃ Y to the number of transactions that contain X:

• But, when Y is independent of X: p(Y) = p(Y | X).

• If p(Y) is high we’ll have a rule with high confidence that associates independent itemsets!

• For example, if p(“milk”) = 80% and “milk” is independent from “salmon”, then the rule “salmon” ⟹ “milk” will have confidence 80%!

)|()()(

||)(

||)(

)( XYpXpYXp

DXnumTrans

DYXnumTrans

YXconf =∧

=

=→

Alternative Measures for Association Rules

• The lift measure indicates the departure from independence of X and Y.

• The lift of X ⟹ Y is:

• But, the lift measure is symmetric; i.e., it does not take into account the direction of implications!

)()()(

)()()(

)()()(

YpXpYXp

YpXpYXp

YpYXconf

YXlift∧

=

=→

=→

Alternative Measures for Association Rules

• The conviction measure indicates the departure from independence of X and Y taking into account the implication direction.

• The conviction of X ⟹ Y is:

)()()()(

YXpYpXp

YXconv¬∧

¬=→

Further applications (not shopping related)

Finding Linked Concepts

• “Baskets” = documents

• “items” = words in those documents

• Lets us find words that appear together unusually frequently, i.e., linked concepts.

• Word4 ⟹ Word3

• When Word4 occurs in a document there a big probability of Word3 occurring

Word1 Word2 Word3 Word4Doc1 1 0 1 1Doc2 0 0 1 1Doc3 1 1 1 0

Detecting Plagiarism

• “Baskets” = sentences

• “items” = documents containing those sentences

• Items that appear together too often could represent plagiarism.

• Doc4 ⟹ Doc3

• When a sentence occurs in document 4 there is a big probability of occurring in document 3

Doc1 Doc2 Doc3 Doc4Sent1 1 0 1 1Sent2 0 0 1 1Sent3 1 1 1 0

Working with webpages

• “Baskets” = Web pages

• “items” = linked pages

• Pairs of pages with many common references may be about the same topic.

• “Baskets” = Web pages, pi

• “items” = pages that link to pi

• Pages with many of the same links may be mirrors or about the same topic.

Summary

• Association Rules form a very applied data mining approach

• lots of potential uses

• Association Rules are derived from frequent itemsets

• The Apriori algorithm is an efficient algorithm for finding all frequent itemsets

• implements level-wise search using the frequent item property

• There are many measures for association rules.