Mining Association Rules Apriori Algorithmpaginas.fe.up.pt/~ec/files_1011/week 04 - Association...The Apriori Algorithm Ck: Candidate itemset of size k Lk: frequent itemset of size

Mining Association Rulesg

1

Mining Association RulesMining Association Rules

What is Association rule mining

Apriori Algorithm

Measures of rule interestingness

Advanced Techniques

2

h i i l i i ?What Is Association Rule Mining?

l Association rule mining

Finding frequent patterns, associations, correlations, or causal structures among sets of items in transaction databases

Understand customer buying habits by finding associations and l ti b t th diff t it th t t l i th icorrelations between the different items that customers place in their

“shopping basket”

Applications

Basket data analysis, cross-marketing, catalog design, loss-leader analysis, web log analysis, fraud detection (supervisor->examiner)

3

Wh t I A i ti R l Mi i ?What Is Association Rule Mining?

Rule form

Antecedent → Consequent [support, confidence]Antecedent → Consequent [support, confidence]

(support and confidence are user defined measures of interestingness)pp f f f g

Examples

buys(x, “computer”) → buys(x, “financial management software”) [0.5%, 60%]

age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “car”) [1%,75%][ , ]

4

How can Association Rules be used?

Probably mom was


Probably mom was calling dad at work to buy diapers on way home and hehome and he decided to buy a six-pack as well.

The retailer could move diapers and beers to separate places and position high-profit items of g pinterest to young fathers along the path

5

path.


Let the rule discovered be


{Bagels,...} → {Potato Chips}

Potato chips as consequent => Can be used to determine what should be done to boost its sales

Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagelsbe affected if the store discontinues selling bagels

Bagels in antecedent and Potato chips in the consequent => Can be Bagels in antecedent and Potato chips in the consequent > Can be used to see what products should be sold with Bagels to promote sale of Potato Chips

6

Basic Concepts

Given Given:

(1) database of transactions,

(2) each transaction is a list of items purchased by a customer in a visit

Find: Find:

all rules that correlate the presence of one set of items (itemset) with that of another set of items(itemset) with that of another set of items

7

E.g., 35% of people who buys salmon also buys cheese

TX1 Shoes Socks Tie BeltTX1 Shoes, Socks, Tie, Belt

TX2 Shoes, Socks, Tie, Belt, Shirt, Hat

TX3 Shoes, Tie

TX4 Shoes, Socks, Belt

Transaction Shoes Socks Tie Belt Shirt Scarf HatTransaction Shoes Socks Tie Belt Shirt Scarf Hat

1 1 1 1 0 0 0

2 1 1 1 1 1 0 1

3 1 0 1 0 0 0 0

4 1 1 0 1 0 0 0

...

Support is 50% (2/4)

TiS k8

Confidence is 66.67% (2/3)TieSocks

Rule Basic Measures

A B [ ]

Rule Basic Measures

A B [ s, c ]

Support: denotes the frequency of the rule within transactions. A high value means that the rule involves a great part of database.

support(A B [ s, c ]) = p(A ∪ B)

Confidence: denotes the percentage of transactions containing A which also contain B It is an estimation of conditioned probabilitywhich also contain B. It is an estimation of conditioned probability .

confidence(A B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).

9

ctio

nstr

ansa

Rule C => D support

f d /10

confidence /

ApplicationsApplications

“Baskets” = documents;

“items” = words in those documents.

Lets us find words that appear together unusually frequently, i.e., linked concepts.

Word 1 Word 2 Word 3 Word 4

Doc 1 1 0 1 1

Doc 2 0 0 1 1

Doc 3 1 1 1 0

Word 4 => Word 3

11

When word 4 occurs in a document there a big probability of word 3 occurring


“Baskets” = sentences, Baskets sentences,

“items” = documents containing those sentences.

Items that appear together too often could represent plagiarism.p g

Doc 1 Doc 2 Doc 3 Doc 4

Sent 1 1 0 1 1

Sent 2 0 0 1 1

Sent 3 1 1 1 0

Doc 4 => Doc 3

12

When a sentence occurs in document 4 there is a big probability of occurring in document 3


“Baskets” = Web pages; p g ;

“items” = linked pages.

Pairs of pages with many common references may be about the same topic.

“Baskets” = Web pages pi ;

“items” = pages that link to pi

Pages with many of the same links may be mirrors or about Pages with many of the same links may be mirrors or about the same topic. wp a wp b wp c wp d

wp1

13

wp1

wp2

Example Definitions:pItemset:

A B B E F

Trans. Id Purchased Items

1 A D

Definitions:

A,B or B,E,F

Support of an itemset:

1 A,D

2 A,C

3 A B C Sup(A,B)=1

Sup(A,C)=2

3 A,B,C

4 B,E,F

Frequent pattern:

Given min. sup=2, {A,C} is a f t tt

For minimum support = 50% and minimum confidence = 50%, we have the

frequent pattern

following rules

A => C with 50% support and 66% confidence

14

C => A with 50% support and 100% confidence

15Randall Matignon 2007, Data Mining Using SAS Enterprise Miner, Wiley (book).

The support probability of each rule is identified by the color of the symbols and the confidence probability of each rule is identified by the shape of the symbols. The items that have the highest confidence are

16

Heineken beer, crackers, chicken, and peppers.

Randall Matignon 2007, Data Mining Using SAS Enterprise Miner, Wiley (book).



Apriori Algorithm Apriori Algorithm


Advanced Techniques

17

Boolean association rulesBoolean association rules

Each transaction is converted to a B l tBoolean vector

18

An ExampleAn Example

Transaction ID Items Bought Min. support 50%Transaction ID Items Bought2000 A,B,C1000 A,C F t It t S t

ppMin. confidence 50%

,4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{ }{C} 50%

{A,C} 50%

For rule A C :

support = support({A , C }) = 50%

confidence = support({A , C }) / support({A }) = 66.6%

19

Finding Association RulesFinding Association Rules

Approach:pp

1) Find all itemsets that have high support

- These are known as frequent itemsets

2) Generate association rules from frequent itemsets2) Generate association rules from frequent itemsets

20

Mining Frequent Itemsets (the Key Step)Mining Frequent Itemsets (the Key Step)

Find the frequent itemsets: the sets of items that have (at least) minimum support

A subset of a frequent itemset must also be a frequent itemsetitemset

Generate length (k+1) candidate itemsets from length k f t it t dk frequent itemsets, and

Test the candidates against DB to determine which are in fact frequent

21

Apriori principleApriori principle

Any subset of a frequent itemset must be frequent

A t ti t i i {b di t } l t i A transaction containing {beer, diaper, nuts} also contains {beer, diaper}

{beer, diaper, nuts} is frequent {beer, diaper} must also be frequent

22

Apriori principleApriori principle

No superset of any infrequent itemset should begenerated or testedgenerated or tested

Many item combinations can be prunedy p

23

Itemset Latticenull

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDEABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

24

ABCDE

Apriori principle for pruning candidatesp p p p g

If i i i fnull

If an itemset is infrequent, then all of its supersets must also be infrequent

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDEPruned supersets

25

ABCDE

How to Generate Candidates?

The items in Lk-1 are listed in an orderk 1

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

A D EA D E

A D F= = <

26

The Apriori Algorithm — Example

Database D itemset sup.{1} 2 itemset supC1 L

Min. support: 2 transactions

TID Items100 1 3 4200 2 3 5

{1} 2{2} 3{3} 3

itemset sup.{1} 2{2} 3Scan D

1 L1

300 1 2 3 5400 2 5

{4} 1{5} 3

{3} 3{5} 3

Scan D

itemset{1 2}{1 3}

itemset sup{1 2} 1{1 3} 2

itemset sup{1 3} 2

L2

C2 C2

{1 3}{1 5}{2 3}{2 5}

{1 3} 2{1 5} 1{2 3} 2

{1 3} 2{2 3} 2{2 5} 3

Scan D

{2 5}{3 5}

{2 5} 3{3 5} 2

{ }{3 5} 2

C Litemset itemset sup

27

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

How to Generate Candidates?

Step 2: pruning

for all itemsets c in Ck dok

for all (k-1)-subsets s of c do

if (s is not in L ) then delete c from Cif (s is not in Lk-1) then delete c from Ck

A D E FA D E F

28

Example of Generating CandidatesExample of Generating Candidates

L = {abc abd acd ace bcd} L3 = {abc, abd, acd, ace, bcd}

Self-joining: L3*L3 Self joining: L3 L3

joining abc and abd gives abcd joining abc and abd gives abcd

joining acd and ace gives acde

29

Example of Generating CandidatesExample of Generating Candidates

L3 = {abc, abd, acd, ace, bcd} L3 {abc, abd, acd, ace, bcd}

Pruning (before counting its support): Pruning (before counting its support):

abcd: check if abc, abd, bcd are in L3 abcd: check if abc, abd, bcd are in L3

acde: check if acd, ace, ade are in L3, ,

acde is removed because ade is not in L3

Thus C4={abcd}

30

4

The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k

Join Step: C is generated by joining L with itself Join Step: Ck is generated by joining Lk-1 with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a

frequent k-itemset

Algorithm:

L1 = {frequent items};

for (k = 1; Lk !=∅; k++) do beginC = candidates generated from L ;Ck+1 = candidates generated from Lk;for each transaction t in database do

increment the count of all candidates in Ck+1 that are k+1

contained in tLk+1 = candidates in Ck+1 with min_support

31

endreturn L = ∪k Lk;

Counting Support of CandidatesCounting Support of Candidates

Why counting supports of candidates a problem? Why counting supports of candidates a problem? The total number of candidates can be very large

One transaction may contain many candidates

Method: Candidate itemsets are stored in a hash-tree Candidate itemsets are stored in a hash-tree

Match candidates to transactions supporting it

32

Counting Support of Candidate ItemsetsCounting Support of Candidate Itemsets Frequent Itemsets are found by counting candidates

Naive way:

Search for each candidate in each transaction Expensive!!! Search for each candidate in each transaction Expensive!!!

Candidates Count

A B 0A B 1A B 1A B 2Transactions

A B C D

A B 0

A C 0

A D 01A D

A C

A B

1

1

1A D

A C

A B

2

1

2A D

A C

A B

2

2

Naïve approachrequires O(NM)comparisons

M

A C E

B C D

A E 0

B C 0

B D 0

0A E

B D

B C

1

1

1A E

B D

B C

1

1

Reduce the number

2A E

B D

B C

4

3

comparisons

N MA B D E

B C E

B D 0

A B E 0

B C D 0

A B D E 0B C D

A B E

B D

1

0

1

B C D

A B E

B D

1

0

1 educe t e u beof comparisons(NM) by using hashtables to store theB C D

A B E

B D

2

2

4

33

C

B DA B D E 0

A B C D E 0A B C D E

A B D E

0

0

A B C D E

A B D E

0

0 candidate itemsetsA B C D E

A B D E

0

1

Reducing Number of ComparisonsReducing Number of Comparisons

Store the candidate itemsets in a hash structure

Hash each transaction to its appropriate bucketsHash each transaction to its appropriate buckets

Match transaction only to candidates stored in the hashed bucket(s)bucket(s)

Transactions

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

34

Example

I: itemset{cucumber parsley onion tomato salt bread olives cheese butter}{cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}

D: set of transactions1 {{cucumber parsley onion tomato salt bread}1 {{cucumber, parsley, onion, tomato, salt, bread},

2 {tomato, cucumber, parsley},

3 {tomato, cucumber, olives, onion, parsley},

4 {tomato, cucumber, onion, bread},

5 {tomato, salt, onion},

6 {b d h }6 {bread, cheese}

7 {tomato, cheese, cucumber}

8 {bread, butter}}{ , }}

35from: http://www.cs.bilkent.edu.tr/~guvenir/courses/CS558/Association%20Rules.ppt

For Support = 30% -> at least 3 transactionsL = {{bread} {cucumber} {onion} {parsley} {tomato}}L1 = {{bread}, {cucumber}, {onion}, {parsley}, {tomato}}

C2 = {{bread cucumber} {bread onion} {bread parsley}C2 {{bread, cucumber}, {bread, onion}, {bread, parsley}, {bread, tomato}, {cucumber, onion}, {cucumber, parsley},{cucumber, tomato}, {onion, parsley},{ , }, { , p y},{onion, tomato}, {parsley, tomato}}

L2 = {{cucumber, onion}, {cucumber, parsley}, {cucumber, tomato}, {onion, tomato}, {parsley, tomato}}

L3 = {{cucumber, parsley, tomato}}L = L ∪ L ∪ LL = L1 ∪ L2 ∪ L3


C2 = {{bread, cucumber}, {bread, onion}, {bread, parsley}, {bread tomato} {cucumber onion} {cucumber parsley}{bread, tomato}, {cucumber, onion}, {cucumber, parsley},{cucumber, tomato}, {onion, parsley},{onion tomato} {parsley tomato}}

The hash-tree constructed for C2:

{onion, tomato}, {parsley, tomato}}

2

root|_______________________|__________________________

| | | |bread cucumber onion parsley

| | | |{bread, cucumber} {cucumber, onion} {onion, parsley} {parsley, tomato}

{bread, onion} {cucumber, parsley} {onion, tomato}{bread, parsley} {cucumber, tomato}{bread, tomato}

For any itemset c contained in transaction t the first item of c must be in t


For any itemset c contained in transaction t, the first item of c must be in t.At root, by hashing on every item in t, we ensure that we only ignore itemsets that start with an item not in t.

Generating AR from frequent intemsetsGenerating AR from frequent intemsets

Confidence (AB) = P(B|A) = B})nt({As pport co Confidence (AB) = P(B|A) = unt({A})support_co

B})unt({A,support_co

For every frequent itemset x, generate all non-empty subsets of xsubsets of x

For every non-empty subset s of x, output the rule

“ s (x-s) ” if

min confunt({x})support_co ≥

38

min_confunt({s})support_co

≥

Rule GenerationRule Generation

How to efficiently generate rules from frequent y g qitemsets?

I l fid d t h ti t In general, confidence does not have an anti-monotone property

But confidence of rules generated from the same itemset has an anti-monotone property

L = {A,B,C,D}:

c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

Confidence is non-increasing as number of items in rule

39

Confidence is non increasing as number of items in rule consequent increases

From Frequent Itemsets to Association RulesFrom Frequent Itemsets to Association Rules

Q: Given frequent set {A B E} what are Q: Given frequent set {A,B,E}, what are possible association rules? A => B, E

A, B => E

A, E => B

B => A, E

B, E => A

E => A, B

40

__ => A,B,E (empty rule), or true => A,B,E

Generating Rules: exampleg p

Trans-ID Items Min_support: 60%

1 ACD2 BCE

Min_confidence: 75%

3 ABCE4 BE5 ABCE Frequent Itemset Support

{BCE},{AC} 60%{BC} {CE} {A} 60%R le Conf {BC},{CE},{A} 60%{BE},{B},{C},{E} 80%

Rule Conf.{BC} =>{E} 100%{BE} =>{C} 75%{BE} {C} 75%{CE} =>{B} 100%{B} =>{CE} 75%

41

{C} =>{BE} 75%{E} =>{BC} 75%

ExerciceExerciceTID Items1 Bread Milk Chips Mustard1 Bread, Milk, Chips, Mustard2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk

Converta os dados para o formato booleano e para um suporte de 40% aplique o

4 Beer, Bread, Diaper, Milk, Chips5 Coke, Bread, Diaper, Milk6 B B d Di Milk M t d

suporte de 40%, aplique o algoritmo apriori.

6 Beer, Bread, Diaper, Milk, Mustard7 Coke, Bread, Diaper, Milk

Bread Milk Chips Mustard Beer Diaper Eggs Coke1 1 1 1 0 0 0 01 0 0 0 1 1 1 00 1 0 0 1 1 0 11 1 1 0 1 1 0 01 1 0 0 0 1 0 11 1 0 1 1 1 0 0

42

1 1 0 1 1 1 0 01 1 0 0 0 1 0 1

0.4*7= 2.8

C1 L1 C2 L2Bread 6 Bread 6 Bread,Milk 5 Bread,Milk 5Milk 6 Milk 6 Bread Beer 3 Bread Beer 3Milk 6 Milk 6 Bread,Beer 3 Bread,Beer 3Chips 2 Beer 4 Bread,Diaper 5 Bread,Diaper 5Mustard 2 Diaper 6 Bread,Coke 2 Milk,Beer 3Beer 4 Coke 3 Milk,Beer 3 Milk,Diaper 5, , pDiaper 6 Milk,Diaper 5 Milk,Coke 3Eggs 1 Milk,Coke 3 Beer,Diaper 4Coke 3 Beer,Diaper 4 Diaper,Coke 3

Beer,Coke 1Diaper,Coke 3

C3 L3C3 L3Bread,Milk,Beer 2 Bread,Milk,Diaper 4Bread,Milk,Diaper 4 Bread,Beer,Diaper 3Bread Beer Diaper 3 Milk Beer Diaper 3Bread,Beer,Diaper 3 Milk,Beer,Diaper 3Milk,Beer,Diaper 3 Milk,Diaper,Coke 3Milk,Beer,CokeMilk,Diaper,Coke 3

43

, p ,8 8

2 38 92 24+ + = >>C C





Advanced Techniques

44

Interestingness Measurementsg

Are all of the strong association rules discovered interesting

enough to present to the user?

How can we measure the interestingness of a rule? How can we measure the interestingness of a rule?

Subjective measures

A rule (pattern) is interesting if

• it is unexpected (surprising to the user); and/or

• actionable (the user can do something with it)

(only the user can judge the interestingness of a rule)

45

• (only the user can judge the interestingness of a rule)

Objective measures of rule interestObjective measures of rule interest

Support Support

Confidence or strengthg

Lift or Interest or Correlation

Conviction

L Pi k Sh i Leverage or Piatetsky-Shapiro

Coverage Coverage

46

Criticism to Support and Confidencepp

Example 1: (Aggarwal & Yu, PODS98)

Among 5000 students

3000 play basketball

basketball not basketball sum(row)cereal 2000 1750 3750 75%

not cereal 1000 250 1250 2 5%

3750 eat cereal

2000 both play basket

not cereal 1000 250 1250 2 5%

sum(col.) 3000 2000 50006 0 % 4 0 %

p yball and eat cereal

play basketball eat cereal [40% 66 7%]play basketball eat cereal [40%, 66.7%]

misleading because the overall percentage of students eating cereal is 75% which is higher than 66 7%is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%]

47

is more accurate, although with lower support and confidence

Lift of a Rule

Lift (Correlation, Interest)

)|()( ABpBAsup)(

)|()()(

),()(Bp

ABpBA

BABALIFT ==→

supsupsup

A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlatedotherwise A and B positively correlated

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0

rule Support LiftX Y 25% 2.00XZ 37 50% 0 86

48

Z 0 1 1 1 1 1 1 1XZ 37.50% 0.86YZ 12.50% 0.57

Lift of a RuleLift of a Rule

Example 1 (cont)p

play basketball eat cereal [40%, 66.7%] 89037503000

50002000

.==LIFTplay basketball eat cereal [40%, 66.7%]

50003750

50003000 ×

play basketball not eat cereal [20%, 33.3%] 33112503000

50001000

.==LIFTplay basketball not eat cereal [20%, 33.3%]

basketball not basketball sum(row)

50001250

50003000 ×

cereal 2000 1750 3750

not cereal 1000 250 1250

49

sum(col.) 3000 2000 5000

Problems With LiftProblems With Lift

R les that hold 100% of the time ma not ha e the highest Rules that hold 100% of the time may not have the highest

possible lift. For example, if 5% of people are Vietnam veterans

d f h l h ld l f fand 90% of the people are more than 5 years old, we get a lift of

0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule

Vietnam veterans -> more than 5 years old.

A d lif i i And, lift is symmetric:

not eat cereal play basketball [20%, 80%] p y

33150001000

LIFT

50

331

50003000

50001250

5000 .=×

=LIFT

Conviction of a RuleConviction of a Rule

1sup( ) sup( ) ( ) ( ) ( )( ( ))( ) A B P A P B P A P BC A B ⋅ ⋅ −1sup( ) sup( ) ( ) ( ) ( )( ( ))( )( ) ( , )sup( , ) ( , )

A B P A P B P A P BConv A BP A P A BA B P A B

→ = = =−

Conviction is a measure of the implication and has value 1 if items are unrelated.

play basketball eat cereal [40%, 66.7%] eat cereal play basketball conv:0.85

750

50002000

50003000

50003750

150003000

.=−

−

=Conv

play basketball not eat cereal [20%, 33.3%]not eat cereal pla basketball con 1 43

12511000300050001250

150003000

.=

−

=Conv

51

not eat cereal play basketball conv:1.43

50001000

50003000 −

ConvictionConviction

conviction of X=>Y can be interpreted as the conviction of X=>Y can be interpreted as the

ratio of the expected frequency that X occurs without Y (that is to p q y (say, the frequency that the rule makes an incorrect prediction) if X and Y were independent

divided by the observed frequency of incorrect predictions.

A conviction value of 1.2 shows that the rule would be incorrect 20% more often (1.2 times as often) if the association between X20% more often (1.2 times as often) if the association between X and Y was purely random chance.

52

Leverage of a RuleLeverage of a Rule

Leverage or Piatetsky Shapiro Leverage or Piatetsky-Shapiro

P ( ) sup( , ) sup( ) sup( )S A B A B A B→ = − ⋅

PS ( L ) PS (or Leverage):

is the proportion of additional elements covered by both theis the proportion of additional elements covered by both the

premise and consequence above the expected if independent.

53

Coverage of a RuleCoverage of a Rule

( ) s p( )A B Acoverage( ) sup( )A B A→ =

54

CommentComment

Traditional methods such as database queries: Traditional methods such as database queries:

support hypothesis verification about a relationship such as the co-occurrence of diapers & beer.

Data Mining methods automatically discover significant l f dassociations rules from data.

Find whatever patterns exist in the database, without the p ,user having to specify in advance what to look for (data driven).

55

Therefore allow finding unexpected correlations

APRIORI EXTENSIONS

56

Challenges of Frequent Pattern MiningChallenges of Frequent Pattern Mining

Challenges Challenges

Multiple scans of transaction database

Huge number of candidates Huge number of candidates

Tedious workload of support counting for candidates

Improving Apriori: general ideas

Reduce number of transaction database scans Reduce number of transaction database scans

Shrink number of candidates

Facilitate support counting of candidates Facilitate support counting of candidates

57

Improving Apriori’s EfficiencyImproving Apriori s Efficiency

Problem with Apriori: every pass goes over whole data. Problem with Apriori: every pass goes over whole data.

AprioriTID: Generates candidates as apriori but DB is used for

l h fcounting support only on the first pass.

Needs much more memory than Apriori

Builds a storage set C^k that stores in memory the frequent sets per

transaction

AprioriHybrid: Use Apriori in initial passes; Estimate the size of

C^k; Switch to AprioriTid when C^k is expected to fit in memoryk; p k p y

The switch takes time, but it is still better in most cases

58

ItemsTID Set-of-itemsetsTID SupportItemset

C^1 L1ItemsTID

1 3 4100

2 3 5200

Set of itemsetsTID

{ {1},{3},{4} }100

{ {2},{3},{5} }200

SupportItemset

2{1}

3{2}

1 2 3 5300

2 5400{ {1},{2},{3},{5} }300

{ {2},{5} }400

3{3}

3{5}

itemsetSI

L2Set-of-itemsetsTID

C2 C^2

{1 2}

{1 3}

{1 5}

SupportItemset

2{1 3}

3{2 3}

{ {1 3} }100

{ {2 3},{2 5} {3 5} }200

{ {1 2} {1 3} {1 5}300{1 5}

{2 3}

{2 5}

3{2 3}

3{2 5}

2{3 5}

{ {1 2},{1 3},{1 5},

{2 3}, {2 5}, {3 5} }

300

{ {2 5} }400{3 5}

SupportItemset

{ { } }

C^3 L3C3 Set-of-itemsetsTID

59

itemset

{2 3 5}

SupportItemset

2{2 3 5}{ {2 3 5} }200

{ {2 3 5} }300

Improving Apriori’s EfficiencyImproving Apriori s Efficiency

Transaction reduction: A transaction that does not Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scansscans

Sampling: mining on a subset of given data.

The sample should fit in memory

Use lower support threshold to reduce the probability of pp p ymissing some itemsets.

The rest of the DB is used to determine the actual itemset

60

The rest of the DB is used to determine the actual itemset count.

Improving Apriori’s Efficiency

Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB (2 DB scans)

(support in a partition is lowered to be proportional to the number of elements in (support in a partition is lowered to be proportional to the number of elements in the partition)

Phase I

Divide D Fi d f t Combine Fi d l b l

Phase IPhase II

Trans.in D

Divide D into nNon-

overlappin

Find frequent itemsets local

to each partition

Combine results to

form a global set of

did t

Find global frequent itemsets among

Freq. itemsets

in Dgpartitions

partition(parallel alg.) candidate

itemsets

among candidates

in D

61

Improving Apriori’s Efficiencyp g p y

Dynamic itemset counting: partitions the DB into several blocks each marked by a start point.

At each start point, DIC estimates the support of all itemsets that are currently

counted and adds new itemsets to the set of candidate itemsets if all its subsets

are estimated to be frequent.

If DIC adds all frequent itemsets to the set of candidate itemsets during the first

ll h d h ’ d hscan, it will have counted each itemset’s exact support at some point during the

second scan;

h C l i thus DIC can complete in two scans.

62

Association Rules VisualizationAssociation Rules Visualization

The coloured column indicates the association rule B→C. Diff t i l d t h diff t t d t l f

63

Different icon colours are used to show different metadata values of the association rule.


64

Association Rules VisualizationAssociation Rules VisualizationAssociation Rules VisualizationAssociation Rules Visualization

Size of ball equates to total support

HeightHeight equates to confidence

65

Association Rules Visualization Association Rules Visualization -- Ball graphBall graph

66

The Ball graph ExplainedThe Ball graph Explained

A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green or blue. The g p y , gblue nodes are active nodes representing the items in the rule in which the user is interested. The yellow nodes are passive representing items related to the active nodes in some way. The green nodes merely assist in visualizing two or more items in either the head or the body of the rule.

A circular node represents a frequent (large) data item. The volume of the ball represents the support of the item. Only those items which occur sufficiently frequently are shown

A b t t d t th l i li ti b t th t it A An arrow between two nodes represents the rule implication between the two items. An arrow will be drawn only when the support of a rule is no less than the minimum support

67


68




FP-tree Algorithm

Additional Measures of rule interestingness

Advanced Techniques

69

Multiple-Level Association RulespFoodStuff

Frozen Refrigerated Fresh Bakery Etc...

Vegetable Fruit Dairy Etc....

Banana Apple Orange Etc...

Fresh Bakery [20%, 60%]

Dairy Bread [6%, 50%]

Fruit Bread [1%, 50%] is not valid

Items often form hierarchy.

70

Flexible support settings: Items at the lower level are expected to have lower support.Transaction database can be encoded based on dimensions and levels explore shared multi-level mining

Multi-Dimensional Association RulesMulti Dimensional Association Rules

Single-dimensional rules:

buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: ≥ 2 dimensions or predicates Multi dimensional rules: ≥ 2 dimensions or predicates

Inter-dimension association rules (no repeated predicates)

(X ”19 25”) ti (X “ t d t”) b (X “ k ”) age(X,”19-25”) ∧ occupation(X,“student”) buys(X,“coke”)

hybrid-dimension association rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) buys(X, “coke”

71

lQuantitative Association Rules

age(X,”30-34”) ∧ income(X,”24K - 48K”) buys(X,”high resolution TV”)

Mining Sequential Patterns

10% of customers bought

“Foundation” and “Ringworld” in one transaction,

followed by

“Ri ld E i ” in another transaction

72

“Ringworld Engineers” in another transaction.

Application DifficultiesApplication Difficulties

Wal Mart knows that customers who buy Barbie dolls (it Wal-Mart knows that customers who buy Barbie dolls (it sells one every 20 seconds) have a 60% likelihood of b i f h f d b Wh dbuying one of three types of candy bars. What does Wal-Mart do with information like that?

'I don't have a clue,' says Wal-Mart's chief of merchandising Lee Scottmerchandising, Lee Scott.

See - KDnuggets 98:01 for many ideas kd t / /98/ 01 ht l

73

www.kdnuggets.com/news/98/n01.html

Some SuggestionsSome Suggestions

By increasing the price of Barbie doll and giving the type of candy bar free, wal-mart By increasing the price of Barbie doll and giving the type of candy bar free, wal mart can reinforce the buying habits of that particular types of buyer

Highest margin candy to be placed near dolls.

Special promotions for Barbie dolls with candy at a slightly higher margin.

Take a poorly selling product X and incorporate an offer on this which is based on buying Barbie and Candy. If the customer is likely to buy these two products anyway then why not try to increase sales on X?

Probably they can not only bundle candy of type A with Barbie dolls but can also Probably they can not only bundle candy of type A with Barbie dolls, but can also introduce new candy of Type N in this bundle while offering discount on whole bundle. As bundle is going to sell because of Barbie dolls & candy of type A, candy of type N can get free ride to customers houses And with the fact that you like something if you seeget free ride to customers houses. And with the fact that you like something, if you see it often, Candy of type N can become popular.

74

ReferencesReferences

Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, 2000

Vipin Kumar and Mahesh Joshi, “Tutorial on High Performance Data Mining ”, 1999g

Rakesh Agrawal, Ramakrishnan Srikan, “Fast Algorithms f Mi i A i i R l ” P VLDB 1994for Mining Association Rules”, Proc VLDB, 1994(http://www.cs.tau.ac.il/~fiat/dmsem03/Fast%20Algorithms%20for%20Mining%20Association%20Rules.ppt)

Alípio Jorge, “selecção de regras: medidas de interesse e meta queries”,

75

q ,(http://www.liacc.up.pt/~amjorge/Aulas/madsad/ecd2/ecd2_Aulas_AR_3_2003.pdf)

Documents

Mining Association Rules Apriori Algorithmpaginas.fe.up.pt/~ec/files_1011/week 04 - Association...The Apriori Algorithm Ck: Candidate itemset of size k Lk: frequent itemset of size