28
Frequent Pattern Mining

Frequent Pattern Mining - Simon Fraser University

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Frequent Pattern Mining

How Many Words Is a Picture Worth?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Burnt or Burned?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 3

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Store Layout Design

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 4

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Transaction Data

•  Alphabet: a set of items – Example: all products sold in a store

•  A transaction: a set of items involved in an activity – Example: the items purchased by a customer in

a visit •  Other information is often associated

– Timestamp, price, salesperson, customer-id, store-id, …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 5

Examples of Transaction Data

•  •  •  •  • 

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 6

How to Store Transaction Data?

•  Transaction-id (t123, a, b, c) (t236, b, d)

•  Relational storage •  Transaction-based storage •  Item-based (vertical) storage

–  Item a: …, t123, … –  Item b: …, t123, …, t236, … – …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 7

Tid Item t123 a t123 b t123 c … … t236 b t236 d

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 8

Transaction Data Analysis

•  Transactions: customers’ purchases of commodities –  {bread, milk, cheese} if they are bought together

•  Frequent patterns: product combinations that are frequently purchased together by customers

•  Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 9

Why Frequent Patterns?

•  What products were often purchased together?

•  What are the frequent subsequent purchases after buying a iPod?

•  What kinds of genes are sensitive to this new drug?

•  What key-word combinations are frequently associated with web pages about game-evaluation?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 10

Why Frequent Pattern Mining?

•  Foundation for many data mining tasks – Association rules, correlation, causality,

sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

•  Broad applications – Basket data analysis, cross-marketing, catalog

design, sale campaign analysis, web log (click stream) analysis, …

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 11

Frequent Itemsets

•  Itemset: a set of items –  E.g., acm = {a, c, m}

•  Support of itemsets –  Sup(acm) = 3

•  Given min_sup = 3, acm is a frequent pattern

•  Frequent pattern mining: finding all frequent patterns in a database

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 12

A Naïve Attempt

•  Generate all possible itemsets, test their supports against the database

•  How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets

•  How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the

support of 220 – 1 = 1,048,575 itemsets

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 13

Transactions in Real Applications

•  A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books

relevant to data mining •  Walmart has more than 20 million

transactions per day, AT&T produces more than 275 million calls per day

•  Mining large transaction databases of many items is a real demand

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 14

How to Get an Efficient Method?

•  Reducing the number of itemsets that need to be checked

•  Checking the supports of selected itemsets efficiently

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 15

Candidate Generation & Test

•  Any subset of a frequent itemset must also be frequent – an anti-monotonic property –  A transaction containing {beer, diaper, nuts} also

contains {beer, diaper} –  {beer, diaper, nuts} is frequent à {beer, diaper} must

also be frequent •  In other words, any superset of an infrequent

itemset must also be infrequent –  No superset of any infrequent itemset should be

generated or tested –  Many item combinations can be pruned!

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 16

Apriori-Based Mining

•  Generate length (k+1) candidate itemsets from length k frequent itemsets, and

•  Test the candidates against DB

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 17

The Apriori Algorithm [AgSr94]

TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2

Itemset Sup a 2 b 3 c 3 d 1 e 3

Data base D 1-candidates

Scan D

Itemset Sup a 2 b 3 c 3 e 3

Freq 1-itemsets Itemset

ab ac ae bc be ce

2-candidates

Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2

Counting

Scan D

Itemset Sup ac 2 bc 2 be 3 ce 2

Freq 2-itemsets Itemset

bce

3-candidates

Itemset Sup bce 2

Freq 3-itemsets

Scan D

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 18

The Apriori Algorithm Level-wise, candidate generation and test •  Ck: Candidate itemset of size k •  Lk : frequent itemset of size k

•  L1 = {frequent items}; •  for (k = 1; Lk !=∅; k++) do

–  Ck+1 = candidates generated from Lk; –  for each transaction t in database do increment the

count of all candidates in Ck+1 that are contained in t –  Lk+1 = candidates in Ck+1 with min_support

•  return ∪k Lk;

Candidate generation

Test

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 19

Important Steps in Apriori

•  How to find frequent 1- and 2-itemsets? •  How to generate candidates?

– Step 1: self-joining Lk

– Step 2: pruning •  How to count supports of candidates?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 20

Finding Frequent 1- & 2-itemsets

•  Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array –  Initialize c[item]=0 for each item – For each transaction T, for each item in T,

c[item]++; –  If c[item]>=min_sup, item is frequent

•  Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 21

Counting Array

•  A 2-dimensional triangle matrix can be implemented using a 1-dimensional array

1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5

There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]

1 2 3 4 5 6 7 8 9 10

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 22

Example of Candidate-generation

•  L3 = {abc, abd, acd, ace, bcd} •  Self-joining: L3*L3

– abcd ß abc * abd – acde ß acd * ace

•  Pruning: – acde is removed because ade is not in L3

•  C4={abcd}

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 23

How to Generate Candidates? •  Suppose the items in Lk-1 are listed in an order •  Step 1: self-join Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

•  Step 2: pruning –  For each itemset c in Ck do

•  For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 24

How to Count Supports?

•  Why is counting supports of candidates a problem? –  The total number of candidates can be very huge –  One transaction may contain many candidates

•  Method –  Candidate itemsets are stored in a hash-tree –  A leaf node of hash-tree contains a list of itemsets and

counts –  Interior node contains a hash table –  Subset function: finds all the candidates contained in a

transaction

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 25

Example: Counting Supports

1,4,7 2,5,8

3,6,9 Subset function

2 3 4 5 6 7

1 4 5 1 3 6

1 2 4 4 5 7 1 2 5

4 5 8 1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8

Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6

1 3 + 5 6

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 26

Association Rules

•  Rule c à am •  Support: 3 (i.e., the support

of acm) •  Confidence: 75% (i.e.,

sup(acm) / sup(c)) •  Given a minimum support

threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB

To-Do List

•  Read Sections 6.1, 6.2.1 and 6.2.2 in the textbook

•  Understand the concept of frequent itemsets and association rules

•  Understand algorithm Apriori •  Figure out how to use Weka to mine

frequent itemsets

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 27

For Thesis-based Students Only

•  Find out in the source code of Weka how transaction data are stored

•  If you are asked to implement Apriori in SQL, what is the major bottleneck? How can you overcome it or why it cannot be overcome?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 28