Transcript
Page 1: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Efficiently Mining Long Patterns from Databases

Roberto J. Bayardo Jr.

IBM Almaden Research Center

Page 2: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Abstract

• Max-Miner : scale roughly linearly in the number of maximal patterns, irrespective of the length of the longest pattern

• previous algorithms : scale exponentially with longest pattern length

Page 3: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Introduction (Cont’d)

• pattern mining algorithms have been developed to operate on databases where the longest patterns are relatively short

• Interesting data-sets with long patterns– sales transactions detailing the purchases made

by regular customers over a large time window– biological data from the fields of DNA and

protein analysis

Page 4: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Introduction (Cont’d)

• Apriori-like algorithms are inadequate on data-sets with long patterns

• a bottom-up search– enumerates every single frequent itemset– exponential complexity

• in order to produce a frequent itemset of length l, it must produce all 2l of its subsets since they too must be frequent

– restricts Apriori-like algorithms to discovering only short patterns

Page 5: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Introduction

• Max-Miner algorithm– for efficiently extracting only the maximal freq

uent itemsets– roughly linear in the number of maximal freque

nt itemsets– “look ahead” , not bottom-up search– can prune all its subsets from consideration, by

identifying a long frequent itemset early on

Page 6: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Max-Miner (Cont’d)

• Rymon’s generic set-enumeration tree search frame work

• ex. Figure 1.• breadth-first search

– in order to limit the number of passes

• pruning strategies– subset infrequency pruning (as does Apriori)– superset frequency pruning

Page 7: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Figure 1. Complete set-enumeration tree over four items

{ }

1 2 3 4

1,2 1,3 1,4 2,3 2,4 3,4

1,2,3 1,3,4 2,3,4

1,2,3,4

Page 8: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Max-Miner (Cont’d)

• candidate group g,– head, h(g)

• represents the itemsets enumerated by the node

– tail, t(g) • an ordered set

• contains all items not in h(g) that can potentially appear in any sub-node

– ex. the node enumerating itemset {1} h(g) = {1}, t(g) = {2, 3, 4}

Page 9: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Max-Miner

• counting the support of a candidate group g,– computing the support of itemsets h(g), h

(g) t(g) and h(g) {i} for all i t(g)

• superset-frequency pruning– halting sub-node expansion at any candidate gr

oup g for which h(g) t(g) is frequent

• subset-infrequency pruning– removing any such tail item from candidate gro

up before expanding its sub-nodes

Page 10: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Figure 2. Max-Miner at its top level

MAX-MINER (Data-set T) ;; Returns the set of maximal frequent itemsets present in T Set of Candidate Groups C { } Set of Itemsets F {GEN-INITIAL-GROUP(T, C)} while C is non-empty do

scan T to count the support of all candidate groups in Cfor eachfor each g g C such that h(g) C such that h(g) t(g) is frequent t(g) is frequent dodo

F F F F {h(g) {h(g) t(g)} t(g)}Set of Candidate Groups Cnew { }for each g C such that h(g) t(g) is infrequent do

F F {GEN-SUB-NODES(g, Cnew)} C Cnew

remove from F any itemset with a proper superset in Fremove from C any group g such that h(g) t(g) has a

superset in F return F

Page 11: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Figure 3. Generating the initial candidate groups

GEN-INITIAL-GROUPS (Data-set T, Set of Candidate Groups C)

;; C is passed by reference and returns the candidate groups

;; The return value of the function is a frequent 1-itemset

scan T to obtain F1, the set of frequent 1-itemsets

impose an ordering on the items in F1

for each item i in F1 other than the greatest item do

let g be a new candidate with h(g) = {i}

and t(g) = {j | j follows i in the ordering}

C C {g}

return the itemset in F1 containing the greatest item

Page 12: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Figure 4. Generating sub-nodes

GEN-SUB-NODES (Candidate Group g, Set of Cand. Groups C) ;; C is passed by reference and returns the sub-nodes of g ;; The return value of the function is a frequent itemset remove any item i from t(g) if h(g) remove any item i from t(g) if h(g) {i} is infrequent {i} is infrequent reorder the items in t(g) for each i t(g) other than greatest do

let g’ be a new candidate with h(g’) = h(g) {i} and t(g’) = { j | j t(g) and j follows i in t(g)}

C C {g’} return h(g) {m} where m is the greatest item in t(g),

or h(g) if t(g) is empty

Page 13: Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

Item Ordering Policies

• to increase the effectiveness of superset-frequency pruning

• to position the most frequent items last• ordering : Gen-Initial-Group

– in increasing order of sup({i})

• reordering : Gen-Sub-Nodes– in increasing order of sup(h(g) {i})

• consider only the subset of transactions relevant to the given node