55
8/10/2019 Sequence Databases and Sequential Pattern Analysis http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 1/55  Sequential Pattern Analysis  ar gung [email protected] Departemen Ilmu Komputer FMIPA IPB November 17, 2007 Data Mining: Concepts and Techniques 1

Sequence Databases and Sequential Pattern Analysis

Embed Size (px)

Citation preview

Page 1: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 1/55

 Sequential Pattern Analysis

 ar [email protected]

Departemen Ilmu Komputer FMIPA IPB

November 17, 2007Data Mining: Concepts and Techniques1

Page 2: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 2/55

 

Object Timestamp Events

 

Sequence Database:

, ,A 20 6, 1A 23 1

B 11 4, 5, 6

B 21 7, 8, 1, 2B 28 1, 6C 14 1, 8, 7

Page 3: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 3/55

Formal Definition of a Se uence A sequence is an ordered list of elements (transactions)

 = 1 2 3 …

Each element contains a collection of events (items)

 ei = i1, i2, …, ik

Each element is attributed to a specific time or location

Length of a sequence, |s|, is given by the number of elements ofthe sequence

A k-sequence is a sequence that contains k events (items)

Page 4: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 4/55

Examples of Sequence Data

Sequence

Database

Sequence Element

(Transaction)

Event

(Item)

 

customer 

 

a customer at time t

, ,

CDs, etc

Web Data Browsing activity of a A collection of files Home page, index

after a single mouse click

, ,

Event data History of events generated

by a given sensor 

Events triggered by a

sensor at time t

Types of alarms

generated by sensors

Genome

sequences

DNA sequence of a

particular species

 An element of the DNA

sequence

Bases A,T,G,C

i1 i1i2

  i3i2

Element(Transaction)

  Event(Item)

Sequence

e1 e2 e3 e4 e5

Page 5: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 5/55

Examples of Sequence

Web sequence:

  omepage ec ron cs g a ameras anon g a amera opp ng

Cart} {Order Confirmation} {Return to Shopping} >

  -Island:

(http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm)<

{condenser polisher outlet valve shut} {booster pumps trip}{main waterpump trips} {main turbine trips} {reactor pressure increases}>

Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the King}>

Page 6: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 6/55

Formal Definition of a SubsequenceA sequence A= <a1 a2 … an> is contained in another sequence

B =<b   b  … b >m  

(m≥

n) if there exist integers i1 < i2 < … < insuch that a1⊆ bi1 , a2 ⊆ bi1, …, an ⊆ binB A

Data sequence Subsequence Contain?

< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes

< {1,2} {3,4} > < {1} {2} > No

< {2,4} {2,4} {2,5} > < {2} {4} > Yes

The support of a subsequence w is defined as the fraction of datasequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose

su ort is ≥ minsu

Page 7: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 7/55

Page 8: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 8/55

 equen a a ern n n : e n on

Given:

a database of sequences a user-specified minimum support threshold, minsup

Task:

 

Page 9: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 9/55

 

Given a sequence: <{a b} {c d e} {f} {g h i}> 

a en e on equen a a ern n n :

Examp es o su sequences:<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.

  - -sequence?

 

k=4: Y _ _ Y Y _ _ _ Y

<{a} {d e} {i}>126

1*2*3*4

6*7*8*9

4

9==⎟⎟

 ⎠

 ⎞⎜⎜⎝ 

⎛ =⎟⎟

 ⎠

 ⎞⎜⎜⎝ 

⎛ 

n

Page 10: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 10/55

Challenges on Sequential Pattern Mining 

A huge number of possible sequential patterns are hidden in databases A mining algorithm should

find the complete set of patterns, when possible, satisfying the

minimum support (frequency) threshold e g y e c en , sca a e, nvo v ng on y a sma num er o

database scans

 be able to incor orate various kinds of user-s ecific constraints

November 17, 2007Data Mining: Concepts and Techniques10

Page 11: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 11/55

A Basic Property of Sequential Patterns: Apriori

A basic ro ert : A riori A rawal & Sirkant’94

If a sequence S is not frequent Then none of the super-sequences of S is frequent

E.g, <hb> is infrequent so do <hab> and <(ah)b>

<(bd)cb(ac)>10

SequenceSeq. ID Given support threshold min_sup =2

<(be)(ce)d>40

<(ah)(bf)abf>30

November 17, 2007Data Mining: Concepts and Techniques11

<a(bd)bcb(ade)>50

Page 12: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 12/55

GSP—A Generalized Sequential Pattern Mining Algorithm

GSP (Generalized Sequential Pattern) mining algorithm

  propose y grawa an r ant,

Outline of the method

Initiall ever item in DB is a candidate of len th-1

for each level (i.e., sequences of length-k) do

scan database to collect support count for each candidate sequence  generate can i ate engt - + sequences rom engt - requent sequences

using Apriori

repeat until no frequent sequence or no candidate can be found

Major strength: Candidate pruning by Apriori

November 17, 2007Data Mining: Concepts and Techniques12

Page 13: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 13/55

Finding Length-1 Sequential Patterns

Examine GSP using an example

Initial candidates: all singleton sequences

<a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates

Cand Sup

<a> 3

min_sup =2

<c> 4

<d> 3

< >

<(bd)cb(ac)>10

SequenceSeq. ID   <e> 3

<f> 2

<(be)(ce)d>40

<(ah)(bf)abf>30 <h> 1

November 17, 2007Data Mining: Concepts and Techniques13

<a( ) c (a e)>50

Page 14: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 14/55

Generating Length-2 Candidates

<a> <b> <c> <d> <e> <f>

<a>   <aa> <ab> <ac> <ad> <ae> <af>

<b>   <ba> <bb> <bc> <bd> <be> <bf>

<c>   <ca> <cb> <cc> <cd> <ce> <cf>

<d>   <da> <db> <dc> <dd> <de> <df>

  eng -

Candidates<e>   <ea> <eb> <ec> <ed> <ee> <ef>

<f>   <fa> <fb> <fc> <fd> <fe> <ff>

<a>   <(ab)> <(ac)> <(ad)> <(ae)> <(af)>

<b>   <(bc)> <(bd)> <(be)> <(bf)>

<c>   <(cd)> <(ce)> <(cf)>

Without Apriori property,

8*8+8*7/2=92 candidates

<d>   <(de)> <(df)>

<e>   <(ef)>

<f>

Apriori prunes

44.57% candidates

November 17, 2007Data Mining: Concepts and Techniques14

Page 15: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 15/55

Finding Length-2 Sequential Patterns

Scan database one more time, collect support count for each

len th-2 candidate

There are 19 length-2 candidates which pass the minimumsupport threshold

 T ey are engt -2 sequentia patterns

November 17, 2007Data Mining: Concepts and Techniques15

Page 16: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 16/55

Support Count Length-2 Candidate

<a> <b> <c> <d> <e> <f>

<a>   <aa> <ab> <ac> <ad> <ae> <af>< >

SequenceSeq. ID

min_sup =2

<b>   <ba> <bb> <bc> <bd> <be> <bf>

<c>   <ca> <cb> <cc> <cd> <ce> <cf>

<d>   <da> <db> <dc> <dd> <de> <df><(ah)(bf)abf>30

<(bf)(ce)b(fg)>20

<e>   <ea> <eb> <ec> <ed> <ee> <ef>

<f>   <fa> <fb> <fc> <fd> <fe> <ff><a(bd)bcb(ade)>50

<( e)(ce) >40

<a>   <(ab)> <(ac)> <(ad)> <(ae)> <(af)>

<b>   <(bc)> <(bd)> <(be)> <(bf)>

<c>   <(cd)> <(ce)> <(cf)>

0

1

<d>   <(de)> <(df)>

<e>   <(ef)>

<f>

3

4

November 17, 2007Data Mining: Concepts and Techniques16

Page 17: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 17/55

Length-2 Sequential Patterns

<a> <b> <c> <d> <e> <f>

<a>   <aa> <ab> <ac> <ad> <ae> <af>< >

SequenceSeq. ID

min_sup =2

<b>   <ba> <bb> <bc> <bd> <be> <bf>

<c>   <ca> <cb> <cc> <cd> <ce> <cf>

<d>   <da> <db> <dc> <dd> <de> <df><(ah)(bf)abf>30

<(bf)(ce)b(fg)>20

<e>   <ea> <eb> <ec> <ed> <ee> <ef>

<f>   <fa> <fb> <fc> <fd> <fe> <ff><a(bd)bcb(ade)>50

<( e)(ce) >40

<a>   <(ab)> <(ac)> <(ad)> <(ae)> <(af)>

<b>   <(bc)> <(bd)> <(be)> <(bf)>

<c>   <(cd)> <(ce)> <(cf)>

0

1

<d>   <(de)> <(df)>

<e>   <(ef)>

<f>

3

4

November 17, 2007Data Mining: Concepts and Techniques17

Page 18: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 18/55

Generating Length-3 Candidates and Finding Length-3 Patterns

Generate Length-3 Candidates

e - o n engt - sequent a patterns Based on the Apriori property

<aa>,<ab> and <ba> are all length-2 sequential patterns <aba> is alength-3 candidate

<(bd)>, <db>, <bb> and are all length-2 sequential patterns <(bd)b> is a

length-3 candidate

46 candidates are generated

Find Length-3 Sequential Patterns

, 19 out of 46 candidates pass support threshold

November 17, 2007Data Mining: Concepts and Techniques18

Page 19: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 19/55

Support Count Length-3 Candidate

<aa> <ab> <ba> <bb> <bc> <bd>

<aa>< >

SequenceSeq. ID

min_sup =2

<ab>

<ba>

<bb><(ah)(bf)abf>30

<(bf)(ce)b(fg)>20

<bc>

<bd><a(bd)bcb(ade)>50

<( e)(ce) >40

0

1

3

4

November 17, 2007Data Mining: Concepts and Techniques19

Page 20: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 20/55

The GSP Mining Process

 

<abba> <(bd)bc> …

<(bd)cba>

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat. an . cannot pass

sup. thresholdCand. not in DB at all

 

<abb> <aab> <aba> <baa> <bab> …

2nd scan: 51 cand. 19 length-2 seq. pat. 10

3rd scan: 46 cand. 19 length-3 seq. pat. 20cand. not in DB at all

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <a > … <a > < a> < > … < > < a > … < e >1st scan: 8 cand. 6 length-1 seq. pat.

.

<(bf)(ce)b(fg)>20

<(bd)cb(ac)>10

.

min_sup =2

November 17, 2007Data Mining: Concepts and Techniques20 <a(bd)bcb(ade)>50

<(be)(ce)d>40

Page 21: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 21/55

The GSP Algorithm

Take sequences in form of <x> as length-1 candidates

Scan database once, find F1, the set of length-1 sequential patterns Let k=1; while Fk is not empty do

Form Ck+1, the set of length-(k+1) candidates from Fk;

If Ck+1 is not empty, scan database once, find Fk+1, the set of length- sequent a patterns

Let k=k+1;

November 17, 2007Data Mining: Concepts and Techniques21

Page 22: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 22/55

Bottlenecks of GSP

A huge set of candidates could be generated

1,000 frequent length-1 sequences generate

length-2 candidates!500,499,1

2

999100010001000   =

×+×

Multiple scans of database in mining

Real challenge: mining long sequential patterns

An exponential number of short candidates

A length-100 sequential pattern needs 1030

candidate sequences!30100

100

1012100

≈−=⎟ ⎞

⎜⎛ 

November 17, 2007Data Mining: Concepts and Techniques22

1=i

Page 23: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 23/55

FreeSpan:

Fre uent Pattern-Pro ected Se uential Pattern Minin

A divide-and-conquer approach

 

on the current set of frequent patterns Mine each projected database to find its patterns

  . an . e , . ortazav - s , . en, . aya , . . su, ree pan: requentpattern-projected sequential pattern mining. In KDD’00.

f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets:•

Sequence Database SDB

< (bd) c b (ac) > . .•Those containing e but no  f•Those containing d  but no e nor  f •Those containing a but no d , e or  f

< (bf) (ce) b (fg) >< (ah) (bf) a b f >< (be) (ce) d >

November 17, 2007Data Mining: Concepts and Techniques23

•Those containing c but no a, d , e or•Those containing only item b

< a (bd) b c b (ade) >

Page 24: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 24/55

From FreeSpan to PrefixSpan: Why?

Freespan:

Projection-based: No candidate sequence needs to be generated

But, projection can be performed at any point in the sequence,

an t e projecte sequences o wi not s rin muc

PrefixSpan

 ro ec on- ase

But only prefix-based projection: less projections and quickly

shrinkin se uences

November 17, 2007Data Mining: Concepts and Techniques24

Page 25: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 25/55

Prefix and Suffix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence<a(abc)(ac)d(cf)>

 

Prefix   Suffix  (Prefix-Based Projection) 

<aa> <(_bc)(ac)d(cf)>

<ab> < c ac d cf >

November 17, 2007Data Mining: Concepts and Techniques25

 _ 

Page 26: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 26/55

Mining Sequential Patterns by Prefix

Pro ections

  Step 1: in engt -1 sequentia patterns

<a>, <b>, <c>, <d>, <e>, <f>

  . . .partitioned into 6 subsets:

The ones having prefix <a>; The ones having prefix <b>;

 

SID sequence

10 <a(abc)(ac)d(cf)>

e ones av ng pre x 20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

November 17, 2007Data Mining: Concepts and Techniques26

Page 27: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 27/55

Finding Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>

<a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>,<(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>,ac , a , a

Further partition into 6 subsets

  < >  SID sequence

 

Having prefix <af>

10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

November 17, 2007Data Mining: Concepts and Techniques27

40 <eg(af)cbc>

Page 28: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 28/55

Completeness of PrefixSpan

SID sequence

SDB

 10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

-

<a>, <b>, <c>, <d>, <e>, <f>

<a>-projected database

Having prefix <a>

<b>- ro ected database …

Having prefix <b>

Having prefix <c>, …, <f>

<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb>< >

Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

  _ 

Having prefix <aa> Having prefix <af>

… …

November 17, 2007Data Mining: Concepts and Techniques28

<aa>-proj. db … <af>-proj. db

Page 29: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 29/55

Efficiency of PrefixSpan

No candidate se uence needs to be enerated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing projected databases

Can be im roved b bi-level ro ections

November 17, 2007Data Mining: Concepts and Techniques29

Page 30: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 30/55

 

Physical projection vs. pseudo-projection

Pseudo-projection may reduce the effort of projection when the

projected database fits in main memory

Parallel projection vs. partition projection Partition projection may avoid the blowup of disk space

November 17, 2007Data Mining: Concepts and Techniques30

Page 31: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 31/55

Speed-up by Pseudo-projection

 

Postfixes of sequences often appear repeatedly inrecursive projected databases

When (projected) database can be held in main memory,use pointers to form projections

 

Offset of the postfix s=<a(abc)(ac)d(cf)>

<a>

<(abc)(ac)d(cf)>

<ab>

s|<a>: ( , 2)

November 17, 2007Data Mining: Concepts and Techniques31

<(_ c)(ac)d(cf)>s|<ab>: ( , 4)

Page 32: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 32/55

Pseudo-Projection vs. Physical Projection

Pseudo-projection avoids physically copyin postfixes Efficient in running time and space when database can be held in

main memory

 ,memory

Disk-based random accessing is very costly Suggested Approach:

Integration of physical and pseudo-projection Swa in to seudo- ro ection when the data set fits in memor

November 17, 2007Data Mining: Concepts and Techniques32

Page 33: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 33/55

PrefixSpan Is Faster than GSP and FreeSpan

-

300

350

  o  n   d   )   PrefixSpan-2

FreeSpan

150

200

   i  m

  e   (  s  e   GSP

0

50

100

   R  u  n

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Support threshold (%)

November 17, 2007Data Mining: Concepts and Techniques33

Page 34: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 34/55

Effect of Pseudo-Projection

PrefixSpan-1

160  n   d   )

PrefixSpan-2

PrefixSpan-1 (Pseudo)

Pref ixS an-2 Pseudo

120

  e

   (  s  e  c  o

40   R  u  n   t   i

0

0.20 0.30 0.40 0.50 0.60

November 17, 2007Data Mining: Concepts and Techniques34

Support threshold (%)

Page 35: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 35/55

Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have identical support

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

1   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0

7 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0

9 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1   0 0 0 0 0 0 0 0 0 0

12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1

13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1

14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1

15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   1 1 1 1 1 1 1 1 1 1

Number of frequent itemsets

Need a com act re resentation

∑=

⎟⎜⎝ 

×=10

1

3k 

Page 36: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 36/55

 An itemset is maximal frequent if none of its immediate supersets is frequent

MaximalItemsets

Infre uent

BorderItemsets

Page 37: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 37/55

  An itemset is closed if none of its immediate supersets has the

same support as t e itemset

Itemset Support{A} 4

1 {A,B}

2 {B,C,D}

3 {A,B,C,D}

{B} 5

{C} 3

{D} 4

emse uppor  

{A,B,C} 2

{A,B,D} 3

 A C D 24 {A,B,D}

5 {A,B,C,D}

,

{A,C} 2

{A,D} 3

B C 3

{B,C,D} 3

{A,B,C,D} 2

{B,D} 4{C,D} 3

Page 38: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 38/55

 TID Items

1 ABC

null

124 123   1234   245 345

Transaction Ids

2 ABCD

3 BCE

4 ACDE   12 124 24   4   123   2 3

5 DE AB AC AD AE BC BD BE CD CE DE

 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

12   2   24   4 4   2 3 4

 ABCD ABCE ABDE ACDE BCDE2   4

  ABCDE

o suppor e yany transactions

Page 39: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 39/55

Page 40: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 40/55

 

Page 41: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 41/55

 equen a a ern n n : xamp e

 Object Timestamp Events

insup =

Examples of Frequent Subsequences:

 

, ,

 A 2 2,3

 A 3 5

B 1 1,2  , s=

< {2,3} > s=60%< {2,4}> s=80%

< {3} {5}> s=80% 

B 2 2,3,4

C 1 1, 2

C 2 2,3,4

  s=< {2} {2} > s=60%< {1} {2,3} > s=60%< {2} {2,3} > s=60%

 

, ,

D 1 2

D 2 3, 4

D 3 4, 5  , , s,

E 2 2, 4, 5

Page 42: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 42/55

  Given n events: i1, i2, i3, …, in

Candidate 1-subsequences:<{i1}>, <{i2}>, <{i3}>, …, <{in}>

Candidate 2-subsequences:

<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-1}{in}>

Candidate 3-subsequences:

<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2} {i2}>, …,<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1} {i1}

{i2}>, …

Page 43: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 43/55

 enera ze equen a a ern Step 1:

Make the first pass over the sequence database D to yield all the 1-elementrequent sequences

Step 2:

Re eat until no new fre uent se uences are found Candidate Generation:

Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidatesequences that contain k items

  Prune candidate k-sequences that contain infrequent (k-1)-subsequences

Support Counting: Make a new pass over the sequence database D to find the support for these candidate

Candidate Elimination: Eliminate candidate k-sequences whose actual support is less than minsup

Page 44: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 44/55

  Base case (k=2): Merging two frequent 1-sequences <{i1}> and <{i2}> will produce two candidate

2-sequences: <{i1} {i2}> and <{i1 i2}>

General case (k>2): A frequent (k-1)-sequence w1 is merged with another frequent

(k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by

removing the first event in w1 is the same as the subsequence obtained by removinge as even n w2

The resulting candidate after merging is given by the sequence w1 extended with the lastevent of w2.

2   , 2

 becomes part of the last element in w1

Otherwise, the last event in w2 becomes a separate element appended to the end of w1

Page 45: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 45/55

  Merging the sequences

w =< 1 2 3 4 > and w =< 2 3 4 5 >will produce the candidate sequence < {1} {2 3} {4 5}> because the last two

events in w2 (4 and 5) belong to the same element

Merging the sequencesw1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>

will produce the candidate sequence < {1} {2 3} {4} {5}> because the lasttwo events n w2 an o not e ong to t e same e ement

We do not have to merge the sequencesw1 = an w2 =

to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viablecandidate, then it can be obtained by merging w1 with< 1 2 6 5 >

Page 46: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 46/55

 

Timing Constraints (I)

Page 47: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 47/55

Timing Constraints (I)

{A B} {C} {D E} 

xg: max-gap

<= ms

= xg ng ng: min-gap

ms: maximum span

xg = 2, ng = 0, ms= 4

Data sequence Subsequence Contain?

< {2,4} {3,5,6} {4,7} {4,5} {8} > < {6} {5} > Yes

< {1} {2} {3} {4} {5}> < {1} {4} > No

< {1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes

< {1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} > No

Page 48: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 48/55

Mining Sequential Patterns with Timing Constraints

Approach 1:

Mine sequential patterns without timing constraints

Postprocess the discovered patterns

Approach 2:

Modif GSP to directl rune candidates that violate timinconstraints

Question: Does Apriori principle still hold?

Page 49: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 49/55

 Object Timestamp Events Suppose:

, ,

 A 2 2,3

 A 3 5

B 1 1 2

xg = 1 (max-gap)

ng = 0 (min-gap)

m = 5 maximum s anB 2 2,3,4

C 1 1, 2

C 2 2,3,4

minsup = 60%

, ,

D 1 2

D 2 3, 4

D 3 4, 5

<{2} {5}> support = 40%

 but

<{2} {3} {5}> support = 60%

E 1 1, 3E 2 2, 4, 5

 ro em ex s s ecause o max-gap cons ra n

No such problem if max-gap is infinite

Page 50: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 50/55

  s is a contiguous subsequence of

w e1  e2 … ek  

if any of the following conditions hold:1. s is obtained from w by deleting an item from either e1 or ek

 . s s o ta ne rom w y e et ng an tem rom any e ement ei t atcontains more than 2 items

3. s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of

w recursive definition 

Examples: s = < {1} {2} > is a contiguous subsequence of

< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} >

is not a contiguous subsequence of < {1} {3} {2}> and < {2} {1} {3} {2}>

Page 51: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 51/55

  Without maxgap constraint:

A candidate k-sequence is pruned if at least one of its (k-1)-

subsequences is infrequent

With maxgap constraint:

A candidate k-se uence is runed if at least one of itscontiguous (k-1)-subsequences is infrequent

Page 52: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 52/55

 {A B} {C} {D E}

xg: max-gap

  -

<= ms

<= xg >ng <= wsg  

ws: window sizems: maximum span

x = 2 n = 0 ws = 1  m = 5

Data sequence Subsequence Contain?

< 2 4 3 5 6 4 7 4 6 8 > < 3 5 > No 

< {1} {2} {3} {4} {5}> < {1,2} {3} > Yes

< {1,2} {2,3} {3,4} {4,5}> < {1,2} {3,4} > Yes

Page 53: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 53/55

  Given a candidate pattern: <{a, c}>

Any data sequences that contain

<… {a c} … >,<… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws)<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws)

will contribute to the support count of candidate pattern

Page 54: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 54/55

  In some domains, we may have only one very long time series

Exam le: monitoring network traffic events for attacks

monitoring telecommunication alarm signals

 oa s o n requen sequences o even s n e me ser es

This problem is also known as frequent episode mining

E1

E2

E1

E2

E1

E2

E3

E4 E3 E4

E1

E2

E2 E4

E3 E5

E2

E3 E5

E1

E2 E3 E1

Pattern: <E1> <E3>

General Support Counting Schemes

Page 55: Sequence Databases and  Sequential Pattern Analysis

8/10/2019 Sequence Databases and Sequential Pattern Analysis

http://slidepdf.com/reader/full/sequence-databases-and-sequential-pattern-analysis 55/55

Assume:

xg = 2 (max-gap)

n = 0 min- a

ws = 0 (window size)

ms = 2 (maximum span)