Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser

Mining Sequential Mining Sequential PatternsPatternsRakesh AgrawalRakesh Agrawal

Ramakrishnan SrikantRamakrishnan SrikantProc. of the Int’l Conference on Proc. of the Int’l Conference on Data Engineering (ICDE) March Data Engineering (ICDE) March

19951995

Presenter: Phil SchlosserPresenter: Phil Schlosser

TopicsTopics

► IntroductionIntroduction►Problem StatementProblem Statement►Finding Sequential Patterns (The Finding Sequential Patterns (The

Algorithm)Algorithm)►Sequence Phase (AprioriAll, Sequence Phase (AprioriAll,

AprioriSome, DynamicSomeAprioriSome, DynamicSome►Conclusion (Performance/FE Conclusion (Performance/FE

Questions/Future work).Questions/Future work).

IntroductionIntroduction

►Bar code data allows us to store Bar code data allows us to store massive amounts of sales data (basket massive amounts of sales data (basket data).data).

►We would like to extract sequential We would like to extract sequential buying patterns.buying patterns.

►E.g. Video rental: Star Wars; Empire E.g. Video rental: Star Wars; Empire Strikes Back; Return of the Jedi.Strikes Back; Return of the Jedi.

►Can include multiple elements.Can include multiple elements.►E.g. Sheet and pillow case; comforter; E.g. Sheet and pillow case; comforter;

drapes.drapes.

Problem StatementProblem Statement

► Given a database of transactions:Given a database of transactions:► Each transaction:Each transaction:

- Customer ID- Customer ID- Transaction Time- Transaction Time- Set of items purchased- Set of items purchased

* No two transactions have same customer ID * No two transactions have same customer ID and transaction timeand transaction time

* Don’t consider quantity of items purchased.* Don’t consider quantity of items purchased. Item was either purchased or not purchasedItem was either purchased or not purchased

Problem Statement (terms)Problem Statement (terms)

► Itemset: non-empty set of items. Each item set Itemset: non-empty set of items. Each item set mapped to an integer.mapped to an integer.

► Sequence: Ordered list of itemsets.Sequence: Ordered list of itemsets.► Customer-Sequence: List of customer transactions Customer-Sequence: List of customer transactions

ordered by increasing transaction time.ordered by increasing transaction time.► (A customer supports a sequence if the sequence is (A customer supports a sequence if the sequence is

contained in the customer-sequence.)contained in the customer-sequence.)► Support for a sequence: Fraction of total customers Support for a sequence: Fraction of total customers

that support a sequence.that support a sequence.► Maximal Sequence: A sequence that is not Maximal Sequence: A sequence that is not

contained in any other sequence.contained in any other sequence.► Large Sequence: Sequence that meets minisup.Large Sequence: Sequence that meets minisup.

ExampleExampleCustomerCustomer

IDIDTransactionTransactionTime Time

ItemsItemsBoughtBought

11 June 25 '93June 25 '93 3030

11 June 30 '93June 30 '93 9090

22 June 10 '93June 10 '93 10,2010,20

22 June 15 '93June 15 '93 3030

22 June 20 '93June 20 '93 40,60,7040,60,70

33 June 25 '93June 25 '93 30,50,7030,50,70

44 June 25 '93June 25 '93 3030

44 June 30 '93June 30 '93 40,7040,70

44 July 25 '93July 25 '93 9090

55 June 12 '93June 12 '93 9090

CustomerCustomer IDID Customer SequenceCustomer Sequence

11 { (30) (90) }{ (30) (90) }

22{ (10 20) (30) (40 60 { (10 20) (30) (40 60 70) }70) }

33 { (30) (50) (70) }{ (30) (50) (70) }

44 { (30) (40 70) (90) }{ (30) (40 70) (90) }

55 { (90) }{ (90) }

Maximal sequences with support > Maximal sequences with support > 25%25%

{ (30) (90) }{ (30) (90) }

{ (30) (40 70) }{ (30) (40 70) }

Note: Use Minisup of 25%{ (10 20) (30) } Does not have minisup (Only supported by Cust. 2){ (30) }, { (70) }, { (30) (40) } … are not maximal.

The AlgorithmThe Algorithm

►Sort PhaseSort Phase►Litemset PhaseLitemset Phase►Transformation PhaseTransformation Phase►Sequence Phase (The Biggie)Sequence Phase (The Biggie)►Maximal PhaseMaximal Phase

Sort PhaseSort Phase

► Sort the database:Sort the database:► Customer ID as the Customer ID as the

major key.major key.► Transaction-Time as Transaction-Time as

the minor key.the minor key.

CustomerCustomerIDID

TransactionTransactionTime Time


11 June 25 '93June 25 '93 3030

11 June 30 '93June 30 '93 9090

22 June 10 '93June 10 '93 10,2010,20

22 June 15 '93June 15 '93 3030

22 June 20 '93June 20 '93 40,60,7040,60,70

33 June 25 '93June 25 '93 30,50,7030,50,70

44 June 25 '93June 25 '93 3030

44 June 30 '93June 30 '93 40,7040,70

44 July 25 '93July 25 '93 9090

55 June 12 '93June 12 '93 9090

Litemset PhaseLitemset Phase

► Litemset (Large Litemset (Large Itemset):Itemset):

► Supported by Supported by fraction of fraction of customers greater customers greater than minisup.than minisup.

CustomerCustomerIDID

TransactionTransactionTime Time


11 June 25 '93June 25 '93 3030

11 June 30 '93June 30 '93 9090

22 June 10 '93June 10 '93 10,2010,20

22 June 15 '93June 15 '93 3030

22 June 20 '93June 20 '93 40,60,7040,60,70

33 June 25 '93June 25 '93 30,50,7030,50,70

44 June 25 '93June 25 '93 3030

44 June 30 '93June 30 '93 40,7040,70

44 July 25 '93July 25 '93 9090

55 June 12 '93June 12 '93 9090

Large ItemsetsLarge Itemsets Mapped ToMapped To

(30)(30) 11

(40)(40) 22

(70)(70) 33

(40 70)(40 70) 44

(90)(90) 55

Transformation PhaseTransformation Phase

Customer Customer IDID

Original Customer Original Customer SequenceSequence

Transformed Custumer Transformed Custumer SequenceSequence After MappingAfter Mapping

11 { (30) (90) }{ (30) (90) } <{(30)} {(90)}><{(30)} {(90)}> <{1} {5}><{1} {5}>

22 { (10 20) (30) (40 60 70) }{ (10 20) (30) (40 60 70) } <{(30)} {(40),(70),(40 70)}><{(30)} {(40),(70),(40 70)}> <{1} {2,3,4}><{1} {2,3,4}>

33 { (30) (50) (70) }{ (30) (50) (70) } <{(30),(70)}><{(30),(70)}> <{1,3}><{1,3}>

44 { (30) (40 70) (90) }{ (30) (40 70) (90) }<{(30)} {(40),(70),(40 70)} <{(30)} {(40),(70),(40 70)}

{(90)}>{(90)}><{1} {2,3,4} <{1} {2,3,4}

{5}>{5}>

55 { (90) }{ (90) } <{(90)}><{(90)}> <{5}><{5}>

► Replace each transaction with all litemsets contained in the Replace each transaction with all litemsets contained in the transaction.transaction.

► Transactions with no litemsets are dropped. (Still considered Transactions with no litemsets are dropped. (Still considered for support counts)for support counts)

Note: (10 20) dropped because of lack of support.(40 60 70) replaced with set of litemsets {(40),(70),(40 70)} (60 does not have minisup)

Sequence PhaseSequence Phase

►Finds Large Sequences:Finds Large Sequences:►AprioriAllAprioriAll►AprioriSomeAprioriSome►DynamicSomeDynamicSome►Stay Tuned….Stay Tuned….

Maximal PhaseMaximal Phase

►Find maximal sequences among large Find maximal sequences among large sequences.sequences.

►k-sequence: sequence of length kk-sequence: sequence of length k►S set of all large sequencesS set of all large sequences► for (k=n; k>1; k--) dofor (k=n; k>1; k--) do

foreach k-sequence sforeach k-sequence skk do do

Delete from S all subsequences of sDelete from S all subsequences of skk

Authors claim data-structures and an algorithm Authors claim data-structures and an algorithm exist to do this efficiently. (hash trees)exist to do this efficiently. (hash trees)

Sequence PhaseSequence Phase

►Sequence phase finds the large Sequence phase finds the large sequences.sequences.

►Two families of algorithms:Two families of algorithms:

*****CountSome**********CountSome*****

*****CountAll**********CountAll*****

Final Exam Question #2Final Exam Question #2

► There were two types of algorithms presented There were two types of algorithms presented to find sequential patterns, CountSome and to find sequential patterns, CountSome and CountAll. What was the main difference CountAll. What was the main difference between the two algorithms?between the two algorithms?

► CountAllCountAll ( (AprioriAllAprioriAll) is careful with respect to ) is careful with respect to minimum support, careless with respect to minimum support, careless with respect to maximality.maximality.

CountSomeCountSome ( (AprioriSomeAprioriSome) is careful with ) is careful with respect to maximality, careless with respect respect to maximality, careless with respect to minimum support.to minimum support.

AprioriAllAprioriAllL1 = {large 1-sequences}for (k = 2; Lk-1 ≠ {}; k++) do begin Ck = New candidates generated from Lk-1

foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. endAnswer = Maximal Sequences in UkLk

Notation:Lk: Set of all large k-sequencesCk: Set of candidate k-sequences

AprioriAll (Example)AprioriAll (Example)Customer Customer

SequencesSequences

<{1 5} {2} {3} <{1 5} {2} {3} {4} >{4} >

<{1} {3} {4} {3 <{1} {3} {4} {3 5}>5}>

<{1} {2} {3} <{1} {2} {3} {4}>{4}>

<{1} {3} {5}><{1} {3} {5}>

<{4} {5}><{4} {5}>

L1L1

1-Seq1-SeqSupporSuppor

tt

<1><1> 44

<2><2> 22

<3><3> 44

<4><4> 44

<5><5> 44

L2L2

2-Seq2-Seq SupportSupport

<1 2><1 2> 22

<1 3><1 3> 44

<1 4><1 4> 33

<1 5><1 5> 33

<2 3><2 3> 22

<2 4><2 4> 22

<3 4><3 4> 33

<3 5><3 5> 22

<4 5><4 5> 22

L3L3


<1 2 3><1 2 3> 22

<1 2 4><1 2 4> 22

<1 3 4><1 3 4> 33

<1 3 5><1 3 5> 22

<2 3 4><2 3 4> 22

L4L4


<1 2 3 4><1 2 3 4> 22

<3 4 5><3 4 5> 11

<2 5><2 5> 00

Minisup = 25%

Answer: <1 2 3 4>, <1 3 5>, <4 5>

AprioriSome (Forward Phase)AprioriSome (Forward Phase)L1 = {large 1-sequences}C1 = L1

last = 1for (k = 2; Ck-1 ≠ {} and Llast ≠ {}; k++) do begin if (Lk-1 known) then Ck = New candidates generated from Lk-1

else Ck = New candidates generated from Ck-1

if (k==next(last)) then begin // (next k to count?) foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in c. Lk = Candidates in Ck with minimum support. last = k; endend

AprioriSome (Backward AprioriSome (Backward Phase)Phase)for (k--; k>=1; k--) do

if (Lk not found in forward phase) then begin Delete all sequences in Ck contained in some Li i>k; foreach customer-sequence c in DT do Increment the count of all candidates in Ck that are contained in c Lk = Candidates in Ck with minimum support end else // lk already known Delete all sequences in Ck contained in some Li i>k;

Answer = UkLk //(Maximal Phase not Needed)

Notation: DT; Transformed database

DynamicSome (Just the DynamicSome (Just the basics)basics)

►Similar to AprioriSomeSimilar to AprioriSome►AprioriSome generates CAprioriSome generates Ckk from C from Ck-1k-1

►DynamicSome generates CDynamicSome generates Ckk “on the “on the fly”fly”


►What was the greatest hardware What was the greatest hardware concern regarding the algorithms concern regarding the algorithms contained in the paper?contained in the paper?

►Main memory capacity. When there is Main memory capacity. When there is little main memory, or many little main memory, or many potentially large sequences, the potentially large sequences, the benefits of benefits of AprioriSomeAprioriSome vanish. vanish.


►How did the two best sequence mining How did the two best sequence mining algorithms (algorithms (AprioriAllAprioriAll and and AprioriSomeAprioriSome) ) perform compared with each other? perform compared with each other? Take into consideration memory, Take into consideration memory, speed, and usefulness of the data.speed, and usefulness of the data.


►Memory:Memory:In terms of main memory usage, In terms of main memory usage, AprioriAllAprioriAll is better. is better.In terms of secondary storage access, In terms of secondary storage access, AprioriSomeAprioriSome is better. is better.


►Speed:Speed:With sufficient memory, as minimum With sufficient memory, as minimum support decreases the difference support decreases the difference between between AprioriAllAprioriAll and and AprioriSomeAprioriSome increases. (increases. (AprioriSomeAprioriSome is better.) is better.)

More large sequences not maximal are More large sequences not maximal are generated.generated.


► Usefulness of the data:Usefulness of the data:For the problem of finding maximal large For the problem of finding maximal large sequences, the answer is “Precisely the sequences, the answer is “Precisely the same.”.same.”.However, However, AprioriAllAprioriAll finds all large sequences, finds all large sequences, while while AprioriSomeAprioriSome discards some large discards some large sequences that aren’t maximal. sequences that aren’t maximal. AprioriAllAprioriAll, , then, generates more “useful” data.then, generates more “useful” data.“The user may want to know the ratio of the “The user may want to know the ratio of the number of people who bought the first number of people who bought the first kk + 1 + 1 items in a sequence to the number of people items in a sequence to the number of people who bought the first who bought the first kk items.” items.”

Documents

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser