75
“OLAP on Sequence Data” , Presenter : Chun Kit Chui OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong [email protected] Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. Cheung Presenter : Authors : Published in SIGMOD 2008 Vancouver, Canada.

“OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong [email protected] Eric Lo,

Embed Size (px)

Citation preview

Page 1: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

OLAP onOLAP onSequence DataSequence Data

Chun Kit Chui (Kit),The University of Hong [email protected]

Chun Kit Chui (Kit),The University of Hong [email protected]

Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. CheungEric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. Cheung

Presenter :Presenter :

Authors :Authors :

Published in SIGMOD 2008 Vancouver, Canada.Published in SIGMOD 2008 Vancouver, Canada.

Page 2: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

OLAP onOLAP onSequence DataSequence Data

Problem MotivationProblem Motivation

Sequence Data Cube and CuboidsSequence Data Cube and Cuboids

Experimental evaluationsExperimental evaluations

New OLAP operationsNew OLAP operations

System architectureSystem architecture

Future worksFuture works

Page 3: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.

Web server access logs Stock market dataU.S. OIL FUND ETF

MEXCO ENERGY CORP

OLAP onOLAP onSequence DataSequence Data

Page 4: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.

Web server access logs Stock market dataU.S. OIL FUND ETF

MEXCO ENERGY CORP

Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

Web server access logs (Web retailor selling sports wear products)

The product dimension is associated with a concept hierarchy in which the finest level of abstraction is product ID, followed by product type, and brand.

Sequence DataSequence Data

Page 5: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.

Web server access logs

Sequence DataSequence Data

Browsing Sequence

Member 688 Adidas shoesNike shoes Nike shoes

Web server access logs (Web retailor selling sports wear products)

From the access logs we can trace back the browsing sequences of all members.

The product dimension is associated with a concept hierarchy in which the finest level of abstraction is product ID, followed by product type, and brand.

Page 6: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature.

Web server access logs

Manager

Sequence DataSequence Data

Browsing Sequence

Member 688 Adidas shoesNike shoes Nike shoes

Web server access logs (Web retailor selling sports wear products)

Page 7: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > ?

< Nike Shoes, Puma Shoes, Nike Shoes > 5,432

< Nike Shoes, Nike Shoes, Nike Shoes > 13,200

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020

< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331

The query is referring to a particular kind of pattern in the browsing sequences.

The comparison shopping semantics can be expressed by the pattern template < X, Y, X >.

Sequence DataSequence Data

Browsing Sequence

Member 688 Adidas shoesNike shoes Nike shoes

Manager

Web server access logs (Web retailor selling sports wear products)

Pattern template

Page 8: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

<Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template.Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell.

Sequence DataSequence Data

Browsing Sequence

Member 688 Adidas shoesNike shoes Nike shoes

Manager

Web server access logs (Web retailor selling sports wear products)

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > 1

< Nike Shoes, Puma Shoes, Nike Shoes > ?

< Nike Shoes, Nike Shoes, Nike Shoes > ?

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > ?

< Adidas Shoes, Puma Shoes, Adidas Shoes > ?

Pattern template Instantiated pattern

Page 9: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000

< Nike Shoes, Puma Shoes, Nike Shoes > 5,432

< Nike Shoes, Nike Shoes, Nike Shoes > 13,200

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020

< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331

The aggregated number of members is counted and a tabulated view of the sequence data should be returned.

Sequence DataSequence Data

Browsing Sequence

Member 688 Adidas shoesNike shoes Nike shoes

Manager

Web server access logs (Web retailor selling sports wear products)

<Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template.Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell.

Page 10: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Web server access logs (Web retailor selling sports wear products)Time member- ID URL Product Product type Brand

2008-1-01 00:01 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 00:02 688 /product.html?pid=13250 13250 Adidas shoes Adidas

2008-1-01 00:10 14230 /product.html?pid=324 324 Puma shoes Puma

… … … …

2008-1-01 02:45 688 /product.html?pid=12800 12800 Nike shoes Nike

… … … …

2008-1-01 03:49 688 /product.html?pid=329 329 Adidas T-shirts Adidas

… … … …2008-1-01 03:45 14230 /checkout.xhtml Nil Nil Nil

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000

< Nike Shoes, Puma Shoes, Nike Shoes > 5,432

< Nike Shoes, Nike Shoes, Nike Shoes > 13,200

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020

< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331

Sequence OLAP systemSequence OLAP system• Support “pattern based” grouping and aggregation.

Query

Result

The aggregated number of members is counted and a tabulated view of the sequence data should be returned.

Manager

Page 11: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence OLAP systemSequence OLAP system• Support “pattern based” grouping and aggregation.

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000

< Nike Shoes, Puma Shoes, Nike Shoes > 5,432

< Nike Shoes, Nike Shoes, Nike Shoes > 13,200

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020

< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331

< X, Y, X, Z >

X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any# Members

< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > 15,000

< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >180,000

< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000

… …

There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if

so what is the product.

Follow up Query

Result

Manager

The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes

• Obtain query results in real time (OLAP feature).

+

Page 12: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes

Sequence OLAP systemSequence OLAP system• Support “pattern based” grouping and aggregation.

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000

< Nike Shoes, Puma Shoes, Nike Shoes > 5,432

< Nike Shoes, Nike Shoes, Nike Shoes > 13,200

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020

< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331

< X, Y, X, Z >

X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any# Members

< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > 15,000

< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >180,000

< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000

… …

Follow up Query

Result

Manager

This manager find out the Adidas T-shirts page is the most popular page for the members who did comparison shopping between Nike shoes and Adidas shoes pages.

• Obtain query results in real time (OLAP feature).

There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if

so what is the product.

+

Page 13: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

I would like to know the number of members that did comparison shopping and their distributions over all product web page to product web page pairs within 2008 Quarter 1.

< X, Y, X, Z >

X=“Nike Shoes”, Y=“Adidas Shoes”, Z=Any# Members

< Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes > 15,000

< Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts >180,000

< Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes > 9,000

… …

The comparison shopping patterns displayed in the “product type” abstraction level is too detailed, I would like to view some higher level statistics.

Query

Result

• Support “pattern based” grouping and aggregation.

• Obtain query results in real time (OLAP feature).

Sequence OLAP systemSequence OLAP system

• Provide OLAP operations to ease sequence

analysis.

Nike

Nike shoes

Nike T-shirts

Nike Basketballs

Nike socks

There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if

so what is the product.

Manager

A simple “roll up” operation on the pattern template transforms the summary statistics to the brand abstraction level.

< X, Y, X > # Members

< Nike Shoes, Adidas Shoes, Nike Shoes > 200,000

< Nike Shoes, Puma Shoes, Nike Shoes > 5,432

< Nike Shoes, Nike Shoes, Nike Shoes > 13,200

… …

< Adidas Shoes, Nike Shoes, Adidas Shoes > 1,020

< Adidas Shoes, Puma Shoes, Adidas Shoes > 4,331

< X, Y, X > # Members

< Nike, Adidas, Nike> 3,150,000

< Nike, Puma, Nike > 2,180,000

< Nike, Nike, Nike > 19,000,000

… …

“Product type” abstraction level

“brand” abstraction level

Page 14: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Research Objective

To design and implement an OLAP system that is able to

support “pattern based” grouping and aggregation. obtain query results in real-time.

Especially optimized for interactive/iterative queries.

provide OLAP operations to ease explorative analysis of sequence data.

< X, Y, X > # Members

< Nike, Adidas, Nike> 315,000

< Nike, Puma, Nike > 2,180,000

< Nike, Nike, Nike > 189,000

… …

< X, Y > # Members

< Nike, Adidas> 1,315,000

< Nike, Puma > 6,480,000

< Nike, Nike> 3,189,000

… …

Sequence OLAP

Sequence OLAP

Page 15: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

RFID Logs

Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags.

The smart card system in public transits Octopus card Hong Kong, Orca card in Seattle (2009)…etc Electronic money Travel history of passengers are logged in a database. Generate massive amount of sequence data.

Smart card

Page 16: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

RFID Logs

Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags.

The smart card system in public transits Octopus card Hong Kong, Orca card in Seattle (2009)…etc Electronic money Payment can be done easily by waving the card over the card reader. Travel history of passengers are logged in a database. Generate massive amount of sequence data .

Time Card-ID Location Action Amount

2008-7-25 09:01 Kit Shatin in -

2008-7-25 09:25 Kit Central out - $5

… … … … …

2008-7-25 18:23 KitCentral

Machine #10Add value + $100

2008-7-25 18:25 Kit Central in -

… … … … …

2008-7-25 18:49 Kit Shatin out - $5

… … … … …

Smart card Card reader Event Database

Page 17: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Time Card-ID Location Action Amount

2008-7-25 09:01 Kit Shatin in -

2008-7-25 09:25 Kit Central out - $5

… … … … …

2008-7-25 18:23 KitCentral

Machine #10Add value + $100

2008-7-25 18:25 Kit Central in -

… … … … …

2008-7-25 18:49 Kit Shatin out - $5

… … … … …

Event Database

The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.

Marketing Manager

Page 18: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.

Marketing Manager

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2,032

< Shatin, Admiralty, Admiralty, Shatin> 1,982

… …

< Admiralty, Central, Central, Admiralty > 22,822

< Admiralty, Kowloon, Kowloon, Admiralty > 10,020

Query• Support “pattern based” grouping and aggregation.

• Obtain query results in real time.

Sequence OLAP systemSequence OLAP system

• Provide OLAP operations to ease explorative analysis.

Result

Round trip statistics (Stations level)Time Card-ID Location Action Amount

2008-7-25 09:01 Kit Shatin in -

2008-7-25 09:25 Kit Central out - $5

… … … … …

2008-7-25 18:23 KitCentral

Machine #10Add value + $100

2008-7-25 18:25 Kit Central in -

… … … … …

2008-7-25 18:49 Kit Shatin out - $5

… … … … …

Event Database

Page 19: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence Data CuboidSequence Data Cuboid

A logical view of sequence data at a particular degree of summarization.A logical view of sequence data at a particular degree of summarization.

Page 20: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Preliminary

Sequence Cuboid (S-Cuboid)

a logical view of sequence data at a particular degree of summarization.

sequences can be characterized by

attributes’ values of the events in the sequence (e.g. time, spending, product type)

the subsequence/ substring patterns they possess. (e.g. <X,Y,X> , <X,Y,Y,X>)

< X, Y, Y, X >#

Users

< Shatin, Central, Central, Shatin > 2

< Kowloon, Admiralty, Admiralty, Kowloon > 9

… …

An S-Cuboid

The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.Marketing Manager

Sequence OLAP

Sequence OLAP

Page 21: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Phase 1. Sequence Formation

An event selection step to select a set of a relevant records and attributes.

Time Card-ID Location Action Amount

2008-6-09 00:01 Kit Shatin in 0

2008-6-09 02:25 Kit Central out -5

… … … … …

2008-6-14 02:23 KitCentral

Machine #10Add value +100

2008-6-14 02:25 Kit Central in 0

… … … … …

2008-6-14 18:49 Kit Shatin out -5

… … … … …

Event DatabaseTime Card-ID Location Action Amount

2008-6-09 00:01 Kit Shatin in 0

2008-6-09 02:25 Kit Central out -5

… … … … …

2008-6-14 02:25 Kit Central in 0

… … … … …

2008-6-14 18:49 Kit Shatin out -5

… … … … …

EventSelection

Page 22: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Phase 1. Sequence Formation

Time Card-ID Location Action Amount

2008-6-09 00:01 Kit Shatin in 0

2008-6-09 02:25 Kit Central out -5

… … … … …

2008-6-14 02:23 KitCentral

Machine #10Add value +100

2008-6-14 02:25 Kit Central in 0

… … … … …

2008-6-14 18:49 Kit Shatin out -5

… … … … …

Event DatabaseTime Card-ID Location Action Amount

2008-6-09 00:01 Kit Shatin in 0

2008-6-09 02:25 Kit Central out -5

… … … … …

2008-6-14 02:25 Kit Central in 0

… … … … …

2008-6-14 18:49 Kit Shatin out -5

… … … … …

EventSelection

Seq ID Sequence of events

S1 < e1, e2, e102, e180>

S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >

S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >

… …

Sequence Formation

User : Individual, Time : Day

A sequence formation step to form sequences from the event dataset.

Sequences can be formed per day and for each individual user.By doing this, we have a number of daily travel sequences of each user. E.g. S1 is Kit’s trip on Monday

Kit’s trip on monday

Page 23: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Phase 1. Sequence Formation

Time Card-ID Location Action Amount

2008-6-09 00:01 Kit Shatin in 0

2008-6-09 02:25 Kit Wan Chai out -5

… … … … …

2008-6-14 02:23 KitWan Chai

Machine #10Add value +100

2008-6-14 02:25 Kit Wan Chai in 0

… … … … …

2008-6-14 18:49 Kit Shatin out -5

… … … … …

Event DatabaseTime Card-ID Location Action Amount

2008-6-09 00:01 Kit Shatin in 0

2008-6-09 02:25 Kit Central out -5

… … … … …

2008-6-14 02:25 Kit Central in 0

… … … … …

2008-6-14 18:49 Kit Shatin out -5

… … … … …

EventSelection

Seq ID Sequence of events

S1 < e1, e2, e102, e180>

S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >

S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >

… …

Sequence Formation

User : Individual, Time : Day Kit’s trip on monday

Seq ID Sequence of events

S1 < e1, e2 , e102, e180 , e1002, e1800 , e1801 ,… >

S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 , e2134, e2135 >

S3 < e4, e5, e9, e13 , e14, e290 , e292 , e352 , e3252,…>

… …

User : Individual, Time : Year

Sequences can also be formed according to time dimension at the abstraction level of year and per individual user.

Kit’s trip in 2008

Page 24: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Phase 2. S-Cuboid construction

Seq ID Sequence of events

S1 < e1, e2, e102, e180>

S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >

S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >

… …

User : Individual, Time : Day

Time : day

Use

r :

indi

vidu

al

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

Kit

Ben

ShingKit’s trip on monday

Monday

Page 25: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Phase 2. S-Cuboid construction

Seq ID Sequence of events

S1 < e1, e2, e102, e180>

S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >

S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >

… …

User : Individual, Time : Day

Time : day

Use

r :

indi

vidu

al

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

time : day

Use

r :

fare

-gro

up

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

SequenceGrouping

A sequence grouping step to group the sequences that share the same dimensions’ values into a sequence group.E.g. travel sequences are grouped according to their fare groups.

Kit

Ben

Shing

RegularGroup

Monday

Page 26: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Phase 2. S-Cuboid construction

Seq ID Sequence of events

S1 < e1, e2, e102, e180>

S2 < e3, e7, e8, e12 , e19, e232 , e234, e235 >

S3 < e4, e5, e9, e13 , e14, e290 , e292, e352 >

… …

User : Individual, Time : Day

Time : day

Use

r :

indi

vidu

al

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

time : day

Use

r :

fare

-gro

up

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

SequenceGrouping

X (Location : station)

Y (

Loca

tion

: st

atio

n)

PatternGrouping

The pattern grouping step further groups the sequences according to the “patterns” they possess.

Pattern X,Y,Y,X

Page 27: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

time : day

Use

r :

fare

-gro

up

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

PatternGrouping

Phase 2. S-Cuboid construction

Event Time Card-ID Location Action Amount

e12008-6-09

00:01Kit Shatin in 0

e22008-6-09

02:25Kit Central out -5

… … … … … …

e1022008-6-09

22:25Kit Central in 0

… … … … … …

e1802008-6-09

23:49Kit Shatin out -5

… … … … … …

X (Location : station)

Y (

Loca

tion

: st

atio

n) Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, Shatin>We assign sequences to a cell if that sequence contains the instantiated pattern.

Shatin

S1

S3

Pattern X,Y,Y,X

The pattern grouping step further groups the sequences according to the “patterns” they possess.

Cen

tral

Page 28: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

time : day

Use

r :

fare

-gro

up

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

PatternGrouping

Phase 2. S-Cuboid construction

X (Location : station)

Y (

Loca

tion

: st

atio

n)

Shatin

Cen

tral S1

S3

Pattern X,Y,Y,X

Count: 2

Aggregated Value

Finally, an aggregation function is applied to the sequences in each cuboid cell.

Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, Shatin>We assign sequences to a cell if that sequence contains the instantiated pattern.

Page 29: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

time : day

Use

r :

fare

-gro

up

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

PatternGrouping

Phase 2. S-Cuboid construction

X (Location : station)

Y (

Loca

tion

: st

atio

n)

Shatin

Cen

tral S1

S3

Pattern X,Y,Y,X

Count: 2

Aggregated Value

4D S-Cuboid

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-Cuboid

Page 30: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

time : day

Use

r :

fare

-gro

up

S1

S2

S3

S4

S100

S90

S23

S29

S388

S124

S242

S129

S1020

S9230

S2453

S2529

PatternGrouping

Phase 2. S-Cuboid construction

X (Location : station)

Y (

Loca

tion

: st

atio

n)

Shatin

Cen

tral S1

S3

Pattern X,Y,Y,X

Count: 2

Aggregated Value

4D S-Cuboid

Global Dimensions

Pattern Dimensions

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-Cuboid

Page 31: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence Cuboid query language

The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4.

SequenceFormation

SequenceGrouping

PatternGrouping

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-CuboidThe number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4.

This query specifies the construction of the S-Cuboid that answer the round trip query in the running example.

Page 32: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence Cuboid query language

The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4.

SequenceFormation

SequenceGrouping

PatternGrouping

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-Cuboid

We specify the global dimensions in the sequence grouping step.Group the sequences with the same fare-group within the same day.

Form individual daily travel sequences.

Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level.

Page 33: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence Cuboid query language

The predicates further increases the expression power of pattern matching in the query language.What exactly is a round-trip pattern?

SequenceFormation

SequenceGrouping

PatternGrouping

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-Cuboid

We specify the global dimensions in the sequence grouping step.Group the sequences with the same fare-group within the same day.

Form individual daily travel sequences.

Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level.

Page 34: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence Cuboid query language

SequenceFormation

SequenceGrouping

PatternGrouping

Pattern templatePattern dimensions

Global dimensions

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-CuboidThe cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence.

E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin >

Page 35: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence Cuboid query language

SequenceFormation

SequenceGrouping

PatternGrouping

Pattern templatePattern dimensions

Global dimensions

< X, Y, Y, X > # Users

< Shatin, Central, Central, Shatin > 2

< Shatin, Kowloon, Kowloon, Shatin > 9

… …

4D S-CuboidThe cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence.

E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin >

Any changes to the cuboid specification transforms the S-Cuboid to another.E.g. changing the pattern template to (X,Y,Y,X,Z) generates another S-Cuboid.

Page 36: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Exponential number of S-cuboids The length of the pattern template is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizableRecall that changing the pattern template essentially changes the cuboid specification and thus generates a new cuboid.

Properties of S-Cuboids

Page 37: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Exponential number of S-cuboids The length of the pattern template is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizable

1 1 1 1 1 1 1# Sales

Mon Tu

e

Wed Thur

Fri

Sat

SunIn traditional OLAP systems, data are summarizable.i.e. Summaries in finer abstraction level can be used to construct the summary in higher abstraction level.

7

Traditional OLAP

# Sales

Whole week

Summarizable!

Finer summaries

Coarsersummaries

Properties of S-Cuboids

Page 38: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Infinite number of S-cuboids The number of pattern dimensions is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizable

# Sales 1 1 1 1 1 1 1

7

Traditional OLAP

Mon Tu

e

Wed Thur

Fri

Sat

Sun

# Sales

Whole week

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central >

Ben < Kowloon, Central, Central, Kowloon >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

The S-Cuboid with pattern template <X,Y,Z>

Sequence Database S-Cuboid (Finer aggregates)

Summarizable!

< A, B, A>

< A, B, B>

< A, B >

Sequence OLAP

11

?

#SequencesFiner

summaries

Coarsersummaries

#Sequences

Properties of S-Cuboids

Page 39: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Infinite number of S-cuboids The number of pattern dimensions is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizable

# Sales 1 1 1 1 1 1 1

7

Traditional OLAP

Mon Tu

e

Wed Thur

Fri

Sat

Sun

# Sales

Whole week

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central >

Ben < Kowloon, Central, Central, Kowloon >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

The S-Cuboid with pattern template <X,Y,Z>

Sequence Database S-Cuboid (Finer aggregates)

Summarizable!

< A, B, A>

< A, B, B>

< A, B >

Sequence OLAP

11

?

#SequencesFiner

summaries

Coarsersummaries

#Sequences

Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?

< X, Y > Count

< Kowloon, Central> ?

S-Cuboid (Coarser aggregates)

Properties of S-Cuboids

Page 40: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Infinite number of S-cuboids The number of pattern dimensions is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizable

# Sales 1 1 1 1 1 1 1

7

Traditional OLAP

Mon Tu

e

Wed Thur

Fri

Sat

Sun

# Sales

Whole week

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central >

Ben < Kowloon, Central, Central, Kowloon >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

Sequence Database S-Cuboid (Finer aggregates)

Summarizable!

< A, B, A>

< A, B, B>

< A, B >

Sequence OLAP

11

?

#SequencesFiner

summaries

Coarsersummaries

#Sequences

< X, Y > Count

< Kowloon, Central> 2

S-Cuboid (Coarser aggregates)

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central, Central >

Ben < Kowloon, Admiralty >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

Sequence Database S-Cuboid (Finer aggregates)

< X, Y > Count

< Kowloon, Central> 1

S-Cuboid (Coarser aggregates)

Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?

The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.

Properties of S-Cuboids

Page 41: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Infinite number of S-cuboids The number of pattern dimensions is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizable

# Sales 1 1 1 1 1 1 1

7

Traditional OLAP

Mon Tu

e

Wed Thur

Fri

Sat

Sun

# Sales

Whole week

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central >

Ben < Kowloon, Central, Central, Kowloon >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

Sequence Database S-Cuboid (Finer aggregates)

Summarizable!

< A, B, A>

< A, B, B>

< A, B >

Sequence OLAP

11

#SequencesFiner

summaries

Coarsersummaries

#Sequences

< X, Y > Count

< Kowloon, Central> 2

S-Cuboid (Coarser aggregates)

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central, Central >

Ben < Kowloon, Admiralty >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

Sequence Database S-Cuboid (Finer aggregates)

< X, Y > Count

< Kowloon, Central> 1

S-Cuboid (Coarser aggregates)

Non-Summarizable!

Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?

The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.

Properties of S-Cuboids

Page 42: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Infinite number of S-cuboids The number of pattern dimensions is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Non-summarizable Coarser aggregates cannot be computed

solely from the corresponding finer aggregates.

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central >

Ben < Kowloon, Central, Central, Kowloon >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

Sequence Database S-Cuboid (Finer aggregates)

< X, Y > Count

< Kowloon, Central> 2

S-Cuboid (Coarser aggregates)

Seq ID Sequence of events

Kit < Kowloon, Central, Kowloon, Central, Central >

Ben < Kowloon, Admiralty >

< X, Y, Z > Count

< Kowloon, Central, Kowloon > 1

< Kowloon, Central, Central > 1

Sequence Database S-Cuboid (Finer aggregates)

< X, Y > Count

< Kowloon, Central> 1

S-Cuboid (Coarser aggregates)

Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database?

The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.

Properties of S-Cuboids

Page 43: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Exponential number of S-cuboids The length of the pattern template is infinite

Pattern Template (X,Y,Y,X,A,B,…)

Full materialization is impossible! Non-summarizable

Coarser aggregates cannot be computed solely from the corresponding finer aggregates.

Partial materialization is infeasible!

Properties of S-Cuboids

Page 44: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Research direction Precompute some other auxiliary data structures

so that queries can be computed online using the pre-built data structures

Properties of S-Cuboids

Page 45: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

S-OLAP SpecificS-OLAP SpecificOperationsOperations

Assist explorative analysis of the sequence dataAssist explorative analysis of the sequence data

Page 46: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

S-OLAP specific operations

Navigate between cuboids with ease Traditional OLAP operations for Global Dimensions

SLICE, DICE, ROLL-UP, DRILL-DOWN, etc. New S-OLAP operations for Pattern Dimensions /

Pattern Template APPEND(X) (X,Y,Y) (X,Y,Y,X) DE-TAIL (X,Y,Y,X) (X,Y,Y) PREPEND(Z) (X,Y,Y,X) (Z,X,X,Y,Y) DE-HEAD (Q,Y,Y,X) (Y,Y,X) PATTERN-ROLL-UP(X) (X,Y,Y,X) (X,Y,Y,X) PATTERN-DRILL-DOWN(X) (X,Y,Y,X) (x,Y,Y,x)

Coarser abstraction level

Finer abstraction level

Page 47: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Sequence OLAP

Sequence OLAP

< X ,Y >Tell me the summary statistics of the single trip travel patterns of passengers among different Rail Lines, please .

CUBOID by SUBSTRING(X,Y) WITH X as location at “Rail Lines”, Y as location at “Rail Lines” LEFT-MAXIMALITY (x1, y1) WITH x1.action = “in” AND y1.action = “out”

Page 48: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

< X, Y > , X and Y at Line level # Passenger

< Tsuen Wan Line, Island Line> 120,000

< Island Line, Tsuen Wan Line > 8,000

… …

S-Cuboid 1 (10 * 10 cells)

Sequence OLAP

Sequence OLAP

< X ,Y >

Page 49: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

< X, Y > , X and Y at Line level # Passenger

< Tsuen Wan Line, Island Line> 120,000

< Island Line, Tsuen Wan Line > 8,000

… …

S-Cuboid 1 (10 * 10 cells)

Sequence OLAP

Sequence OLAP

< X ,Y >

More detailed statistics of passengers traveling from the Tsuen Wan Line to each of the Island Line stations, please .

Page 50: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

< X, Y > , X and Y at Line level # Passenger

< Tsuen Wan Line, Island Line> 120,000

< Island Line, Tsuen Wan Line > 8,000

… …

S-Cuboid 1 (10 * 10 cells)

Sequence OLAP

Sequence OLAP

< X ,Y >

< X, Y > , X at Line level, Y at Station level

X=“Tsuen Wan Line”, Y=“Island Line”# Passenger

< Tsuen Wan Line, Central> 100,000

< Tsuen Wan Line, Admiralty > 8,300

< Tsuen Wan Line, Wan Chai > 4,030

< Tsuen Wan Line, Causeway Bay > 12,430

… …

S-Cuboid 2 (1 * 14 cells)

Slice, P-DRILL-DOWN

Instead of specifying the S-Cuboid construction query, a SLICE plus a P-DRILL-DOWN(Y) is done.

Page 51: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

< X, Y > , X and Y at Line level # Passenger

< Tsuen Wan Line, Island Line> 120,000

< Island Line, Tsuen Wan Line > 8,000

… …

S-Cuboid 1 (10 * 10 cells)

Sequence OLAP

Sequence OLAP

< X ,Y >

< X, Y > , X at Line level, Y at Station level

X=“Tsuen Wan Line”, Y=“Island Line”# Passenger

< Tsuen Wan Line, Central> 100,000

< Tsuen Wan Line, Admiralty > 8,300

< Tsuen Wan Line, Wan Chai > 4,030

< Tsuen Wan Line, Causeway Bay > 12,430

… …

S-Cuboid 2 (1 * 14 cells)

Slice, P-DRILL-DOWN

< X, Y ,Y> , X at Line level, Y at Station level

X=“Tsuen Wan Line”, Y=“Island Line”# Passenger

< Tsuen Wan Line, Central, Central > 90,000

< Tsuen Wan Line, Admiralty, Admiralty > 8,300

< Tsuen Wan Line, Wan Chai, Wan Chai > 4,030

< Tsuen Wan Line, Admiralty, Admiralty > 2,430

… …

S-Cuboid 3 (1 * 14 * 14 cells)

APPEND (Y)

Page 52: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

< X, Y > , X and Y at Line level # Passenger

< Tsuen Wan Line, Island Line> 120,000

< Island Line, Tsuen Wan Line > 8,000

… …

S-Cuboid 1 (10 * 10 cells)

Sequence OLAP

Sequence OLAP

< X ,Y >

< X, Y > , X at Line level, Y at Station level

X=“Tsuen Wan Line”, Y=“Island Line”# Passenger

< Tsuen Wan Line, Central> 100,000

< Tsuen Wan Line, Admiralty > 8,300

< Tsuen Wan Line, Wan Chai > 4,030

< Tsuen Wan Line, Causeway Bay > 12,430

… …

S-Cuboid 2 (1 * 14 cells)

Slice, P-DRILL-DOWN

< X, Y ,Y> , X at Line level, Y at Station level

X=“Tsuen Wan Line”, Y=“Island Line”# Passenger

< Tsuen Wan Line, Central, Central > 90,000

< Tsuen Wan Line, Admiralty, Admiralty > 8,300

< Tsuen Wan Line, Wan Chai, Wan Chai > 4,030

< Tsuen Wan Line, Admiralty, Admiralty > 2,430

… …

S-Cuboid 3 (1 * 14 * 14 cells)

APPEND (Y)

DE-TAIL

The S-OLAP operations not only assists the exploratory analysis of the sequence data, it also hides all the technical details of specifying the S-Cuboid query from the business users.

Page 53: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

System ArchitectureSystem Architecture

Skip

Page 54: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

System Architecture

EventDataset

The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.

Page 55: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

EventDataset

The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.

System Architecture

SequenceQuery Engine

SequenceQuery Engine

SequenceCache

The job of the Sequence Query Engine is to compose sets of event sequences out of the event dataset (Phase 1 in S-Cuboid construction).

Page 56: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

System Architecture

SequenceQuery Engine

SequenceQuery Engine

EventDataset

SequenceCache

UserInterface

UserInterface

The User Interface provides certain user-friendly components to help a user specify an S-cuboid.

The job of the Sequence Query Engine is to compose sets of event sequences out of the event dataset (Phase 1 in S-Cuboid construction).

The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.

Queries

Page 57: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

System Architecture

SequenceQuery Engine

SequenceQuery Engine

EventDataset

SequenceCache

The User Interface provides certain user-friendly components to help a user specify an S-cuboid.

The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.

Sequence OLAP EngineSequence OLAP Engine

Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an S-cuboid has been previously computed and stored.

Cuboid Repository

Results

Queries

UserInterface

UserInterface

Page 58: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Cuboid Repository

Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an S-cuboid has been previously computed and stored.

System Architecture

SequenceQuery Engine

SequenceQuery Engine

EventDataset

SequenceCache

The User Interface provides certain user-friendly components to help a user specify an S-cuboid.

The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.

Sequence OLAP EngineSequence OLAP Engine

AuxiliaryData Structures

The SOLAP Engine computes the S-cuboid with the help of certain Auxiliary Data Structures.

UserInterface

UserInterface

Results

Queries

Page 59: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Given an S-Cuboid query, the SOLAP Engine consults a Cuboid Repository to see if such an S-cuboid has been previously computed and stored.

Cuboid Repository

System Architecture

SequenceQuery Engine

SequenceQuery Engine

EventDataset

SequenceCache

The User Interface provides certain user-friendly components to help a user specify an S-cuboid.

The raw data of an S-OLAP system is a set of events that are deposited in an Event Dataset.

Sequence OLAP EngineSequence OLAP Engine

AuxiliaryData Structures

The SOLAP Engine computes the S-cuboid with the help of certain Auxiliary Data Structures.

UserInterface

UserInterface

Results

Queries

Page 60: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Auxiliary Data StructuresAuxiliary Data Structures

Counter based approachCounter based approach

Inverted indices approachInverted indices approach

Page 61: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Counter-Based approach

Counter-Based approach Each cell in an S-cuboid is associated with a counter. To determine the counters’ values, the entire set of sequences

is scanned. For each sequence s, we determine the cells whose

associated patterns are contained in s and increment each of such counters by 1.

Basic and simple But processing iterative queries requires Counting from

scratch.

Page 62: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

S-OLAP query evaluation

Inverted-Index Approach Based on the fragment cube (X. Li, J. Han, and H.

Gonzalez. VLDB 2004) concept. A set of inverted indices are created by pre-processing

the data offline. Algorithm BuildIndex (see paper)

During query processing, the relevant inverted indices are joined based on the matching pattern, in real-time.

Algorithm QueryIndices (see paper) By-products of answering a query is the creation of new

inverted indices. Newly built indices are useful to the processing of iterative S-

OLAP operations (see paper for algorithms)

Page 63: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments A prototype S-OLAP system was implemented

using C++. Real Data

Passenger traveling history. KDD Cup 2000

Clickstream data from a web retailer selling legwear and legcare products.

50,524 sequences. KDD Cup 2000 Question 1

Look for page-click patterns We answer this question in an exploratory way via three

iterative queries.

Page 64: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments

KDD Cup 2000 Question 1 Look for page-click patterns We answer this question in an exploratory way via three

iterative queries

< X, Y>

X,Y at “page category” level# User

sessions

< Main page, Product Catalog> 6,524

… …

< Product Catalog, Legwear Product > 2,201

… …

< Main page, Promotion ad > 852

… …

< Product Catalog, Legcare Product > 150

Cuboid Qa (44*44 cells)

Comparatively speaking, there are very few visitors browse from a product catalog page to a Legcareproduct page.

The corresponding pattern template to capture the 2 steps navigation semantics is <X,Y>.

Qa: Look for the statistics of all 2- steps navigations in the “page category” level.

Page 65: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments< X, Y>

X,Y at “page category” level# User

sessions

< Main page, Product Catalog> 6,524

… …

< Product Catalog, Legwear Product > 2,201

… …

< Main page, Promotion ad > 852

… …

< Product Catalog, Legcare Product > 150

< X, Y > (sliced)

X at “page category” level ; Y at “page” level# User

sessions

< Product Catalog, Null> 181

< Product Catalog, PID - 34839 > 172

< Product Catalog, PID - 34897 > 163

… …

Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse?

Cuboid Qa (44*44 cells)

Cuboid Qb (1*279 cells)

The most popular product that visitors browsefrom the catalog page is the product 34839 (DKNY skin legwear collection product)

2. P-DRILL-DOWN

1.SLICE

Qa: Look for the statistics of all 2- steps navigations in the “page category” level.

Page 66: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments< X, Y>

X,Y at “page category” level# User

sessions

< Main page, Product Catalog> 6,524

… …

< Product Catalog, Legwear Product > 2,201

… …

< Main page, Promotion ad > 852

… …

< Product Catalog, Legcare Product > 150

< X, Y > (sliced)

X at “page category” level ; Y at “page” level# User

sessions

< Product Catalog, Null> 181

< Product Catalog, PID - 34839 > 172

< Product Catalog, PID - 34897 > 163

… …

< X, Y, Z > (sliced)

X at “page category” level ; Y, Z at “page” level# User

sessions

… …

< Product Catalog, PID - 34839, PID - 34839 > 17

< Product Catalog, PID - 34839, PID - 34897 > 14

… …

Qc: APPEND(Z)

Cuboid Qa (44*44 cells)

Cuboid Qb (1*279 cells)

Cuboid Qc (1*279*279 cells)

The runtime of II is higher than CB in Qa because we include the indices precomputation time in Qa.

Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse?

2. P-DRILL-DOWN

1.SLICE

Qa: Look for the statistics of all 2- steps navigations in the “page category” level.

Page 67: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

The runtime of II is higher than CB in Qa because we include the indices precomputation time in Qa.

Experiments< X, Y>

X,Y at “page category” level# User

sessions

< Main page, Product Catalog> 6,524

… …

< Product Catalog, Legwear Product > 2,201

… …

< Main page, Promotion ad > 852

… …

< Product Catalog, Legcare Product > 150

< X, Y > (sliced)

X at “page category” level ; Y at “page” level# User

sessions

< Product Catalog, Null> 181

< Product Catalog, PID - 34839 > 172

< Product Catalog, PID - 34897 > 163

… …

< X, Y, Z > (sliced)

X at “page category” level ; Y, Z at “page” level# User

sessions

… …

< Product Catalog, PID - 34839, PID - 34839 > 17

< Product Catalog, PID - 34839, PID - 34897 > 14

… …

Cuboid Qa (44*44 cells)

Cuboid Qb (1*279 cells)

Cuboid Qc (1*279*279 cells)

For the iterative queries, II takes the advantage of processing only the sequences that possess the pattern < Product catalog, Legwear Product>.

Qc: APPEND(Z)

Qb: Since there are many visitors browse from the product catalog to a legwear product page. What exactly are the products they browse?

Qa: Look for the statistics of all 2- steps navigations in the “page category” level.

2. P-DRILL-DOWN

1.SLICE

Page 68: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments on synthetic data

Study the scalability of Counter-Based approach (CB) and Inverted-Index approach (II) under a series of APPEND operations

QA1 SUBSTRING(X,Y) SLICE + APPEND QA2 (X,Y,Z) SLICE + APPEND QA3 (X,Y,Z,A) SLICE + APPEND QA4 (X,Y,Z,A,B) SLICE + APPEND QA5 (X,Y,Z,A,B,C)

Page 69: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments on synthetic data

Both CB and II scale linearly w.r.t. number of sequences.II outperformed CB in all datasets in this experiment.

Cumulative runtime

II precomputation time : less than 4 secs in all cases

Page 70: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments on synthetic data

Both CB and II scale linearly w.r.t. number of sequences.II outperformed CB in all datasets in this experiment.

CB scans the entire dataset once on each iterative query.For Qa1, II does not need to scan any data sequences because the query can be answered by inverted indices directly.

Cumulative runtime

Cumulative # sequence scanned

II precomputation time : less than 4 secs in all cases

Page 71: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Experiments on synthetic data

Vary Average sequence length (L) Data distribution (Skew factor) Domain of the events (I)

P-ROLL-UP operation P-DRILL-DOWN operation <X,Y,Y,X> pattern templates Substring / Subsequence pattern templates (See technical report)

Page 72: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Conclusion

We propose a new online analytical processing system for sequence data analysis (The S-OLAP system).

The proposed system is motivated by real-life problems. Page click analysis RFID log analysis …etc

We defined basic concepts S-Cuboid, S-Cube

Identified two properties of S-Cube Infinite number of S-Cuboid Non-summarizable

Illustrated the usability of the proposed S-OLAP system through a prototype system that works on real data.

Page 73: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

The EndThe EndThank you!Thank you!

Page 74: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Synthetic dataset generator

Synthetic sequence databases are synthesized in the following manner:

The generated sequence database has D sequences. Each sequence s in a dataset is generated independently

The sequence length l, with mean L, is first determined by a random variable following a Poisson distribution.

Then, we repeatedly add events to the sequence until the target length l is reached.

The first event symbol is randomly selected according to a pre-determined distribution following Zipf’s law with parameter I and Θ I is the number of possible symbols, and Θ is the skew factor

Subsequent events are generated one after the other using a Markov chain of degree 1. The conditional probabilities are pre-determined and are skewed according to

Zipf’s law. All the generated sequences form a single sequence group and that

is served as the input data to the algorithms.

Page 75: “OLAP on Sequence Data”, Presenter : Chun Kit Chui (Kit) OLAP on Sequence Data Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk Eric Lo,

“OLAP on Sequence Data” , Presenter : Chun Kit Chui (Kit)

Related Work

Sequence Databases: PREDATOR (Seshadri, Livny, and Ramakrishnan; SIGMOD 94,

VLDB 96) DEVise (Ramakrishnan et al.; SSDBM 98) TS-SQL (Sadri et al.; PODS 01)

OLAP Data-cube operator (Gray et al.; 95), iceberg-cube,

star-schema, …, etc.

OLAP on unconventional data RFID-cube (Gonzalez, Han, and Li; VLDB 06) Stream-cube (Chen et al.; VLDB 02) XML-cube (Wiwatwattana el al.; ICDE 07)