A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

[email protected] Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 1

A Comprehensive Look at Mining Time-Series and

Sequential Patterns

Ankur TeredesaiDepartment of Computer Science,Rochester Institute of Technology

Definition of Time-Series

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750

..

..24.625024.675024.675024.625024.625024.625024.675024.7500

A time series is a collection of observations made sequentially in time.

[email protected] Dr. Ankur M. Teredesai P2

Sample Example for TimeSample Example for Time--Series (cont.)Series (cont.)

People measure things...People measure things...•The presidents approval rating•Their blood pressure•The annual rainfall in Los Angeles•The value of their Yahoo stock•The number of web hits per second

…… and things change over time.and things change over time.


Thus time series occur in virtually every medical, scientific anThus time series occur in virtually every medical, scientific and d business domainbusiness domain

What can We Do with Time-Series?

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

10

⇒s = 0.5c = 0.3

Motif DiscoveryMotif Discovery


Novelty DetectionNovelty Detection

Sample Model for Information Streams Mining


•Information streams vs. time-series :

•In many emerging science and business applications, data takes the form of streams rather than static datasets.

•We can define information stream as continuously incoming dynamic data by contrast with static time-series data.

*MIESIS (MIning from Earth Science Information Streams)

Information Streams SegmentationWe need to do this for

• Symbolization

• Dimensionality reductionUsing Fixed length sliding window

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a


Information Streams Segmentation (cont.)Using turning points

0 10 20 30 40 50 60 70-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Time

Val

ue

Information stream data from sensor


ClusteringFeature extraction

• For dimensional reduction, we need to extract features from raw information streams

• DFT (Discrete Fourier Transform), DWT (Discrete Wavelet Transform), PAA (Piecewise Aggregate Approximation), etc

Similarity Measure

• Defining the similarity between two raw information stream or two feature vectors

• Euclidean Distance Metric, Pearson’s Correlation Coefficient, etc


Clustering (cont.)

Hierarchical Clustering

Partitional Clustering (e.g. K-means)


Symbolic RepresentationExample 1

Example 2

R1R2 R5

R3

R7 R9R8

R6

R4

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a


Symbolic Representation (cont.)Express information stream as sequence of symbols

Now, we can work on less dimensional space than raw information stream dataAlso, we can use well known string processing data structure like inverted index, HMM or suffix tree

aaabaabcbabccb


Possible Mining OperationsNovelty detection

• Can be used to identify potential anomaly events

• It is also referred to as the detection of “Aberrant Behavior”, “Anomalies”, “Faults”, “Surprises”, “Deviants” ,“Temporal Change”, and “Outliers”.

• As like above words say, we can detect unseen patterns or sequences from incoming information stream based on training dataset

Prediction

• The utility of prediction model lies in detecting events, rather than predicting numerical values. Event is referred to as meaningful object to which we can assign some semantics, e.g., earthquake or flood.

Finding correlation between clusters

• We can detect Spatial/Temporal correlation between clusters or information streams


Mining Time-Series and Sequence Data

Time-series database

• Consists of sequences of values or events changing with time

Applications

• Financial: stock price, inflation

• Biomedical: blood pressure

• Meteorological: precipitation


Mining Time-Series and Sequence Data: Trend analysis

Categories of Time-Series Movements• Long-term or trend movements (trend curve)• Cyclic movements or cycle variations, e.g., business cycles• Seasonal movements or seasonal variations

– i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

• Irregular or random movements


Estimation of Trend CurveThe freehand method

• Fit the curve by looking at the graph

• Costly and barely reliable for large-scaled data miningThe least-square method

• Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points

The moving-average method

• Eliminate cyclic, seasonal and irregular patterns

• Loss of end data

• Sensitive to outliers


Discovery of Trend in Time-Series (1) Estimation of seasonal variations

• Seasonal index– Set of numbers showing the relative values of a variable during the

months of the year– E.g., if the sales during October, November, and December are 80%,

120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months

• Deseasonalized data– Data adjusted for seasonal variations– E.g., divide the original monthly data by the seasonal index numbers for

the corresponding months


Similarity Search in Time-Series AnalysisNormal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequenceTwo categories of similarity queries

• Whole matching: find a sequence that is similar to the query sequence

• Subsequence matching: find all pairs of similar sequencesTypical Applications

• Financial market

• Market basket data analysis

• Scientific databases

• Medical diagnosis


Data transformationMany techniques for signal analysis require the data to be in the frequency domainUsually data-independent transformations are used

• The transformation matrix is determined a priori– E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT)

• The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain

• DFT does a good job of concentrating energy in the first few coefficients

• If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance


Finding Surprising Patterns in a Time Series Database in Linear Time and Space




Slide 19

Paper by: Eamonn Keogh, Stefano Lonardi, Bill ChiuPresented in ACM SIGKDD 2002

Main PurposeNovelty Detection

• Authors said that this problem should not be confused with the relatively simple problem of outlier detection.

• They focused on finding surprising patterns, not on finding individually surprising datapoints.

• The blue time series at the top is a normal healthy human heart beats with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are detected


Basic IdeasA pattern is surprising if its frequency of occurrence is greatly different from that which we expected.Their notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. Example : Consider the head and shoulders pattern shown below

• The existence of this pattern in a stock market time series occurs an average of three times a year.

• If it occurred ten times this year : surprising.

• If its frequency of occurrence is less than expected : Also Surprising pattern.


ApproachFormal Definition of surprising pattern

• A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

ExampleIf x = principalskinner

Σ is {a,c,e,i,k,l,n,p,r,s}|x| is 16skin is a substring of xprin is a prefix of xner is a suffix of xIf y = in, then fx(y) = 2If y = pal, then fx(y) = 1principalskinner

How about “y = eik” ?


Approach (cont.)Steps (TARZAN algorithm)

• Discretizing time-series into Symbolic strings– Fixed sized sliding window– Slope of the best-fitting line

• Calculate probability of any pattern, including ones we have never seen before using Markov models

• For maintaining linear time and space property, they use suffix tree data structure

• Computing scores by comparing trees between reference data and incoming information stream

aaabaabcbabccb


Experimental EvaluationTwo features

• Sensitivity (High True Positive Rate)– The algorithm can find truly surprising patterns in a time series.– It’s similar with Precision

• Selectivity (Low False Positive Rate)– The algorithm will not find spurious “surprising” patterns in a time series– It’s similar with Recall

Goal is maintaining High Precision and Recall

• They achieved high Sensitivity

• But, Selectivity??


Experimental Evaluation (cont.)Shock EGG

Training data

Test data(subset)

0 200 400 600 800 1000 1200 1400 1600

200 400 600 800 1000 1200 1400 16000

Tarzan’s level of surprise


Experimental Evaluation (cont.)Power Demand

• They consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points.

• Above is the first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

0 200 400 600 800 1000 1200 1400 1600 1800 2000500

1000

1500

2000

2500


Experimental Evaluation (Power Demand cont.)

They used from Monday January 6th to Sunday March 23rd as reference data. This time period includes national holidays. They tested on the remainder of the year.

They showed the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches they showed the entire week (beginning Monday) in which the 3 largest values of surprise fell.

Both TSA-tree and IMM returned sequences that appear to be normal workweeks.

Tarzan returned 3 sequences that correspond to the weeks that contain national holidays.

Tarzan TSA Tree IMM


Experimental Evaluation (cont.)


• The previous experiments demonstrate the ability of Tarzan to find surprising patterns (Sensitivity)

• However, they also need to consider Tarzans selectivity. – For reducing false alarms, they attempted to scale to massive datasets.

• If Tarzan is trained on “short random walk” dataset,– The chance that similar patterns of test data exist in the short training

database is very small>Many False alarms

– Solution for this problem >Increase the size of the training data, the surprisingness of the test data

should decrease >The more training on “huge random walk data”, the less surprising

pattern could be detected

Possible Future Research OpportunityMentioned in the paper

• Incorporating user feedback and domain based constraints

• Applying different feature extraction techniqueInformation streams + Ontology

• Finding method to combine information streams mining with Ontology

• Intuitively, if we can extract general/abnormal patterns in information streams and generate clusters, we can give semantics to the patterns or clusters.

• For example, we can define relationship between news stream data about “War at Iraq” and stock price change of oil company using “News-Stock Ontology Model”. It is because we can detect rapid increase of the amount of news regarding “War” and “Iraq” at time t, and we can see rapid increase/decrease of oil company’s stock price at time t+α.


Suffix Tree Data Structure


Multidimensional IndexingMultidimensional index

• Constructed for efficient accessing using the first few Fourier coefficientsUse the index can to retrieve the sequences that are at most a certain small distance away from the query sequencePerform postprocessing by computing the actual distance between sequences in the time domain and discard any false matches





Slide 32

B-Trees

B-TreesGeneralizes multilevel index

• Number of levels varies with size of data file, but is quite often 3

• Height balanced– Equal length access paths to different records

• Adapts well to insertions and deletionsDBMS typically use a variant called a B+ tree

• All nodes have same format: n keys, n +1 pointers, and at least half of thisUseful for primary, secondary indexes, primary keys, non-keys


B+Tree Example

Root n=3

100

120

150

180

30

3 5 11 30 35 100

101

110

120

130

150

156

179

180

200


Sample Non-Leaf Node

120

150

180

< 120 120≤ k<150 150≤k<180 ≥180to keysto keys to keys to keys


Sample Leaf Node

From non-leaf node

120

130

unus

ed

To r

ecor

d w

ith k

ey 1

20

To r

ecor

d w

ith k

ey 1

30

to next leafin sequence


Nodes Must Not Be Too EmptyNumber of pointers in use

• At internal nodes at least ⎡(n+1)/2⎤– To child nodes

• At leaves at least ⎣(n+1)/2⎦– To data records/blocks


Node Bounds

Full node Minimum node

120

150

180

30Non-leaf

3 5 11 30 35Leaf


n=3

B+tree RulesAll leaves at the same lowest level

• Balanced treePointers in leaves point to records

• Except for “sequence pointer”Number of pointers/keys for B+tree


Non-leaf(non-root) n+1 n ⎡(n+1)/2⎤ ⎡(n+1)/2⎤- 1

Leaf(non-root) n+1 n

Root n+1 n 2♠ 1

Max Max Min Min Ptrs keys Ptrs→data keys

⎣(n+1)/2⎦ ⎣(n+1)/2⎦

♠ Can be 1 if only one record in the file

B+tree Insertions Search for the key being insertedFour cases

• Leaf has space– Just insert (key, pointer-to-record)

• Leaf overflow

• Non-leaf overflow

• New root


Leaf has SpaceInsert key 32

n=3

100

30

3 5 11 30 31 32


Leaf OverflowInsert key 7

100 n=3

3 5 11 30 31

30

3 5

7

7


Non-Leaf OverflowInsert key 160

100

120

150

180

150

156

179

180

200

160

180

160

179

n=3


New RootInsert 45

10 20 30

1 2 3 10 12 20 25 30 32 40 40 45

40

30new rootn=3

Tree grows at root and maintains balance


B+tree Deletions Search for key being deleted

• If found, deleteThree broad cases

• Leaf does not underflow

• Borrow keys from an adjacent sibling if that does not also cause underflows

• Coalesce with a sibling node– Repeat if needed

Sometimes acceptable to allow a B-tree leaf to become sub-minimum (no mergers) to violate B-tree definition


Leaf Does Not UnderflowDelete key 35

n=4min number of keys

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 50


Borrow KeysDelete key 50


in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 5035

35


Coalesce with SiblingDelete key 50

n=4

20 40 100

20 30 40 5040


Coalesce Non-LeafDelete 37

40 4530 3725 2620 2210 141 3

10 20 30 4040

30

25

25

new root


in a non-leaf= ⎡(n+1)/2⎤ - 1=3-1= 2


Tree shrinks at root

B+tree Deletions in PracticeCoalescing is often not implemented

• Too hard and usually not worth it!

• Subsequent insertions may return node back to required minimum size

• Compromise– Try redistributing keys with a sibling– If not possible, leave it there

• If all accesses to records are through B-tree– Place a "tombstone" for deleted record at the leaf


Traditional B-TreesB-tree is similar to B+tree

• Each search key appears only once– No redundant storage of search keys

• Additional pointer field for each search key in non-leaf node– Points to record directly

P1 K1 P2 … Pn-1 Kn-1 Pn

versus

Pn-1 Rn-1 Kn-1 PnP1 R1 K1 P2 R2 K2 …


B-Tree Advantages and DisadvantagesAdvantages

• Fewer nodes than corresponding B+-tree

• Possible to find key before hitting leaf nodeDisadvantages

• Only small fraction of all keys found early

• Non-leaf nodes are larger so reduced fan-out– B-tree often deeper than corresponding B+tree

• More complex than B+trees– Insertion/deletion and overall implementation

• B+trees usually better than B-trees


B+ Trees in Practice Typical order: 100

• Typical fill-factor around 67%.

• Average fanout around 133 Typical capacities:

• Height 4: 1334 = 312,900,700 records

• Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool

• Level 1 = 1 page = 8 KBytes

• Level 2 = 133 pages = 1 MByte

• Level 3 = 17,689 pages = 133 MBytes


Tree Structured Indexes Ideal for range-searches and equality searchesISAM: static structure

• Only leaf pages modified

• Overflow pages degrade performance B+tree: dynamic structure

• Inserts/deletes leave tree height-balanced, and offers graceful growth and shrinking– High fanout (F) ⇒ depth, rarely > 3 or 4– 67% occupancy on average

• Preferable to ISAM, modulo locking

• Widely used DBMS index structure and one of the most optimized DBMS components


Multidimensional Data Geographic & multidimensional data applications

• Sale (store, day, item, color, size, etc.)– Each sale is a point in 5-dimensional space

• Customer: (age, salary, zip, married, …)Typical Queries

• Range queries– Find employees in the Toy department who make at least 25K Geoffrey

dollars?

• Nearest neighbor– I am here: where’s the nearest MacGregors?

• Is this expressible in SQL?


Big Impediment For these queries, no clean way to eliminate lots of records that don't meet WHERE condition Approaches

• Index on one attribute– Get data for 1 attribute and remove others

• Index on attributes independently– Intersect pointers in main memory to save disk I/O– Does this help with nearest neighbor?

• Multiple key index– Index on one attribute provides pointer to an index on the other


2-Level Indexing

I1

I2

I3

Index onfirst attribute

Index onsecond attribute


Example


ArtSalesToy

10k15k17k21k

12k15k15k19k

SalaryIndex

DeptIndex

Name=JoeDept=SalesSalary=15k

SampleEmployee

Some QueriesQuestion

• For what kinds of conditions about dept and salary will a multiple-key index (dept first) significantly reduce number of disk I/O's?

How about finding records where …

• Dept = “Sales” and Salary = 20k

• Dept = “Sales” and Salary > 20k

• Dept = “Sales”

• Salary = 20k


Interesting Application

x

y

Geographic Data

Data

<X1,Y1, Attributes>

<X2,Y2, Attributes>

...Queries

• What city is at <Xi,Yi>?

• What is within 5 miles from <Xi,Yi>?

• Which is closest point to <Xi,Yi>?



h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj25 15 35 20

40

30

20

10

h i a bcd efg

n omlj k

• Search points near f• Search points near b

5

15 15

Example

QueriesFind points with Yi > 20Find points with Xi < 5Find points “close” to i = <12,38>Find points “close” to b = <7,24>


Other StructuresOther geographic index structures

• Quad Trees

• R TreesMore Multikey Indexes

• Grid

• Partitioned hash


Grid IndexKey 2

X1 X2 …… XnV1V2

Key 1

Vn

To records with key1=V3, key2=X2


ClaimCan quickly find records with

• key 1 = Vi and Key 2 = Xj

• key 1 = Vi

• key 2 = Xj

And also ranges …

• E.g., key 1 ≥ Vi and key 2 < Xj


Storing Grid IndexesCatch with Grid Indexes!

• Storing Grid Index stored on disk?Problem

• Need regularity to compute position of <Vi,Xj> entry

LikeArray...

X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4

V1 V2 V3


Solution: Use Indirection

Buckets

------------

------

------

------

X1 X2 X3V1

V2V3

V4

Grid only containspointers to buckets


Buckets

Indexing Grid on Value Ranges

Salary Grid

Linear Scale1 2 3

Toy Sales Personnel

0-20K 120K-50K 2

50K- 38


Grid can be regular without wasting space

We do have price of indirection

Partitioned Hashing Hash function

• Combines several attributes

• Great when attributes have values specifiedPartitioned hash function devotes some bits to each attribute independently

010110 1110010Key2Key1

h1 h2


Example (1)

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

<Joe><Sally>

<Fred>000

111110101100011010001


Insert<Fred, toy, 10K><Joe, sales, 10K><Sally, art, 30K>

Example (2)


..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

000

111110101100011010001

Find Emp. with Dept. = Sales and Sal = 40k


Example (3)

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look hereFind Emp. with Sal = 30k


..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

000

111110101100011010001


Example (4)

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look hereFind Emp. with Dept. = Sales


..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

000

111110101100011010001





Slide 74

R Trees

A Dynamic Index Structure for Spatial Representation

Why R Trees?Multi-dimensional spaces not well represented by point locationsNeed to be able to perform range searchesOne dimensional indexes not suitable for multi-dimensional spacesEx: Find all the counties within 20 mi radius of Georgia Tech


Main ConceptsHeight balanced tree similar to a B-treeIndex records in leaf nodes point to data objectsIndex is dynamic and no periodic reorganization is requireIndex records are of the form ( @ leaf nodes):

(I, tuple-identifier) where

I => n dimensional bounding rectangle i.e.I = (I0,I1,…,In) where n = no of dimensions and Ii = [a, b] (a closed bounded rectangle)


More Concepts…Non leaf nodes are of the form:

(I, child-pointer)where

child-pointer is the address of a lower node and I is the smallest rectangle covering all the rectangles in the lower node’s entriesM = maximum number of entries in one node


More Concepts…m = parameter specifying the minimum no of entries in a node. m can be tuned at run time and is <= M/2R tree containing N index records has at most a height [logm N] -1Worst case space utilization: m/MMaximum no of nodes: N/m + N/m2 +…+1


SearchingDenote the rectangle part of a node E by E.I and the child-pointer part by E.pAlgorithm Search: Given an R Tree with root node T find all index records whose rectangles overlap a search rectangle SStep 1) [Search subtrees.] If T is not a leaf, check each entry to determine if E.I overlaps S. For all overlapping entries invoke Search on the tree whose root is E.pStep 2) [Search leaf node.] If T is a leaf, check all entries E to determine whether E.I overlaps S. If so, E is a qualifying records. Return E.


Insertion into R-TreeSimilar to B-Tree insertionNew index records added to leavesOverflowing nodes are splitSplits propagate up the tree


Algorithm Insert - DetailsInvoke ChooseLeaf to select a leaf node L to place EIf L has room for another entry install E, else invoke SplitNode on L, obtaining L and LLInvoke AdjustTree on L (and LL if a split was performed)If root is split, create new root with two children (those obtained by splitting the old root)


Algorithm ChooseLeafSet N to be the rootIf N is leaf, return NIf N is not leaf, let F be the entry in N whose rectangle needs least enlargement to include E.ISet N to be the child node pointed to by F.P and repeat from Step 2


Algorithm AdjustTreeSet N = L, if L was split set NN = LLIf N is Root, STOPLet P be N’s parent and let EN be N’s entry in P. Adjust EN.I so that it “tightly”encloses all entries in NIf NN exists, create a new entry ENN, with ENN.p pointing to NN and ENN.I enclosing all rectangles in NN. Add ENN to P if there is room. Otherwise invoke SplitNode to produce PP.Move up the next level, repeat process


Node Splitting“Full” node to be split when new entry needs to be addedMust ensure that on any subsequent searches, with high probability only one node needs to be exploredTotal area of two rectangles to be minimizedExhaustive Search – Exponential Complexity


Quadratic Cost AlgorithmUse PickSeeds to choose two entries to be first elements of the two groups. Repeat STEP 3 until all entries have been assigned to one of the groupsInvoke algorithm PickNext to choose next entry to assign. Add it to the group whose covering rectangle needs to be expanded the least.


Algorithm PickSeedsFor each pair of entries E1 and E2, let J be the rectangle including E1.I and E2.I.

Calculate d = area(J) – area(E1.I) – area(E2.I)Choose the pair with largest d value


Algorithm PickNextFor each entry E not yet in any group, calculate d1 = the area increase required in

the covering rectangle of Group 1 to include E.1. Calculate d2 similarly for Group 2

Choose any entry with maximum difference between d1 and d2


Algorithm LinearPickSeedsAlong each dimension, find the entry whose rectangle has the highest low side, and the one

with highest low side. Record the separation between themNormalize the separations by dividing by the width of the entire set along corresponding

dimensionChoose the pair with greatest normalized separation along any dimension


Algorithm DeleteInvoke FindLeaf to locate leaf node L containing E. Remove E from LInvoke CondenseTree on LIf root node has only one child, make the child the new root


Algorithm FindLeafSet T to be the root of the treeIf T is not leaf, check each entry F in T to determine if F.I overlaps E.I. For each

entry, invoke FindLeaf on the tree pointed to by F.PIf T is a leaf, check each entry to see if it matches E. If E is found return T


Algorithm CondenseTreeSet N = L. Set Q, the set of eliminated nodes as empty setIf N is the root, go to STEP 6, else, let P be the parent of N, and let EN be N’s

entry in PIf N has fewer than m entries, delete EN from P and add N to Q


Algorithm CondenseTree (contd..)If N not eliminated, adjust EN.I to tightly contain all entries in NSet N = P and repeat from STEP 2Reinsert all entries in Q. Entries from eliminated leaf nodes are inserted as in

algorithm Insert. Entries from higher-level nodes are to be inserted higher in the tree.





Slide 93

Multi-dimensional Sequential Pattern Mining

OutlineWhy multidimensional sequential pattern mining?Problem definitionAlgorithmsExperimental resultsConclusions


Why Sequential Pattern Mining?Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)Many data and applications are time-related

• Customer shopping patterns, telephone calling patterns – E.g., first buy computer, then CD-ROMS, software, within 3 mos.

• Natural disasters (e.g., earthquake, hurricane)

• Disease and treatment

• Stock market fluctuation

• Weblog click stream analysis

• DNA sequence analysis


Sequential Pattern MiningMining of frequently occurring patterns related to time or other sequences

Examples

• Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi”in that order

• Collection of ordered events within an intervalApplications

• Targeted marketing

• Customer retention

• Weather prediction


Motivating ExampleSequential patterns are useful

• “free internet access buy package 1 upgrade to package 2”

• Marketing, product design & developmentProblems: lack of focus

• Various groups of customers may have different patternsMD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining


Sequences and PatternsGiven a set of sequences, find the complete set of frequent subsequences

A sequence : < (ef) (ab) (df) c b >A sequence database


Elements items within an element are listed alphabetically

SID sequence

10 <a(ababc)(acc)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(abab)(df)ccb>

40 <eg(af)cbc>

<a(bc)dc> is a subsequence of <<aa(a(abcbc)(ac))(ac)dd((ccff)>)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Sequential Pattern: Basics

A sequence sequence : <(bd) c b (ac)>

Elements Elements

A sequence database sequence database

<a(bdbd)bcbcb(ade)>50<(be)(ce)d>40

<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bdbd)cbcb(ac)>10

SequenceSeq. ID

<ad(ae)> is a subsequence subsequence of <aa(bdd)bcb(aadee)>

Given support threshold support threshold min_sup =2, <(bd)cb> is a sequential patternsequential pattern


Enhanced similarity search methodsAllow for gaps within a sequence or differences in offsets or amplitudesNormalize sequences with amplitude scaling and offset translationTwo subsequences are considered similar if one lies within an envelope of εwidth around the other, ignoring outliersTwo sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction


Subsequence MatchingBreak each sequence into a set of pieces of window with length wExtract the features of the subsequence inside the windowMap each sequence to a “trail” in the feature spaceDivide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangleUse a multipiece assembly algorithm to search for longer sequence matches


Sequential pattern mining: Cases and Parameters Duration of a time sequence T

• Sequential pattern mining can then be confined to the data within a specified duration

• Ex. Subsequence corresponding to the year of 1999

• Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption

Event folding window w

• If w = T, time-insensitive frequent patterns are found

• If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant

• If 0 < w < T, sequences occurring within the same period w are folded in the analysis


Sequential pattern mining: Cases and Parameters (2)Time interval, int, between events in the discovered pattern

• int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found– Ex. “Find frequent patterns occurring in consecutive weeks”

• min_int ≤ int ≤ max_int: find patterns that are separated by at least min_intbut at most max_int– Ex. “If a person rents movie A, it is likely she will rent movie B within 30

days” (int ≤ 30)

• int = c ≠ 0: find patterns carrying an exact interval– Ex. “Every time when Dow Jones drops more than 5%, what will happen

exactly two days later?” (int = 2)


Episodes and Sequential Pattern Mining MethodsOther methods for specifying the kinds of patterns

• Serial episodes: A → B

• Parallel episodes: A & B

• Regular expressions: (A | B)C*(D → E)Methods for sequential pattern mining

• Variations of Apriori-like algorithms


Click StreamsClient click-stream analysis is a click-by-click view of a visitor's journey (or journeys) through a web site. By viewing a click-stream report, you can follow the exact pathway a visitor took through a web site, even down to the length of time they spent looking at each particular page.


Click Streams…ContinuedThe people most interested in this report would typically be involved in marketing, web design or web development. The information presented provides a click-by-click view of how visitors are interacting and navigating through their web site.


Periodicity AnalysisPeriodicity is everywhere: tides, seasons, daily power consumption, etc.Full periodicity

• Every point in time contributes (precisely or approximately) to the periodicity

Partial periodicity: A more general notion

• Only some segments contribute to the periodicity– Jim reads NY Times 7:00-7:30 am every week day

Cyclic association rules

• Associations which form cyclesMethods

• Full periodicity: FFT, other statistical analysis methods

• Partial and cyclic periodicity: Variations of Apriori-like mining methods


MD Sequence DatabaseP=(*,Chicago,*,<bf>) matches tuple 20 and 30If support =2, P is a MD sequential pattern


cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Mining of MD Seq. Pat.Embedding MD information into sequences

• Using a uniform seq. pat. mining methodIntegration of seq. pat. mining and MD analysis method


UNISEQEmbed MD information into sequences







Mine the extended sequence database

using sequential pattern mining methods

cid MD-extension of sequences

10 <(Business,Boston,Middle)(bd)cba>

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

40 <(Education,New York,Retired)(be)(ce)>

Mine Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns

• <a>, , <c>, <d>, <e>, <f>Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

• The ones having prefix <a>;

• The ones having prefix ;

• …

• The ones having prefix <f> SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>


Find Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>

• <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

• Further partition into 6 subsets– Having prefix <aa>;– …– Having prefix <af>

SID sequence


20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>


Completeness of PrefixSpan

SID sequence


20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDBLength-1 sequential patterns<a>, , <c>, <d>, <e>, <f>


<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

…-projected database

Having prefix Having prefix <c>, …, <f>

… …

Efficiency of PrefixSpan

No candidate sequence needs to be generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing projected databases

• Can be improved by bi-level projections


Mining MD-Patterns


All

(cust-grp,*,*) (*,city,*) (*,*,age-grp)

(cust-grp,city) Cust-grp,*,age-grp)

(cust-grp,city,age-grp)






MD pattern(*,Chicago,*)

BUC processing

Dim-Seq

First find MD-patterns• E.g. (*,Chicago,*)

Form projected sequence database• <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)

Find seq. pat in projected database• E.g. (*,Chicago,*,<bf>)







Seq-Dim

Find sequential patterns• E.g. <bf>

Form projected MD-database• E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for

<bf>Mine MD-patterns

• E.g. (*,Chicago,*,<bf>)







Scalability Over Dimensionality


Scalability Over Cardinality


Scalability Over Support Threshold


Scalability Over Database Size


Pros & Cons of AlgorithmsSeq-Dim is efficient and scalable

• Fastest in most casesUniSeq is also efficient and scalable

• Fastest with low dimensionalityDim-Seq has poor scalability


ConclusionsMD seq. pat. mining are interesting and usefulMining MD seq. pat. efficiently

• Uniseq, Dim-Seq, and Seq-DimFuture work

• Applications of sequential pattern mining

Documents

A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of