Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
[email protected] Proprietary and Confidential
NOTICE: Proprietary and Confidential
This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 1
A Comprehensive Look at Mining Time-Series and
Sequential Patterns
Ankur TeredesaiDepartment of Computer Science,Rochester Institute of Technology
Definition of Time-Series
0 50 100 150 200 250 300 350 400 450 50023
24
25
26
27
28
29
25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750
..
..24.625024.675024.675024.625024.625024.625024.675024.7500
A time series is a collection of observations made sequentially in time.
[email protected] Dr. Ankur M. Teredesai P2
Sample Example for TimeSample Example for Time--Series (cont.)Series (cont.)
People measure things...People measure things...•The presidents approval rating•Their blood pressure•The annual rainfall in Los Angeles•The value of their Yahoo stock•The number of web hits per second
…… and things change over time.and things change over time.
[email protected] Dr. Ankur M. Teredesai P3
Thus time series occur in virtually every medical, scientific anThus time series occur in virtually every medical, scientific and d business domainbusiness domain
What can We Do with Time-Series?
ClusteringClustering ClassificationClassification
Query by Content
Rule Discovery
10
⇒s = 0.5c = 0.3
Motif DiscoveryMotif Discovery
[email protected] Dr. Ankur M. Teredesai P4
Novelty DetectionNovelty Detection
Sample Model for Information Streams Mining
[email protected] Dr. Ankur M. Teredesai P5
•Information streams vs. time-series :
•In many emerging science and business applications, data takes the form of streams rather than static datasets.
•We can define information stream as continuously incoming dynamic data by contrast with static time-series data.
*MIESIS (MIning from Earth Science Information Streams)
Information Streams SegmentationWe need to do this for
• Symbolization
• Dimensionality reductionUsing Fixed length sliding window
0
--
0 20 40 60 80 100 120
bbb
a
cc
c
a
[email protected] Dr. Ankur M. Teredesai P6
Information Streams Segmentation (cont.)Using turning points
0 10 20 30 40 50 60 70-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Time
Val
ue
Information stream data from sensor
[email protected] Dr. Ankur M. Teredesai P7
ClusteringFeature extraction
• For dimensional reduction, we need to extract features from raw information streams
• DFT (Discrete Fourier Transform), DWT (Discrete Wavelet Transform), PAA (Piecewise Aggregate Approximation), etc
Similarity Measure
• Defining the similarity between two raw information stream or two feature vectors
• Euclidean Distance Metric, Pearson’s Correlation Coefficient, etc
[email protected] Dr. Ankur M. Teredesai P8
Clustering (cont.)
Hierarchical Clustering
Partitional Clustering (e.g. K-means)
[email protected] Dr. Ankur M. Teredesai P9
Symbolic RepresentationExample 1
Example 2
R1R2 R5
R3
R7 R9R8
R6
R4
0
--
0 20 40 60 80 100 120
bbb
a
cc
c
a
[email protected] Dr. Ankur M. Teredesai P10
Symbolic Representation (cont.)Express information stream as sequence of symbols
Now, we can work on less dimensional space than raw information stream dataAlso, we can use well known string processing data structure like inverted index, HMM or suffix tree
aaabaabcbabccb
[email protected] Dr. Ankur M. Teredesai P11
Possible Mining OperationsNovelty detection
• Can be used to identify potential anomaly events
• It is also referred to as the detection of “Aberrant Behavior”, “Anomalies”, “Faults”, “Surprises”, “Deviants” ,“Temporal Change”, and “Outliers”.
• As like above words say, we can detect unseen patterns or sequences from incoming information stream based on training dataset
Prediction
• The utility of prediction model lies in detecting events, rather than predicting numerical values. Event is referred to as meaningful object to which we can assign some semantics, e.g., earthquake or flood.
Finding correlation between clusters
• We can detect Spatial/Temporal correlation between clusters or information streams
[email protected] Dr. Ankur M. Teredesai P12
Mining Time-Series and Sequence Data
Time-series database
• Consists of sequences of values or events changing with time
Applications
• Financial: stock price, inflation
• Biomedical: blood pressure
• Meteorological: precipitation
[email protected] Dr. Ankur M. Teredesai P13
Mining Time-Series and Sequence Data: Trend analysis
Categories of Time-Series Movements• Long-term or trend movements (trend curve)• Cyclic movements or cycle variations, e.g., business cycles• Seasonal movements or seasonal variations
– i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.
• Irregular or random movements
[email protected] Dr. Ankur M. Teredesai P14
Estimation of Trend CurveThe freehand method
• Fit the curve by looking at the graph
• Costly and barely reliable for large-scaled data miningThe least-square method
• Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points
The moving-average method
• Eliminate cyclic, seasonal and irregular patterns
• Loss of end data
• Sensitive to outliers
[email protected] Dr. Ankur M. Teredesai P15
Discovery of Trend in Time-Series (1) Estimation of seasonal variations
• Seasonal index– Set of numbers showing the relative values of a variable during the
months of the year– E.g., if the sales during October, November, and December are 80%,
120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months
• Deseasonalized data– Data adjusted for seasonal variations– E.g., divide the original monthly data by the seasonal index numbers for
the corresponding months
[email protected] Dr. Ankur M. Teredesai P16
Similarity Search in Time-Series AnalysisNormal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequenceTwo categories of similarity queries
• Whole matching: find a sequence that is similar to the query sequence
• Subsequence matching: find all pairs of similar sequencesTypical Applications
• Financial market
• Market basket data analysis
• Scientific databases
• Medical diagnosis
[email protected] Dr. Ankur M. Teredesai P17
Data transformationMany techniques for signal analysis require the data to be in the frequency domainUsually data-independent transformations are used
• The transformation matrix is determined a priori– E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT)
• The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain
• DFT does a good job of concentrating energy in the first few coefficients
• If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance
[email protected] Dr. Ankur M. Teredesai P18
Finding Surprising Patterns in a Time Series Database in Linear Time and Space
[email protected] Proprietary and Confidential
NOTICE: Proprietary and Confidential
This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 19
Paper by: Eamonn Keogh, Stefano Lonardi, Bill ChiuPresented in ACM SIGKDD 2002
Main PurposeNovelty Detection
• Authors said that this problem should not be confused with the relatively simple problem of outlier detection.
• They focused on finding surprising patterns, not on finding individually surprising datapoints.
• The blue time series at the top is a normal healthy human heart beats with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are detected
[email protected] Dr. Ankur M. Teredesai P20
Basic IdeasA pattern is surprising if its frequency of occurrence is greatly different from that which we expected.Their notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. Example : Consider the head and shoulders pattern shown below
• The existence of this pattern in a stock market time series occurs an average of three times a year.
• If it occurred ten times this year : surprising.
• If its frequency of occurrence is less than expected : Also Surprising pattern.
[email protected] Dr. Ankur M. Teredesai P21
ApproachFormal Definition of surprising pattern
• A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.
ExampleIf x = principalskinner
Σ is {a,c,e,i,k,l,n,p,r,s}|x| is 16skin is a substring of xprin is a prefix of xner is a suffix of xIf y = in, then fx(y) = 2If y = pal, then fx(y) = 1principalskinner
How about “y = eik” ?
[email protected] Dr. Ankur M. Teredesai P22
Approach (cont.)Steps (TARZAN algorithm)
• Discretizing time-series into Symbolic strings– Fixed sized sliding window– Slope of the best-fitting line
• Calculate probability of any pattern, including ones we have never seen before using Markov models
• For maintaining linear time and space property, they use suffix tree data structure
• Computing scores by comparing trees between reference data and incoming information stream
aaabaabcbabccb
[email protected] Dr. Ankur M. Teredesai P23
Experimental EvaluationTwo features
• Sensitivity (High True Positive Rate)– The algorithm can find truly surprising patterns in a time series.– It’s similar with Precision
• Selectivity (Low False Positive Rate)– The algorithm will not find spurious “surprising” patterns in a time series– It’s similar with Recall
Goal is maintaining High Precision and Recall
• They achieved high Sensitivity
• But, Selectivity??
[email protected] Dr. Ankur M. Teredesai P24
Experimental Evaluation (cont.)Shock EGG
Training data
Test data(subset)
0 200 400 600 800 1000 1200 1400 1600
200 400 600 800 1000 1200 1400 16000
Tarzan’s level of surprise
[email protected] Dr. Ankur M. Teredesai P25
Experimental Evaluation (cont.)Power Demand
• They consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points.
• Above is the first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends
0 200 400 600 800 1000 1200 1400 1600 1800 2000500
1000
1500
2000
2500
[email protected] Dr. Ankur M. Teredesai P26
Experimental Evaluation (Power Demand cont.)
They used from Monday January 6th to Sunday March 23rd as reference data. This time period includes national holidays. They tested on the remainder of the year.
They showed the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches they showed the entire week (beginning Monday) in which the 3 largest values of surprise fell.
Both TSA-tree and IMM returned sequences that appear to be normal workweeks.
Tarzan returned 3 sequences that correspond to the weeks that contain national holidays.
Tarzan TSA Tree IMM
[email protected] Dr. Ankur M. Teredesai P27
Experimental Evaluation (cont.)
[email protected] Dr. Ankur M. Teredesai P28
• The previous experiments demonstrate the ability of Tarzan to find surprising patterns (Sensitivity)
• However, they also need to consider Tarzans selectivity. – For reducing false alarms, they attempted to scale to massive datasets.
• If Tarzan is trained on “short random walk” dataset,– The chance that similar patterns of test data exist in the short training
database is very small>Many False alarms
– Solution for this problem >Increase the size of the training data, the surprisingness of the test data
should decrease >The more training on “huge random walk data”, the less surprising
pattern could be detected
Possible Future Research OpportunityMentioned in the paper
• Incorporating user feedback and domain based constraints
• Applying different feature extraction techniqueInformation streams + Ontology
• Finding method to combine information streams mining with Ontology
• Intuitively, if we can extract general/abnormal patterns in information streams and generate clusters, we can give semantics to the patterns or clusters.
• For example, we can define relationship between news stream data about “War at Iraq” and stock price change of oil company using “News-Stock Ontology Model”. It is because we can detect rapid increase of the amount of news regarding “War” and “Iraq” at time t, and we can see rapid increase/decrease of oil company’s stock price at time t+α.
[email protected] Dr. Ankur M. Teredesai P29
Suffix Tree Data Structure
[email protected] Dr. Ankur M. Teredesai P30
Multidimensional IndexingMultidimensional index
• Constructed for efficient accessing using the first few Fourier coefficientsUse the index can to retrieve the sequences that are at most a certain small distance away from the query sequencePerform postprocessing by computing the actual distance between sequences in the time domain and discard any false matches
[email protected] Dr. Ankur M. Teredesai P31
[email protected] Proprietary and Confidential
NOTICE: Proprietary and Confidential
This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 32
B-Trees
B-TreesGeneralizes multilevel index
• Number of levels varies with size of data file, but is quite often 3
• Height balanced– Equal length access paths to different records
• Adapts well to insertions and deletionsDBMS typically use a variant called a B+ tree
• All nodes have same format: n keys, n +1 pointers, and at least half of thisUseful for primary, secondary indexes, primary keys, non-keys
[email protected] Dr. Ankur M. Teredesai P33
B+Tree Example
Root n=3
100
120
150
180
30
3 5 11 30 35 100
101
110
120
130
150
156
179
180
200
[email protected] Dr. Ankur M. Teredesai P34
Sample Non-Leaf Node
120
150
180
< 120 120≤ k<150 150≤k<180 ≥180to keysto keys to keys to keys
[email protected] Dr. Ankur M. Teredesai P35
Sample Leaf Node
From non-leaf node
120
130
unus
ed
To r
ecor
d w
ith k
ey 1
20
To r
ecor
d w
ith k
ey 1
30
to next leafin sequence
[email protected] Dr. Ankur M. Teredesai P36
Nodes Must Not Be Too EmptyNumber of pointers in use
• At internal nodes at least ⎡(n+1)/2⎤– To child nodes
• At leaves at least ⎣(n+1)/2⎦– To data records/blocks
[email protected] Dr. Ankur M. Teredesai P37
Node Bounds
Full node Minimum node
120
150
180
30Non-leaf
3 5 11 30 35Leaf
[email protected] Dr. Ankur M. Teredesai P38
n=3
B+tree RulesAll leaves at the same lowest level
• Balanced treePointers in leaves point to records
• Except for “sequence pointer”Number of pointers/keys for B+tree
[email protected] Dr. Ankur M. Teredesai P39
Non-leaf(non-root) n+1 n ⎡(n+1)/2⎤ ⎡(n+1)/2⎤- 1
Leaf(non-root) n+1 n
Root n+1 n 2♠ 1
Max Max Min Min Ptrs keys Ptrs→data keys
⎣(n+1)/2⎦ ⎣(n+1)/2⎦
♠ Can be 1 if only one record in the file
B+tree Insertions Search for the key being insertedFour cases
• Leaf has space– Just insert (key, pointer-to-record)
• Leaf overflow
• Non-leaf overflow
• New root
[email protected] Dr. Ankur M. Teredesai P40
Leaf OverflowInsert key 7
100 n=3
3 5 11 30 31
30
3 5
7
7
[email protected] Dr. Ankur M. Teredesai P42
Non-Leaf OverflowInsert key 160
100
120
150
180
150
156
179
180
200
160
180
160
179
n=3
[email protected] Dr. Ankur M. Teredesai P43
New RootInsert 45
10 20 30
1 2 3 10 12 20 25 30 32 40 40 45
40
30new rootn=3
Tree grows at root and maintains balance
[email protected] Dr. Ankur M. Teredesai P44
B+tree Deletions Search for key being deleted
• If found, deleteThree broad cases
• Leaf does not underflow
• Borrow keys from an adjacent sibling if that does not also cause underflows
• Coalesce with a sibling node– Repeat if needed
Sometimes acceptable to allow a B-tree leaf to become sub-minimum (no mergers) to violate B-tree definition
[email protected] Dr. Ankur M. Teredesai P45
Leaf Does Not UnderflowDelete key 35
n=4min number of keys
in a leaf= ⎣5/2⎦ = 2
10 40 100
10 20 30 35 40 50
[email protected] Dr. Ankur M. Teredesai P46
Borrow KeysDelete key 50
n=4min number of keys
in a leaf= ⎣5/2⎦ = 2
10 40 100
10 20 30 35 40 5035
35
[email protected] Dr. Ankur M. Teredesai P47
Coalesce with SiblingDelete key 50
n=4
20 40 100
20 30 40 5040
[email protected] Dr. Ankur M. Teredesai P48
Coalesce Non-LeafDelete 37
40 4530 3725 2620 2210 141 3
10 20 30 4040
30
25
25
new root
n=4min number of keys
in a non-leaf= ⎡(n+1)/2⎤ - 1=3-1= 2
[email protected] Dr. Ankur M. Teredesai P49
Tree shrinks at root
B+tree Deletions in PracticeCoalescing is often not implemented
• Too hard and usually not worth it!
• Subsequent insertions may return node back to required minimum size
• Compromise– Try redistributing keys with a sibling– If not possible, leave it there
• If all accesses to records are through B-tree– Place a "tombstone" for deleted record at the leaf
[email protected] Dr. Ankur M. Teredesai P50
Traditional B-TreesB-tree is similar to B+tree
• Each search key appears only once– No redundant storage of search keys
• Additional pointer field for each search key in non-leaf node– Points to record directly
P1 K1 P2 … Pn-1 Kn-1 Pn
versus
Pn-1 Rn-1 Kn-1 PnP1 R1 K1 P2 R2 K2 …
[email protected] Dr. Ankur M. Teredesai P51
B-Tree Advantages and DisadvantagesAdvantages
• Fewer nodes than corresponding B+-tree
• Possible to find key before hitting leaf nodeDisadvantages
• Only small fraction of all keys found early
• Non-leaf nodes are larger so reduced fan-out– B-tree often deeper than corresponding B+tree
• More complex than B+trees– Insertion/deletion and overall implementation
• B+trees usually better than B-trees
[email protected] Dr. Ankur M. Teredesai P52
B+ Trees in Practice Typical order: 100
• Typical fill-factor around 67%.
• Average fanout around 133 Typical capacities:
• Height 4: 1334 = 312,900,700 records
• Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool
• Level 1 = 1 page = 8 KBytes
• Level 2 = 133 pages = 1 MByte
• Level 3 = 17,689 pages = 133 MBytes
[email protected] Dr. Ankur M. Teredesai P53
Tree Structured Indexes Ideal for range-searches and equality searchesISAM: static structure
• Only leaf pages modified
• Overflow pages degrade performance B+tree: dynamic structure
• Inserts/deletes leave tree height-balanced, and offers graceful growth and shrinking– High fanout (F) ⇒ depth, rarely > 3 or 4– 67% occupancy on average
• Preferable to ISAM, modulo locking
• Widely used DBMS index structure and one of the most optimized DBMS components
[email protected] Dr. Ankur M. Teredesai P54
Multidimensional Data Geographic & multidimensional data applications
• Sale (store, day, item, color, size, etc.)– Each sale is a point in 5-dimensional space
• Customer: (age, salary, zip, married, …)Typical Queries
• Range queries– Find employees in the Toy department who make at least 25K Geoffrey
dollars?
• Nearest neighbor– I am here: where’s the nearest MacGregors?
• Is this expressible in SQL?
[email protected] Dr. Ankur M. Teredesai P55
Big Impediment For these queries, no clean way to eliminate lots of records that don't meet WHERE condition Approaches
• Index on one attribute– Get data for 1 attribute and remove others
• Index on attributes independently– Intersect pointers in main memory to save disk I/O– Does this help with nearest neighbor?
• Multiple key index– Index on one attribute provides pointer to an index on the other
[email protected] Dr. Ankur M. Teredesai P56
2-Level Indexing
I1
I2
I3
Index onfirst attribute
Index onsecond attribute
[email protected] Dr. Ankur M. Teredesai P57
Example
[email protected] Dr. Ankur M. Teredesai P58
ArtSalesToy
10k15k17k21k
12k15k15k19k
SalaryIndex
DeptIndex
Name=JoeDept=SalesSalary=15k
SampleEmployee
Some QueriesQuestion
• For what kinds of conditions about dept and salary will a multiple-key index (dept first) significantly reduce number of disk I/O's?
How about finding records where …
• Dept = “Sales” and Salary = 20k
• Dept = “Sales” and Salary > 20k
• Dept = “Sales”
• Salary = 20k
[email protected] Dr. Ankur M. Teredesai P59
Interesting Application
x
y
Geographic Data
Data
<X1,Y1, Attributes>
<X2,Y2, Attributes>
...Queries
• What city is at <Xi,Yi>?
• What is within 5 miles from <Xi,Yi>?
• Which is closest point to <Xi,Yi>?
[email protected] Dr. Ankur M. Teredesai P60
[email protected] Dr. Ankur M. Teredesai P61
h
nb
i a
co
d
10 20
10 20
e
g
f
m
l
kj25 15 35 20
40
30
20
10
h i a bcd efg
n omlj k
• Search points near f• Search points near b
5
15 15
Example
QueriesFind points with Yi > 20Find points with Xi < 5Find points “close” to i = <12,38>Find points “close” to b = <7,24>
[email protected] Dr. Ankur M. Teredesai P62
Other StructuresOther geographic index structures
• Quad Trees
• R TreesMore Multikey Indexes
• Grid
• Partitioned hash
[email protected] Dr. Ankur M. Teredesai P63
Grid IndexKey 2
X1 X2 …… XnV1V2
Key 1
Vn
To records with key1=V3, key2=X2
[email protected] Dr. Ankur M. Teredesai P64
ClaimCan quickly find records with
• key 1 = Vi and Key 2 = Xj
• key 1 = Vi
• key 2 = Xj
And also ranges …
• E.g., key 1 ≥ Vi and key 2 < Xj
[email protected] Dr. Ankur M. Teredesai P65
Storing Grid IndexesCatch with Grid Indexes!
• Storing Grid Index stored on disk?Problem
• Need regularity to compute position of <Vi,Xj> entry
LikeArray...
X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4
V1 V2 V3
[email protected] Dr. Ankur M. Teredesai P66
Solution: Use Indirection
Buckets
------------
------
------
------
X1 X2 X3V1
V2V3
V4
Grid only containspointers to buckets
[email protected] Dr. Ankur M. Teredesai P67
Buckets
Indexing Grid on Value Ranges
Salary Grid
Linear Scale1 2 3
Toy Sales Personnel
0-20K 120K-50K 2
50K- 38
[email protected] Dr. Ankur M. Teredesai P68
Grid can be regular without wasting space
We do have price of indirection
Partitioned Hashing Hash function
• Combines several attributes
• Great when attributes have values specifiedPartitioned hash function devotes some bits to each attribute independently
010110 1110010Key2Key1
h1 h2
[email protected] Dr. Ankur M. Teredesai P69
Example (1)
h1(toy) = 0h1(sales) = 1h1(art) = 1
..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00
..
<Joe><Sally>
<Fred>000
111110101100011010001
[email protected] Dr. Ankur M. Teredesai P70
Insert<Fred, toy, 10K><Joe, sales, 10K><Sally, art, 30K>
Example (2)
h1(toy) = 0h1(sales) = 1h1(art) = 1
..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00
..
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
000
111110101100011010001
Find Emp. with Dept. = Sales and Sal = 40k
[email protected] Dr. Ankur M. Teredesai P71
Example (3)
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
look hereFind Emp. with Sal = 30k
h1(toy) = 0h1(sales) = 1h1(art) = 1
..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00
..
000
111110101100011010001
[email protected] Dr. Ankur M. Teredesai P72
Example (4)
<Fred><Joe><Jan>
<Mary>
<Sally>
<Tom><Bill><Andy>
look hereFind Emp. with Dept. = Sales
h1(toy) = 0h1(sales) = 1h1(art) = 1
..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00
..
000
111110101100011010001
[email protected] Dr. Ankur M. Teredesai P73
[email protected] Proprietary and Confidential
NOTICE: Proprietary and Confidential
This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 74
R Trees
A Dynamic Index Structure for Spatial Representation
Why R Trees?Multi-dimensional spaces not well represented by point locationsNeed to be able to perform range searchesOne dimensional indexes not suitable for multi-dimensional spacesEx: Find all the counties within 20 mi radius of Georgia Tech
[email protected] Dr. Ankur M. Teredesai P75
Main ConceptsHeight balanced tree similar to a B-treeIndex records in leaf nodes point to data objectsIndex is dynamic and no periodic reorganization is requireIndex records are of the form ( @ leaf nodes):
(I, tuple-identifier) where
I => n dimensional bounding rectangle i.e.I = (I0,I1,…,In) where n = no of dimensions and Ii = [a, b] (a closed bounded rectangle)
[email protected] Dr. Ankur M. Teredesai P76
More Concepts…Non leaf nodes are of the form:
(I, child-pointer)where
child-pointer is the address of a lower node and I is the smallest rectangle covering all the rectangles in the lower node’s entriesM = maximum number of entries in one node
[email protected] Dr. Ankur M. Teredesai P77
More Concepts…m = parameter specifying the minimum no of entries in a node. m can be tuned at run time and is <= M/2R tree containing N index records has at most a height [logm N] -1Worst case space utilization: m/MMaximum no of nodes: N/m + N/m2 +…+1
[email protected] Dr. Ankur M. Teredesai P78
SearchingDenote the rectangle part of a node E by E.I and the child-pointer part by E.pAlgorithm Search: Given an R Tree with root node T find all index records whose rectangles overlap a search rectangle SStep 1) [Search subtrees.] If T is not a leaf, check each entry to determine if E.I overlaps S. For all overlapping entries invoke Search on the tree whose root is E.pStep 2) [Search leaf node.] If T is a leaf, check all entries E to determine whether E.I overlaps S. If so, E is a qualifying records. Return E.
[email protected] Dr. Ankur M. Teredesai P79
Insertion into R-TreeSimilar to B-Tree insertionNew index records added to leavesOverflowing nodes are splitSplits propagate up the tree
[email protected] Dr. Ankur M. Teredesai P80
Algorithm Insert - DetailsInvoke ChooseLeaf to select a leaf node L to place EIf L has room for another entry install E, else invoke SplitNode on L, obtaining L and LLInvoke AdjustTree on L (and LL if a split was performed)If root is split, create new root with two children (those obtained by splitting the old root)
[email protected] Dr. Ankur M. Teredesai P81
Algorithm ChooseLeafSet N to be the rootIf N is leaf, return NIf N is not leaf, let F be the entry in N whose rectangle needs least enlargement to include E.ISet N to be the child node pointed to by F.P and repeat from Step 2
[email protected] Dr. Ankur M. Teredesai P82
Algorithm AdjustTreeSet N = L, if L was split set NN = LLIf N is Root, STOPLet P be N’s parent and let EN be N’s entry in P. Adjust EN.I so that it “tightly”encloses all entries in NIf NN exists, create a new entry ENN, with ENN.p pointing to NN and ENN.I enclosing all rectangles in NN. Add ENN to P if there is room. Otherwise invoke SplitNode to produce PP.Move up the next level, repeat process
[email protected] Dr. Ankur M. Teredesai P83
Node Splitting“Full” node to be split when new entry needs to be addedMust ensure that on any subsequent searches, with high probability only one node needs to be exploredTotal area of two rectangles to be minimizedExhaustive Search – Exponential Complexity
[email protected] Dr. Ankur M. Teredesai P84
Quadratic Cost AlgorithmUse PickSeeds to choose two entries to be first elements of the two groups. Repeat STEP 3 until all entries have been assigned to one of the groupsInvoke algorithm PickNext to choose next entry to assign. Add it to the group whose covering rectangle needs to be expanded the least.
[email protected] Dr. Ankur M. Teredesai P85
Algorithm PickSeedsFor each pair of entries E1 and E2, let J be the rectangle including E1.I and E2.I.
Calculate d = area(J) – area(E1.I) – area(E2.I)Choose the pair with largest d value
[email protected] Dr. Ankur M. Teredesai P86
Algorithm PickNextFor each entry E not yet in any group, calculate d1 = the area increase required in
the covering rectangle of Group 1 to include E.1. Calculate d2 similarly for Group 2
Choose any entry with maximum difference between d1 and d2
[email protected] Dr. Ankur M. Teredesai P87
Algorithm LinearPickSeedsAlong each dimension, find the entry whose rectangle has the highest low side, and the one
with highest low side. Record the separation between themNormalize the separations by dividing by the width of the entire set along corresponding
dimensionChoose the pair with greatest normalized separation along any dimension
[email protected] Dr. Ankur M. Teredesai P88
Algorithm DeleteInvoke FindLeaf to locate leaf node L containing E. Remove E from LInvoke CondenseTree on LIf root node has only one child, make the child the new root
[email protected] Dr. Ankur M. Teredesai P89
Algorithm FindLeafSet T to be the root of the treeIf T is not leaf, check each entry F in T to determine if F.I overlaps E.I. For each
entry, invoke FindLeaf on the tree pointed to by F.PIf T is a leaf, check each entry to see if it matches E. If E is found return T
[email protected] Dr. Ankur M. Teredesai P90
Algorithm CondenseTreeSet N = L. Set Q, the set of eliminated nodes as empty setIf N is the root, go to STEP 6, else, let P be the parent of N, and let EN be N’s
entry in PIf N has fewer than m entries, delete EN from P and add N to Q
[email protected] Dr. Ankur M. Teredesai P91
Algorithm CondenseTree (contd..)If N not eliminated, adjust EN.I to tightly contain all entries in NSet N = P and repeat from STEP 2Reinsert all entries in Q. Entries from eliminated leaf nodes are inserted as in
algorithm Insert. Entries from higher-level nodes are to be inserted higher in the tree.
[email protected] Dr. Ankur M. Teredesai P92
[email protected] Proprietary and Confidential
NOTICE: Proprietary and Confidential
This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 93
Multi-dimensional Sequential Pattern Mining
OutlineWhy multidimensional sequential pattern mining?Problem definitionAlgorithmsExperimental resultsConclusions
[email protected] Dr. Ankur M. Teredesai P94
Why Sequential Pattern Mining?Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)Many data and applications are time-related
• Customer shopping patterns, telephone calling patterns – E.g., first buy computer, then CD-ROMS, software, within 3 mos.
• Natural disasters (e.g., earthquake, hurricane)
• Disease and treatment
• Stock market fluctuation
• Weblog click stream analysis
• DNA sequence analysis
[email protected] Dr. Ankur M. Teredesai P95
Sequential Pattern MiningMining of frequently occurring patterns related to time or other sequences
Examples
• Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi”in that order
• Collection of ordered events within an intervalApplications
• Targeted marketing
• Customer retention
• Weather prediction
[email protected] Dr. Ankur M. Teredesai P96
Motivating ExampleSequential patterns are useful
• “free internet access buy package 1 upgrade to package 2”
• Marketing, product design & developmentProblems: lack of focus
• Various groups of customers may have different patternsMD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining
[email protected] Dr. Ankur M. Teredesai P97
Sequences and PatternsGiven a set of sequences, find the complete set of frequent subsequences
A sequence : < (ef) (ab) (df) c b >A sequence database
[email protected] Dr. Ankur M. Teredesai P98
Elements items within an element are listed alphabetically
SID sequence
10 <a(ababc)(acc)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(abab)(df)ccb>
40 <eg(af)cbc>
<a(bc)dc> is a subsequence of <<aa(a(abcbc)(ac))(ac)dd((ccff)>)>
Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern
Sequential Pattern: Basics
A sequence sequence : <(bd) c b (ac)>
Elements Elements
A sequence database sequence database
<a(bdbd)bcbcb(ade)>50<(be)(ce)d>40
<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bdbd)cbcb(ac)>10
SequenceSeq. ID
<ad(ae)> is a subsequence subsequence of <aa(bdd)bcb(aadee)>
Given support threshold support threshold min_sup =2, <(bd)cb> is a sequential patternsequential pattern
[email protected] Dr. Ankur M. Teredesai P99
Enhanced similarity search methodsAllow for gaps within a sequence or differences in offsets or amplitudesNormalize sequences with amplitude scaling and offset translationTwo subsequences are considered similar if one lies within an envelope of εwidth around the other, ignoring outliersTwo sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction
[email protected] Dr. Ankur M. Teredesai P100
Subsequence MatchingBreak each sequence into a set of pieces of window with length wExtract the features of the subsequence inside the windowMap each sequence to a “trail” in the feature spaceDivide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangleUse a multipiece assembly algorithm to search for longer sequence matches
[email protected] Dr. Ankur M. Teredesai P101
Sequential pattern mining: Cases and Parameters Duration of a time sequence T
• Sequential pattern mining can then be confined to the data within a specified duration
• Ex. Subsequence corresponding to the year of 1999
• Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption
Event folding window w
• If w = T, time-insensitive frequent patterns are found
• If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant
• If 0 < w < T, sequences occurring within the same period w are folded in the analysis
[email protected] Dr. Ankur M. Teredesai P102
Sequential pattern mining: Cases and Parameters (2)Time interval, int, between events in the discovered pattern
• int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found– Ex. “Find frequent patterns occurring in consecutive weeks”
• min_int ≤ int ≤ max_int: find patterns that are separated by at least min_intbut at most max_int– Ex. “If a person rents movie A, it is likely she will rent movie B within 30
days” (int ≤ 30)
• int = c ≠ 0: find patterns carrying an exact interval– Ex. “Every time when Dow Jones drops more than 5%, what will happen
exactly two days later?” (int = 2)
[email protected] Dr. Ankur M. Teredesai P103
Episodes and Sequential Pattern Mining MethodsOther methods for specifying the kinds of patterns
• Serial episodes: A → B
• Parallel episodes: A & B
• Regular expressions: (A | B)C*(D → E)Methods for sequential pattern mining
• Variations of Apriori-like algorithms
[email protected] Dr. Ankur M. Teredesai P104
Click StreamsClient click-stream analysis is a click-by-click view of a visitor's journey (or journeys) through a web site. By viewing a click-stream report, you can follow the exact pathway a visitor took through a web site, even down to the length of time they spent looking at each particular page.
[email protected] Dr. Ankur M. Teredesai P105
Click Streams…ContinuedThe people most interested in this report would typically be involved in marketing, web design or web development. The information presented provides a click-by-click view of how visitors are interacting and navigating through their web site.
[email protected] Dr. Ankur M. Teredesai P106
Periodicity AnalysisPeriodicity is everywhere: tides, seasons, daily power consumption, etc.Full periodicity
• Every point in time contributes (precisely or approximately) to the periodicity
Partial periodicity: A more general notion
• Only some segments contribute to the periodicity– Jim reads NY Times 7:00-7:30 am every week day
Cyclic association rules
• Associations which form cyclesMethods
• Full periodicity: FFT, other statistical analysis methods
• Partial and cyclic periodicity: Variations of Apriori-like mining methods
[email protected] Dr. Ankur M. Teredesai P107
MD Sequence DatabaseP=(*,Chicago,*,<bf>) matches tuple 20 and 30If support =2, P is a MD sequential pattern
[email protected] Dr. Ankur M. Teredesai P108
cid Cust_grp City Age_grp sequence
10 Business Boston Middle <(bd)cba>
20 Professional Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York Retired <(be)(ce)>
Mining of MD Seq. Pat.Embedding MD information into sequences
• Using a uniform seq. pat. mining methodIntegration of seq. pat. mining and MD analysis method
[email protected] Dr. Ankur M. Teredesai P109
UNISEQEmbed MD information into sequences
[email protected] Dr. Ankur M. Teredesai P110
cid Cust_grp City Age_grp sequence
10 Business Boston Middle <(bd)cba>
20 Professional Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York Retired <(be)(ce)>
Mine the extended sequence database
using sequential pattern mining methods
cid MD-extension of sequences
10 <(Business,Boston,Middle)(bd)cba>
20 <(Professional,Chicago,Young)(bf)(ce)(fg)>
30 <(Business,Chicago,Middle)(ah)abf>
40 <(Education,New York,Retired)(be)(ce)>
Mine Sequential Patterns by Prefix Projections
Step 1: find length-1 sequential patterns
• <a>, <b>, <c>, <d>, <e>, <f>Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
• The ones having prefix <a>;
• The ones having prefix <b>;
• …
• The ones having prefix <f> SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
[email protected] Dr. Ankur M. Teredesai P111
Find Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a>
• <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
• Further partition into 6 subsets– Having prefix <aa>;– …– Having prefix <af>
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
[email protected] Dr. Ankur M. Teredesai P112
Completeness of PrefixSpan
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
SDBLength-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>
[email protected] Dr. Ankur M. Teredesai P113
<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>
Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa>
<aa>-proj. db … <af>-proj. db
Having prefix <af>
…<b>-projected database
Having prefix <b>Having prefix <c>, …, <f>
… …
Efficiency of PrefixSpan
No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected databases
• Can be improved by bi-level projections
[email protected] Dr. Ankur M. Teredesai P114
Mining MD-Patterns
[email protected] Dr. Ankur M. Teredesai P115
All
(cust-grp,*,*) (*,city,*) (*,*,age-grp)
(cust-grp,city) Cust-grp,*,age-grp)
(cust-grp,city,age-grp)
cid Cust_grp City Age_grp sequence
10 Business Boston Middle <(bd)cba>
20 Professional Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York Retired <(be)(ce)>
MD pattern(*,Chicago,*)
BUC processing
Dim-Seq
First find MD-patterns• E.g. (*,Chicago,*)
Form projected sequence database• <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)
Find seq. pat in projected database• E.g. (*,Chicago,*,<bf>)
[email protected] Dr. Ankur M. Teredesai P116
cid Cust_grp City Age_grp sequence
10 Business Boston Middle <(bd)cba>
20 Professional Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York Retired <(be)(ce)>
Seq-Dim
Find sequential patterns• E.g. <bf>
Form projected MD-database• E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for
<bf>Mine MD-patterns
• E.g. (*,Chicago,*,<bf>)
[email protected] Dr. Ankur M. Teredesai P117
cid Cust_grp City Age_grp sequence
10 Business Boston Middle <(bd)cba>
20 Professional Chicago Young <(bf)(ce)(fg)>
30 Business Chicago Middle <(ah)abf>
40 Education New York Retired <(be)(ce)>
Scalability Over Dimensionality
[email protected] Dr. Ankur M. Teredesai P118
Scalability Over Cardinality
[email protected] Dr. Ankur M. Teredesai P119
Scalability Over Support Threshold
[email protected] Dr. Ankur M. Teredesai P120
Scalability Over Database Size
[email protected] Dr. Ankur M. Teredesai P121
Pros & Cons of AlgorithmsSeq-Dim is efficient and scalable
• Fastest in most casesUniSeq is also efficient and scalable
• Fastest with low dimensionalityDim-Seq has poor scalability
[email protected] Dr. Ankur M. Teredesai P122
ConclusionsMD seq. pat. mining are interesting and usefulMining MD seq. pat. efficiently
• Uniseq, Dim-Seq, and Seq-DimFuture work
• Applications of sequential pattern mining
[email protected] Dr. Ankur M. Teredesai P123