Upload
perry-velasquez
View
37
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Multimedia DBs. 29. 28. 27. 26. 25. 24. 23. 0. 50. 100. 150. 200. 250. 300. 350. 400. 450. 500. Time Series Data. A time series is a collection of observations made sequentially in time. 25.1750 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 - PowerPoint PPT Presentation
Citation preview
Multimedia DBs
Time Series Data
0 50 100 150 200 250 300 350 400 450 50023
24
25
26
27
28
29
25.1750 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750
.. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500
A time series is a collection of observations
made sequentially in time.
time axis
valueaxis
PAA and APCA Feature extraction for GEMINI:
Fourier Wavelets
Another approach: segment the time series into equal parts, store the average value for each part.
Use an index to store the averages and the segment end points
0
1
2
3 4
5
6
7
Haar 0
Haar 1
Haar 2
Haar 3
Haar 4
Haar 5
Haar 6
Haar 7
0 20 40 60 80 100 120 140
X
X'DFT
Agrawal, Faloutsos, Swami 1993
Chan & Fu 1999
eigenwave 0
eigenwave 1
eigenwave 2
eigenwave 3
eigenwave 4
eigenwave 5
eigenwave 6
eigenwave 7
Korn, Jagadish, Faloutsos 1997
Feature Spaces
X
X'DWT
0 20 40 60 80 100 120 140
X
X'SVD
0 20 40 60 80 100 120 140
Piecewise Aggregate Approximation (PAA)
valueaxis
time axis
Original time series(n-dimensional vector)S={s1, s2, …, sn}
n’-segment PAA representation (n’-d vector)
S = {sv1 , sv2, …, svn’ }sv1
sv2 sv3sv4
sv5
sv6
sv7
sv8
PAA representation satisfies the lower bounding lemma(Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)
Can we improve upon PAA?
n’-segment PAA representation
(n’-d vector)
S = {sv1 , sv2, …, svN }
sv1
sv2 sv3sv4
sv5
sv6
sv7
sv8
sv1
sv2
sv3
sv4
sr1 sr2 sr3 sr4
n’/2-segment APCA representation
(n’-d vector)
S= { sv1, sr1, sv2, sr2, …, svM , srM }
(M is the number of segments = n’/2)
Adaptive Piecewise Constant Approximation (APCA)
1.69
3.02
1.21
1.75
3.77
1.03
Reconstruction error PAA Reconstruction error APCA
APCA approximates original signal better than PAA
Improvement factor =
APCA Representation can be computed efficiently
Near-optimal representation can be computed in O(nlog(n)) time
Optimal representation can be computed in O(n2M) (Koudas et al.)
Q
M
i iiii svqvsrsr1
21 ))((
DLB(Q’,S)
DLB(Q’,S)
Distance Measure
S
Q
D(Q,S)
n
iii sq
1
2
D(Q,S)
Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S)
S
S
Q’
Index on 2M-dimensional APCA space
Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)
R1
R3
R2R4
2M-dimensional APCA space
S6
S5
S1
S2 S3
S4
S8
S7
S9
R2 R3 R4
R3 R4
R1
S3 S4 S5 S6 S7 S8 S9S2S1
R2
k-nearest neighbor Algorithm
R1
S7
R3
R2
R4
S1
S2S3
S5
S4
S6
S8
S9
MINDIST(Q,R2)
MINDIST(Q,R4)
MINDIST(Q,R3)
Q
For any node U of the index structure with MBR R, MINDIST(Q,R) D(Q,S) for any data item S under U
Index Modification for MINDIST Computation
APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM }
S1
S2S3
S5
S4 S6
S8S9
R1
R3
R2R4
APCA rectangle S= (L,H) where
L= { smin1, sr1, smin2, sr2, …, sminM, srM } and
H = { smax1, sr1, smax2, sr2, …, smaxM, srM }
sv1
sv2
sv3
sv4
sr1 sr2 sr3 sr4
smax3
smin3
smax1
smin1
smax2
smin2
smax4
smin4
S7
REGION 3
REGION 2
REGION 1
MBR Representation in time-value space
valueaxis
time axis L= { l1, l2, l3, l4 , l5, l6 }
We can view the MBR R=(L,H) of any node U as two APCA representations
L= { l1, l2, …, l(N-1), lN } and H= { h1, h2, …, h(N-1), hN }
l1
l2
l3
l4 l6
l5
H= { h1, h2, h3, h4 , h5, h6 }
h1
h2
h3
h4
h5
h6
Regions
M regions associated with each MBR; boundaries of ith region:
REGION i
l(2i-1)
h(2i-1)
h2il(2i-2)+1
h3
h1
h5
h2 h4 h6
valueaxis
time axis
l3
l1
l2 l4
l6
l5
REGION 1
REGION 3
REGION 2
Regions
h3
h1
h5
h2 h4 h6
valueaxis
time axis
l3
l1
l2 l4
l6
l5
REGION 2 t1 t2
REGION 3
REGION 1
ith region is active at time instant t if it spans across t
The value st of any time series S under node U at time instant t must
lie in one of the regions active at t (Lemma 2)
MINDIST Computation
For time instant t, MINDIST(Q, R, t) =
minregion G active at t MINDIST(Q,G,t)
h3
h1
h5
h2 h4 h6
l3
l1
l2 l4
l6
l5
t1
REGION 3
REGION 2
REGION 1
MINDIST(Q,R,t1)=min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1))=min((qt1 - h1)2 , (qt1 - h3)2 )=(qt1 - h1)2
MINDIST(Q,R) =
n
ttRQMINDIST
1),,(
Lemma3: MINDIST(Q,R) D(Q,C) for any time series C under node U
Approximate Search
A simpler definition of the distance in the feature space is the following:
But there is one problem… what?
M
i crki
crcr
k i
ii qcv1
2
1)(
1
1DLB(Q’,S)
Multimedia dbs
A multimedia database stores also images
Again similarity queries (content based retrieval)
Extract features, index in feature space, answer similarity queries using GEMINI
Again, average values help!
Images - color
what is an image?A: 2-d array
Images - color
Color histograms,and distance function
Images - color
Mathematically, the distance function is:
Images - color
Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly
Q: what to do? A: feature-extraction question
Images - color
possible answers: avg red, avg green, avg blue
it turns out that this lower-bounds the histogram distance ->
no cross-talk SAMs are applicable
Images - color
performance:
time
selectivity
w/ avg RGB
seq scan
Images - shapes distance function: Euclidean, on
the area, perimeter, and 20 ‘moments’
(Q: how to normalize them?
Images - shapes distance function: Euclidean, on
the area, perimeter, and 20 ‘moments’
(Q: how to normalize them? A: divide by standard deviation)
Images - shapes distance function: Euclidean, on
the area, perimeter, and 20 ‘moments’
(Q: other ‘features’ / distance functions?
Images - shapes distance function: Euclidean, on the
area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance
functions? A1: turning angle A2: dilations/erosions A3: ... )
Images - shapes distance function: Euclidean, on
the area, perimeter, and 20 ‘moments’
Q: how to do dim. reduction?
Images - shapes distance function: Euclidean, on
the area, perimeter, and 20 ‘moments’
Q: how to do dim. reduction? A: Karhunen-Loeve (= centered
PCA/SVD)
Images – shapes Performance: ~10x faster
# of features kept
log(# of I/Os)
all kept
Is d(u,v) = sqrt ((u-v)TA(u-v) ) a metric?
xTAx = Σ xixjAij = Σ λixi2
λi is the ith eigenvalue xi is the projection of x along the ith
eigenvector
d(u,v) = sqrt ((u-v)TA(u-v) ) = sqrt (Σ λi(ui-vi)2 )
d(u,v) >= 0, d(u,u) = 0, d(u,v) = d(v,u) d(u,w) <= d(u,v) + d(v,w), provided
sqrt (Σ λi(ui-wi)2 ) <= sqrt (Σ λi(ui-vi)2 ) + sqrt(Σ λi(vi-wi)2 ) sqrt(Σ (√λi ui- √λiwi)2 ) <= sqrt(Σ (√λiui- √λivi)2 ) + sqrt(Σ(√λivi-
√λiwi)2 ) Metric condition for Lp norm
Filtering in QBIC Histogram column vectors x, y of length n
Σ xi = 1, Σ yi = 1 Difference z = (x-y)
Σ zi = 0 Contribution of each color bin to a
smaller set of colors: VT = (c1, c2,.., cn), each ci is a column
vector of length 3 xavg = VT x, yavg = Vty, column vectors of
length 3
Filtering in QBIC Distances
davg2 = (xavg - yavg)T(xavg - yavg)
= (VT z)T(VT z)= zTVVt z
= zTW z dhist
2 = zTA z dhist
2 >= λ1davg2 , where λ1 is the
smallest eigenvalue of A’z = λW’z
Filtering in QBIC Rewrite z to remove the extra
condition that Σ zi = 0. z’ becomes a (n-1) dimensional
column vector zTA z = z’TA’ z’ and zTW z = z’TW’
z’ A’ and W’ are (n-1)x(n-1) matrices
Show that z’TA’ z’ >= λ1z’TW’ z’
Proof of z’TA’ z’ >= λ1z’TW’ z’ Minimize wrt z’, z’TA’ z’, subject to
the constraint z’TW’ z’ = C. Same as minimizing wrt z’,
z’TA’ z’ - λ(z’TW’ z’ - C) Differentiate wrt z and set to 0
A’z’ = λW’ z’ λ and z’ must be eigenvalues and
eigenvectors resp. of A’z’ = λW’ z’
Proof of z’TA’ z’ >= λ1z’TW’ z’ z’TA’ z’ = λz’TW’ z’ = λC To minimize z’TA’ z’ , we must
choose the smallest eigenvalue λ1. The minimization of z’TA’ z’, under z’,
subject to the constraint z’TW’ z’ = C equals λ1C
If z’TW’ z’ = C > 0 then z’TA’ z’ >= λ1C
If z’TW’ z’ = 0 then z’TA’ z’ >= 0, A’ is positive semi-definite