Multimedia DBs

Multimedia DBs

Time Series Data

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.1750 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750

.. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500

A time series is a collection of observations

made sequentially in time.

time axis

valueaxis

PAA and APCA Feature extraction for GEMINI:

Fourier Wavelets

Another approach: segment the time series into equal parts, store the average value for each part.

Use an index to store the averages and the segment end points

0

1

2

3 4

5

6

7

Haar 0

Haar 1

Haar 2

Haar 3

Haar 4

Haar 5

Haar 6

Haar 7

0 20 40 60 80 100 120 140

X

X'DFT

Agrawal, Faloutsos, Swami 1993

Chan & Fu 1999

eigenwave 0

eigenwave 1

eigenwave 2

eigenwave 3

eigenwave 4

eigenwave 5

eigenwave 6

eigenwave 7

Korn, Jagadish, Faloutsos 1997

Feature Spaces

X

X'DWT

0 20 40 60 80 100 120 140

X

X'SVD

0 20 40 60 80 100 120 140

Piecewise Aggregate Approximation (PAA)

valueaxis

time axis

Original time series(n-dimensional vector)S={s1, s2, …, sn}

n’-segment PAA representation (n’-d vector)

S = {sv1 , sv2, …, svn’ }sv1

sv2 sv3sv4

sv5

sv6

sv7

sv8

PAA representation satisfies the lower bounding lemma(Keogh, Chakrabarti, Mehrotra and Pazzani, 2000; Yi and Faloutsos 2000)

Can we improve upon PAA?

n’-segment PAA representation

(n’-d vector)

S = {sv1 , sv2, …, svN }

sv1

sv2 sv3sv4

sv5

sv6

sv7

sv8

sv1

sv2

sv3

sv4

sr1 sr2 sr3 sr4

n’/2-segment APCA representation

(n’-d vector)

S= { sv1, sr1, sv2, sr2, …, svM , srM }

(M is the number of segments = n’/2)

Adaptive Piecewise Constant Approximation (APCA)

1.69

3.02

1.21

1.75

3.77

1.03

Reconstruction error PAA Reconstruction error APCA

APCA approximates original signal better than PAA

Improvement factor =

APCA Representation can be computed efficiently

Near-optimal representation can be computed in O(nlog(n)) time

Optimal representation can be computed in O(n2M) (Koudas et al.)

Q

M

i iiii svqvsrsr1

21 ))((

DLB(Q’,S)

DLB(Q’,S)

Distance Measure

S

Q

D(Q,S)

n

iii sq

1

2

D(Q,S)

Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S)

S

S

Q’

Index on 2M-dimensional APCA space

Any feature-based index structure can used (e.g., R-tree, X-tree, Hybrid Tree)

R1

R3

R2R4

2M-dimensional APCA space

S6

S5

S1

S2 S3

S4

S8

S7

S9

R2 R3 R4

R3 R4

R1

S3 S4 S5 S6 S7 S8 S9S2S1

R2

k-nearest neighbor Algorithm

R1

S7

R3

R2

R4

S1

S2S3

S5

S4

S6

S8

S9

MINDIST(Q,R2)

MINDIST(Q,R4)

MINDIST(Q,R3)

Q

For any node U of the index structure with MBR R, MINDIST(Q,R) D(Q,S) for any data item S under U

Index Modification for MINDIST Computation

APCA point S= { sv1, sr1, sv2, sr2, …, svM, srM }

S1

S2S3

S5

S4 S6

S8S9

R1

R3

R2R4

APCA rectangle S= (L,H) where

L= { smin1, sr1, smin2, sr2, …, sminM, srM } and

H = { smax1, sr1, smax2, sr2, …, smaxM, srM }

sv1

sv2

sv3

sv4

sr1 sr2 sr3 sr4

smax3

smin3

smax1

smin1

smax2

smin2

smax4

smin4

S7

REGION 3

REGION 2

REGION 1

MBR Representation in time-value space

valueaxis

time axis L= { l1, l2, l3, l4 , l5, l6 }

We can view the MBR R=(L,H) of any node U as two APCA representations

L= { l1, l2, …, l(N-1), lN } and H= { h1, h2, …, h(N-1), hN }

l1

l2

l3

l4 l6

l5

H= { h1, h2, h3, h4 , h5, h6 }

h1

h2

h3

h4

h5

h6

Regions

M regions associated with each MBR; boundaries of ith region:

REGION i

l(2i-1)

h(2i-1)

h2il(2i-2)+1

h3

h1

h5

h2 h4 h6

valueaxis

time axis

l3

l1

l2 l4

l6

l5

REGION 1

REGION 3

REGION 2

Regions

h3

h1

h5

h2 h4 h6

valueaxis

time axis

l3

l1

l2 l4

l6

l5

REGION 2 t1 t2

REGION 3

REGION 1

ith region is active at time instant t if it spans across t

The value st of any time series S under node U at time instant t must

lie in one of the regions active at t (Lemma 2)

MINDIST Computation

For time instant t, MINDIST(Q, R, t) =

minregion G active at t MINDIST(Q,G,t)

h3

h1

h5

h2 h4 h6

l3

l1

l2 l4

l6

l5

t1

REGION 3

REGION 2

REGION 1

MINDIST(Q,R,t1)=min(MINDIST(Q, Region1, t1), MINDIST(Q, Region2, t1))=min((qt1 - h1)2 , (qt1 - h3)2 )=(qt1 - h1)2

MINDIST(Q,R) =

n

ttRQMINDIST

1),,(

Lemma3: MINDIST(Q,R) D(Q,C) for any time series C under node U

Approximate Search

A simpler definition of the distance in the feature space is the following:

But there is one problem… what?

M

i crki

crcr

k i

ii qcv1

2

1)(

1

1DLB(Q’,S)

Multimedia dbs

A multimedia database stores also images

Again similarity queries (content based retrieval)

Extract features, index in feature space, answer similarity queries using GEMINI

Again, average values help!

Images - color

what is an image?A: 2-d array

Images - color

Color histograms,and distance function

Images - color

Mathematically, the distance function is:

Images - color

Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly

Q: what to do? A: feature-extraction question

Images - color

possible answers: avg red, avg green, avg blue

it turns out that this lower-bounds the histogram distance ->

no cross-talk SAMs are applicable

Images - color

performance:

time

selectivity

w/ avg RGB

seq scan

Images - shapes distance function: Euclidean, on

the area, perimeter, and 20 ‘moments’

(Q: how to normalize them?



(Q: how to normalize them? A: divide by standard deviation)



(Q: other ‘features’ / distance functions?

Images - shapes distance function: Euclidean, on the

area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance

functions? A1: turning angle A2: dilations/erosions A3: ... )



Q: how to do dim. reduction?



Q: how to do dim. reduction? A: Karhunen-Loeve (= centered

PCA/SVD)

Images – shapes Performance: ~10x faster

# of features kept

log(# of I/Os)

all kept

Is d(u,v) = sqrt ((u-v)TA(u-v) ) a metric?

xTAx = Σ xixjAij = Σ λixi2

λi is the ith eigenvalue xi is the projection of x along the ith

eigenvector

d(u,v) = sqrt ((u-v)TA(u-v) ) = sqrt (Σ λi(ui-vi)2 )

d(u,v) >= 0, d(u,u) = 0, d(u,v) = d(v,u) d(u,w) <= d(u,v) + d(v,w), provided

sqrt (Σ λi(ui-wi)2 ) <= sqrt (Σ λi(ui-vi)2 ) + sqrt(Σ λi(vi-wi)2 ) sqrt(Σ (√λi ui- √λiwi)2 ) <= sqrt(Σ (√λiui- √λivi)2 ) + sqrt(Σ(√λivi-

√λiwi)2 ) Metric condition for Lp norm

Filtering in QBIC Histogram column vectors x, y of length n

Σ xi = 1, Σ yi = 1 Difference z = (x-y)

Σ zi = 0 Contribution of each color bin to a

smaller set of colors: VT = (c1, c2,.., cn), each ci is a column

vector of length 3 xavg = VT x, yavg = Vty, column vectors of

length 3

Filtering in QBIC Distances

davg2 = (xavg - yavg)T(xavg - yavg)

= (VT z)T(VT z)= zTVVt z

= zTW z dhist

2 = zTA z dhist

2 >= λ1davg2 , where λ1 is the

smallest eigenvalue of A’z = λW’z

Filtering in QBIC Rewrite z to remove the extra

condition that Σ zi = 0. z’ becomes a (n-1) dimensional

column vector zTA z = z’TA’ z’ and zTW z = z’TW’

z’ A’ and W’ are (n-1)x(n-1) matrices

Show that z’TA’ z’ >= λ1z’TW’ z’

Proof of z’TA’ z’ >= λ1z’TW’ z’ Minimize wrt z’, z’TA’ z’, subject to

the constraint z’TW’ z’ = C. Same as minimizing wrt z’,

z’TA’ z’ - λ(z’TW’ z’ - C) Differentiate wrt z and set to 0

A’z’ = λW’ z’ λ and z’ must be eigenvalues and

eigenvectors resp. of A’z’ = λW’ z’

Proof of z’TA’ z’ >= λ1z’TW’ z’ z’TA’ z’ = λz’TW’ z’ = λC To minimize z’TA’ z’ , we must

choose the smallest eigenvalue λ1. The minimization of z’TA’ z’, under z’,

subject to the constraint z’TW’ z’ = C equals λ1C

If z’TW’ z’ = C > 0 then z’TA’ z’ >= λ1C

If z’TW’ z’ = 0 then z’TA’ z’ >= 0, A’ is positive semi-definite

Documents

Multimedia DBs