Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find

Multimedia DBs

Multimedia dbs

A multimedia database stores text, strings and images

Similarity queries (content based retrieval) Given an image find the images in the database

that are similar (or you can “describe” the query image)

Extract features, index in feature space, answer similarity queries using GEMINI

Again, average values help!

Image Features

Features extracted from an image are based on: Color distribution Shapes and structure …..

Images - color

what is an image?A: 2-d RGB array

Images - color

Color histograms,and distance function

Images - color

Mathematically, the distance function between

a vector x and a query q is:

D(x, q) = (x-q)T A (x-q) = aij (xi-qi) (xj-qj)

A=I ?

Images - color

Problem: ‘cross-talk’: Features are not orthogonal -> SAMs will not work properly

Q: what to do? A: feature-extraction question

Images - color

possible answers: avg red, avg green, avg blue

it turns out that this lower-bounds the histogram distance ->

no cross-talk SAMs are applicable

Images - color

performance:

time

selectivity

w/ avg RGB

seq scan

Images - shapes distance function: Euclidean, on

the area, perimeter, and 20 ‘moments’

(Q: how to normalize them?



(Q: how to normalize them? A: divide by standard deviation)



(Q: other ‘features’ / distance functions?

Images - shapes distance function: Euclidean, on the

area, perimeter, and 20 ‘moments’ (Q: other ‘features’ / distance

functions? A1: turning angle A2: dilations/erosions A3: ... )



Q: how to do dim. reduction?



Q: how to do dim. reduction? A: Karhunen-Loeve (= centered

PCA/SVD)

Images - shapes Performance: ~10x faster

# of features kept

log(# of I/Os)

all kept

Dimensionality Reduction Many problems (like time-series and

image similarity) can be expressed as proximity problems in a high dimensional space

Given a query point we try to find the points that are close…

But in high-dimensional spaces things are different!

Effects of High-dimensionality

Assume a uniformly distributed set of points in high dimensions [0,1]d

Let’s have a query with length 0.1 in each dimension query selectivity in 100-d 10-

100

If we want constant selectivity (0.1) the length of the side must be ~1!


Surface is everything! Probability that a point is closer

than 0.1 to a (d-1) dimensional surface D=2 0.36 D = 10 ~1 D=100 ~1


Number of grid cells and surfaces Number of k-dimensional surfaces in

a d-dimensional hypercube Binary partitioning 2d cells

Indexing in high-dimensions is extremely difficult “curse of dimensionality”

Dimensionality Reduction The main idea: reduce the dimensionality of the

space. Project the d-dimensional points in a k-

dimensional space so that: k << d distances are preserved as well as possible

Solve the problem in low dimensions (the GEMINI idea of course…)

DR requirements The ideal mapping should:1. Be fast to compute: O(N) or O(N

logN) but not O(N2)2. Preserve distances leading to

small discrepancies3. Provide a fast algorithm to map a

new query (why?)

MDS (multidimensional scaling)

Input: a set of N items, the pair-wise (dis) similarities and the dimensionality k

Optimization criterion: stress = (ij(D(Si,Sj) - D(Ski, Skj) )2 / ijD(Si,Sj) 2) 1/2

where D(Si,Sj) be the distance between time series Si, Sj, and D(Ski, Skj) be the Euclidean distance of the k-dim representations

Steepest descent algorithm: start with an assignment (time series to k-dim point) minimize stress by moving points

MDS Disadvantages:

Running time is O(N2), because of slow convergence

Also it requires O(N) time to insert a new point, not practical for queries

FastMap [Faloutsos and Lin, 1995]

Maps objects to k-dimensional points so that distances are preserved well

It is an approximation of Multidimensional Scaling

Works even when only distances are known Is efficient, and allows efficient query

transformation

FastMap Find two objects that are far away Project all points on the line the two objects

define, to get the first coordinate

FastMap - next iteration

ResultsDocuments /cosine similarity ->

Euclidean distance (how?)