32
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 6 Lecture 6 (book chapter 12) (book chapter 12) : : Multimedia IR: Multimedia IR: Indexing and Searching Indexing and Searching Alexander Gelbukh www.Gelbukh.com

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

Embed Size (px)

Citation preview

Page 1: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 6 Lecture 6 (book chapter 12)(book chapter 12): :

Multimedia IR:Multimedia IR:Indexing and SearchingIndexing and Searching

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Basically, images are handled as text described them Namely, feature vectors (or feature hierarchies) Context can be used when available to determine features

Also, queries by example are common From the point of view of DBMS, integration with IR

and multimedia-specific techniques is needed Object-oriented technology is adequate

Page 3: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

How similarity function can be defined? What features of images (video, sound) there are? How to better specify the importance of individual

features? (Give me similar houses: similar = size?color? strructure? Architectural style?)

How to determine the objects in an image? Integration with DBMSs and SQL for fast access and

rich semantics Integration with XML Ranking: by similarity, taking into account history,

profile

Page 4: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

4

The problemThe problem

Data examples: 2D/3D color/grayscale images: e.g., brain scans, scientific

databases of vector fields (2D) video, (1D) voice/music; (1D) time series: e.g.,

financial/marketing time series; DNA/genomic databases

Query examples: find photographs with the same color distribution as this find companies whose stock prices move as this one find brain scans with a texture of a tumor

Applications: search; data mining

Page 5: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

5

SolutionSolution

Reduce the problem to search for multi-dimensional points (feature vectors, but vector space is not used)

Define a distance measure for time series: e.g., Euclidean distance between vectors for images: e.g., color distribution (Euclidean distance);

another approach: mathematical morphology Other features as vectors

For search within distance, the vectors are organized in R-trees

Clustering plays important role

Page 6: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

6

Types of queriesTypes of queries

All within given distance Find all images that are within 0.05 distance from this one

Nearest-neighbor Find 5 stocks most similar to IBM

All pairs within given distance Further: clustering

Whole object vs. sub-pattern match Find parts of image that are... E.g., in 512 512 brain scans, find pieces similar to the

given 16 16 typical X-ray of a tumor Like passage retrieval for text documents

Page 7: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

7

Neighbor and pairs types of queriesNeighbor and pairs types of queries

The objects are organized in R-trees For neighbor queries: branch-and-bound algorithm For pairs: recently discovered algorithms These types of queries are not discussed here

Page 8: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

8

Desiderata for a methodDesiderata for a method

Fast No sequential search with all objects

Correct 100% recall Precision is less important, though kept low. False alarms

are easy to discard manually

Little space overhead Dynamic

easy to insert, delete, update

Page 9: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

9

Types of methodsTypes of methods

Linear quadtrees Complexity = hypersurface of the query region Grows exponentially with dimensionality

grid-files Complexity grows exponentially with dimensionality

R-trees methods, such as R*-trees Most used due to lower complexity

Page 10: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

10

R-treeR-tree

Objects and parts of images represented as Minimal Bounding Rectangle (MBR) Can overlap for different objects

Larger objects contain smaller objects MBRs are nested

MBRs are arranged into a tree In storage, an index of disk blocks is maintained

Disk blocks are fetched at once at hardware level For better insertion/deletion, tight MBRs are needed Good clustering is needed

Page 11: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

11

File structure of R-treeFile structure of R-tree

Corresponds to disk blocks Fanout = 3: number of parts to group

Page 12: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

12

R-treeR-treeR-treeR-tree

Page 13: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

13

Search in R-treeSearch in R-tree

Range queries:find objects within distance from query object

= Find MBRs that intersect with query’s MBR Determine MBR of the query Descend the tree Discarding all MBRs that do not intersect with the qu

ery’s MBR

Many variations of R-tree method have been proposed

Page 14: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

14

IndexingIndexing

Only consider here whole match queries Given collection of objects and distance function Find objects within given distance from given object Q

Problems:1. Slow comparison of two objects

2. Huge database

GEMINI approach GEneric Multimedia object INdexIng Attempts to solve both problems

Page 15: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

15

GEMINI indexingGEMINI indexing

Quick-and-dirty test to quickly discard bad objects Uses clusters to avoid sequential search Quick test

Single-valued feature, e.g., average for series.Averages differ much objects differ much

Not vice-versa. False alarms are OK Several features, but fewer than all data. E.g., deviation

for series

Page 16: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

16

AlgorithmAlgorithm

Map the actual objects into f-dimensional feature space

Use clusters (e.g., R-trees) to search Retrieve objects, compute the actual distances, and

discard false alarms

Page 17: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

17

Page 18: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

18

Feature selectionFeature selection

Features should reflect distances Allow no misses (100% recall)

features should make things look closer

Lower Bound lemma: If distance in feature space actual distance then 100% recall (we speak about whole-match queries) Holds for distance search, nearest-neighbor, pair search

Page 19: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

19

Algorithm (more detail)Algorithm (more detail)

Determine distance Choose features Prove that distance in feature space for actual objects Use quick method (R-tree) to search in feature space For found objects, compute the actual distances (this

can be expensive) Discard false alarms

objects with greater actual distances, even if in feature space the distance is OK

Example: similar averages, but different series

Page 20: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

20

DiscussionDiscussion

The method does NOT improve quality Provides SAME quality as sequential search, but faster

Distance definition requires domain/application expert How much do the two images differ? What is important/unimportant for the specific application?

Feature selection requires a good knowledge engineer Choose the most characteristic feature: discriminative If needed, choose the second best, etc. Good features should be orthogonal: combination adds info

Page 21: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

21

Example: Time seriesExample: Time series

In yearly stock movements, find ones similar to IBM Distance: Euclidean (365-D vectors); others exist Features:

First feature is average. If needed, Discrete Fourier Transform (DFT) coefficients Or, Discrete Cosine Transform, waivelet Transform, etc.

Lower-bound lemma: Parseval theorem: DFT preserves distances (DCT, WT too) First several coefficients give distance Transforms “concentrate energy” in the first coefficients Thus, the more realistic prediction of distance

Page 22: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

22

Time series: Time series: ApplicationsApplications

Such feature selection is effective for many skewed spectrum distributions

Colored noises: the energy decreases as F–b

b = 0: white spectrum: unpredictable. Method useless. b = 1: pink noise: works of art b = 2: brown noise: stock movements b > 2: black noise: river levels, rainfall patterns

The greater b the better the first coefficients of the transform predict the actual distance

Some other n-D signals show similar properties JPEG compression ignores higher coefficients

Page 23: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

23

Time series: Time series: PerformancePerformance

Fewer features more false alarms time lost More features more complex computation Optimal number of features proves to be about 1..3

for skewed enough distributions JPEG compression shows that photographs have it

Page 24: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

24

Time series: Time series: Sub-pattern searchSub-pattern search

Use sliding window Encode each window with few features

Page 25: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

25

Example: Color imagesExample: Color images

Give me images with a texture of tumor like this one Give me images with blue at top and red at bottom Handles color, texture, shape, position, dominant

edges

Page 26: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

26

Color images: Color images: Color representationColor representation

Compute color histogram Distance: use color similarity matrix

Very expensive computationally: cross-talk between features (compare all to all features)

Page 27: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

27

Page 28: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

28

Color images: Color images: Feature mappingFeature mapping

The GEMINI question again: What single feature is the most representative? Take average R, G, B

Lower-bound? Yes: Quadratic Distance Bounding theorem

Page 29: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

29

Automatic feature selectionAutomatic feature selection

Features can be selected automatically In texts: Latent semantic indexing (LSI) Many methods Principle components analysis (= LSI), ... In fact, they can reduce features, but not define them

Of colors, one can select characteristic combinations But not classify into faces and flowers So description of the objects is still on human researchers

Page 30: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

30

Research topicsResearch topics

Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.

What features to select? How to determine them?

Mixed-type data (e.g., webpages, or images withsound and description)

What clustering/IR methods are better suited forwhat features? (What features for what methods?)

Similar methods in data mining, ...

Page 31: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

31

ConclusionsConclusions

How to accelerate search? Same results as sequential Ideas:

Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates

Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution

Few coefficients concentrate energy, rest are less important

Page 32: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 6 (book chapter 12) : Multimedia IR: Indexing and Searching Alexander

32

Thank you!Till Tuesday 11, 6

pm