91
CS 766: Computer Vision Computer Sciences Department, University of Wisconsin- Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4, ilikhan, mlenz} @cs.wisc.edu 1

Indexing and Retrieval

  • Upload
    dena

  • View
    89

  • Download
    1

Embed Size (px)

DESCRIPTION

Indexing and Retrieval. James Hill, Ozcan Ilikhan, Mark Lenz {jshill4, ilikhan, mlenz} @cs.wisc.edu. 1. Presentation Outline. 1- Introduction 2- Common methods used in the papers * SIFT descriptor * k-means clustering * TF-IDF weight 3- Video Google - PowerPoint PPT Presentation

Citation preview

Page 1: Indexing and Retrieval

CS 766: Computer VisionComputer Sciences Department, University of Wisconsin-Madison

Indexing and RetrievalJames Hill, Ozcan Ilikhan, Mark Lenz

{jshill4, ilikhan, mlenz} @cs.wisc.edu

1

Page 2: Indexing and Retrieval

Presentation Outline

1- Introduction

2- Common methods used in the papers* SIFT descriptor* k-means clustering* TF-IDF weight

3- Video Google

4- Scalable Recognition with a Vocabulary Tree

5- City-Scale Location Recognition

2

Page 3: Indexing and Retrieval

Introduction

Find identical objects in multiple images

Difficulties with changes in– Scale– Orientation– Viewpoint– Lighting

Search time and storage space

3

Page 4: Indexing and Retrieval

Indexing and Retrieval

Common Solutions

Invariant features (e.g. SIFT)

kd-trees

Best Bin First

4

Page 5: Indexing and Retrieval

SIFT - Scale-Invariant Feature Transform

Key Steps

5

1)Difference of Gaussians in scale space

2)Maxima and minima are feature points

3)Remove low-contrast and non-robust edge points

4)Assign each point an orientation

5)Create a descriptor from windowed region

Page 6: Indexing and Retrieval

SIFT - Scale-Invariant Feature Transform

Key Benefits

6

Feature points invariant to scale and translation Orientations provide invariance to rotation Distinctive descriptors are partially invariant to changes

in illumination and viewpoint Robust to background clutter and occlusion

Page 7: Indexing and Retrieval

k-means clustering

Motivation (what are we trying to do)

We want to develop a method for finding the centers of different clusters in a set of data.

7

Page 8: Indexing and Retrieval

k-means clustering

8

Page 9: Indexing and Retrieval

k-means clustering

9

Page 10: Indexing and Retrieval

k-means clustering

10

Page 11: Indexing and Retrieval

k-means clustering

11

Page 12: Indexing and Retrieval

k-means clustering

How do we find these means?

We need to perform a minimization on:

12

k

i Sxij

ij

x1

2

Page 13: Indexing and Retrieval

k-means clustering

How do we extend this?

With Hierarchical k-means Clustering!

13

Page 14: Indexing and Retrieval

k-means clustering

14

Page 15: Indexing and Retrieval

k-means clustering

15

Page 16: Indexing and Retrieval

k-means clustering

16

Page 17: Indexing and Retrieval

k-means clustering

Now that we can cluster our data, how can we use this information to quickly find the closest vector in our data given some test vector?

17

Page 18: Indexing and Retrieval

k-means clustering

We will build a vocabulary tree using this clustering method.

Each vector in our data (including the means) will be considered a “word” in our vocabulary.

We will build a tree using the means of our data.

18

Page 19: Indexing and Retrieval

k-means clustering

19

Page 20: Indexing and Retrieval

k-means clustering

20

Page 21: Indexing and Retrieval

k-means clustering

21

Page 22: Indexing and Retrieval

TF-IDF

Term frequency–inverse document frequency (tf–idf): is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

A standard weight often used in information retrieval and text mining.

22

Page 23: Indexing and Retrieval

TF-IDF

23

nid : the number of occurrences of word i in document d.

nd : the total number of words in document d.

Ni : the number of documents containing term i.

N : the total number of documents in the whole database.

Page 24: Indexing and Retrieval

TF-IDF

24

word frequency inverse document frequency

X

Each document is represented by a vector

Then vectors are organized as an inverted file.

Page 25: Indexing and Retrieval

TF-IDF

25

Image credit: http://www.lovdata.no/litt/hand/hand-1991-2.html

Page 26: Indexing and Retrieval

Video Google

26

A Text Retrieval Approach to Object Matching in Videos

Josef Sivic and Andrew ZissermanVisual Geometry Group,

Department of Engineering ScienceUniversity of Oxford, United Kingdom

Proceedings of the International Conference on Computer Vision (2003)

Page 27: Indexing and Retrieval

Video Google

27

Efficient Visual Search of Videos Cast as Text Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 31, Number 4, page 591--606, 2009

Fundamental idea of paper:

Retrieve key frames and shots of a video containing a particular object with ease, speed, and accuracy with which Google retrieves text documents (web pages) containing particular words.

Page 28: Indexing and Retrieval

Video Google

28

Recall Text Retrieval (preprocessing)

1. Parse documents into words

2. Stemming: “walk" = { “walk,” “walking,” “walks”,…}

3. Stop list to reject very common words , such as “the” and “an”.

4. Each document is represented by a vector with components given by

the frequency of occurrence of the words the document contains

5. Store vector in an inverted file.

Page 29: Indexing and Retrieval

Video Google

29

Can we treat video the same way?

What and where are the words of a video?

Page 30: Indexing and Retrieval

Video Google

30

1. Detect affine covariant regions in each key frame of video

2. Reject unstable regions.

3. Build visual vocabulary

4. Remove stop listed words

5. Compute weighted document frequency

6. Build the index (inverted file).

The Video Google algorithm:a) Pre-processing (off-line):

Page 31: Indexing and Retrieval

Video Google

31

Building a Visual Vocabulary

Step 1. Calculate viewpoint invariant regions:

Shape Adapted (SA) region: centered on corner-like features

Maximally Stable (MS) region: correspond to blobs of high

contrast with respect to their surroundings such as a dark window

on a gray wall.

Each region is represented by a 128-dimentional vector using SIFT descriptor

720 x 576 pixel video frame ≈ 1200 regions

Page 32: Indexing and Retrieval

Video Google

32

Page 33: Indexing and Retrieval

Video Google

33

Step 2. Reject unstable regions:

Any region that does not survive for more than 3 frames is rejected.

This “stability check” significantly reduces the number of regions to

about 600 per frame.

Page 34: Indexing and Retrieval

Video Google

34

Step 3. Build Visual Vocabulary:

Use K-Means clustering to vector quantize descriptors into clusters

Mahalanobis distance:

Page 35: Indexing and Retrieval

Video Google

35

Step 4. Remove stop-listed visual words:

The most frequent visual words that occur in almost all images,

such as highlights which occur in many frames, are rejected.

Page 36: Indexing and Retrieval

Video Google

36

Step 5. Compute tf-idf weighted document frequency vector:

Variations of tf-idf may be used.

Step 6. Build inverted-file indexing structure:

Page 37: Indexing and Retrieval

Video Google

37

1. Determine the set of visual words within the query region

2. Retrieve keyframes based on visual word frequencies

3. Re-rank the top keyframes using spatial consistency

The Video Google algorithm:b) Run-time (on-line):

Page 38: Indexing and Retrieval

Video Google

38

Matched covariant regions in the retrieved frames should have a similar

spatial arrangement to those of the outlined region in the query image.

Spatial consistency:

Page 39: Indexing and Retrieval

Video Google

39

How it works:

Query region and its close-up.

Page 40: Indexing and Retrieval

Video Google

40

How it works:

Original matches based on visual words

Page 41: Indexing and Retrieval

Video Google

41

How it works:

Original matches based on visual words

Page 42: Indexing and Retrieval

Video Google

42

How it works:

Matches after using the stop-list

Page 43: Indexing and Retrieval

Video Google

43

How it works:

Final set of matches after filtering on spatial consistency

Page 44: Indexing and Retrieval

Video Google

44

Page 45: Indexing and Retrieval

Video Google

45

Page 46: Indexing and Retrieval

Video Google

46

Real-time demo

Page 47: Indexing and Retrieval

CS 766: Computer VisionComputer Sciences Department, University of Wisconsin-Madison

Scalable Recognition With a Vocabulary

TreeJames Hill, Ozcan Ilikhan, Mark Lenz

{jshill4, ilikhan, mlenz} @cs.wisc.edu

47

Page 48: Indexing and Retrieval

The Paper

Scalable Recognition with a vocabulary treeDavid Nister and Henrik Stewenius

Center for Visualization and Virtual Environments

Department of Computer Science, University of Kentucky

Published in 2006

Appeared in: 2006 IEEE Computer Science Conference on Computer Vision and Pattern Recognition

48

Page 49: Indexing and Retrieval

What are we trying to do.

Provide an indexing scheme that:

Scales to large image databases (1 million).

Retrieves images in an acceptable amount of time.

49

Page 50: Indexing and Retrieval

Inspiration

Sivic and Zisserman (what you just saw)

Used k-means to partition the descriptors in several pictures.

Used TF-IDF to score an image and find a close match.

50

Page 51: Indexing and Retrieval

What’s new?

The idea of a vocabulary tree.

Using a larger vocabulary tree speeds things up and improve match quality

Can use many more training images (35000 vs 400)

Can insert new images into the Database quickly (0.2s vs 10s)

51

Page 52: Indexing and Retrieval

How do we do it?

Follow these three steps:

1. Build the vocabulary tree using the image descriptors.

2. Generate a score for a given query image.

3. Find the images in the database that best match that score.

52

Page 53: Indexing and Retrieval

Recap the Vocabulary Tree

1. For each image in our database, we calculate a set of feature point descriptors.

2. Each of these descriptors is a vector of numbers which exists in some space (128).

3. Consider each of these vectors to be a “word” in the vocabulary of our database.

53

Page 54: Indexing and Retrieval

Recap the Vocabulary Tree

Build the vocabulary tree using hierarchical k-means clustering.

54

Page 55: Indexing and Retrieval

Recap the Vocabulary Tree

55

Page 56: Indexing and Retrieval

Recap the Vocabulary Tree

56

Page 57: Indexing and Retrieval

Recap the Vocabulary Tree

57

Page 58: Indexing and Retrieval

What’s it good for?

Now that we have a vocabulary tree, we can generate a path down the vocabulary tree which is stored in an integer for scoring.

At each level of the tree, the descriptor is compared to each of the k children using a dot product. The closest is the path that is followed.

58

Page 59: Indexing and Retrieval

Scoring

We have a bunch of paths through the tree, how do we compare the query image to a database image?

At each node, we define a weight wi.

The paper suggests two methods

• Use a constant weighting scheme.

• Use an entropy weighting scheme such as

59

iN

Nln

Page 60: Indexing and Retrieval

Scoring (continued)

Where

N is the number of images in the database

Ni is the number of images in the database with at least one descriptor vector path through node i.

60

iN

Nln

Page 61: Indexing and Retrieval

Scoring (continued)

This scoring mechanism results in a TF-IDF scheme.

So we should see a higher score if more nodes are shared by more descriptors.

61

Page 62: Indexing and Retrieval

Scoring (continued)

To compare two scores, we use the normalized difference between the query score and the database score.

62

d

d

q

qdqs ),(

Page 63: Indexing and Retrieval

Scoring (continued)

Researchers found that the most important factors to quality where.

• A large vocabulary tree.

• Stronger weights towards the leaves of the tree.

• Using the L1 norm in the previous equation.

63

Page 64: Indexing and Retrieval

Scoring Implementation

Scoring is implemented using inverted files

• At each node create an inverted file

• Each file contains a list of images in which the current node appears.

• The inverted file of inner nodes is simply the concatenation of it’s children’s inverted files.

• Database image scores are pre-computed and pre normalized.

64

Page 65: Indexing and Retrieval

Testing

This method was tested using a a database of 40000 CD album covers.

Pictures of cd album covers where then used as query images and run against the database.

Also tested using 6376 images in groups of 4.

Each image was queried in the hopes that the other 3 images would produce the top scores.

Have tested on databases with image counts as high as 1 million (highest at time of writing)

65

Page 66: Indexing and Retrieval

Testing

66

Page 67: Indexing and Retrieval

Results

67

Page 68: Indexing and Retrieval

Conclusions

The main conclusions of the paper are:

• Using a larger vocabulary tree makes things better.

• Using an L1 norm in the normalized difference of the scores produces better results than the L2 norm

• This method can scale up to 1 million images and still run in near real time.

68

Page 69: Indexing and Retrieval

CS 766: Computer VisionComputer Sciences Department, University of Wisconsin-Madison

City-Scale Location Recognition

James Hill, Ozcan Ilikhan, Mark Lenz{jshill4, ilikhan, mlenz} @cs.wisc.edu

69

Page 70: Indexing and Retrieval

City-Scale Location Recognition

70

Estimate location by matching features from a large set of images

Page 71: Indexing and Retrieval

City-Scale Location Recognition

71

City-wide database of photos labeled with location

Page 72: Indexing and Retrieval

Image Features

72

SIFT features invariant to– Translation– Scale– Orientation– Illumination (partially)

Page 73: Indexing and Retrieval

Difficulties Matching Features

73

Storage space– 30,000 images ≈ 100,000,000 SIFT features ≈ 12 GB

Search time

kd-trees and Best Bin First require descriptors

Page 74: Indexing and Retrieval

Method

74

Cluster features into visual words

Build vocabulary tree from clusters

Search tree to score matches

Location of image with top score

Page 75: Indexing and Retrieval

Method

75

Build trees with informative features

Create trees of varying branching factor

Vary number of comparisons during search

Page 76: Indexing and Retrieval

Vocabulary Tree

76

Visual word = region of an object

Just need the distance between a query feature and each node

Only leaf nodes are words

Page 77: Indexing and Retrieval

Informative Features

77

Cluster small subsets into visual words

Compute information gain of features

Select most informative features to build tree

Page 78: Indexing and Retrieval

Information Gain

78

Informative Feature– Found in all images of a location– Not in any image of another location

Information gain: measure of how much new information reduces uncertainty

Page 79: Indexing and Retrieval

Information Gain

79

N DB=number of images in databaseN L=number of images at location lia=number of images visual wordw j occurs at location lib=number of images visual wordw j occurs at other locations

Page 80: Indexing and Retrieval

Building the Tree

80

Hierarchical k-means to cluster features

Nodes are the centroids

Leaves are the visual words

Page 81: Indexing and Retrieval

Branching Factor

81

Vary number of nodes compared to increase search accuracy

Fixed vocabulary size M

Branching factor k, depth L

kL≈M

Page 82: Indexing and Retrieval

Greedy N-Best Paths

82

Approximate nearest neighbor

Similar to Best Bin First

Generalization of vocab tree search

Search multiple branches at each level

Page 83: Indexing and Retrieval

Greedy N-Best Paths

83

k + kN(L-1) comparisons

Page 84: Indexing and Retrieval

Matching

84

Votes for image d = Cd

Computed in linear time in # of features

Page 85: Indexing and Retrieval

Results

85

30,000 images covering 20 km

278 GPS-labelled query images

Performance = % query images within 10m of ground truth

Page 86: Indexing and Retrieval

Results

86

Informative Features vs. Uniform

Page 87: Indexing and Retrieval

Results

87

Greedy N-Best Paths vs. Best Bin First

Page 88: Indexing and Retrieval

Results

88

Top n matches

Page 89: Indexing and Retrieval

Conclusion

89

Vocabulary tree structure affects performance in recognition tasks

Structure becomes more critical as database size increases

Number of comparisons drives performance, not branching factor

Page 90: Indexing and Retrieval

CS 766: Computer VisionComputer Sciences Department, University of Wisconsin-Madison

Q & A, DiscussionMonday, November 29, 2010

Page 91: Indexing and Retrieval

CS 766: Computer VisionComputer Sciences Department, University of Wisconsin-Madison

AcknowledgementsMany thanks to Prof. Andrew Zisserman and

Dr. Josef Sivic for providing us with extra materials for presentation.

91