23
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003

Video Google: Text Retrieval Approach to Object Matching in Videos

  • Upload
    helene

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Video Google: Text Retrieval Approach to Object Matching in Videos. Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Video Google:   Text Retrieval Approach to Object Matching in Videos

Video Google: Text Retrieval

Approach to Object Matching in Videos

Authors: Josef Sivic and Andrew ZissermanUniversity of Oxford

ICCV 2003

Page 2: Video Google:   Text Retrieval Approach to Object Matching in Videos

Motivation Retrieve key frames and shots of video

containing particular object with ease, speed and accuracy with which Google retrieves web pages containing particular words

Investigate whether text retrieval approach is applicable to object recognition

Visual analogy of word: vector quantizing descriptor vectors

Page 3: Video Google:   Text Retrieval Approach to Object Matching in Videos

Benefits Matches are pre-computed so at run time

frames and shots containing particular object can be retrieved with no delay

Any object (or conjunction of objects) occurring in a video can be retrieved even though there was no explicit interest in the object when the descriptors were built

Page 4: Video Google:   Text Retrieval Approach to Object Matching in Videos

Text Retrieval Approach Documents are parsed into words Words represented by stems Stop list to reject common words Remaining words assigned unique

identifier Document represented by vector of

weighted frequency of words Vectors organized in inverted files Retrieval returns documents with closest

(angle) vector to query

Page 5: Video Google:   Text Retrieval Approach to Object Matching in Videos

Viewpoint invariant description Two types of viewpoint covariant regions

computed for each frame Shape Adapted (SA) Mikolajczyk & Schmid Maximally Stable (MSER) Matas et al.

Detect different image areas Provide complimentary representations of

frame Computed at twice originally detected

region size to be more discriminating

Page 6: Video Google:   Text Retrieval Approach to Object Matching in Videos

Shape Adapted Regions: the Harris-Affine Operator Elliptical shape adaptation about interest point

Iteratively determine ellipse center, scale and shape

Scale determined by local extremum (across scale) of Laplacian

Shape determined by maximizing intensity gradient isotropy over elliptical region

Centered on corner-like features

Page 7: Video Google:   Text Retrieval Approach to Object Matching in Videos

Examples of Harris-Affine Operator

Page 8: Video Google:   Text Retrieval Approach to Object Matching in Videos

Maximally Stable Regions

Use intensity watershed image segmentation

Select areas that are approximately stationary as intensity threshold is varied

Correspond to blobs of high contrast with respect to surroundings

Page 9: Video Google:   Text Retrieval Approach to Object Matching in Videos

Examples of Maximally Stable Regions

Page 10: Video Google:   Text Retrieval Approach to Object Matching in Videos

Feature Descriptor Each elliptical affine invariant region represented

by 128 dimensional vector using SIFT descriptor

Page 11: Video Google:   Text Retrieval Approach to Object Matching in Videos

Noise Removal Information aggregated over sequence of frames

Regions detected in each frame tracked using simple constant velocity dynamical model and correlation

Region not surviving more than 3 frames are rejected

Estimate descriptor for region computed by averaging descriptors throughout track

Page 12: Video Google:   Text Retrieval Approach to Object Matching in Videos

Noise Removal•Tracking region over 70 frames

Page 13: Video Google:   Text Retrieval Approach to Object Matching in Videos

Visual Vocabulary Goal: vector quantize descriptors into

clusters (visual words)

When a new frame is observed, the descriptor of the new frame is assigned to the nearest cluster, generating matches for all frames

Page 14: Video Google:   Text Retrieval Approach to Object Matching in Videos

Visual Vocabulary Implementation: K-Means clustering

Regions tracked through contiguous frames and average description computed

10% of tracks with highest variance eliminated, leaving about 1000 regions per frame

Subset of 48 shots (~10%) selected for clustering

Distance function: Mahalanobis

6000 SA clusters and 10000 MS clusters

Page 15: Video Google:   Text Retrieval Approach to Object Matching in Videos

Visual Vocabulary

Page 16: Video Google:   Text Retrieval Approach to Object Matching in Videos

Experiments - Setup Goal: match scene

locations within closed world of shots

Data:164 frames from 48 shots taken at 19 different 3D locations; 4-9 frames from each location

Page 17: Video Google:   Text Retrieval Approach to Object Matching in Videos

Experiments - Retrieval Entire frame is query Each of 164 frames as query region in turn Correct retrieval: other frames which show same

location Retrieval performance: average normalized rank

of relevant images

Nrel = # of relevant images for query image

N = size of image set

Ri = rank of ith relevant imageRank lies between 0 and 1.Intuitively, it will be 0 if allrelevant images are returnedahead of any others.It will be .5 for random retrievals.

Page 18: Video Google:   Text Retrieval Approach to Object Matching in Videos

Experiment - Results

Zero is good!

Page 19: Video Google:   Text Retrieval Approach to Object Matching in Videos

Experiments - Results

Precision = # relevant images/total # of frames retrieved

Recall = # correctly retrieved frames/ # relevant frames

Page 20: Video Google:   Text Retrieval Approach to Object Matching in Videos

Stop List Top 5% and bottom 10%

of frequent words are stopped

Page 21: Video Google:   Text Retrieval Approach to Object Matching in Videos

Spatial Consistency Matched region in retrieved frames have

similar spatial arrangement to outlined region in query

Retrieve frames using weighted frequency vector and re-rank based on spatial consistency

Page 22: Video Google:   Text Retrieval Approach to Object Matching in Videos

More Results

Page 23: Video Google:   Text Retrieval Approach to Object Matching in Videos

Related Web Pages http://www.robots.ox.ac.uk/~vgg/research/

vgoogle/how/method/method_a.html http://www.robots.ox.ac.uk/~vgg/research/

vgoogle/index.html