21
ADBIS 2013 Conference Genoa, Italy, Sep 1-4, 2013 Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing Dimitris Kastrinakis 1 , Symeon Papadopoulos 2 , Athena Vakali 1 1 Aristotle University of Thessaloniki, Department of Informatics (AUTH) 2 Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI)

Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

Embed Size (px)

DESCRIPTION

Paper presentation in ADBIS 2013. Abstract: Multimedia data indexing for content-based retrieval has attracted significant attention in recent years due to the commoditization of multimedia capturing equipment and the widespread adoption of social networking platforms as means for sharing media content online. Due to the very large amounts of multimedia content, notably images, produced and shared online by people, a very important requirement for multimedia indexing approaches pertains to their efficiency both in terms of computation and memory usage. A common approach to support query-by-example image search is based on the extraction of visual words from images and their indexing by means of inverted indices, a method proposed and popularized in the field of text retrieval. The main challenge that visual word indexing systems currently face arises from the fact that it is necessary to build very large visual vocabularies (hundreds of thousands or even millions of words) to support sufficiently precise search. However, when the visual vocabulary is large, the image indexing process becomes computationally expensive due to the fact that the local image descriptors (e.g. SIFT) need to be quantized to the nearest visual words. To this end, this paper proposes a novel method that significantly decreases the time required for the above quantization process. Instead of using hundreds of thousands of visual words for quantization, the proposed method manages to preserve retrieval quality by using a much smaller number of words for indexing. This is achieved by the concept of composite words, i.e. assigning multiple words to a local descriptor in ascending order of distance. We evaluate the proposed method in the Oxford and Paris buildings datasets to demonstrate the validity of the proposed approach.

Citation preview

Page 1: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

ADBIS 2013 ConferenceGenoa, Italy, Sep 1-4, 2013

Compact and Distinctive Visual Vocabularies forEfficient Multimedia Data IndexingDimitris Kastrinakis1, Symeon Papadopoulos2, Athena Vakali1

1 Aristotle University of Thessaloniki, Department of Informatics (AUTH)2 Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI)

Page 2: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#2

Overview

• Problem formulation

• Related work

• Proposed method

• Evaluation

• Conclusions

Page 3: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#3

Motivation

• Multimedia collections are ever-growing– Personal photo collections can easily reach several

thousands of photos– Professional photo archives are typically in the range of

hundreds of thousands to many millions of photos– Online photos are many billions– It is estimated that a total of 3.5 trillion photos have been

captured by people so far

• Need for effective and efficient search!– Prevalent paradigm: Content-based image search

http://blog.1000memories.com/94-number-of-photos-ever-taken-digital-and-analog-in-shoebox

Page 4: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#4

Problem Formulation (1)

Content-based image search (CBIR)• Also known as “example-based search”, “similarity-

based search”• Given an indexed collection of images and a query

image (typically not part of the collection) fetch N images of the collection that are most similar to the query image– similarity is typically computed on the basis of Euclidean

distance between feature vectors extracted from the visual content of images

Page 5: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#5

Problem Formulation (2)

• Recent CBIR systems make use of local descriptors (e.g. SIFT, SURF) extracted from images, according to the BoW retrieval paradigm. – Each image is represented as a set (“bag”) of visual words

• Visual words are the result of a learning process (clustering + quantization) on a large set of images

– An inverted index structure is used to speed-up retrieval• Such systems achieve good search accuracy, but:

– For this to happen, the number of visual words in the vocabulary need to be very large (~105-106)

– In case of very large vocabularies, we face two computational problems: (a) creating the vocabularies (offline), and (b) quantizing the local descriptors of a new image to the vocabulary words (online)

Page 6: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#6

Overview of BoW indexing

Feature extraction (e.g. SIFT, SURF)

Feature clustering + quantization

Visual vocabulary

VOCABULARY LEARNING

Feature extraction (e.g. SIFT, SURF)

Feature-to-vocabulary mapping

TRAINING COLLECTION

COLLECTION TO INDEX

QUERY IMAGE

Feature extraction (e.g. SIFT, SURF)

Feature-to-vocabulary mapping

Collection index

INDEXING

RETRIEVAL

Page 7: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#7

Our contribution

• Propose the concept of Composite Visual Words (CVW) that reduces the need for many visual words– CVW are permutations of plain visual words– Using CVW instead of plain visual words makes it possible

to achieve similar search accuracy levels having much fewer visual words in the vocabulary

• Experimentally validate our approach in two standard datasets (Oxford and Paris buildings)– With a vocabulary of 200 visual words and the use of CVW

we manage to match the retrieval performance of approaches that use two-three orders of magnitude more visual words.

Page 8: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#8

Our contribution

Feature extraction (e.g. SIFT, SURF)

Feature clustering + quantization

Visual vocabulary

VOCABULARY LEARNING

Feature extraction (e.g. SIFT, SURF)

Feature-to-vocabulary mapping

TRAINING COLLECTION

COLLECTION TO INDEX

QUERY IMAGE

Feature extraction (e.g. SIFT, SURF)

Feature-to-vocabulary mapping

Collection index

INDEXING

RETRIEVAL

CVW

Page 9: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#9

Notation

Page 10: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#10

Related Work

• First use of BoW approach for image similarity search (Sivic & Zisserman 2003)

• Approaches to speed up original method:– Hierarchical vocabulary tree (Nister & Stewenius, 2007)– Approximate k-means clustering (Philbin et al., 2007)

• More expressive (and expensive!) representations:– Soft assignment of descriptors to words (Philbin et al., 2008)

requires more index space and time– Visual phrases (Yuan et al., 2007) does not take into account order

of words, expensive mining phase– Words + phrases (Zhang et al., 2009) high accuracy but

computationally intensive learning and still large vocabulary– Vocabulary that maintains spatial layout of words (Zhang et al., 2010),

bundled features (Wu et al., 2009) complex vocabulary generation process + not high coverage of feature space

Page 11: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#11

Proposed Approach (1)

• Instead of indexing using plain visual words, we index using permutations of visual words based on their relative distance from the image to be indexed

PLAIN VISUAL WORDS COMPOSITE VISUAL WORDS

Page 12: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#12

Proposed Approach (2)

• Important distinction:– Original visual vocabulary (V): Number of visual words

that are the result of the clustering and quantization process on the training collection.

– Effective visual vocabulary (V’): Number of composite visual words that can be used for indexing.

Maximum theoretical size:

– For instance, if |V|=100 and B = 3 (the maximum length of a permutation), then |V|’max = 970,200

This is the main way to increase the distinctive capability of the vocabulary without increasing complexity.

Page 13: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#13

Proposed Approach (3)

• Caveat: If the maximum effective vocabulary size increases a lot (e.g. several millions), the retrieval performance might be harmed since several CVWs will appear very sparsely.

• For this, we employ a thresholding strategy to make sure that the resulting CVWs are high-quality.

where d(u,f) is the Euclidean distance between local feature f and word u, while max d(u’,f) is the maximum distance between feature f and any word of V. Parameter α controls the “strictness” of the threshold (larger α means stricter thresholding).

Page 14: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#14

Algorithmic description of approach

Page 15: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#15

Evaluation

• Datasets: Oxford and Paris buildings

• Implementation:– Local descriptors: SIFT, 2000 features/image (Oxford),

1000 features/image (Paris)– Inverted index: Apache Solr

• Evaluation Measure: mean Average Precision (mAP)

Page 16: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#16

Results (1)

OXFORD PARIS

Page 17: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#17

Results (2)

EXAMPLE RESULTS (|V|=200)

Comparison to SoA

Page 18: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#18

Conclusions

SUMMARY• Approach to increase distinctive capability of visual

vocabulary without harming efficiency.• Validation in standard datasets demonstrates

significant reduction in vocabulary size while maintaining state-of-the-art retrieval accuracy.

FUTURE WORK• Tests in larger datasets (e.g. using millions of images

as distractors)• Alternative thresholding or CVW filtering strategies.

Page 19: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#19

References (1)

• Classic BoW indexingSivic, J., Zisserman, A.: Video Google: A text retrieval

approach to object matching in videos. Ninth IEEE International Conference on Computer Vision, ICCV (2003)

• Efficient BoW indexingNister, D., Stewenius, H.: Scalable recognition with a

vocabulary tree. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. (2006)

Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07 (2007)

Page 20: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#20

References (2)

• Richer BoW-based representationsPhilbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in

quantization: Improving particular object retrieval in large scale image databases. IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2008)

Wu, Z., Ke, Q., Isard, M., Sun, J.: Bundling features for large scale partial-duplicate web image search. IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2009)

Yuan, J., Wu, Y., Yang, M.: Discovery of collocation patterns: from visual words to visual phrases. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’07, 1–8, IEEE (2007)

Zhang, S., Tian, Q., Hua, G., Huang, Q., Li, S.: Descriptive visual words and visual phrases for image applications. In Proceedings of the 17th ACM international conference on Multimedia, 75–84, ACM (2009)

Zhang, S., Huang, Q., Hua, G., Jiang, S., Gao, W., Tian, Q: Building contextual visual vocabulary for large-scale image applications. In Proceedings of the international conference on Multimedia, 501–510, ACM (2010)

Page 21: Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Indexing

#21

Questions

Further contact:

[email protected]

Acknowledgement:

www.socialsensor.eu