104
Compact Discrete Representations for Scalable Similarity Search by Mohammad Norouzi A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto c Copyright 2016 by Mohammad Norouzi

by Mohammad Norouzi - University of Toronto T-Space Compact Discrete Representations for Scalable Similarity Search Mohammad Norouzi Doctor of Philosophy Graduate Department of Computer

Embed Size (px)

Citation preview

Compact Discrete Representations for Scalable Similarity Search

by

Mohammad Norouzi

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Computer ScienceUniversity of Toronto

c© Copyright 2016 by Mohammad Norouzi

Abstract

Compact Discrete Representations for Scalable Similarity Search

Mohammad Norouzi

Doctor of Philosophy

Graduate Department of Computer Science

University of Toronto

2016

Scalable similarity search on images, documents, and user activities benefits generic search, data

visualization, and recommendation systems. This thesis concerns the design of algorithms and

machine learning tools for faster and more accurate similarity search. The proposed techniques

advocate the use of discrete codes for representing the similarity structure of data in a compact

way. In particular, we will discuss how one can learn to map high-dimensional data onto

binary codes with a metric learning approach. Then, we will describe a simple algorithm for

fast exact nearest neighbor search in Hamming distance, which exhibits sub-linear query time

performance. Going beyond binary codes, we will highlight a compositional generalization of k-

means clustering which maps data points onto integer codes with storage and search costs that

grow sub-linearly in the number of cluster centers. This representation improves upon binary

codes, and provides an even more precise approximation of Euclidean distance. Experimental

results are reported on multiple datasets including a dataset of SIFT descriptors with 1B entries.

ii

Acknowledgements

I would like to thank my extraordinary advisor and mentor, David Fleet, for his continuous

support and encouragement, his perfectionism, clarity, great intuitions, openness to ideas, as

well as his chill attitude, modesty, and superb sense of humor. I was truly blessed to have David

as my teacher.

I am grateful to the rest of my advisory committee including Radford Neal, Ruslan Salakhut-

dinov, and Kyros Kutulakos, whose insightful comments and great questions inspired me to

extend some of the research findings and helped me improve the exposition of the ideas. I

am grateful to Thorsten Joachims and Raquel Urtasun for serving as my thesis examiners and

offering me their thoughtful comments and feedback.

I thank my fellow labmates and members of the AI group for stimulating discussions and

the fun we had together at UofT including Abdel-Rahman Mohamed, Aida Nematzadeh, Ali

Punjani, Amin Tootoonchian, Charlie Tang, Fartash Faghri, Fernando Flores-Mangas, George

Dahl, Ilya Sutskever, Jonathan Taylor, Kaustav Kundu, Marcus Brubaker, Navdeep Jaitly, Ni-

tish Srivastava, Sarah Sabour, Shenlong Wang, Siavosh Benabbas, Tom Lee, Varada Kolhatka,

Vlad Mnih, Wenjie Luo, Yanshuai Cao, Yuval Filmus, and others. I am sorry about forgetting

to include some of the names. You are going to be missed.

My sincere gratitude goes to Peyman Sarrafi and Ali Ashasi for putting up with me. My

thanks also goes to my awesome friends for cheering me up, including Afshar Ganjali, Ahmad

Sobhani, Ali Kalantarian, Ali Naseri, Alireza Sahraei, Amir Aghaei, Asghar Zahedi, Ebrahim

Bagheri, Emad Zilouchian, Faezeh Ensan, Hamed Parham, Hamideh Zakeri, Hossein Kaffash,

Kaveh Ghasemloo, Kianoosh Mokhtarian, Mandana Einolghozati, Mansour Safdari, Moham-

mad Derakhshani, Mohammad Rashidian, Mona Sobhani, Morteza Zadimoghaddam, Nima

Zarian, Safa Akbarzadeh, Samira Karimelahi, Zeynab Ziaie, and others that I forgot to include.

I particularly thank Nazanin Montazeri for her constant support and positive energy.

I thank Relu Patrascu for his interesting conversations and for keeping our computers and

servers up and running. I also thank Luna Keshwah for her excellent administrative support.

I express my deepest gratitude to my special family, my best parents in the world, Mansoureh

and Sadegh, and my best brothers in the universe, Mahdi and Sajad. My family supported me,

shared with me their advice, and made me feel unconditionally loved from oversees.

iii

Contents

1 Introduction 1

1.1 Nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Keyword search in text documents . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Hashing for nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Sketching with compact discrete codes . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Our approach to learning hash functions . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Search in Hamming space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.7 Vector quantization for nearest neighbor search . . . . . . . . . . . . . . . . . . . 9

1.8 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.9 Relationship to Published Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Minimal Loss Hashing 12

2.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Pairwise hinge loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Binary Reconstructive Embedding . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Bound on empirical loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Structural SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Convex-concave bound for hashing . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Tightness of the bound and regularization . . . . . . . . . . . . . . . . . . 17

2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Loss-augmented inference with pairwise hashing loss . . . . . . . . . . . . 18

2.3.2 Perceptron-like learning with pairwise loss . . . . . . . . . . . . . . . . . . 19

2.4 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Six datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Euclidean 22K LabelMe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.3 Semantic 22K LabelMe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Hashing for very high-dimensional data . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.A Proof of the inequlity on the tightness of bound . . . . . . . . . . . . . . . . . . . 30

iv

3 Hamming Distance Metric Learning 31

3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Triplet loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Optimization through an upper bound . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Loss-augmented inference with triplet hashing loss . . . . . . . . . . . . . 35

3.2.2 Perceptron-like learning with triplet loss . . . . . . . . . . . . . . . . . . . 36

3.3 Asymmetric Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Fast Exact Search in Hamming Space with Multi-Index Hashing 44

4.0.1 Background: problem and related work . . . . . . . . . . . . . . . . . . . 45

4.1 Multi-Index Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Substring search radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.2 Multi-Index Hashing for r-neighbor search . . . . . . . . . . . . . . . . . . 49

4.2 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Choosing an effective substring length . . . . . . . . . . . . . . . . . . . . 52

4.2.2 Run-time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.3 Storage complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3 k-Nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.3 Multi-Index Hashing vs. Linear Scan . . . . . . . . . . . . . . . . . . . . . 59

4.4.4 Direct lookups with a single hash table . . . . . . . . . . . . . . . . . . . 62

4.4.5 Substring Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Cartesian k-means 68

5.1 k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Orthogonal k-means with 2m centers . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.1 Learning ok-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.2 Distance estimation for approximate nearest neighbor search . . . . . . . 72

5.2.3 Experiments with ok-means . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 Cartesian k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.1 Learning ck-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Relations and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.1 Iterative Quantization vs. Orthogonal k-means . . . . . . . . . . . . . . . 78

v

5.4.2 Orthogonal k-means vs. Cartesian k-means . . . . . . . . . . . . . . . . . 78

5.4.3 Product Quantization vs. Cartesian k-means . . . . . . . . . . . . . . . . 79

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.5.1 Euclidean distance estimation for approximate NNS . . . . . . . . . . . . 79

5.5.2 Learning visual codebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6 More recent quantization techniques . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Conclusion 85

Bibliography 87

vi

List of Tables

3.1 Classification error rates on MNIST test set. . . . . . . . . . . . . . . . . . . . . . 39

3.2 Recognition accuracy on the CIFAR-10 test set . . . . . . . . . . . . . . . . . . . 41

4.1 Summary of run-time results on AMD machine . . . . . . . . . . . . . . . . . . . 59

4.2 Summary of run-time results on Intel machine . . . . . . . . . . . . . . . . . . . . 60

4.3 Run-time improvements from optimization of substring bit assignments . . . . . 64

4.4 Selected number of substrings used for the experiments . . . . . . . . . . . . . . 66

5.1 summary of quantization models in terms of encoding and storage . . . . . . . . 77

5.2 Recognition accuracy on CIFAR-10 using different codebook learning algorithms 83

vii

List of Figures

1.1 An illustration of binary sketching for similarity search . . . . . . . . . . . . . . . 6

1.2 Visualization of pairwise hinge loss for learning binary hash functions . . . . . . 7

1.3 An illustration of training data organized into triplets . . . . . . . . . . . . . . . 7

1.4 Illustration of Hamming ball with a radius of r . . . . . . . . . . . . . . . . . . . 8

2.1 The upper bound and empirical loss as functions of optimization step. . . . . . . 20

2.2 Precision for near neighbors within Hamming radii of 1 and 5 . . . . . . . . . . . 23

2.3 Precision of near neighbors within a Hamming radius of 3 bits . . . . . . . . . . . 24

2.4 Precision-recall curves for different methods on MNIST and LabelMe . . . . . . . 24

2.5 Precision-recall curves for different methods on four other datasets . . . . . . . . 25

2.6 Precision-recall curves for different code lengths on Euclidean 22K LabelMe . . . 26

2.7 Comparison of MLH, NNCA, and NN baseline on semantic 22K LabelMe . . . . 27

2.8 Qualitative results on semantic 22K LabelMe . . . . . . . . . . . . . . . . . . . . 28

3.1 MNIST precision@k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Precision@k plots for Hamming distance on CIFAR-10 . . . . . . . . . . . . . . . 41

3.3 Qualitative retrieval results for four CIFAR-10 images . . . . . . . . . . . . . . . 42

4.1 The number of hash table buckets within a Hamming ball, and the expected

search radius required for kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Search cost and its upper bound as functions of substring length . . . . . . . . . 53

4.3 Histograms of the search radii required to find kNN on binary codes . . . . . . . 55

4.4 Memory footprint of our Multi-Index Hashing implementation . . . . . . . . . . . 57

4.5 Recall rates for BIGANN dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Run-times on AMD on 1B 64-bit codes by LSH . . . . . . . . . . . . . . . . . . . 61

4.7 Run-times on AMD on 1B 128-bit codes by LSH . . . . . . . . . . . . . . . . . . 61

4.8 Run-times on AMD on 1B 256-bit codes by LSH . . . . . . . . . . . . . . . . . . 61

4.9 Number of lookups for exact kNN on binary codes using a single hash table . . . 63

4.10 Run-times for multi-index-hashing with consecutive vs. optimized substrings . . 64

5.1 Depiction of ok-means clusters on 2D data . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Euclidean approximate NNS results on 1M SIFT dataset . . . . . . . . . . . . . . 74

viii

5.3 Depiction of Cartesian quantization on 4D data . . . . . . . . . . . . . . . . . . . 75

5.4 Euclidean approximate NNS results on 1M SIFT, 1M GIST, and 1B SIFT . . . . 81

5.5 PQ and ck-means results using natural, structured, and random ordering . . . . 82

5.6 PQ and ck-means results using different number of bits for encoding . . . . . . . 82

ix

Chapter 1

Introduction

Staggering numbers of new images and videos appear on the world wide web, everyday. Ac-

cording to a report from May 2014 [KPCB14], more than 1.8 billion photos are uploaded and

shared per day on selected platforms including Flickr, Snapchat, Instagram, Facebook, and

WhatsApp. The availability of digital cameras and the ease of sharing digital content on the

Internet, have contributed significantly to the creation of massive image and video datasets,

which are growing rapidly. Computer software must improve significantly to enable indexing,

searching, processing, and organizing such a quickly growing volume of visual data.

Better and faster algorithms for indexing and searching digital content will be enabling

in fundamental ways for search engines and myriad big data and multimedia applications.

For example, consider data-driven approaches to computer vision, which are now becoming

successful in tasks such as object instance recognition [Low04], image restoration and inpain-

ing [FJP02, CPT04, HE07], pose estimation [SVD03], 3D structure from motion [SSS08], and

object segmentation [KGF12]. A key element of these approaches is content-based, similarity

search, in which unseen test queries are matched against large datasets of images and visual

features. Then, often, labeled information contained in visually similar data is aggregated and

transferred to label the query images. The problem is that, current similarity search techniques

do not easily scale to more than several million data points, where storage overheads and simi-

larity computations become prohibitive. As a consequence, massive image and video collections

on the web remain unexplored for most applications.

In computer vision, one extracts high-dimensional feature vectors from visual data. Stor-

age costs associated with large high-dimensional datasets pose a big challenge for large-scale

similarity search. Dimensionality reduction techniques, such as PCA, tend to simplify matters

by reducing the storage cost, but such techniques do not specifically target similarity search

applications. One hopes to obtain much better efficiency by designing specific compression tech-

niques for similarity search. In this thesis, we advocate the development of compact discrete

representations that facilitate fast, near neighbor retrieval. As will be shown, compact discrete

codes can be used as hash keys for fast retrieval of candidate near neighbors, or can be used

for fast distance estimation only based on compressed codes. Our ultimate goal is to develop

1

Chapter 1. Introduction 2

content-based similarity search tools and algorithms with minimal memory and computation

costs, to facilitate the use of web-scale datasets in computer vision and machine learning.

1.1 Nearest neighbor search

The problem of nearest neighbor search (NNS) is expressed as follows: Given a dataset of n

data points, construct a data structure such that, given any query data point, dataset points

that are nearest to the query based on a pre-specified distance can be found quickly. One may

be interested in one-nearest neighbor or k-nearest neighbors. We expect the indexing data

structure to be storage efficient.

Suppose we are given a dataset of p-dimensional feature vectors, denoted D ≡ {xi}ni=1 where

xi ∈ Rp. Let z ∈ Rp denote a query feature vector, and suppose we are interested in Euclidean

distance as our pairwise distance function. The one-nearest neighbor of a query z is defined as

NN(z) = argmin1≤ i≤n

‖z− xi‖2 . (1.1)

As an example, consider one-dimensional Euclidean NNS problem, i.e., p = 1. One can solve

this problem efficiently by organizing the dataset points into a sorted array. Given a query,

one resorts to binary search to find the nearest elements in the sorted array. Hence, with a

pre-processing of O(n log n) for sorting, and a storage of O(n), each query can be answered in

O(log n) running time. Even for p = 2, one can design an efficient NNS algorithm with linear

storage and logarithmic query time based on voronoi diagrams and point location data structure.

However, for p ≥ 3 simple algorithms do not exist. For low-dimensional NNS problems (with p

up to about 10 or 20), one can obtain good practical performance by using k-d trees [Ben75] or

other space partitioning data structures [Sam06], but no satisfactory worst case query time can

be guaranteed. However, for relatively high-dimensional data, the NNS problem is unsolved,

both in theory and practice.

The brute force linear scan (exhaustive search) solution to Euclidean NNS problem requires

a query time of O(np). For large datasets, one cannot tolerate a linear query time. One may

be willing to spend O(np), or slightly more, in a pre-processing stage to create a suitable data

structure, but at query time, we expect a running time sublinear in n. Unfortunately, for mod-

erate feature dimensionality (e.g., p ≥ 20), exact sub-linear NNS solutions require storage cost

or query time exponential in p. To this day, we do not know of any algorithm with polynomial

pre-processing and storage costs, which guarantees sublinear query time performance, even for

the simplest distance measures such as Hamming distance. Therefore, some recent work has fo-

cused on approximate rather than exact techniques (e.g., [IM98, GIM99, AI08, And09, ML14]).

There are two lines of research addressing approximate NNS problem: theoretical (such

as [IM98, AI08]) and applied (such as [JDS11, ML14]). Theoretical research aims at improving

the approximation ratios of the NNS solutions, and their space and worst case query time

Chapter 1. Introduction 3

complexity. In addition, theoreticians try to develop hardness results for NNS under different

metrics. Applied research, such as the current thesis, mainly concerns experimental evaluation

of techniques, and while it draws inspiration from theory, it does not compare methods based

on their worst case query time complexity, but based on their average query time performance

and precision / recall curves on standard benchmarks. From an applied perspective, ideally, one

should compare different NNS solutions based on their impact on a specific final task such as

image restoration and inpainting.

Different distance functions and similarity measures have been used within NNS applications

in the literature. An incomplete list of metrics includes Euclidean distance, Hamming distance,

α-norm distance including `1 and `∞, cosine similarity, Jaccard index, edit distance, and earth

mover’s distance. Ideally, one aims to devise a common approach to NNS under different

metrics instead of hand crafting solutions for each metric separately. We advocate the use of

machine learning techniques to reduce any arbitrary metric to a host metric that is amenable

to efficient NNS solutions. The host metric that this thesis focuses on is Hamming distance. In

Chapters 2 and 3 we develop a method to learn a proper mapping of data points to binary codes,

under which Hamming distance preserves a form of similarity structure in the original space.

In Chapter 4 we discuss how efficient search in Hamming distance can be conducted. Another

convenient host metric is Euclidean distance for which many machine learning algorithms for

distance metric learning exist [BHS13]. In Chapter 5 we discuss methods for Euclidean NNS.

An important common characteristic of NNS algorithms is that they perform some form of

space partitioning to make the search problem more manageable. Space partitioning may be

performed via hierarchical subdivision of the space in k-d trees and variants, or via random

hyperplanes and lattices in hashing approaches, or via Voronoi diagrams in extensions of k-

means clustering. A common theme of this thesis is also space partitioning. We focus on

designing machine learning techniques that optimize different forms of space partitioning based

on different objectives useful for different applications.

1.2 Keyword search in text documents

Perhaps one path to the development of effective solutions for NNS is to follow established

methods for text search. We use search engines such as Google on a daily basis to perform

keyword search in text documents. For example, one may look up “nearest neighbor search

computer vision applications” to find web pages and text documents elaborating on this

topic. Simply put, the goal is to find all of the documents on the Internet that contain all of

the query keywords.

One can represent each document by a binary high-dimensional vector, where presence

(absence) of each word is represented by a bit. The number of bits depends on the number

of words in the vocabulary. Each query can be represented in the same way, but we know

that queries only have a few non-zero bits, as the number of words in a query is much smaller

Chapter 1. Introduction 4

than a document. This is a specific search problem in which queries and documents are both

high-dimensional and sparse, but they have different sparsity patterns.

The text search problem has a fairly standard solution based on a simple data structure

called inverted index, a.k.a. inverted file. An inverted index stores a mapping from each keyword

to a set of document IDs that include that keyword. Thus, one can quickly look up all of the

documents containing a keyword. Given multiple words in a query, one can take the intersection

of the sets of documents IDs corresponding to keywords to find the solution. We are making

many simplifying assumptions about the text search problem here, such as ignoring the tf-idf

weighting of the words, but, inverted index is one of the key ideas behind the current search

engines. With some smart modifications [ZM06], one can make this idea work remarkably well

on billions of documents and quite a few words within each query. There exist well-known open

source packages such as Apache Lucene addressing this task [LUC].

If we could represent images and videos as sparse high-dimensional vectors, then one might

be able to use of text search systems to also search visual data. An image search application

based on current text search engines easily scales to billions of data points, as our text search

systems have been optimized over the past decades to be fast, scalable, and accurate. The main

problem, however, is that the nature of NNS with dense features where query and dataset points

come from the same distribution is quite different from the nature of text search. That said,

reducing the dense NNS problem to sparse keyword search is an interesting research direction

which deserves further investigation.

Sivic and Zisserman in their seminal work [SZ03] propose to use vector quantization methods

to define visual words [CDF+04] for regions of images and videos. They represent images and

videos by histograms of visual words, and they use an inverted index data structure to carry

out the retrieval. Even though they do not directly address scalability to massive datasets,

one can hope to improve their approach to make it more scalable by sparsifying the feature

representations. However, recent work has shown that quantizing feature vectors using k-means

and its variants significantly degrades performance, and one can obtain better results with real-

valued representations [CLVZ11]. This creates a serious concern regarding the use of text search

approaches for visual data. Hence, the algorithms developed in this thesis aim to address NNS

problems for dense vectors, and we do not assume sparsity in the input representations.

1.3 Hashing for nearest neighbor search

A common approach to NNS, advocated by Indyk and Motwani [IM98, GIM99], hinges on using

several hash functions for which nearby points have a higher probability of collision than distant

points. Following this approach, one pre-processes a dataset by creating multiple hash tables

and populating them with the dataset points using their hash keys. Then, at query time, one

applies the hash functions to the query and retrieves the dataset entries that fall into the same

hash buckets as the query. This provides a set of approximate near neighbors for the query.

Chapter 1. Introduction 5

A key challenge for the hashing approach (a.k.a cell probing) is to find an appropriate

family of hash functions to guarantee higher probability of collision for close points for a given

metric. Indyk and Motwani [IM98] formalize this desired property of hash functions by defining

a concept of locality sensitive hashing (LSH) as follows: A family F of hash functions f(.) is

called (r1, r2, p1, p2)–sensitive if for any x, z ∈ Rp the following statements hold:

• if ‖x− z‖ ≤ r1 then Pf∼F[f(x) = f(z)

]≥ p1.

• if ‖x− z‖ ≥ r2 then Pf∼F[f(x) = f(z)

]≤ p2.

Note that the probability of collision is calculated for a random draw of a hash function f(.)

from F . In order for a locality sensitive hash function to be useful, it has to satisfy r1 < r2

and p1 > p2. Previous work [IM98, GIM99] proposed locality sensitive hash functions for NNS

on binary codes in Hamming distance. Such methods also extend to Euclidean distance by

embedding Euclidean structure into Hamming space. Later, [DIIM04] proposed an LSH scheme

based on p-stable distributions that works directly on points in Euclidean space. The follow-up

work of [AI06] improved the running time and space complexity of LSH-based approximate

Euclidean NNS.

LSH schemes, such as the ones above, make no prior assumption about the data distribution,

and come with theoretical guarantees that the LSH property holds for a specific metric under

any data distribution. In contrast, we advocate machine learning methods that explicitly exploit

empirical data distributions. In particular, we advocate the formulation of techniques to learn

similarity preserving hash functions from the data, which provide compact hash codes that are

extremely effective for a specific dataset. Not surprisingly, there has been a surge of recent

research on learning hash functions [SVD03, SH09, TFW08, KD09, BTF11b], thereby taking

advantage of the data distribution. These techniques typically outperform LSH and its variants,

at the expense of a training phase.

As a simple example, consider Euclidean NNS and a hash function based on k-means clus-

tering. One can run k-means on a set of training points to divide the space into several Voronoi

cells, and a hash function can simply map points to their corresponding Voronoi cell IDs. Ob-

viously, points that fall into the same cell (mapped to the same hash code) are more likely to

have a small Euclidean distance than points that fall in different cells. This simple hashing

method can act as a filtering stage to return a short list of near neighbor candidates for further

inspection by more advanced techniques. Despite its simplicity, this is the basis for some of the

current state-of-the-art Euclidean NNS algorithms [JDS11, BL12].

1.4 Sketching with compact discrete codes

A key challenge facing scalable similarity search is the large storage cost associated with massive

datasets of high dimensional data. There is a natural trade-off between storage cost and query

time in most nearest neighbor search algorithms, i.e., algorithms which consume more storage

tend to be faster and vice versa. For most practical applications, however, we are not even

Chapter 1. Introduction 6

. . . . . . . . .

↓ ↓ ↓ ↓. . . 110010 100010 . . . 000101 001101 . . .

Figure 1.1: An illustration of binary sketching for similarity search. The sketch function mapssimilar items to nearby codes, and dissimilar items to distant codes.

able to store the entire raw dataset, requiring O(np) storage, in memory, let alone algorithms

that require superlinear storage with large exponents and constants. Unfortunately, many

approximate nearest neighbor search algorithms require superlinear storage [GIM99, AI06],

and hence, their practical impact has been limited despite their theoretical allure.

A family of search algorithms that have received increasing recent interest in computer

vision and machine learning develop dimensionality reduction techniques that produce compact

and discrete representations of the data. These methods exploit similarity-preserving sketch

functions to map data points to compact fingerprint codes, while maintaining the similarity

structure of the data.

The idea of sketching is almost the same as hashing, and their subtle difference is often ig-

nored in applied fields, where the term hash function is often used to refer to a sketch function.

Sketch functions map data points to short codes or fingerprints, which provide sufficient statis-

tics for differentiating close and distant pairs of points. Suppose points x and z are mapped to

sketches f(x) and f(z). Then, f(x) and f(z) should be sufficient to approximate ‖x − z‖, or

at least answer whether ‖x− z‖ ≤ r1 or ‖x− z‖ ≥ r2 for r2 > r1. Hence, hash functions can be

thought as a form of sketch functions, on which we are only allowed to check for collision, or

the equality of codes, i.e., whether f(x) = f(z). Sketch functions , in contrast, support more

involved calculations on f(x) and f(z), such as computing ‖f(x)− f(z)‖H , Hamming distance

of f(x) and f(z), when the output of f is binary.

Fig. 1.1 shows an example of binary sketch functions that map images to binary codes

such that images with similar content are mapped to codes with small Hamming distance, and

dissimilar images are mapped to distant binary codes. The benefit of this general approach

is twofold. First, compact discrete codes require much less storage compared to the original

high-dimensional data. Second, discrete codes can be used as hash keys to enable fast hash

indexing, keeping that in mind that hash cells with nearby hash keys will contain similar data

points.

Chapter 1. Introduction 7

Similar data points:

‖h− g‖H

`hinge(h,g, 1)

ρ−1

Dissimilar data points:

‖h− g‖H

`hinge(h,g, 0)

ρ+1

Figure 1.2: Visualization of pairwise hinge loss for learning binary hash (sketch) functions.

1.5 Our approach to learning hash functions

In this thesis, we advocate the use of compact binary codes for scalable similarity search. We

discuss three ways of learning binary sketch functions from data in Chapters 2, 3, and 5. In the

reminder of thesis, we use the term hash function to refer to a sketch function in order to be

consistent with the literature that does not differentiate sketching and hashing [SVD03, SH09,

TFW08, KD09].

In Chapter 2, we propose a method for learning binary linear threshold functions that map

high dimensional data onto binary codes. Our formulation is based on structured prediction

with latent variables and a pairwise hinge loss function. We assume that training data is

organized into pairs of similar and dissimilar points, which should be mapped to codes that

preserve the similarity labels. Given a hyper-parameter ρ, binary codes h and g are considered

similar if their Hamming distance is smaller than ρ, i.e., ‖h− g‖H ≤ ρ− 1. Conversely, codes

h and g are considered dissimilar if ‖h−g‖H ≥ ρ+ 1. We use a pairwise hinge loss depicted in

Fig. 1.2 to optimize the parameters of hash function for pairs of similar and dissimilar training

data points. The proposed learning algorithm is efficient to train for large datasets, scales well

to large code lengths, and outperforms state-of-the-art methods.

For some tasks and datasets, classifying the training data into similar vs. dissimilar pairs is

nearly impossible. Moreover, compared to more advanced projections by multilayer neural net-

works, linear threshold functions are limited in their expressive power. We address both of these

concerns by presenting a framework for learning a broad family of non-linear mapping functions

using a flexible form of triplet ranking loss. The training dataset D, as shown in Fig. 1.3, is

organized into triplets of exemplars, D ={

(xi,x+i ,x

−i )}ni=1

, such that xi is more similar to x+i

than x−i . We define a triplet ranking loss function that penalizes a triplet of binary codes when

Hamming distance between the more similar pair is larger than Hamming distance between the

D ={(

, ,),(

, ,), . . .

}Figure 1.3: An illustration of training data organized into triplets, D =

{(xi,x

+i ,x

−i )}ni=1

, such

that xi is more similar to x+i than x−i .

Chapter 1. Introduction 8

Figure 1.4: Illustration of Hamming ball with a radius of r bits in the vicinity a code 0000.

less similar pair. Employing this loss function, we aim to learn hash functions that satisfy as

many triplet ranking constraints as possible. We overcome the discontinuous optimization of

the discrete mapping by minimizing a piecewise smooth upper bound on empirical loss. A new

loss-augmented inference algorithm that is quadratic in the code length is proposed. We use

stochastic gradient descent for scalable optimization.

1.6 Search in Hamming space

There has been growing interest in representing image data and feature descriptors in terms

of compact binary codes, often to facilitate fast near neighbor search and feature matching

in vision applications (e.g., [AOV12, CLSF10, SVD03, SBBF12, TFW08, KGF12]). Nearest

neighbor search (NNS) on binary codes is used for image search [RL09, TFW08, WTF08],

matching local features [AOV12, CLSF10, JDS08, SBBF12], image classification [BTF11a], etc.

Sometimes the binary codes are generated directly as feature descriptors for images or image

patches, such as BRIEF or FREAK [CLSF10, BTF11a, AOV12, TCFL12], and sometimes

binary corpora are generated by similarity-preserving hash functions from high-dimensional

data, as discussed above. Regardless of the algorithm used to generate the binary codes, one

has to develop algorithms for search in Hamming space that scale to massive datasets.

To facilitate NNS in Hamming space previous work suggests creating hash tables on the

binary codes in the dataset, and retrieving the contents of the hash buckets in the vicinity of a

query code to find near neighbors. The problem is that the number of hash buckets within a

Hamming ball around a query, which one might have to examine in order to find near neighbors

(see Fig. 1.4), grows near-exponentially with the search radius. When binary codes are longer

than 64 bits, even with a small search radius, the number of buckets to examine may be larger

than the number of items in the database, hence slower than linear scan.

In Chapter 4 of the thesis, inspired by the work of Greene, Parnas, and Yao [GPY94],

we introduce a rigorous way to build multiple hash tables on binary code substrings to enable

exact k-nearest neighbor search in Hamming distance. Our approach, called multi-index hashing

(MIH), is storage efficient and straight-forward to implement. We present theoretical analysis

Chapter 1. Introduction 9

that shows that the algorithm exhibits sublinear run-time behavior for uniformly distributed

codes. In addition, our empirical with non-uniformly distributed codes show dramatic speedups

over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits.

The algorithm for searching binary codes complements the methods for learning binary hash

functions to build a full system for large-scale similarity search.

1.7 Vector quantization for nearest neighbor search

Sketch functions map data points to short codes that provide sufficient statistics to differentiate

close and distant pairs of points. Vector quantizers map data points to compressed codes

sufficient to approximately reconstruct the original data. They can be thought as a form of

sketch functions too. Let f(x) denote the quantization of x, and suppose x can be reconstructed

by f−1(f(x)). Then, we can compute the distance between x and z by simply computing

‖f−1(f(x))−f−1(f(z))‖. The benefit is that we are no longer required to keep high-dimensional

vectors x and z in memory, but only having access to compressed codes f(x) and f(z) suffices

to estimate ‖x − z‖. For this approach to be effective, we expect estimation of distance given

f(x) and f(z) to be faster than simply computing ‖x− z‖, so we are interested in a sub-family

of vector quantizers that allow fast distance estimation.

As an example, consider quantization by mapping data points to their k-means cluster

centers. Given a set of k centers denoted {C(i)}ki=1, the quantizer f(x) is defined as:

f(x) = argmini‖x− C(i)‖22 . (1.2)

One can approximate Euclidean distance ‖x − z‖22 by the distance between cluster centers

associated with x and z, i.e., ‖C(f(x))−C(f(z))‖22. Pairwise distances between cluster centers

can be precomputed and stored in a lookup table to provide a fast way for distance estimation.

One way to reduce the error in distance estimation is to only quantize the database points

and not the query [JDS11]. This way, one can approximate distance between x and z by

‖C(f(x)) − z‖22. For each query z, a query-specific lookup table can be created that stores

the distance between z and all of the k cluster centers. As long as k is much smaller than n,

creating the lookup table for distance estimation with k-means is effective.

We note that vector quantizers can be used for hashing as well as distance estimation.

When used for hashing, as described in Section 1.3, all of the data points sharing the same

quantization code are mapped to a hash bucket, which can be accessed by a query look up

efficiently. In this case, we need to keep the number of quantization regions small, roughly

in the range of n, so we do not allocate many hash buckets that are empty. However, when

quantization is used for distance estimation, in order to reduce the estimation error we need

to increase the number of quantization regions, so long as the memory footprint is acceptable.

This has led to hybrid approaches to Euclidean NNS [JDS11, BL12]: A coarse quantization of

Chapter 1. Introduction 10

the space is used with hashing to reduce the search problem to a reasonable short list of near

neighbor candidates. A fine quantization of the space is used for distance estimation to rank

the short list candidates and find the nearest neighbors. The fine quantization allows one to

save memory by discarding the original high-dimensional vectors, if only keeping the quantized

vectors yields sufficiently good approximate results.

For the quantization approach to be useful for distance estimation, we need small vector

quantization errors that yield small error in distance estimation. Increasing the number of

cluster centers in k-means is one way to reduce quantization error, but vector quantization

with k-means is slow and memory intensive especially when k gets large. In this thesis, we

develop new models related to k-means clustering with a compositional parametrization of

cluster centers, so representational capacity and effective number of quantization regions in-

crease super-linearly in the number of parameters. This allows one to efficiently quantize data

using billions or trillions of centers. We formulate two such models, Orthogonal k-means and

Cartesian k-means. They are closely related to one another, to k-means, to methods for binary

hash function optimization like ITQ [GL11], and to Product Quantization for vector quanti-

zation [JDS11]. With the help of these techniques one can devise fast and memory efficient

models for Euclidean NNS.

1.8 Thesis Outline

• Chapters 2 and 3 of the thesis focus on learning binary hash (sketch) functions from

training data, as summarized in Section 1.5.

• Chapter 4 presents an algorithm for fast exact NNS on binary codes in Hamming distance,

as summarized in Section 1.6.

• Chapter 5 discusses compositional quantization models useful for hashing and distance

estimation with compressed codes, as summarized in Section 1.7.

• Chapter 6 concludes the thesis, with a discussion of some interesting directions for future

research.

1.9 Relationship to Published Papers

The chapters in this thesis describe work that has been published in the following conference

and journal papers:

Chapter 2: M. Norouzi and D. J. Fleet, Minimal Loss Hashing for Compact Binary Codes,

ICML 2011 [NF11].

Chapter 3: M. Norouzi, D. J. Fleet, and R. Salakhutdinov, Hamming Distance Metric Learn-

ing, NIPS 2012 [NFS12].

Chapter 1. Introduction 11

Chapter 4: M. Norouzi, A. Punjani, and D. J. Fleet, Fast Search in Hamming Space with

Multi-Index Hashing, CVPR 2012 [NPF12].

M. Norouzi, A. Punjani, and D. J. Fleet, Fast Exact Search in Hamming Space

with Multi-Index Hashing, TPAMI 2014 [NPF14].

Chapter 5: M. Norouzi, D. J. Fleet, Cartesian k-means, CVPR 2013 [NF13].

Chapter 2

Minimal Loss Hashing

A common approach to approximate nearest neighbor search, well suited to high-dimensional

data, uses similarity-preserving hash functions, where similar/dissimilar pairs of inputs are

mapped to nearby/distant hash codes. One can preserve Euclidean distances, e.g., with Locality-

Sensitive Hashing (LSH) [IM98], or one might want to preserve the similarity associated with

object category labels, or real-valued affinities associated with training exemplars.

Using compact binary codes as hash keys is particularly useful for nearest neighbor search

(NNS). If the neighbors of a query fall within a small Hamming ball in Hamming space, then

search can be accomplished in sublinear time, by enumerating over all of the binary hash codes

within the Hamming ball in the vicinity of the query code. Even an exhaustive linear scan

through the database of binary codes enables very fast search. Moreover, compact binary codes

allow one to store large databases in memory.

Finding a suitable mapping of the data onto binary codes has a profound impact on the

quality of a hashing-based NNS system. Random projections are used in LSH [IM98, Cha02] and

related methods [RL09, Bro97]. They are dataset independent, make no prior assumption about

the data distribution, and come with theoretical guarantees that specific metrics (e.g., cosine

similarity) are increasingly well preserved in Hamming space as the code length increases. But

they require large code lengths for good retrieval accuracy, and they are not applicable to

general similarity measures, like human ratings.

To find better more compact codes, recent research has turned to machine learning tech-

niques that optimize mappings for specific datasets (e.g., [KD09, SH09, SVD03, TFW08,

BTF11b]). Most learning methods aim to preserve Euclidean structure of the input datasets

(e.g., [GL11, KD09, WTF08]). However, some papers also considered more generic measures

of similarity. Unsupervised multilayer neural nets of Salakhutdinov and Hinton [SH09] aim

to discover semantic similarity using deep autoencoders with stochastic binary code layers.

Shakhnarovich, Viola, and Darrel [SVD03] exploit boosting to learn binary hash bits greedily

from supervised similarity labels. By contrast, the method that we propose in this chapter is

supervised and not sequential; it optimizes all the code bits simultaneously.

The task at hand is to find a hash function that maps high-dimensional inputs, x ∈ Rp,

12

Chapter 2. Minimal Loss Hashing 13

onto q-bit binary codes, h ∈ Hq ≡ {−1, 1}q, which preserves some notion of similarity. The

canonical approach assumes centered (mean-subtracted) inputs, linear projection, and binary

quantization. Such hash functions, parameterized by W ∈ Rq×p, are given by

b lin(x;w) = sign (Wx), (2.1)

where w≡vec(W ), and the ith bit of the vector sign(Wx) is 1 iff the ith dimension of (Wx) is

positive. In other words, the ith row of W determines the ith bit of the hash function in terms

of a hyperplane in the input space; −1 is assigned to points on one side of the hyperplane, and

1 to points on the other side.1

The main difficulty in optimizing similarity-preserving binary hash functions stems from

the discontinuity of the projection, resluting in a discontinuous learning objective function.

At a high level, there exist at least three general ways to minimize a discontinous objective.

First, coordinate descent where one iteratively optimizes each parameter dimention separately

by exhaustive enumeration a large collection of possible values (e.g., see [KD09]). Second,

continuous relaxations where one approximates b lin(x;w) by a smooth function such as tangent

hyperbolic. Third, optimization via a continous upper bound on the discontinuous objective,

which is the approach that we follow in this work.

In this chapter, we formulate the learning of compact binary codes in terms of structured

prediction with latent variables using new classes of loss functions designed for preserving

similarity. We design a loss function specifically for hashing that takes Hamming distance

and binary quantization into account. Our novel formulation adopts the approach of latent

structural SVMs [YJ09] and an effective online learning algorithm. The resulting algorithm is

shown to outperform state-of-the-art methods.

2.1 Formulation

Turning to formulation, let a training dataset D comprise n pairs of p-dimensional training

points (xi,x′i) along with their similarity label si ∈ {0, 1}, i.e., D ≡ {(xi,x′i, si)}

ni=1. The data

points xi and x′i are similar when the binary similarity label is 1 (si = 1), and dissimilar when

si = 0. To preserve a specific metric (e.g., Euclidean distance) one can use binary similarity

labels obtained by thresholding pairwise distances. Alternatively, for preserving similarity based

on semantic content of examples, one can use a weakly supervised dataset in which each training

point is associated with a set of neighbors (similar exemplars), e.g., with the same class label,

and non-neighbors (dissimilar exemplars), e.g., with different class labels.

The quality of a mapping b lin(x;w) is determined by a loss function `pair : Hq×Hq×{0, 1} →R+ that assigns a cost to a pair of binary codes and a similarity label. For binary codes h ∈ Hq,

g ∈ Hq, and a label s ∈ {0, 1}, the loss function `pair(h,g, s) measures how compatible h and

1One can add an offset from the origin, but we find the gain is marginal. Nonlinear projections are alsopossible, but in this chapter we concentrate on linear projections.

Chapter 2. Minimal Loss Hashing 14

g are with s. For example, when s = 1, the loss assigns a small cost if h and g are nearby

codes, and large cost otherwise. Ultimately, to learn w, we aim to minimize empirical loss over

training pairs:

L(w) =

n∑i=1

`pair(b(xi;w), b(x′i;w), si) . (2.2)

2.1.1 Pairwise hinge loss

The loss function that we advocate is specific to learning binary hash functions, and bears strong

similarity to hinge loss used in SVMs. It includes a hyper-parameter ρ, which is a threshold

in the Hamming space that differentiates neighbors from non-neighbors. This is important for

learning hash codes, since we want similar training points to map to binary codes that differ

by no more that ρ bits. Non-neighbors should map to codes no closer than ρ bits.

Let ‖h − g‖H denote Hamming distance between binary codes h and g. Our hinge loss

function, denoted `hinge, depends on ‖h− g‖H and not on the individual codes:

`hinge(h,g, s) =

[‖h− g‖H − ρ+ 1

]+

for s = 1

λ[ρ− ‖h− g‖H + 1

]+

for s = 0

(2.3)

where [α]+ ≡ max(α, 0), and λ is another loss hyper-parameter that controls the ratio of the

slopes of the penalties incurred for similar vs. dissimilar points when they are too far apart

vs. too close. Linear penalties are useful as they are robust to outliers. Note that when similar

points are sufficiently close, or dissimilar points are distant, our loss does not impose any

penalty. The `hinge(h,g, s) loss is depicted in Fig. 1.2.

2.1.2 Binary Reconstructive Embedding

Our loss-based framework for learning binary hash functions is inspired by Binary Eeconstruc-

tive Embedding (BRE) introduced by Kulis and Darrel [KD09]. The BRE uses a different

pairwise loss function that penalizes the squared difference between normalized Hamming dis-

tance in binary codes and a real-valued distance in the input space. Given two q-bit codes h

and g for data points x and x′, and a parameter 0 ≤ d ≤ 1 which represents a measure of

distance between x and x′, the BRE loss takes the form of

`bre(h,g, d) =

(( 1

q‖h− g‖H

)2 − d2)2

. (2.4)

The BRE [KD09] assumes that inputs are unit norm, and uses d = 12‖x−x

′‖2. This is equivalent

to d = 1− cos(θ(x,x′)) for unnormalized x and x′, which makes `bre loss particularly suitable

for preserving cosine similarity, and relates BRE to angular LSH [Cha02]. That said, other

normalized distance measures (between 0 and 1) can be used within the BRE loss too to define

Chapter 2. Minimal Loss Hashing 15

the value of d, for instance d = 1− s.The BRE method [KD09] addresses the difficulty in optimizing empirical loss in (2.2) by

using coordinate descent. At each iteration of the optimzation, the BRE adjusts one entry of

W by exhaustive search. By changing a single entry of W , denoted W[a,b], only one bit in the

output binary codes can change (i.e., ath bit). For each training data point, one can compute

the value of W[a,b] that flips the ath bit of the code for that data point. Therefore a set of n

thresholds are computed, one for each training point, and the optimal threshold is selected as

the new value of W[a,b] by exhaustive evaluation of empirical loss for all thresholds. During

training, to enable faster parameter update, the BRE caches q-dimensional real-valued linear

projections of the training data points. This incurs high storage cost for training, making large

datasets impractical.

In the BRE optimization, coordinate descent is possible because of the restricted form of the

hash functions b lin(.). For a more general family of hash functions e.g., based on thresholded

multilayer neural networks, coordiante descent is not possible anymore as changing one entry

in the weights from −∞ to +∞ may flip multiple bits in the output codes several times. By

contrast, the approach that we propose in this section can be applied to both b lin(.) and a more

general family of hash functions, discussed in Section 3.

2.2 Bound on empirical loss

The empirical loss in (2.2) is discontinuous and typically non-convex, making optimization

difficult. Rather than directly minimizing empirical loss, we instead formulate, and minimize,

a piecewise linear upper bound on empirical loss. Our bound is inspired by a bound used, for

similar reasons, in latent structural SVMs [YJ09].

We first re-express the hash function b lin(x;w) as a form of structured prediction:

b lin(x;w) = sign (Wx), (2.5a)

= argmaxh∈Hq

hTWx (2.5b)

= argmaxh∈Hq

wTψ(x,h) , (2.5c)

where ψ(x,h) ≡ vec(hxT). Here, wTψ(x,h) acts as a scoring function that determines the

relevance of input-output pairs, based on a weighted sum of features in their joint feature

vector ψ(x,h). Note that other forms of ψ(x,h) are possible too, leading to other types of hash

functions. For example one may consider pairwise weights for interactions between binary bits

within h which would require a binary quadratic optimization for inference. That said, this

chapter focuses on the simplest family of hash functions based on linear threshold functions.

To motivate our upper bound on empirical loss, we begin with a short review of the bound

commonly used for structural SVMs [TGK03, THJA04].

Chapter 2. Minimal Loss Hashing 16

2.2.1 Structural SVM

In structural SVMs, given input-output training pairs {(xi,y∗i )}ni=1, one aims to learn a mapping

from inputs to discrete outputs in terms of a parameterized scoring function f(x,y;w), such

that the model’s prediction y,

y = argmaxy

f(x,y;w) , (2.6)

correlates closely with the ground-truth label y∗. Given a loss function on the output domain,

`(·, ·), the structural SVM with margin-rescaling introduces a margin violation (slack) variable

for each training pair, and minimizes sum of slack variables. For a pair (x, y∗), slack is defined

as

maxy

[`(y,y∗)+f(x,y;w)]− f(x,y∗;w) . (2.7)

Importantly, the slack variables provide an upper bound on loss for the model’s prediction y :

`(y,y∗)

≤ maxy

[`(y,y∗)+f(x,y;w)]− f(x, y;w) (2.8a)

≤ maxy

[`(y,y∗) + f(x,y;w)]− f(x,y∗;w) . (2.8b)

To see the inequality in (2.8a), note that, if the first term on the RHS of (2.8a) is maximized by

y = y, then the f terms cancel, and (2.8a) becomes an equality. Otherwise, the optimal value

of the max term must be larger than when y = y, which causes the inequality. The second

inequality (2.8b) follows straightforwardly from the definition of y in (2.6); i.e., f(x, y;w) ≥f(x,y;w) for all y including y∗. The bound in (2.8b) is piecewise linear, convex in w, and easier

to optimize than the empirical loss. Structural SVM formulates learning as the minimization

of the sum of bounds in (2.8b) for every training examples (i.e., sum of slack variables), plus a

regularizer on w.

2.2.2 Convex-concave bound for hashing

The difference between learning hash functions and the structural SVM is that the binary codes

for our training data are not known a priori. However, note that the tighter bound in (2.8a)

uses y∗ only in the loss term. This is useful for hash function learning, as suitable loss functions

for hashing, such as `bre and `hinge, do not require ground-truth labels, but a pair of binary

codes. The bound (2.8a) is piecewise linear, convex-concave (a sum of convex and concave

terms), and is the basis for structural SVMs with latent variables [YJ09]. Below we formulate

a similar bound for learning binary hash functions.

Our upper bound on a generic pairwise loss function `pair, given a pair of inputs, x and x′,

Chapter 2. Minimal Loss Hashing 17

a similarity label s, and the parameters of the hash function w, has the form

`pair( b lin(x;w), b lin(x′;w), s)

≤ maxg,g′∈Hq

{`pair(g,g

′, s) + gTWx + g′TWx′

}− max

h∈HqhTWx− max

h′∈Hqh′

TWx′ .

(2.9)

It follows from the definition of b lin(.) that the second and third terms on the RHS of (2.9)

are maximized by h = b lin(x;w) and h′ = b lin(x′;w). If the first term were maximized by

g = b lin(x;w) and g′ = b lin(x′;w), then the inequality in (2.9) becomes an equality. For all

other values of g and g′ that maximize the first term, the RHS can only increase, hence the

inequality. The bound holds for any loss function `pair including `bre and `hinge.

We formulate the optimization for learning the weights w of the hashing function, in terms

of minimization of the following convex-concave upper bound on empirical loss:

n∑i=1

(maxgi,g′i

{`pair(gi,g

′i, si) + gi

TWxi + g′iTWx′i

}−max

hi

hiTWxi −max

h′i

h′iTWx′i

). (2.10)

2.2.3 Tightness of the bound and regularization

Regarding the tightness of the bound in (2.9), we present a proposition that helps understanding

the nature of empirical loss optimization via minimizing the upper bound. Clearly, the loss

`pair(b lin(x;w), b lin(x′;w), s) does not change with the norm of w as b lin(x;w) = b lin(x;αw)

for any scalar α > 0. However, change in the norm of w affects the upper bound in (2.9). We

claim that the upper bound gets tighter as the norm of w gets larger. In other words, the

bound for γw for any γ > 1 is smaller than or equal to the bound for w:

maxg,g′∈Hq

{`pair(g,g

′, s) + γgTWx + γg′TWx′

}− max

h∈HqγhTWx− max

h′∈Hqγh′

TWx′

≤ maxg,g′∈Hq

{`pair(g,g

′, s) + gTWx + g′TWx′

}− max

h∈HqhTWx− max

h′∈Hqh′

TWx′ .

(2.11)

We provide an algebraic proof of (2.11) in Section 2.A.

Given the proposition (2.11), one undesirable way to minimize the upper bound is to increase

the norm of w, which does not affect the loss, but the bound. In particular, when γ goes to

+∞, it is easy to see that the upper bound and the actual loss become equivalent as the score

terms dominate the maximization over g and g′ (unless Wx and Wx′ are zero). Hence, when

‖w‖ is really large, the upper bound becomes really tight and almost piecewise constant in

w, so using the gradient of the bound for optimization with respect to w is hopeless. On the

other hand, when γ goes to zero, the score terms will not affect the maximization over g and

g′, and all of the terms except loss go to zero, so the upper bound becomes a constant value of

maxg,g′ {`pair(g,g′, s)}.

Chapter 2. Minimal Loss Hashing 18

To prevent w from growing really large during optimization, we use a regularizer on ‖w‖22.According to our experiments, using a regularizer on w leads to a smaller value of emprical

loss after convergence. We believe that constraining the norm of w makes the upper bound

looser, but also makes the bound smoother, leading to more progress by the gradient based

optimizer. Because the bound is non-convex, gradient based optimization is one of the only

options available. The regularizer that we choose for optimization of b lin, is a set of hard

constraints on the `2 norm of rows of W . This way we have control over the norm of each

hyperplane separately.

Including a regularizer, here is the surrogate objective that we aim to minimize given a

training dataset of n similar / dissimilar pairs of data points (xi,x′i) and their labels si:

n∑i=1

(maxgi,g′i

{`pair(gi,g

′i, si) + gi

TWxi + g′iTWx′i

}−max

hi

{hi

TWxi}−max

h′i

{h′i

TWx′i

}) s. t. ∀1≤ j≤ q∥∥W[j, ·]

∥∥22≤ ν , (2.12)

where ν is a hyper parameter controlling the regularization, and W[j, ·] is the jth row of W .

2.3 Optimization

Minimizing (2.12) to find w entails the maximization of three terms for each training pair

(xi,x′i). The second and third terms are trivially maximized directly by the hash function

itself. Maximizing the first term is, however, not trivial. In the structural SVM literature,

optimizing this term is called loss-augmented inference. The next section describes an efficient

algorithm for finding the exact solution of loss-augmented inference for hash function learning

with pairwise losses.

2.3.1 Loss-augmented inference with pairwise hashing loss

To solve loss-augmented inference, one needs to find a pair of binary codes g and g′ given by

(g, g′) = argmax(g,g′)∈Hq×Hq

{`pair(g,g

′, s) + gTWx + g′TWx′

}. (2.13)

We solve loss-augmented inference exactly and efficiently for loss functions of the form

`pair(g,g′, s) = `

(‖g − g′‖H , s

), (2.14)

such as `hinge and `bre that depend on Hamming distance between g and g′ but not the specific

bit sequences g and g′. Before deriving a general solution, first consider a specific case for

which we restrict the Hamming distance between g and g′ to be m, i.e., ‖g − g′‖H = m. For

q-bit codes, m is an integer between 0 and q. When ‖g− g′‖H = m, the loss in (2.13) depends

Chapter 2. Minimal Loss Hashing 19

on m and s, but not on g or g′. Thus, instead of (2.13), we can now solve

`(m, s) + maxg,g′

{gTWx + g′

TWx′

}s. t. ‖g − g′‖H = m . (2.15)

The key to finding the two codes that maximize (2.15) is to decide which m bits in the two

codes should be different.

Let v[k] denote the kth dimension of a vector v. We can compute the joint contribution of

the kth bits of g and g′ to(gTWx + gTWx′

)by

cont(k,g[k],g′[k]) = g[k](Wx)[k] + g′[k](Wx′)[k] , (2.16)

and these contributions can be computed for the four possible states of the kth bits. Let δk

represent how much is gained by setting the bits g[k] and g′[k] to be different rather than the

same, i.e.,

δk = max(cont(k, 1,−1), cont(k,−1, 1)

)−max

(cont(k,−1,−1), cont(k, 1, 1)

)(2.17)

Because g and g′ differ only in m bits, optimal g and g′ are obtained by setting the m bits with

m largest δk’s to be different. All other bits in the two codes should be the same. When g[k] and

g′[k] must be different, their best values are found by comparing cont(k, 1,−1) and cont(k,−1, 1).

Otherwise, they are determined by the larger of cont(k,−1,−1) and cont(k, 1, 1). Now solve

(2.15) for all m; noting that we only compute δk for each bit once.

In sum, to solve the loss-augmented inference it suffices to find the m that provides the

largest value for the objective function in (2.15). We first sort the δk’s once, and for different

values of m, we compare the sum of the first m largest δk’s plus `(m, s), and choose the m that

achieves the highest score. Afterwards, we determine the values of the bits according to their

contributions as described above.

Given the values of Wx and Wx′, this loss-augmented inference algorithm takes time

O(q log q). Other than sorting the δk’s, all other steps are linear in q which makes the in-

ference efficient and scalable to large code lengths. The computation of Wx can be done once

per data point and cached, in case the data point is being used in multiple pairs.

2.3.2 Perceptron-like learning with pairwise loss

In Section 2.2.3, we formulated a convex-concave bound in (2.12) on empirical loss, which we

use as a surrogate objective for learning binary hash functions. In Section 2.3.1 we described

how the value of the bound could be computed at a given W for a given (xi,x′i, si). Now

we consider optimizing the objective i.e., lowering the bound. A standard technique for mini-

mizing such objectives is called difference of convex (DC) programming or the concave-convex

procedure [YR03]. Applying this method to our problem, we should iteratively impute the

missing data (the binary codes b lin(xi) and b lin(x′i)) and optimize for the convex term (the loss-

Chapter 2. Minimal Loss Hashing 20

200 400 600 800 1000 1200 1400 1600 1800 200050

100

150

Iterations

Valu

e s

um

med o

ver

105

pairs

Upperbound

Empirical loss

Figure 2.1: The upper bound and empirical loss as functions of optimization step.

augmented terms in (2.12)). However, our preliminary experiments showed that convex-concave

procedure is slow and not so effective for our optimization problem.

Alternatively, inspired by structured perceptron [Col02] and McAllester et al. [MHK10],

we employ a stochastic gradient-based approach based on an iterative perceptron-like update

rule. At iteration t of optimization, let the current weight matrix be W (t). Then, we randomly

sample a training pair (xt,x′t) with a similarity label st, and compute,

ht = sign(W (t)xt

)(2.18a)

h′t = sign(W (t)x′t

)(2.18b)

(gt, g′t) = argmax(g,g′)∈Hq×Hq

{`pair(g,g

′, st) + gTW (t)xt + g′TW (t)x′t

}. (2.18c)

Next, we update the parameters according to the following simple learning rule:

W (t+1) ←W (t) − η(gtxt

T + g′tx′tT − htxt

T − h′tx′tT), (2.19)

where η is the learning rate, and we project rows of W whose `2 norm exceeds ν back into the

feasible set,

For j = 1 to q : if∥∥∥W (t+1)

[j, ·]

∥∥∥22> ν, then W

(t+1)[j, ·] ←

√ν ·W (t+1)

[j, ·]∥∥∥W (t+1)[j, ·]

∥∥∥2

. (2.20)

The update rule of (2.19) follows the noisy gradient descent direction of our convex-concave

objective in (2.12). To see this, note that ∂hTWx/∂W = xhT. However, also note that the

objective in (2.12) is piecewise smooth, due to the max operations, and thus not differentiable

at isolated points. Hence, the gradient is not defined at such points, and since the objective

is not convex, sub-gradient methods are not applicable. Thus, it is difficult apply standard

convergence proofs to this learning rule. While the theoretical properties of this learning al-

gorithm needs further investigation (e.g., see [MHK10]), we empirically verify that the update

rule lowers the upper bound, and converges to a local minima. Fig. 2.1 plots the empirical loss

and the bound, computed over 105 training pairs, as a function of the iteration number.

Chapter 2. Minimal Loss Hashing 21

2.4 Implementation details

To optimize (2.12) as a means of learning hash functions, one needs to select an appropriate ν to

constrain the norms of rows of W as much as needed. In our experiments, we set ν = 1, but we

introduce another parameter, denoted ε, with identical effects. The parameter ε is multiplied

by the value of the loss to obtain the following objective function,

n∑i=1

(maxgi,g′i

{ε · `pair(gi,g′i, si) + gi

TWxi + g′iTWx′i

}−max

hi

{hi

TWxi}−max

h′i

{h′i

TWx′i

}) s. t. ∀1≤ j≤ q∥∥W[j, ·]

∥∥22≤ 1 .

(2.21)

One can verify that given a pair of (W , ν) from (2.12), W ′ = W/√ν satisfies the constrains in

(2.21) and ε = 1/√ν provides an identical behavior of the objective function. We select ε by

validation on a set of candiate choices. The benefit of using ε, instead of ν, is that for different

ε the range of W can stay the same, and similar learning rates can be used.

We initialize W using angular LSH [Cha02]; i.e., the entries of W are sampled i.i.d. from

a standard normal density N (0, 1), and each row is then normalized to have unit length. This

initialization is particularly well suited for preservation of cosine similarity.

The learning rule in (2.19) is used with several minor modifications: 1) In loss-augmented

inference (2.13), the loss is multiplied by ε. 2) We use mini-batches of size 100 to compute the

gradient. 3) We use a momentum term, which adds the gradient of the previous step with a

ratio of 0.9 to the current gradient.

For each experiment, we select 10% of the training set as a validation set. We choose the

loss hyper-parameters ρ and λ by validation on a few candidate choices. We allow the candidate

choices for ρ to linearly increase with the code length. Each epoch includes a random sample of

105 data point pairs, independent of the mini-batch size or the number of training points. For

validation, we optimize parameteres using 100 epochs, and for training, we use 2000 epochs.

For small datasets, a smaller number of epochs was used. Using fewer epochs for validation

vs. training is not ideal, but to accelerate the experimetns we chose to stop validation iterations

after fewer epochs. We found that even validation with fewer epochs results in very good results.

2.5 Experiments

We compare our approach, minimal loss hashing – MLH, with several state-of-the-art methods.

Results for binary reconstructive embedding; BRE [KD09], spectral hashing; SH [WTF08],

shift-invariant kernel hashing; SIKH [RL09], and multilayer neural nets with supervised fine-

tuning; NNCA [TFW08], were obtained with implementations generously provide by their

respective authors. For locality-sensitive hashing; LSH [Cha02], we used our own implemen-

tation. We show results of SIKH for experiments with larger datasets and longer code lengths

Chapter 2. Minimal Loss Hashing 22

only, because it was not competitive otherwise.

Each dataset comprises a training set, a test set, and a set of ground-truth neighbors. For

evaluation, we compute precision and recall for points retrieved within a Hamming distance

R of codes associated with the test queries. Precision as a function of R is H/T , where T

is the total number of points retrieved in Hamming ball with radius R, H is the number of

true neighbors among them. Recall as a function of R is H/G where G is the total number of

ground-truth neighbors.

2.5.1 Six datasets

We first mirror the experiments of Kulis and Darrell [KD09] with five datasets2: Photo-tourism,

a corpus of image patches represented as 128D SIFT features [SSS06]; LabelMe and Peekaboom,

collections of images represented as 512D Gist descriptors [TFW08]; MNIST, 784D greyscale

images of handwritten digits; and Nursery, 8D features. We also use a synthetic dataset

comprising uniformly sampled points from a 10D hypercube [WTF08]. Like Kulis and Darrel we

used 1000 random points for training, and 3000 points (where possible) for testing; all methods

used identical training and test sets. The neighbors of each data-point are defined with a

dataset-specific threshold. On each training set we find the Euclidean distance at which each

point has, on average, 50 neighbors. This defines ground-truth neighbors and non-neighbors for

training, and for computing precision and recall statistics during testing.

For preprocessing, each dataset is mean-centered. For all but the 10D Uniform data, we

then normalize each datum to have unit length. Because some methods (BRE, SH, SIKH)

improve with dimensionality reduction prior to training and testing, we apply PCA to each

dataset (except 10D Uniform and 8D Nursery) and retain a 40D subspace. MLH often performs

slightly better on the full datasets, but we report results for the 40D subspace, to be consistent

with the other methods.

For all methods with local minima or stochastic optimization (i.e., all but SH) we optimize

10 independent models, at each of several code lengths. Fig. 2.3 plots precision (averaged over

10 models, with standard deviation bars), for points retrieved within a Hamming radius R = 3

using difference code lengths. These results are similar to those in [KD09], where BRE yields

higher precision than SH and LSH for different binary code lengths. The plots also show that

MLH consistently yields higher precision than BRE. This behavior persists for a wide range of

retrieval radii as shown in Fig. 2.2 for Hamming radii of R = 1 and R = 5 on LabelMe.

For many retrieval tasks with large datasets, precision is more important than recall. Nev-

ertheless, for other tasks such as recognition, high recall may be desired if one wants to find the

majority of similar points to each query. To assess both recall and precision, Figures 2.4 and

2.5 plot precision-recall curves (averaged over 10 models, with standard deviation bars) for all

of the six benchmarks, and for binary codes of length 15, 30, and 45. These plots are obtained

2Kulis and Darrel treated Caltech-101 differently from the other 5 datasets, with a specific kernel, so experi-ments were not conducted on that dataset.

Chapter 2. Minimal Loss Hashing 23

by varying the retrieval radius R, from 0 to q. In almost all cases, the performance of MLH is

clearly superior. MLH has high recall at all levels of precision.

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 1

MLH

BRE

LSH

SH

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 5

MLH

BRE

LSH

SH

Figure 2.2: Precision for near neighbors within Hamming radii of 1 (left) and 5 (right) onLabelMe. (view in color)

2.5.2 Euclidean 22K LabelMe

We also conduct experiments on a larger LabelMe dataset compiled by Torralba et al.., [TFW08],

which we call 22K LabelMe. It has 20,019 training images and 2000 test images, each with a

512D Gist descriptor. With 22K LabelMe, we can examine how different methods scale to

both larger datasets and longer binary codes. Data pre-processing was identical to that above

(i.e., mean centering, normalization, 40D PCA). Neighbors were defined by the threshold in

the Euclidean Gist space such that each training point has, on average, 100 neighbors.

Fig. 2.6 shows precision-recall curves as a function of code length, from 16 to 256 bits. As

above, it is clear that MLH outperforms all other methods for short and long code lengths. SH

does not scale well to large code lengths. We could not run the BRE implementation on the full

dataset due to its memory needs and run time. Instead we trained it with 1000 to 5000 points

and observed that the results do not change dramatically. The results shown here are with

3000 training points, after which the database was populated with all 20019 training points.

At 256 bits LSH approaches the performance of BRE, and outperforms SH and SIKH. The

dashed curves (MLH.5) in Fig. 2.6 are MLH precision-recall results but at half the code length

(e.g., the dashed curve on the 64-bit plot is for 32-bit MLH). Note that MLH often outperforms

other methods even with half the code length.

Finally, since the MLH framework admits general loss functions of the form `(‖h−g‖H , s), it

is also interesting to consider the results of our learning framework with `bre loss (2.4) optimized

on the full training set. The BRE2 curves in Fig. 2.6 show this approach to be on par with

BRE. While our optimization technique is more efficient that the coordinate-descent algorithm

of Kulis and Darrel [KD09], the difference in performance between MLH and BRE is mainly

attributed to the pairwise hinge loss function, `hinge in (2.3).

Chapter 2. Minimal Loss Hashing 24

10D Uniform LabelMe MNIST

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 3

MLH

BRE

LSH

SH

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 3

MLH

BRE

LSH

SH

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 3

MLH

BRE

LSH

SH

Nursery Peekaboom Photo-tourism

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 3

MLH

BRE

LSH

SH

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 3

MLH

BRE

LSH

SH

10 15 20 25 30 35 40 45 50 0

0.2

0.4

0.6

0.8

1

Code length (bits)

Pre

cis

ion for

Ham

m. dis

t. <

= 3

MLH

BRE

LSH

SH

Figure 2.3: Precision of near neighbors retrieved using a Hamming radius of 3 bits as a functionof code length on six benchmarks. (view in color)

10D Uniform – 15 bits 10D Uniform – 30 bits 10D Uniform – 45 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

LabelMe – 15 bits LabelMe – 30 bits LabelMe – 45 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

Figure 2.4: Precision-recall curves on MNIST and LabelMe for different methods for differentcode lengths. Moving down the curves involves increasing Hamming distances for retrieval.(view in color)

Chapter 2. Minimal Loss Hashing 25

MNIST – 15 bits MNIST – 30 bits MNIST – 45 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

Nursery – 15 bits Nursery – 30 bits Nursery – 45 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

Peekaboom – 15 bits Peekaboom – 30 bits Peekaboom – 45 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

Photo-tourism – 15 bits Photo-tourism – 30 bits Photo-tourism – 45 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

MLH

BRE

LSH

SH

Figure 2.5: Precision-recall curves on 10D Uniform, Nursery, Peekaboom, MNIST and Photo-tourism for different methods for different code lengths. Moving down the curves involvesincreasing Hamming distances for retrieval. (view in color)

Chapter 2. Minimal Loss Hashing 26

16 bits 32 bits 64 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

128 bits 256 bits

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

0 0.5 1 0

0.2

0.4

0.6

0.8 1

MLH

BRE

LSH

SH

SIKH

BRE2

MLH.5

Figure 2.6: Precision-recall curves for different code lengths from 16 to 256 bits on Euclidean22K LabelMe dataset. (view in color)

2.5.3 Semantic 22K LabelMe

22K LabelMe also comes with a semantic pairwise affinity matrix that is based on segmenta-

tions and object labels provided by humans. This affinity matrix provides pairwise similarity

scores based on semantic content. While Gist remains the input for our model, we use this

affinity matrix to define a new set of neighbors for each training and test point. Hash functions

learned using these semantic labels should be more useful for content-based retrieval than hash

functions trained using Euclidean distance in Gist space. Multilayer neural nets trained by

Torralba et al. [TFW08] (NNCA) are considered the superior method for semantic 22K La-

belMe. Their model is fine-tuned using semantic labels and nonlinear neighborhood component

analysis of [SH07].

We trained MLH, using varying code lengths, on raw 512D Gist descriptors using semantic

labels. Fig. 2.7 shows the performance of MLH and NNCA, along with a nearest neighbor

(NN) baseline that uses cosine similarity (slightly better than Euclidean distance) in Gist space

– NN. Note that NN is the bound on the performance of LSH and BRE as they mimic Euclidean

distance. MLH and NNCA exhibit similar performance for 32-bit codes, but for longer codes

MLH is superior. NNCA is not significantly better than Gist-based NN, but MLH with 128

and 256 bits is better than NN, especially for larger M (number of images retrieved). Finally,

Fig. 2.8 shows some interesting qualitative results on the semantic 22K LabelMe model.

Chapter 2. Minimal Loss Hashing 27

500 10000

0.1

0.2

0.3

0.4

0.5

0.6

M (number of retrieved images)P

erc

ent of 50 n

eig

hbors

within

M

MLH−256

MLH−64NNCA−256

NN

32 64 128 256

0.06

0.07

0.08

0.09

Code length

Perc

ent of 50 n

eig

hbors

(M

=50)

MLHNNCANN

32 64 128 256

0.3

0.32

0.34

0.36

0.38

0.4

Code lengthP

erc

ent of 50 n

eig

hbors

(M

=500)

MLHNNCANN

Figure 2.7: (top) Percentage of 50 ground-truth neighbors as a function of number of im-ages retrieved (0 ≤ M ≤ 1000) for MLH with 64, 256 bits, and for NNCA with 256 bits.(bottom) Percentage of 50 neighbors retrieved as a function of code length for M = 50 andM = 500. (view in color)

2.6 Hashing for very high-dimensional data

Computing a q-bit binary code for a p-dimensional input x using a linear threhold function

b lin(x) requires a computation time of O(qp). Depending on the application, and the dimen-

sionality of inputs and outputs, this computation complexity is typically acceptable, and one

may consider even more expensive binary hash functions with more expressive power (discussed

in the next chapter). On the other hand, for some applications, even a computation of O(qp)

is too expensive, and more efficient alterntavie hash functions are desired.

Some previous work has focused on efficient binary hash functions [GKRL13, YKGC14,

RKKI15] applicable to very high-dimensional data. Here, we review three families of efficient

hash functions, and we point out that all of these hash functions can be optimized using our

proposed minimal loss hashing framework.

Gong et al. [GKRL13] propose to use the Kronecker product to factor the projection matrix

W to accelarate inner product. Their bilinear hash function is defined as

bblin(x;W1,W2) = sgn ((W1 ⊗W2)x) (2.22a)

= sgn(

vec(W2XW1

T))

, (2.22b)

Chapter 2. Minimal Loss Hashing 28

Figure 2.8: Qualitative results on semantic 22K LabelMe. The first image of each row is aquery image. The remaining 13 images on each row were retrieved using 256-bit MLH binarycodes, in increasing order of their Hamming distance.

where ⊗ denotes the Kronecker product, W1,W2 ∈ R√q×√p, and X ∈ R

√p×√p is a matrix whose

entries are identical to that of x, i.e., x = vec (X). Computing a bilinear hash function takes

O(q√p +√qp) which is much better than O(qp). While bilinear hash functions are optimized

using quantization error and two-sided orthogonal procrustes problem in [GKRL13], one can

use our upper bound approach to optimze bblin(x;W1,W2) as well.

Yu et al. [YKGC14] propose a technique called circulant binary embedding, which constrains

the projection matrix W to be a circulant matrix multiplied by a diagonal matrix. A circulant

matrix is a special square matrix in which each row vector is rotated one element to the right

relative to the preceding row vector, so a p by p circulant matrix has only p free parameters.

A circulant binary hash function is defined as

bcir(x;C,D) = sgn (CDx) , (2.23)

where C is a p by p circulant matrix and D is a p by p diagonal matrix. The main benefit of

this approach is that a circulat matrix product can be calculated by Fast Fourier Transform

(FFT) which in O(p log p). Assuming that q ≤ p, which is often the case, the first q bit of

bcir(x) can be taken as the binary code. Minimal loss hashing is applicable to optimization of

circulant matrices too.

Finally, Rastegari et al. [RKKI15] propose to use a sparse W matrix to facilitate fast inner

Chapter 2. Minimal Loss Hashing 29

product. They use an `1 regularizer on W based on a different objective function. We believe

that the loss-based framework that we proposed for learning hash functions in this chapter can

be easily amended with an `1 regularizer to obtain sparse projection matrices.

2.7 Summary

In this chapter, based on the latent structural SVM framework, we formulated minimal loss

hashing (MLH), an approach to learning similarity preserving binary codes under a general

class of pairwise loss functions. We introduced a new loss function, pairwise hinge loss, suitable

for training using Euclidean distance or using sets of similar / dissimilar points. Our learning

algorithm is online, efficient, and scales well to large datasets and large code lengths. We pro-

posed an efficient loss-augmented inference algorithm for optimization of pairwise loss functions.

Empirical results on different datasets suggest that MLH outperforms existing methods.

Chapter 2. Minimal Loss Hashing 30

2.A Proof of the inequlity on the tightness of bound

We present a step by step proof for the following inequality on the tightness of the upper bound

on pairwise hashing loss for any scalar γ > 1,

maxg,g′∈Hq

{`pair(g,g

′, s) + γgTWx + γg′TWx′

}− max

h∈HqγhTWx− max

h′∈Hqγh′

TWx′

≤ maxg,g′∈Hq

{`pair(g,g

′, s) + gTWx + g′TWx′

}− max

h∈HqhTWx− max

h′∈Hqh′

TWx′ .

(2.24)

To prove the inequality (2.24) let

(g, g′) = argmax(g,g′)∈Hq×Hq

{`pair(g,g

′, s) + gTWx + g′TWx′

}, (2.25)

(gγ , g′γ) = argmax(g,g′)∈Hq×Hq

{`pair(g,g

′, s) + γgTWx + γg′TWx′

}. (2.26)

Then, because of (2.25), we know that

`pair(gγ , g′γ , s) + gγTWx + g′γ

TWx′ ≤ `pair(g, g′, s) + gTWx + g′

TWx′ (2.27)

Next, we subract the same quantity from both sides of (2.27) to obtain,

`pair(gγ , g′γ , s) + gγTWx + g′γ

TWx′ − max

h∈HqhTWx− max

h′∈Hqh′

TWx′

≤ `pair(g, g′, s) + gTWx + g′TWx′ − max

h∈HqhTWx− max

h′∈Hqh′

TWx′ .

(2.28)

Note that the RHS of (2.28) is identical to the RHS of (2.24). Below we show the LHS of (2.28)

is larger than the LHS of (2.24), hence the inequality. By definition we know that

gγTWx ≤ max

h∈HqhTWx , (2.29)

g′γTWx′ ≤ max

h′∈Hqh′

TWx′ . (2.30)

Muliplying all of the terms by (γ − 1) and summing over the two sides we get,

(γ − 1)gγTWx + (γ − 1)g′γ

TWx′ ≤ (γ − 1) max

h∈HqhTWx + (γ − 1) max

h′∈Hqh′

TWx′ (2.31)

Adding `pair(gγ , g′γ , s) to both sides and reorganizing the terms we get,

`pair(gγ , g′γ , s) + γgγTWx + γg′γ

TWx′ − γ max

h∈HqhTWx− γ max

h′∈Hqh′

TWx′

≤ `pair(gγ , g′γ , s) + gγTWx + g′γ

TWx′ − max

h∈HqhTWx− max

h′∈Hqh′

TWx′

(2.32)

Now by combining (2.32) and (2.28), we have a proof for inequality (2.24).

Chapter 3

Hamming Distance Metric Learning

Many machine learning algorithms presuppose the existence of a pairwise similarity measure on

the input space. Examples include semi-supervised clustering, nearest neighbor classification,

and kernel-based methods. When similarity measures are not given a priori, one could adopt

a generic function such as Euclidean distance, but this often produces unsatisfactory results.

The goal of distance metric learning techniques is to improve matters by incorporating side

information, and optimizing parametric distance functions such as the Mahalanobis distance

[DKJ+07, GRHS04, SSSN04, WBS06, XNJR02].

Motivated by large-scale multimedia applications, this chapter continues to advocate the

use of discrete mappings, from input features to binary codes. Compact binary codes are re-

markably storage efficient, allowing one to store massive datasets in memory. The Hamming

distance, a natural similarity measure on binary codes, can be computed with just a few machine

instructions per comparison. Further, it has been shown that one can perform exact nearest

neighbor search in Hamming space significantly faster than linear search, with sublinear run-

times [GPY94, NPF12] (e.g., see Chapter 4). By contrast, retrieval based on Mahalanobis dis-

tance requires approximate nearest neighbor search (NNS), for which state-of-the-art methods

(e.g., see [JDS11, ML09]) do not always perform well, especially with massive, high-dimensional

datasets when storage overheads and distance computations become prohibitive.

In this chapter, we introduce a framework for learning a broad class of binary hash functions

based on a triplet ranking loss designed to preserve relative similarity (c.f., [SJ04, FSSM07,

CSSB10]). While certainly useful for preserving metric structure, this loss function is very well

suited to the preservation of semantic similarity. Notably, it can be viewed as a form of local

ranking loss. It is more flexible than the pairwise hinge loss of the previous chapter, and is

shown below to produce superior hash functions.

Our formulation generalizes the minimal loss hashing (MLH) algorithm of Chapter 2. To

optimize hash function parameters we formulate a continuous upper-bound on empirical loss,

with a new form of loss-augmented inference designed for efficient optimization with the pro-

posed triplet loss on the Hamming space. We also present ways to optimize more general

families of hash functions based on non-linear projection of the data by multilayer neural nets.

31

Chapter 3. Hamming Distance Metric Learning 32

To our knowledge, this is one of the most general frameworks for learning a broad class of hash

functions. In particular, many previous loss-based techniques (e.g., [KD09]) are not capable of

optimizing mappings that involve non-linear projections.

Our experiments indicate that the framework is capable of preserving semantic structure on

challenging datasets, namely, MNIST [MNI] and CIFAR-10 [Kri09]. We show that k-nearest

neighbor (kNN) search on the resulting binary codes retrieves items that bear remarkable

similarity to a given query item. To show that the binary representation is rich enough to

capture salient semantic structure, as is common in metric learning, we also report classification

performance on the binary codes. Surprisingly, on these datasets, simple kNN classifiers in

Hamming space are competitive with sophisticated discriminative classifiers, including SVMs

and neural networks. An important appeal of our approach is the scalability of kNN search on

binary codes to billions of data points, and of kNN classification to millions of class labels.

3.1 Formulation

We aim to learn a mapping b(x) : Rp → Hq, while preserving some notion of similarity. This

mapping is parameterized by a real-valued weight vector w as

b(x;w) = sign (f(x;w)) , (3.1)

where sign(.) denotes the element-wise sign function, and f(x;w) : Rp → Rq is a real-valued

transformation. Different forms of f give rise to different families of hash functions:

1. A linear transform f(x) = Wx, where W ∈ Rq×p and w ≡ vec(W ), is the simplest and

most well-studied case [Cha02, GL11, NF11, WKC10] discussed in the previous chapter.

Under this mapping, denoted b lin(x), the kth bit is determined by a hyperplane in the

input space whose normal is given by the kth row of W . 1

2. In [WTF08], linear projections are followed by an element-wise cosine transform, i.e., f(x) =

cos(Wx). For such mappings the bits correspond to stripes of 1 and −1 regions, oriented

parallel to the corresponding hyperplanes, in the input space.

3. Kernelized hash functions [KD09, KG09] make use of a kernel function κ(x, z) and a set

of randomly selected m data points {xπj}mj=1 to define the ith dimension of f denoted fi

via a parameter matrix W ∈ Rq×(m+1) by,

fi(x;W ) = Wi0 +

m∑j=1

Wijκ(xπj ,x) . (3.2)

4. More complex hash functions are obtained with multilayer neural networks [SH09, TFW08].

1For presentation clarity, in linear and nonlinear cases, we omit bias terms. They are incorporated by addingone dimension to the input vectors, and to the hidden layers of neural networks, with a fixed value of one.

Chapter 3. Hamming Distance Metric Learning 33

For example, a two-layer network with a p′-dimensional hidden layer and weight matrices

W1 ∈ Rp′×p and W2 ∈ Rq×p′ can be expressed as f(x) = tanh(W2 tanh(W1x)), where

tanh(.) is the element-wise hyperbolic tangent function.

Our framework applies to all of the above families of hash functions. The only restriction is

that f must be differentiable with respect to its parameters, so that one is able to compute the

Jacobian of f(x;w) with respect to w.

3.1.1 Triplet loss function

The choice of loss function is crucial for learning good similarity measures. Most existing

supervised binary hashing techniques [GL11, LWJ+12, NF11] formulate learning objectives in

terms of pairwise similarity, where pairs of inputs are labelled as either similar or dissimilar.

Next, the methods aim to ensure that Hamming distances between binary codes for similar

(dissimilar) items are small (large). For example, in Chapter 2 we proposed a pairwise hinge

loss function in (2.3). This loss incurs zero cost when a pair of similar inputs map to codes

that differ by less than ρ bits. The loss is zero for dissimilar items whose Hamming distance is

more than ρ bits.

One problem with such a loss function is that finding a suitable threshold ρ with cross-

validation is expensive. Furthermore, for many problems one cares about the relative magni-

tudes of pairwise distances more than their precise numerical values. So, constraining pairwise

Hamming distances over all pairs of codes with a single threshold is overly restrictive. More

importantly, not all datasets are amenable to labeling input pairs as similar or dissimilar. One

way to avoid some of these problems is to define loss in terms of relative similarity. Such loss

a function has been used in metric learning [FSSM07, CSSB10], and, as shown below, it is

naturally suited to Hamming distance metric learning.

To define relative similarity, we assume that the training data includes triplets of items

(x,x+,x−), such that the pair (x,x+) is more similar than the pair (x,x−). Our goal is to

learn a hash function b such that b(x) is closer to b(x+) than to b(x−) in Hamming distance.

Accordingly, we propose a ranking loss on the triplet of binary codes (h,h+,h−), obtained from

b applied to (x,x+,x−):

`rank(h,h+,h−) =[‖h−h+‖H − ‖h−h−‖H + 1

]+. (3.3)

This loss is zero when the Hamming distance between the more-similar pair, ‖h−h+‖H , is at

least one bit smaller than the Hamming distance between the less-similar pair, ‖h−h−‖H . This

loss function is more flexible than the pairwise loss function `hinge, as it can be used to preserve

rankings among similar items, for example based on Euclidean distance, or perhaps using a tree

based distance between category labels within a phylogenetic tree.

Chapter 3. Hamming Distance Metric Learning 34

3.2 Optimization through an upper bound

Given a training set of triplets, D ={

(xi,x+i ,x

−i )}ni=1

, our objective is empirical loss,

L(w) =∑

(x,x+,x−)∈D

`rank(b(x;w), b(x+;w), b(x−;w)

). (3.4)

This objective is discontinuous and non-convex. We again construct a continuous upper bound

on the loss inspired by previous work on latent structural SVMs [YJ09]. The key observation

is that,

b(x;w) = sign (f(x;w)) (3.5a)

= argmaxh∈Hq

hTf(x;w) . (3.5b)

The upper bound on loss that we exploit for learning takes the following form,

`rank(b(x;w), b(x+;w), b(x−;w)

)≤

maxg,g+,g−

{`rank

(g, g+, g−

)+ gTf(x;w) + g+T

f(x+;w) + g−Tf(x−;w)

}−max

h

{hTf(x;w)

}−max

h+

{h+T

f(x+;w)}−max

h−

{h−

Tf(x−;w)

},

(3.6)

where g, g+, g−, h, h+, and h− are constrained to be q-dimensional binary vectors. To

prove the inequality in (3.6), note that if the first term on the RHS were maximized2 by

(g,g+,g−) = (b(x), b(x+), b(x−)), then using (3.5b), it is straightforward to show that (3.6)

would become an equality. In all other cases of (g,g+,g−) which maximize the first term, the

RHS can only be as large or larger than when (g,g+,g−) = (b(x), b(x+), b(x−)), hence the

inequality holds.

Summing the upper bound instead of the loss in (3.4) yields an upper bound on the empirical

loss. The resulting bound is continuous and piecewise smooth in w as long as f is continuous

in w. The upper bound of (3.6) is a generalization of a bound introduced in Section 2.2 for

linear f(x) = Wx. In particular, when f is linear in w, the bound on empirical loss becomes

piecewise linear and convex-concave. While the bound in (3.6) is only piecewise smooth, it

allows us to learn hash functions based on non-linear functions f , e.g., neural networks. Note

that the bound in Section 2.2 was defined for pairwise loss functions and pairwise similarity

labels, and the bound here applies to the more flexible class of triplet loss functions.

Regarding the tightness of the bound, one can echo the proposition (2.11) showing that the

upper bound (3.6) becomes tighter and more non-smooth as ‖f(x)‖ grows. Hence, if f(x) is

not constrained, one should consider regularizing the parameters. For neural networks, we use

a typical weight decay regularizer.

2For presentation clarity we will sometimes drop the dependence of f and b on w, and write b(x) and f(x).

Chapter 3. Hamming Distance Metric Learning 35

3.2.1 Loss-augmented inference with triplet hashing loss

Loss-augmented inference to find the three binary codes given by,

(g, g+, g−) = argmax(g,g+,g−)

{`rank

(g, g+, g−

)+ gTf(x) + g+T

f(x+) + g−Tf(x−)

}, (3.7)

is hard because there are 23q possible binary codes over which one has to maximize the RHS.

We can solve this loss-augmented inference problem efficiently for the class of triplet loss

functions that depend only on the value of

dH(g,g+,g−) ≡ ‖g−g+‖H − ‖g−g−‖H .

Importantly, such loss functions do not depend on the specific binary codes, but rather just the

Hamming distance differences. Note that dH(g,g+,g−) can take on only 2q+1 possible values,

since it is an integer between −q and +q. Clearly the triplet ranking loss only depends on d as,

`rank(g,g+,g−

)= ` ′

(dH(g,g+,g−)

), where ` ′(α) = [α+ 1 ]+ . (3.8)

For this family of loss functions, given the values of f(x), f(x+), and f(x−) in (3.7), loss-

augmented inference can be performed in time O(q2). To show this, first consider the case

dH(g,g+,g−) = m, where m is an integer between −q and q. In this case we can replace the

loss-augmented inference problem with

` ′(m) + maxg,g+,g−

{gTf(x) + g+T

f(x+) + g−Tf(x−)

}s.t. dH(g,g+,g−) = m . (3.9)

One can solve (3.9) for each possible value of m. It is straightforward to see that the largest of

those 2q + 1 maxima is the solution to (3.7). Then, what remains for us is to solve (3.9).

To solve (3.9), consider the kth bit for each of the three codes, i.e., a=g[k], b=g+[k], and

c=g−[k]. There are 8 ways to select a, b and c, but no matter what values they take on, they

can only change the value of dH(g,g+,g−) by −1, 0, or +1. Accordingly, let ek ∈ {−1, 0,+1}denote the effect of the kth bits on dH(g,g+,g−). For each value of ek, we can easily compute

the maximal contribution of (a, b, c) to (3.9) by:

cont(k, ek) = maxa,b,c

{af(x)[k] + bf(x+)[k] + cf(x−)[k]

}s. t. ‖a−b‖H − ‖a−c‖H = ek (3.10)

for a, b, c ∈ {−1,+1}.Therefore, to solve (3.9), we aim to select values for ek ∈ {−1, 0,+1}, for all k, such that

dH(g,g+,g−) =∑q

k=1 ek = m and∑q

k=1 cont(k, ek) is maximized. This can be solved for

any m using a dynamic programming algorithm similar to knapsack, which runs in O(q2).

This dynamic programming algorithm relies on solving a subproblem C(r, s), which seeks the

maximum value of∑r

k=1 cont(k, ek) for the first r bits (0 ≤ r ≤ q) such that∑r

k=1 ek = s for

Chapter 3. Hamming Distance Metric Learning 36

any −r ≤ s ≤ r. One can easily update C(r, s) from its previous values by

C(r, s)← max(C(r − 1, s+ 1) + cont(r,−1),

C(r − 1, s) + cont(r, 0),

C(r − 1, s− 1) + cont(r,+1)).

(3.11)

Note that because the value of∑r

k=1 ek is always between −r and r, we set C(r, s) = −∞ for

any s > r and s < −r. The value of C(0, 0) is initialized at 0. Updating all of the values of

C(r, s) for 1 ≤ r ≤ q and −r ≤ s ≤ r requires a running time of O(q2).

Finally, we choose the best m according to (3.9) as

m = argmax−q≤m≤q

{` ′(m) + C(q,m)

}, (3.12)

and we set the triplet of q-bit codes, (g, g+, g−), according to the ek’s that yield the maximum

of C(q, m) and the bits that maximize cont(k, ek)’s.

3.2.2 Perceptron-like learning with triplet loss

Our learning algorithm is a form of stochastic gradient descent, where in the tth iteration we

sample a triplet (x,x+,x−) from the dataset, and then take a step in the direction that decreases

the upper bound on the triplet’s loss in (3.6). To this end, we randomly initialize w(0). Then,

at each iteration t + 1, given w(t), we use the following procedure to update the parameters,

w(t+1):

1. Select a random triplet (x,x+,x−) from dataset D.

2. Compute (h, h+, h−) = (b(x;w(t)), b(x+;w(t)), b(x−;w(t))) using (3.5b).

3. Compute (g, g+, g−), the solution to the loss-augmented inference problem in (3.7) .

4. Update model parameters using

w(t+1)=w(t) + η

[∂f(x)

∂w

(h−g

)+∂f(x+)

∂w

(h+−g+

)+∂f(x−)

∂w

(h−−g−

)− λw(t)

],

where η is the learning rate, and ∂f(x)/∂w ≡ ∂f(x;w)/∂w|w=w(t) ∈ R|w|×q is the trans-

pose of the Jacobian matrix, where |w| is the number of parameters.

This update rule can be seen as gradient descent in the (regularized) upper bound of the

empirical loss. Although the upper bound in (3.6) is not differentiable at isolated points (owing

to the max terms), in our experiments we find that this update rule consistently decreases both

the upper bound and the actual empirical loss L(w).

Chapter 3. Hamming Distance Metric Learning 37

3.3 Asymmetric Hamming distance

When Hamming distance is used to score and retrieve the nearest neighbors to a given query,

there is a high probability of a tie, where multiple items are equidistant from the query in Ham-

ming space. To break ties and improve the similarity measure, previous work suggests the use

of an asymmetric Hamming (AH) distance [DCL08, GP11]. With an AH distance, one stores

dataset entries as binary codes (for storage efficiency) but the queries are not binarized. An

asymmetric distance function is therefore defined on a real-valued query vector, v ∈ Rq, and a

database binary code, h ∈ Hq. Computing AH distance is slightly less efficient than Hamming

distance, and efficient retrieval algorithms, such as [NPF12], are not directly applicable. Nev-

ertheless, the AH distance can also be used to re-rank items retrieved using Hamming distance,

with a negligible increase in run-time. To improve efficiency further when there are many codes

to be re-ranked, AH distance from the query to binary codes can be pre-computed for each 8

or 16 consecutive bits, and stored in a query-specific lookup table.

In this work, we use the following asymmetric Hamming distance function

AH(h,v; s) =1

4‖h− tanh(diag(s)v) ‖22 , (3.13)

where s ∈ Rq is a vector of scaling parameters that control the slope of hyperbolic tangent

applied to different bits; diag(s) is a diagonal matrix with the elements of s on its diagonal. As

the scaling factors in s approach infinity, AH and Hamming distances become identical. Here

we use the AH distance between a database code b(x) and the real-valued projection for a query

z given by f(z). Based on our validation sets, the AH distance of (3.13) is relatively insensitive

to values in s. For the experiments we simply use s to scale the average absolute values of the

elements of {f(xi)}ni=1 for database entries to be 0.25.

3.4 Implementation details

In practice, the basic learning algorithm described in Section 3.2 is implemented with several

modifications. First, instead of using a single training triplet to estimate the gradients, we

use mini-batches comprising 100 triplets and average the gradient. Second, for each triplet

(x,x+,x−), we replace x− with a “hard” example by selecting an item among all negative

examples in the mini-batch that is closest in the current Hamming distance to b(x). By har-

vesting hard negative examples, we ensure that the Hamming constraints for the triplets are

not too easily satisfied. Third, to find good binary codes, we encourage each bit, averaged

over the training data, to be mean-zero before quantization (motivated in [WTF08]). This is

accomplished by adding the following penalty to the objective function:

1

2‖mean

x

(f(x;w)

)‖22 , (3.14)

Chapter 3. Hamming Distance Metric Learning 38

where mean(f(x;w)) denotes the mean of f(x;w) across the training data. In our implementa-

tion, for efficiency, the stochastic gradient of Eq. 3.14 is computed per mini-batch. Empirically,

we observe that including this term in the objective improves the quality of binary codes,

especially with the triplet ranking loss.

We use a heuristic to adapt learning rates, known as bold driver [Bat89]. For each mini-batch

we evaluate the learning objective before the parameters are updated. As long as the objective

is decreasing we slowly increase the learning rate η, but when the objective increases, η is

halved. In particular, after every 25 epochs, if the objective, averaged over the last 25 epochs,

decreased, we increase η by 5%, otherwise we decrease η by 50%. We also used a momentum

term; i.e., the previous gradient update is scaled by 0.9 and then added to the current gradient.

All experiments are run on a GPU for 2, 000 passes through the datasets. The training time

for our current implementation is under 4 hours of GPU time for most of our experiments. The

two exceptions involve CIFAR-10 with 6400D inputs and relatively long code-lengths of 256

and 512 bits, for which the training times are approximated 8 and 16 hours respectively.

3.5 Experiments

Our experiments evaluate Hamming distance metric learning using two families of hash func-

tions, namely, linear transforms and multilayer neural networks. For each, we examine two loss

functions, the pairwise hinge loss, `hinge in (2.3), and the triplet ranking loss, `rank in (3.3).

Experiments are conducted on two well-known image corpora, MNIST [MNI] and CIFAR-

10 [Kri09]. Ground-truth similarity labels are derived from class labels; items from the same

class are deemed similar. Training triplets are created by taking two items from the same class,

and one item from a different class. This definition of similarity ignores intra-class variations

and the existence of sub-categories, e.g., styles of handwritten fours, or types of airplanes.

Nevertheless, we use these coarse similarity labels to evaluate our framework. To that end,

using items from the test set as queries, we report precision@k, i.e., the fraction of k-nearest

neighbors in Hamming distance that are same-class items. We also show kNN retrieval results

for qualitative inspection. Finally, we report Hamming (H) and asymmetric Hamming (AH)

kNN classification rates on the test sets.

Datasets. The MNIST [MNI] digit dataset contains 60, 000 training and 10, 000 test images

(28×28 pixels) of ten handwritten digits (0 to 9). Of the 60, 000 training images, we set aside

5, 000 for validation. CIFAR-10 [Kri09] comprises 50, 000 training and 10, 000 test color images

(32×32 pixels). Each image belongs to one of 10 classes, namely airplane, automobile, bird, cat,

deer, dog, frog, horse, ship, and truck. The large variability in scale, viewpoint, illumination,

and background clutter poses a significant challenge for classification. Instead of using raw

pixel values, we borrow a bag-of-words representation from Coates et al [CLN11]. Its 6400D

feature vector comprises one 1600-bin histogram per image quadrant, the codewords of which

are learned from 6×6 image patches. Such high-dimensional inputs are challenging for learning

Chapter 3. Hamming Distance Metric Learning 39

10 100 1000 100000.87

0.9

0.93

0.96

0.99

k

Pre

cis

ion @

k

Two−layer net, tripletTwo−layer net, pairwiseLinear, tripletLinear, pairwise

10 100 1000 100000.87

0.9

0.93

0.96

0.99

k

Pre

cis

ion @

k

128−bit, linear, triplet64−bit, linear, triplet32−bit, linear, tripletEuclidean distance

Figure 3.1: MNIST precision@k: (left) four methods with 32-bit codes; (right) three codelengths with triplet loss.

Hash function, Loss Distance kNN 32 bits 64bits 128 bits

Linear, pairwise hinge [NF11]

Ham

min

g 3 NN 4.73 3.11 2.61Linear, triplet ranking 3 NN 4.44 3.13 2.44Two-layer Net, pairwise hinge 30 NN 1.50 1.45 1.44Two-layer Net, triplet ranking 30 NN 1.45 1.38 1.27

Linear, pairwise hinge

Asy

m.

Ham

min

g 3 NN 4.30 2.78 2.51Linear, triplet ranking 3 NN 3.88 2.90 2.51Two-layer Net, pairwise hinge 30 NN 1.50 1.36 1.35Two-layer Net, triplet ranking 30 NN 1.45 1.29 1.20

Baseline Error

Deep neural nets with pre-training [HS06] 1.2Large margin nearest neighbor [WBS06] 1.3RBF-kernel SVM [DS02] 1.4Neural network [SSP03] 1.6Euclidean 3NN 2.89

Table 3.1: Classification error rates on MNIST test set.

similarity-preserving hash functions. Of the 50, 000 training images, we set aside 5, 000 for

validation.

MNIST: We optimize binary hash functions, mapping raw MNIST images to 32, 64, and

128-bit codes. For each test code we find the k closest training codes using Hamming distance,

and report precision@k in Fig. 3.1. As one might expect, the non-linear mappings with neural

networks significantly outperform linear mappings, also seen in Table 3.1. We make use of a

neural network with two weight layers and a hidden layer of 512 units, which has 784 input

units and q output units. Weights were initialized randomly, and the Jacobian with respect to

the parameters was computed using the backprop algorithm [RHW86]. We find that the triplet

loss `rank yields better performance than the pairwise loss `hinge. The sharp drop in precision

at k = 6000 is a consequence of the fact that each digit in MNIST has approximately 6000

same-class neighbors. Fig. 3.1 (right) shows how precision improves as a function of the binary

Chapter 3. Hamming Distance Metric Learning 40

code length. Notably, kNN retrieval, for k > 10 and all code lengths, yields higher precision

than Euclidean NN on the 784D input space. Further, note that these Euclidian kNN results

effectively provide an upper bound on the performance one would expect with existing hashing

methods that preserve Eucliean distances (e.g., [GL11, IM98, KD09, WTF08]).

One can also evaluate the fidelity of the Hamming space represenation in terms of classifi-

cation performance from the Hamming codes. To focus on the quality of the hash functions,

and the speed of retrieval for large-scale multimedia datasets, we use a kNN classifier; i.e., we

just use the retrieved neighbors to predict class labels for each test code. Table 3.1 reports

classification error rates using kNN based on Hamming and asymmetric Hamming distance.

Non-linear mappings, even with only 32-bit codes, significantly outperform linear mappings

(e.g.,with 128 bits). The triplet ranking loss also improves upon the pairwise hinge loss, even

though the former has no hyperparameters. Table 3.1 also indicates that AH distance provides

a modest boost in performance. For each method the parameter k in the kNN classifier is

chosen based on the validation set.

For baseline comparison, Table 3.1 reports state-of-the-art performance on MNIST with

sophisticated discriminative classifiers (excluding those using examplar deformations and con-

volutional nets). Despite the simplicity of a kNN classifier, our model achieves error rates of

1.29% and 1.20% using 64- and 128-bit codes. This is comparable with 1.4% with RBF-SVM

[DS02], and 1.6%, the best published neural net result for this version of the task [SSP03]. Our

model also outperforms the metric learning approach of [WBS06], and is competitive with the

best known Deep Belief Network [HS06]; although they used unsupervised pre-training while

we do not.

The above results show that our Hamming distance metric learning framework can preserve

sufficient semantic similarity, to the extent that Hamming kNN classification becomes com-

petitive with state-of-the-art discriminative methods. Nevertheless, our method is not solely a

classifier, and it can be used within many other machine learning algorithms.

CIFAR-10: On CIFAR-10 we optimize hash functions for 64, 128, 256, and 512-bit codes.

Fig. 3.2 depicts precision@k curves, showing superior quality of hash functions learned by the

ranking loss compared to the pairwise hinge loss. Fig. 3.3 depicts the quality of retrieval results

for four queries from CIFAR-10 test set, showing the 16 nearest neighbors using 256-bit codes,

64-bit codes (both learned with the triplet ranking loss), and Euclidean distance in the original

6400D feature space. The number of class-based retrieval errors is much smaller in Hamming

space, and the similarity in visual appearance is also superior.

Table 3.2 reports classification performance (showing accuracy instead of error rates for

consistency with previous papers). Euclidean kNN on the 6400D input features yields under

60% accuracy, while kNN with the binary codes obtains 76−78%. As with MNIST data, this

level of performance is comparable to one-vs-all SVMs applied to the same features [CLN11].

Not surprisingly, training fully-connected neural nets on 6400-dimensional features with only

50, 000 training examples is challenging and susceptible to over-fitting, hence the results of

Chapter 3. Hamming Distance Metric Learning 41

Triplet ranking loss Pairwise hinge loss

10 100 1000 100000.61

0.64

0.67

0.7

0.73

0.76

0.79

k

Pre

cis

ion @

k

512−bit, linear, triplet256−bit, linear, triplet128−bit, linear, triplet64−bit, linear, triplet

10 100 1000 100000.61

0.64

0.67

0.7

0.73

0.76

0.79

k

Pre

cis

ion @

k

512−bit, linear, pairwise256−bit, linear, pairwise128−bit, linear, pairwise64−bit, linear, pairwise

Figure 3.2: Precision@k plots on the CIFAR-10 dataset for Hamming distance on 512, 256,128, 64-bit codes trained using (left) triplet ranking loss function (right) pairwise hinge loss.Precision is averaged over the test examples.

Hashing, Loss Distance kNN 64 bits 128 bits 256 bits 512 bits

Linear, pairwise hinge (Chapter 2) H 7 NN 72.2 72.8 73.8 74.6Linear, pairwise hinge AH 8 NN 72.3 73.5 74.3 74.9Linear, triplet ranking H 2 NN 75.1 75.9 77.1 77.9Linear, triplet ranking AH 1 NN 75.7 76.8 77.5 78.0

Baseline Accuracy

One-vs-all linear SVM [CLN11] 77.9Euclidean 3NN 59.3

Table 3.2: Recognition accuracy on the CIFAR-10 test set (H ≡ Hamming, AH ≡ Asym.Hamming).

neural nets on CIFAR-10 were not competitive. Previous work [Kri09] had some success training

convolutional neural nets on this dataset. Note that our framework can easily incorporate

convolutional neural nets, which are intuitively better suited to the intrinsic spatial structure

of natural images.

In comparison, another hashing technique called iterative quantization (ITQ) [GL11] achieves

8.5% error on MNIST and 78% accuracy on CIFAR-10. Our method compares favorably, es-

pecially on MNIST. However, note that ITQ [GL11] binarizes the outcome of a supervised

classifier (Canonical Correlation Analysis with labels), and does not explicitly learn a similarity

measure on the input features based on pairs or triplets.

3.6 Summary

We present a framework for Hamming distance metric learning, which entails learning a discrete

mapping from an input space onto binary codes. This framework accommodates different

Chapter 3. Hamming Distance Metric Learning 42

(Hamming on 256 bit codes) (Hamming on 64 bit codes) (Euclidean distance)

Figure 3.3: Retrieval results for four CIFAR-10 test images using Hamming distance on 256-bitand 64-bit codes, and Euclidean distance on bag-of-words features. Red rectangles indicatemistakes.

Chapter 3. Hamming Distance Metric Learning 43

families of hash functions, including linear threshold functions, and quantized multilayer neural

networks. By using a piecewise smooth upper bound on a triplet ranking loss, we optimize hash

functions that are shown to preserve semantic similarity on complex datasets. In particular,

our experiments show that a simple kNN classifier on the learned binary codes is competitive

with sophisticated discriminative classifiers. While other hashing papers have used CIFAR or

MNIST, none report kNN classification performance, often because it has been thought that the

bar established by state-of-the-art classifiers is too high. On the contrary, our kNN classification

performance suggests that Hamming space can be used to represent complex semantic structures

with high fidelity. One appeal of this approach is the scalability of kNN search on binary codes

to billions of data points, and of kNN classification to millions of class labels.

Chapter 4

Fast Exact Search in Hamming

Space with Multi-Index Hashing

There has been growing interest in representing image data and feature descriptors in terms

of compact binary codes, often to facilitate fast near neighbor search and feature matching in

vision applications (e.g., [AOV12, CLSF10, SVD03, SBBF12, TFW08, KGF12]). Binary codes

are storage efficient and comparisons require just a small number of machine instructions.

Millions of binary codes can be compared to a query in less than a second. But the most

compelling reason for binary codes, and discrete codes in general, is their use as direct indices

(addresses) into a hash table, yielding a dramatic increase in search speed compared to an

exhaustive linear scan (e.g., [WTF08, SH09, NF11]).

Nevertheless, using binary codes as direct hash indices is not necessarily efficient. To find

near neighbors one needs to examine all hash table entries (or buckets) within some Hamming

ball around the query. The problem is that the number of such buckets grows near-exponentially

with the search radius. Even with a small search radius, the number of buckets to examine is

often larger than the number of items in the database, and hence slower than linear scan. Recent

papers on binary codes mention the use of hash tables, but resort to linear scan when codes

are longer than 32 bits (e.g., [TFW08, SH09, KD09, NF11]). Not surprisingly, code lengths

are often significantly longer than 32 bits in order to achieve satisfactory retrieval performance

(e.g., see Fig. 4.5).

This chapter presents a new algorithm for exact k-nearest neighbor (kNN) search on binary

codes that is dramatically faster than exhaustive linear scan. This has been an open problem

since the introduction of hashing techniques with binary codes. Our new multi-index hashing

algorithm exhibits sub-linear search times, is storage efficient, and straightforward to implement.

Empirically, on databases of up to 1B codes we find that multi-index hashing is hundreds of

times faster than linear scan. Extrapolation suggests that the speedup gain grows quickly with

database size beyond 1B codes.

44

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 45

4.0.1 Background: problem and related work

Nearest neighbor search (NNS) on binary codes is used for image search [RL09, TFW08,

WTF08], matching local features [AOV12, CLSF10, JDS08, SBBF12], image classification

[BTF11a], object segmentation [KGF12], and parameter estimation [SVD03]. Sometimes the

binary codes are generated directly as feature descriptors for images or image patches, such as

BRIEF or FREAK [CLSF10, BTF11a, AOV12, TCFL12], and sometimes binary corpora are

generated by discrete similarity-preserving mappings from high-dimensional data. Most such

mappings are designed to preserve Euclidean distance (e.g., [GL11, KD09, RL09, SBBF12,

WTF08]). Others focus on semantic similarity (e.g., [NF11, SVD03, SH09, TFW08, NFS12,

RFF12, LWJ+12]). Our concern in this chapter is not the algorithm used to generate the codes,

but rather with fast search in Hamming space.1

We address two related search problems in Hamming space. Given a dataset of binary codes,

H ≡ {hi}ni=1, the first problem is to find the k codes in H that are closest in Hamming distance

to a given query, i.e., kNN search in Hamming distance. The 1NN problem in Hamming space

was called the Best Match problem by Minsky and Papert [MP69]. They observed that there

are no obvious approaches significantly better than exhaustive search, and asked whether such

approaches might exist.

The second problem is to find all codes in a dataset H that are within a fixed Hamming dis-

tance of a query, sometimes called the Approximate Query problem [GPY94], or Point Location

in Equal Balls (PLEB) [IM98]. A binary code is an r-neighbor of a query code, denoted g, if it

differs from g in r bits or less. We define the r-neighbor search problem as: find all r-neighbors

of a query g from H.

One way to tackle r-neighbor search is to use a hash table populated with the binary codes

hi ∈ H, and examine all hash buckets whose indices are within r bits of a query g (e.g.,

[TFW08]). For binary codes of q bits, the number of distinct hash buckets to examine is

V (q, r) =

r∑z=0

(q

z

). (4.1)

As shown in Fig. 4.1 (top), V (q, r) grows very rapidly with r. Thus, this approach is only

practical for small radii or short code lengths. Some vision applications restrict search to exact

matches (i.e., r = 0) or a small search radius (e.g., [HRCB11, WKC10] ), but in most cases of

interest the desired search radius is larger than is currently feasible (e.g., see Fig. 4.1 (bottom)).

Our work is inspired in part by the multi-index hashing results of Greene, Parnas, and

Yao [GPY94]. Building on the classical Turan problem for hypergraphs, they construct a set

of over-lapping binary substrings such that any two codes that differ by at most r bits are

guaranteed to be identical in at least one of the constructed substrings. Accordingly, they

1There do exist several other promising approaches to fast approximate NNS on large real-valued imagefeatures (e.g., [AMP11, JDS11, NPF12, ML09, BL12]). Nevertheless, we restrict our attention in this chapter tocompact binary codes and exact search.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 46

0 2 4 6 8 100

3

6

9

Hamming Radius

# H

ash

Bu

cke

ts (

log

10)

32 bits

64 bits

128 bits

256 bits

1 10 100 10000

5

10

15

20

# Near neighbors

Ha

mm

ing

Ra

diu

s n

ee

de

d

64 bits

128 bits

Figure 4.1: (Top) Curves show the (log10) number of distinct hash table indices (buckets) withina Hamming ball of radius r, for different code lengths. With 64-bit codes there are about 1Bbuckets within a Hamming ball with a 7-bit radius. Hence with fewer than 1B database items,and a search radius of 7 or more, a hash table would be less efficient than linear scan. Usinghash tables with 128-bit codes is prohibitive for radii larger than 6. (Bottom) This plot showsthe expected search radius required for kNN search as a function of k, based on a dataset of1B SIFT descriptors. Binary codes with 64 and 128 bits were obtained by random projections(LSH) from the SIFT descriptors [JTDA11]. Standard deviation bars help show that largesearch radii are often required.

propose an exact method for finding all r-neighbors of a query using multiple hash tables, one

for each substring. At query time, candidate r-neighbors are found by using query substrings as

indices into their corresponding hash tables. As explained below, while run-time efficient, the

main drawback of their approach is the prohibitive storage required for the requisite number

of hash tables. By comparison, the method we propose requires much less storage, and is only

marginally slower in search performance.

While we focus on exact search, there also exist algorithms for finding approximate r-

neighbors (ε-PLEB), or approximate nearest neighbors (ε-NN) in Hamming distance. One

example is Hamming Locality Sensitive Hashing [IM98, GIM99], which aims to solve the (r, ε)-

neighbors decision problem: determine whether there exists a binary code h ∈ H such that

‖h − g‖H ≤ r, or whether all codes in H differ from g in (1 + ε)r bits or more. Approximate

methods are interesting, and the approach below could be made faster by allowing misses.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 47

Nonetheless, this chapter will focus on the exact search problem.

We proposes a data-structure that applies to both kNN and r-neighbor search in Hamming

space. We prove that for uniformly distributed binary codes of q bits, and a search radius of r

bits when r/q is small, our query time is sub-linear in the size of dataset. We also demonstrate

impressive performance on real-world datasets. To our knowledge this is the first practical

data-structure solving exact kNN in Hamming distance.

Section 4.1 describes a multi-index hashing algorithm for r-neighbor search in Hamming

space, followed by run-time and memory analysis in Section 4.2. Section Section 4.3 describes

our algorithm for k-nearest neighbor search, and Section Section 4.4 reports results on empirical

datasets.

4.1 Multi-Index Hashing

Our approach is a form of multi-index hashing. Binary codes from the database are indexed

m times into m different hash tables, based on m disjoint binary substrings. Given a query

code, entries that fall close to the query in at least one such substring are considered neighbor

candidates. Candidates are then checked for validity using the entire binary code, to remove

any non-r-neighbors. To be practical for large-scale datasets, the substrings must be chosen so

that the set of candidates is small, and storage requirements are reasonable. We also require

that all true neighbors will be found.

The key idea here stems from the fact that, with n binary codes of q bits, the vast majority

of the 2q possible buckets in a full hash table will be empty, since 2q � n. It seems expensive

to examine all V (q, r) buckets within r bits of a query, since most of them contain no items.

Instead, we merge many buckets together (most of which are empty) by marginalizing over

different dimensions of the Hamming space. We do this by creating hash tables on substrings

of the binary codes. The distribution of the code substring comprising the first s bits is the

outcome of marginalizing the distribution of binary codes over the last q − s bits. As such, a

given bucket of the substring hash table includes all codes with the same first s bits, but having

any of the 2(q−s) values for the remaining q − s bits. Unfortunately these larger buckets are

not restricted to the Hamming volume of interest around the query. Hence not all items in the

merged buckets are r-neighbors of the query, so we then need to cull any candidate that is not

a true r-neighbor.

4.1.1 Substring search radii

In more detail, each binary code h, comprising q bits, is partitioned into m disjoint substrings,

h(1), . . . ,h(m), each of length bq/mc or dq/me bits. For convenience in what follows, we assume

that q is divisible2 by m, and that the substrings comprise contiguous bits. The key idea rests

2When q is not divisible by m, we use substrings of different lengths with either b qmc or d q

me bits, i.e., differing

by at most 1 bit.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 48

on the following statement: When two binary codes h and g differ by at most r bits, then, in

at least one of their m substrings they must differ by at most br/mc bits. This leads to the

first proposition:

Proposition 1: If ‖h−g‖H ≤ r, where ‖h−g‖H denotes the Hamming distance between h and

g, then

∃ 1 ≤ z ≤ m s.t. ‖h(z) − g(z)‖H ≤ r′ , (4.2)

where r′ = br/mc.Proof of Proposition 1 follows straightforwardly from the Pigeonhole Principle. That is, suppose

that the Hamming distance between each of the m substrings is strictly greater than r′. Then,

‖h− g‖H ≥ m (r′+ 1). Clearly, m (r′+ 1) > r, since r = mr′+ a for some a where 0 ≤ a < m,

which contradicts the premise.

The significance of Proposition 1 derives from the fact that the substrings have only q/m

bits, and that the required search radius in each substring is just r′ = br/mc. For example, if

h and g differ by 3 bits or less, and m = 4, at least one of the 4 substrings must be identical.

If they differ by at most 7 bits, then in at least one substring they differ by no more than 1

bit; i.e., we can search a Hamming radius of 7 bits by searching a radius of 1 bit on each of

4 substrings. More generally, instead of examining V (q, r) hash buckets, it suffices to examine

V (q/m, r′) buckets in each of m substring hash tables.

While it suffices to examine all buckets within a radius of r′ in all m hash tables, we next

show that it is not always necessary. Rather, it is often possible to use a radius of just r′ − 1

in some of the m substring hash tables while still guaranteeing that all r-neighbors of g will

be found. In particular, with r = mr′ + a, where 0 ≤ a < m, to find any item within a radius

of r on q-bit codes, it suffices to search a + 1 substring hash tables to a radius of r′, and the

remaining m− (a+ 1) substring hash tables up to a radius of r′− 1. Without loss of generality,

since there is no order to the substring hash tables, we search the first a + 1 hash tables with

radius r′, and all remaining hash tables with radius r′ − 1.

Proposition 2: If ||h− g||H ≤ r = mr′ + a, then

∃ 1 ≤ z ≤ a+ 1 s.t. ||h(z) − g(z)||H ≤ r′ (4.3a)

OR

∃ a+ 1 < z ≤ m s.t. ||h(z) − g(z)||H ≤ r′ − 1 . (4.3b)

To prove Proposition 2, we show that when (4.3a) is false, (4.3b) must be true. If (4.3a) is false,

then it must be that a < m−1, since otherwise a = m−1, in which case (4.3a) and Proposition

1 are equivalent. If (4.3a) is false, it also follows that h and g differ in each of their first a+ 1

substrings by r′ + 1 or more bits. Thus, the total number of bits that differ in the first a + 1

substrings is at least (a+1)(r′+1). Because ||h−g||H ≤ r, it also follows that the total number

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 49

Algorithm 1 Building m substring hash tables (or direct address tables).

binary code dataset: H = {hi}ni=1

for j = 1 to m doinitialize jth hash table (or direct addres tables)for i = 1 to n do

insert (key = h(j)i , id = i) into jth hash table

end forend for

of bits that differ in the remaining m − (a+1) substrings is at most r − (a+1)(r′+1). Then,

using Proposition 1, the maximum search radius required in each of the remaining m− (a+ 1)

substring hash tables is⌊r − (a+1)(r′+1)

m− (a+1)

⌋=

⌊mr′ + a− (a+1)r′ − (a+1)

m− (a+1)

⌋=

⌊r′ − 1

m− (a+1)

⌋= r′ − 1 , (4.4)

and hence Proposition 2 is true. Because of the near exponential growth in the number of

buckets for large search radii, the smaller substring search radius required by Proposition 2 is

significant.

A special case of Proposition 2 is when r < m, hence r′ = 0 and a = r. In this case, it

suffices to search r+ 1 substring hash tables for a radius of r′ = 0 (i.e., exact matches), and the

remaining m− (r + 1) substring hash tables can be ignored. Clearly, if a code does not match

exactly with a query in any of the selected r+ 1 substrings, then the code must differ from the

query in at least r + 1 bits.

4.1.2 Multi-Index Hashing for r-neighbor search

In a pre-processing step, given a dataset of binary codes, one hash table is built for each of the

m substrings, as outlined in Algorithm 1. Even though we use the term hash table, we make

use of direct address tables when substring length is small and one can allocate 2q/m buckets.

If the substring length is large, one has to use a mapping from binary codes (e.g., taking them

modulo a large prime number) to create smaller number of buckets. At query time, given a

query g with substrings {g(j)}mj=1, we search the jth substring hash table for entries that are

within a Hamming distance of br/mc or br/mc − 1 of g(j), as prescribed by (4.3). By doing

so we obtain a set of candidates from the jth substring hash table, denoted Nj(g). According

to the propositions above, the union of the m sets, N (g) =⋃j Nj(g), is necessarily a superset

of the r-neighbors of g. The last step of the algorithm computes the full Hamming distance

between g and each candidate in N (g), retaining only those codes that are true r-neighbors of

g. Algorithm 2 outlines the r-neighbor retrieval procedure for a query g.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 50

Algorithm 2 r-neighbor search for query g using multi-index hashing with m substrings.

query substrings:{g(j)

}mj=1

initialize mark: for 1 ≤ c ≤ n, set mark[ c ]← falsea← r −mbr/mcfor j = 1 to j ≤ m do

if j ≤ a+ 1 thenρ← br/mc

elseρ← br/mc − 1

end iffor t = 0 to t ≤ ρ do

from jth substring hash table, lookup buckets with keys differing from g(j) in t bitsfor each candidate found with id c doif not mark[ c ] thenmark[ c ]← trueif the code with id c differs from g in at most r bits (full Hamming distance) then

add c to the set of r-neighbor of gend if

end ifend for

end forend for

The query search time depends on the number of buckets examined, lookups, and the number

of candidates tested by full Hamming distance comparison. Some buckets may be empty for

which we only pay the cost of a lookup but not a candidate check. Some canidates may also

be duplicates, for which we only pay the cost of a duplicate detection and not a full Hamming

comparison. Not surprisingly, there is a natural trade-off between the number of lookups and

the number of candidates, controlled by the number of substrings. With a large number of

lookups one can minimize the number of extraneous candidates. By merging many buckets to

reduce the number of lookups, one obtains a large number of candidates to test. In the extreme

case with m = q, substrings are 1 bit long, so we can expect the candidate set to include almost

the entire database.

Note that the idea of building multiple hash tables is not novel in itself (e.g., see [GPY94,

IM98]). However previous work relied heavily on exact matches in substrings. Relaxing this

constraint is what leads to a more effective algorithm, especially in terms of the storage require-

ment.

4.2 Performance analysis

We next develop an analytical model of search performance to help address two key questions:

(1) How does search cost depend on substring length, and hence the number of substrings?

(2) How do run-time and storage complexity depend on database size, code length, and search

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 51

radius?

To help answer these questions we exploit a well-known bound on the sum of binomial

coefficients [FG06]; i.e., for any 0 < ε ≤ 12 and η ≥ 1.

bε ηc∑κ=0

κ

)≤ 2H(ε) η , (4.5)

where H(ε) ≡ −ε log2 ε − (1− ε) log2(1− ε) is the entropy of a Bernoulli distribution with

probability ε.

In what follows, n continues to denote the number of q-bit database codes, and r is the

Hamming search radius. Let m denote the number of hash tables, and let s denote the substring

length s = q/m. Hence, the maximum substring search radius becomes r′ = br/mc = bs r/qc.As above, for the sake of model simplicity, we assume q is divisible by m.

We begin by formulating an upper bound on the number of lookups. First, the number of

lookups in Algorithm 2 is bounded above by the product of m, the number of substring hash

tables, and the number of hash buckets within a radius of bs r/qc on substrings of length s bits.

Accordingly, using (4.5), if the search radius is less than half the code length, r ≤ q/2 , then

the total number of lookups is given by

lookups(s) = m

bs r/qc∑z=0

(s

z

)≤ q

s2H(r/q)s . (4.6)

Clearly, as we decrease the substring length s, thereby increasing the number of substrings m,

exponentially fewer lookups are needed.

To analyze the expected number of candidates per bucket, we consider the case in which the

n binary codes are uniformly distributed over the Hamming space. In this case, for a substring

of s bits, for which a substring hash table has 2s buckets, the expected number of items per

bucket is n/2s. The expected size of the candidate set therefore equals the number of lookups

times n/2s.

The total search cost per query is the cost for lookups plus the cost for candidate tests.

While these costs will vary with the code length q and the way the hash tables are implemented,

empirically we find that, to a reasonable approximation, the costs of a lookup and a candidate

test are similar (when q ≤ 256). Accordingly, we model the total search cost per query, for

retrieving all r-neighbors, in units of the time required for a single lookup, as

cost(s) =(

1 +n

2s

) q

s

bsr/qc∑k=0

(s

k

), (4.7)

≤(

1 +n

2s

) q

s2H(r/q)s . (4.8)

In practice, database codes will generally not be uniformly distributed, nor are uniformly

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 52

distributed codes ideal for multi-index hashing. Indeed, the cost of search with uniformly

distributed codes is relatively high since the search radius increases as the density of codes

decreases. Rather, the uniform distribution is primarily a mathematical convenience that fa-

cilitates the analysis of run-time, thereby providing some insight into the effectiveness of the

approach and how one might choose an effective substring length.

4.2.1 Choosing an effective substring length

As noted above in Section 4.1.2, finding a good substring length is central to the efficiency of

multi-index hashing. When the substring length is too large or too small the approach will not

be effective. In practice, an effective substring length for a given dataset can be determined by

cross-validation. Nevertheless this can be expensive.

In the case of uniformly distributed codes, one can instead use the analytic cost model

in (4.7) to find a near optimal substring length. As discussed below, we find that a substring

length of s = log2 n yields a near-optimal search cost. Further, with non-uniformly distributed

codes in benchmark datasets, we confirm empirically that s = log2 n is also a reasonable heuris-

tic for choosing the substring length (e.g., see Table 4.4 below).

In more detail, to find a good substring length using the cost model above, assuming uni-

formly distributed binary codes, we first note that, dividing cost(s) in (4.7) by q has no effect

on the optimal s. Accordingly, one can view the optimal s as a function of two quantities,

namely the number of items, n, and the search ratio r/q.

Fig. 4.2 plots search cost as a function of substring length s, for 240-bit codes, different

database sizes n, and different search radii (expressed as a fraction of the code length q).

Dashed curves depict cost(s) in (4.7) while solid curves of the same color depict the upper

bound in (4.8). The tightness of the bound is evident in the plots, as are the quantization

effects of the upper range of the sum in (4.7). The small circles in Fig. 4.2 (top) depict cost

when all quantization effects are included, and hence it is only shown at substring lengths that

are integer divisors of the code length.

Fig. 4.2 (top) shows cost for search radii equal to 5%, 15% and 25% of the code length, with

n=109 in all cases. One striking property of these curves is that the cost is persistently minimal

in the vicinity of s = log2 n, indicated by the vertical line close to 30 bits. This behavior is

consistent over a wide range of database sizes.

Fig. 4.2 (bottom) shows the dependence of cost on s for databases with n = 106, 109, and

1012, all with r/q = 0.25 and q = 128 bits. In this case we have laterally displaced each curve

by − log2 n; notice how this aligns the minima close to 0. These curves suggest that, over a

wide range of conditions, cost is minimal for s = log2 n. For this choice of the substring length,

the expected number of items per substring bucket, i.e., n/2s, reduces to 1. As a consequence,

the number of lookups is equal to the expected number of candidates. Interestingly, this choice

of substring length is similar to that of Greene et al. [GPY94]. A somewhat involved theoretical

analysis based on Stirling’s approximation, omitted here, also suggests that as n goes to infinity,

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 53

10 20 30 40 50 60

5

10

15

Substring Length (bits)

Cost (log

10)

r/q = 0.25

r/q = 0.15

r/q = 0.05

−20 −10 0 10 20

5

10

15

Substring Length − log n (bits)

Cost (log

10)

n = 1012

n = 109

n = 106

Figure 4.2: Cost (4.7) and its upper bound (4.8) are shown as functions of substring length(using dashed and solid curves respectively). The code length in all cases is q = 240 bits. (Top)Cost for different search radii, all for a database with n = 109 codes. Circles depict a moreaccurate cost measure, only for substring lengths that are integer divisors of q, and with themore efficient indexing in Algorithm 3. (Bottom) Three database sizes, all with a search radiusof r = 0.25 q. The minima are aligned when each curve is displaced horizontally by − log2 n.

the optimal substring length converges asymptotically to log2 n.

4.2.2 Run-time complexity

Choosing s in the vicinity of log2 n also permits a simple characterization of retrieval run-time

complexity, for uniformly distributed binary codes. When s = log2 n, the upper bound on the

number of lookups (4.6) also becomes a bound on the number candidates. In particular, if we

substitute log2 n for s in (4.8), then we find the following upper bound on the cost, now as a

function of database size, code length, and the search radius:

cost(s) ≤ 2q

log2 nnH(r/q) . (4.9)

Thus, for a uniform distribution over binary codes, if we choose m such that s ≈ log2 n,

the expected query time complexity is O(q nH(r/q)/log2 n). For a small ratio of r/q this is sub-

linear in n. For example, if r/q ≤ .11, then H(.11) < .5, and the run-time complexity becomes

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 54

O(b√n/log2 n). That is, the search time increases with the square root of the database size

when the search radius is approximately 10% of the code length. For r/q ≤ .06, this becomes

O(b 3√n/log2 n). The time complexity with respect to q is not as important as that with respect

to n since q is not expected to vary significantly in most applications.

4.2.3 Storage complexity

The storage complexity of our multi-index hashing algorithm is asymptotically optimal when

bq/ log2 nc ≤ m ≤ dq/ log2 ne. To store the full database of binary codes requires O(nq)

bits. For each of m hash tables, we also need to store n unique identifiers to the database

items. This allows one to identify the retrieved items and fetch their full codes; this requires

an additional O(mn log2 n) bits. In sum, the storage required is O(nq + mn log2 n). When

bq/ log2 nc ≤ m ≤ dq/ log2 ne, as is suggested above, this storage cost reduces toO(nq+n log2 n).

Here, the n log2 n term does not cancel as m ≥ 1, but in most interesting cases q > log2 n.

While the storage cost for our multi-index hashing algorithm is linear in nq, the related

multi-index hashing algorithm of Greene et al. [GPY94] entails storage complexity that is super-

linear in n. To find all r-neighbors, for a given search radius r, they construct m = O(r2sr/q)

substrings of length s bits per binary code. Their suggested substring length is also s = log2 n, so

the number of substring hash tables becomes m = O(rnr/q), each of which requires O(n log2 n)

in storage. As a consequence for large values of n, even with small r, this technique requires a

prohibitive amount of memory to store the hash tables.

Our approach is more memory-efficient than that of [GPY94] because we do not enforce

exact equality in substring matching. In essence, instead of creating all of the hash tables

off-line, and then having to store them, we flip bits of each substring at run-time and implicitly

create some of the substring hash tables on-line. This increases run-time slightly, but greatly

reduces storage costs.

4.3 k-Nearest neighbor search

To use the above multi-index hashing in practice, one must specify a Hamming search radius

r. For many tasks, the value of r is chosen such that queries will, on average, retrieve k near

neighbors. Nevertheless, as expected, we find that for many hashing techniques and different

sources of visual data, the distribution of binary codes is such that a single search radius for

all queries will not produce similar numbers of neighbors.

Fig. 4.3 depicts empirical distributions of search radii needed for 10-NN and 1000-NN on

three sets of binary codes obtained from 1B SIFT descriptors [JTDA11, Low04]. In all cases, for

64 and 128-bit codes, and for hash functions based on angular LSH [Cha02] and MLH (Chap-

ter 2), there is a substantial variance in the search radius. This suggests that binary codes are

not uniformly distributed over the Hamming space. As an example, for 1000-NN in 64-bit LSH

codes, more than 10% of the queries require a search radius of 10 bits or larger, while for about

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 55

64-bit LSH 64-bit LSH

0 2 4 6 8 10 12 14 160

0.05

0.1

0.15

0.2

0.25

Hamming radii needed for 10−NN

Fra

ction o

f queries

0 2 4 6 8 10 12 14 160

0.05

0.1

0.15

0.2

Hamming radii needed for 1000−NN

Fra

ction o

f queries

128-bit LSH 128-bit LSH

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

Hamming radii needed for 10−NN

Fra

ction o

f queries

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

Hamming radii needed for 1000−NN

Fra

ction o

f queries

128-bit MLH 128-bit MLH

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

Hamming radii needed for 10−NN

Fra

ction o

f queries

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

Hamming radii needed for 1000−NN

Fra

ction o

f queries

Figure 4.3: Shown are histograms of the search radii that are required to find 10-NN and1000-NN, for 64 and 128-bit code from angular LSH [Cha02], and 128-bit codes from MLH[NF11], based on 1B SIFT descriptors [JTDA11]. Clearly shown are the relatively large searchradii required for both the 10-NN and the 1000-NN tasks, as well as the increase in the radiirequired when using 128 bits versus 64 bits.

10% of the queries it can be 5 bits or smaller. Also evident from Fig. 4.3 is the growth in the

required search radius as one moves from 64-bit codes to 128 bits, and from 10-NN to 1000-NN.

A fixed radius for all queries would produce too many neighbors for some queries, and

too few for others. It is therefore more natural for many tasks to fix the number of required

neighbors, i.e., k, and let the search radius depend on the query. Fortunately, our multi-index

hashing algorithm is easily adapted to accommodate query-dependent search radii.

Given a query, one can progressively increase the Hamming search radius per substring,

until a specified number of neighbors is found. For example, if one examines all r′-neighbors of

a query’s substrings, from which more than k candidates are found to be within a Hamming

distance of (r′ + 1)m− 1 bits in full codes, then it is guaranteed that k-nearest neighbors have

been found. Indeed, if all kNNs of a query g differ from g in r bits or less, then Propositions

1 and 2 above provide guarantees that all such neighbors will be found if one searches the

substring hash tables with the prescribed radii.

In our experiments, we follow this progressive increment of the search radius until we can

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 56

Algorithm 3 kNN search with query g .

query substrings:{g(i)}mi=1

initialize mark: for 1 ≤ c ≤ n, set mark[ c ]← falseinitialize sets: for 0 ≤ d ≤ q, set Nd = ∅initialize integers: r′ = 0, a = 0, r = 0repeat

assert: full radius of search is r = mr′ + a .from (a+1)th substring hash table, lookup buckets with keys differing from g(a+1) in r′ bitsfor each candidate found with id c do

if not mark[ c ] thenmark[ c ]← truelet d equal the full Hamming distance between the code with id c and query gadd c to Nd

end ifend forr ← r + 1a← a+ 1if a ≥ m thena← 0r′ ← r′ + 1

end ifuntil

∑r−1d=0 |Nd| ≥ k (i.e., k (r−1)-neighbors are found)

find kNN in the guaranteed neighborhood of a query. This approach, outlined in Algorithm 3,

is helpful because it uses a query-specific search radius depending on the distribution of codes

in the neighborhood of the query.

4.4 Experiments

Our implementation of multi-index hashing is available at [MIH]. Experiments are run on two

different architectures. The first is a mid- to low-end 2.3Ghz dual quad-core AMD Opteron

processor, with 2MB of L2 cache, and 128GB of RAM. The second is a high-end machine with

a 2.9Ghz dual quad-core Intel Xeon processor, 20MB of L2 cache, and 128GB of RAM. The

difference in the size of the L2 cache has a major impact on the run-time of linear scan, since

the effectiveness of linear scan depends greatly on L2 cache lines. With roughly ten times the

L2 cache, linear scan on the Intel platform is roughly 50% faster than the AMD machines. By

comparison, multi-index hashing does not have a serial memory access pattern and so the cache

size does not have such a pronounced effect. Actual run-times for multi-index hashing on the

Intel and AMD platforms are within 20% of one another.

Both linear scan and multi-index hashing were implemented in C++ and compiled with

identical compiler flags. To accommodate the large size of memory footprint required for 1B

codes, we used the libhugetlbfs package and Linux Kernel 3.2.0 to allow the use of 2MB page

sizes. Further details about the implementations are given in Section 4.5. Finally, despite

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 57

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

20

40

60

mem

ory

usage (

GB

)dataset size (billions)

128 bit Linear Scan

128 bit MIH

64 bit Linear Scan

64 bit MIH

Figure 4.4: Memory footprint of our Multi-Index Hashing implementation as a function ofdatabase size. Note that the memory usage does not grow super-linearly with dataset size. Thememory usage is independent of the number of nearest neighbors requested.

DB size = 1M DB size = 1B

250 500 750 10000

0.5

1

k

Eucl

. NN

Rec

all @

k

MLH 128LSH 128MLH 64LSH 64

250 500 750 10000

0.5

1

k

Figure 4.5: Recall rates for BIGANN dataset [JTDA11] (1M and 1B subsets) obtained by kNNon 64- and 128-bit MLH and LSH codes.

the existence of multiple cores, all experiments are run on a single core to simplify run-time

measurements.

The memory requirements for multi-index hashing are described in detail in Section 4.5.

We currently require approximately 27 GB for multi-index hashing with 1B 64-bit codes, and

approximately twice that for 128-bit codes. Fig. 4.4 shows how the memory footprint depends

on the database size for linear scan and multi-index hashing. As explained in the Section 4.2.3,

and demonstrated in Fig. 4.4 the memory cost of multi-index hashing grows linearly in the

database size, as does linear scan. While we use a single computer in our experiments, one

could implement a distributed version of multi-index hashing on computers with much less

memory by placing each substring hash table on a separate computer.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 58

4.4.1 Datasets

We consider two well-known large-scale vision corpora: 80M Gist descriptors from 80 million

tiny images [TFF08] and 1B SIFT features from the BIGANN dataset [JTDA11]. SIFT vec-

tors [Low04] are 128D descriptors of local image structure in the vicinity of feature points. Gist

features [OT01] extracted from 32× 32 images capture global image structure in 384D vectors.

These two feature types cover a spectrum of NNS problems in vision from feature to image

indexing.

We use two similarity-preserving mappings to create datasets of binary codes, namely, binary

angular Locality Sensitive Hashing (LSH) [Cha02], and Minimal Loss Hashing (MLH) (Chap-

ter 2). LSH is considered a baseline random projection method, closely related to cosine simi-

larity. MLH is a state-of-the-art learning algorithm that, given a set of similarity labels, finds

an optimal mapping by minimizing a loss function over pairs or triplets of binary codes.

Both the 80M Gist and 1B SIFT corpora comprise three disjoint sets, namely, a training

set, a base set for populating the database, and a test query set. Using a random permutation,

Gist descriptors are divided into a training set with 300K items, a base set of 79 million items,

and a query set of size 104. The SIFT corpus comes with 100M for training, 109 in the base

set, and 104 test queries.

For LSH we subtract the mean, and pick a set of coefficients from the standard normal

density for a linear projection, followed by quantization. For MLH the training set is used to

optimize hash function parameters. After learning is complete, we remove the training data

and apply the learned hash function to the base set to create the database of binary codes.

With two image corpora (SIFT and Gist), up to three code lengths (64, 128, and 256 bits), and

two hashing methods (LSH and MLH), we obtain several datasets of binary codes with which

to evaluate our multi-index hashing algorithm. Note that 256-bit codes are only used with LSH

and SIFT vectors.

Fig. 4.5 shows Euclidean nearest neighbor recall rates for kNN search on binary codes

generated from 1M and 1B SIFT descriptors. In particular, we plot the fraction of Euclidean

first nearest neighbors found, by kNN in 64-bit and 128-bit LSH and MLH binary codes. As

expected 128-bit codes are more accurate, and MLH outperforms LSH. Note that the multi-

index hashing algorithm solves exact kNN search in Hamming distance; the approximation

that reduces recall is due to the mapping from the original Euclidean space to the Hamming

space. To preserve the Euclidean structure in the original SIFT descriptors, it seems useful

to use longer codes, and exploit data-dependent hash functions such as MLH. Interestingly,

as described below, the speedup factors of multi-index hashing on MLH codes are better than

those for LSH.

Obviously, Hamming distance computed on q-bit binary codes is an integer between 0 and

q. Thus, the nearest neighbors in Hamming distance can be divided into subsets of elements

that have equal Hamming distance (up to q + 1 subsets). Although Hamming distance does

not provide a means to distinguish between equi-distant elements, often a re-ranking phase

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 59

speedup factors for kNN vs. linear scandataset # bits mapping 1-NN 10-NN 100-NN 1000-NN linear scan

SIFT 1B

64MLH 823 757 587 390

16.51sLSH 781 698 547 306

128MLH 1048 675 353 147

42.64sLSH 747 426 208 91

256 LSH 220 111 58 27 62.31s

Gist 79M64

MLH 401 265 137 511.30s

LSH 322 145 55 18

128MLH 124 50 26 13

3.37sLSH 85 33 18 9

Table 4.1: Summary of results for nine datasets of binary codes on AMD Opteron Processorwith 2MB L2 cache. The first four rows correspond to 1 billion binary codes, while the lastfour rows show the results for 79 million codes. Codes are 64, 128, or 256 bits long, obtainedby LSH or MLH. The run-time of linear scan is reported along with the speedup factors forkNN with multi-index hashing.

using Asymmetric Hamming distance [GP11] or other distance measures is helpful in practice.

Nevertheless, this chapter is solely concerned with the exact Hamming kNN problem up to a

selection of equi-distant elements in the top k elements.

4.4.2 Experimental results

Each of our experiments involve 104 queries, for which we report the average run-time. Our

implementation of the linear scan baseline searches 60 million 64-bit codes in just under one

second on the AMD machine. On the Intel machine it examines over 80 million 64-bit codes

per second. This is remarkably fast compared to Euclidean NNS with 128D SIFT vectors. The

speed of linear scan is in part due to memory caching, without which it would be much slower.

Run-times for linear scan on all of the datasets, on both architectures, are reported in Tables 4.1

and 4.2.

4.4.3 Multi-Index Hashing vs. Linear Scan

Tables 4.1 and 4.2 shows run-time per query for the linear scan baseline, along with speedup

factors of multi-index hashing for different kNN problems and nine different datasets. Despite

the remarkable speed of linear scan, the multi-index hashing implementation is hundreds of

times faster. For example, the multi-index hashing method solves the exact 1000-NN for a

dataset of 1B 64-bit codes in about 50 ms, well over 300 times faster than linear scan (see

Table 4.1). Performance on 1-NN and 10-NN are even more impressive. With 128-bit MLH

codes, multi-index hashing executes the 1NN search task over 1000 times faster than the linear

scan baseline.

The run-time of linear scan does not depend on the number of neighbors (except for partial

sorting of distances to find k-nearest neighbors), nor on the underlying distribution of binary

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 60

speedup factors for kNN vs. linear scandataset # bits mapping 1-NN 10-NN 100-NN 1000-NN linear scan

SIFT 1B

64MLH 573 542 460 291

12.23sLSH 556 516 411 237

128MLH 670 431 166 92

20.71sLSH 466 277 137 60

256 LSH 115 67 34 16 38.89s

Gist 79M64

MLH 286 242 136 530.97s

LSH 256 142 55 18

128MLH 77 37 19 10

1.64sLSH 45 18 9 5

Table 4.2: Summary of results for nine datasets of binary codes on Intel Xeon Processor with20MB L2 cache. Note that the speedup factors reported in this table for multi-index hashingare smaller than in Table 4.1. This is due to the significant effect of cache size on the run-timeof linear scan on the Intel architecture.

codes. The run-time for multi-index hashing, however, depends on both factors. In particular,

as the desired number of near neighbors increases, the Hamming radius of the search also

increases (e.g., see Fig. 4.3). This implies longer run-times for multi-index hashing. Indeed,

notice that going from 1-NN to 1000-NN on each row of the tables shows a decrease in the

speedup factors.

The multi-index hashing run-time also depends on the distribution of binary codes. Indeed,

one can see from Table 4.1 that MLH code databases yield faster run times than the LSH

codes; e.g., for 100-NN in 1B 128-bit codes the speedup for MLH is 353× vs 208× for LSH.

Fig. 4.3 depicts the histograms of search radii needed for 1000-NN with 1B 128-bit MLH and

LSH codes. Interestingly, the mean of the search radii for MLH codes is 19.9 bits, while it is

19.8 for LSH. While the means are similar the variances are not; the standard deviations of the

search radii for MLH and LSH are 4.0 and 5.0 respectively. The longer tail of the distribution

of search radii for LSH plays an important role in the expected run-time. In fact, queries that

require relatively large search radii tend to dominate the average query cost.

It is also interesting to look at the multi-index hashing run-times as a function of n, the

number of binary codes in the database. To that end, Figures 4.6, 4.7, and 4.8 depict run-times

for linear scan and multi-index kNN search on 64, 128, and 256-bit codes on the AMD machine.

The left two figures in each show different vertical scales (since the behavior of multi-index kNN

and linear scan are hard to see at the same scale). The right-most panels show the same data

on log-log axes. First, it is clear from these plots that multi-index hashing is much faster than

linear scan for a wide range of dataset sizes and k. Just as importantly, it is evident from the

log-log plots that as we increase the database size, the speedup factors improve. The dashed

lines on the log-log plots depict√n (up to a scalar constant). The similar slope of multi-index

hashing curves with the square root curves show that multi-index hashing exhibits sub-linear

query time, even for the empirical, non-uniform distributions of codes.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 61

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

tim

e p

er

query

(s)

dataset size (billions)

Linear scan

1000−NN

100−NN

10−NN

1−NN

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

tim

e p

er

qu

ery

(s)

dataset size (billions)

Linear scan

1000−NN

100−NN

10−NN

1−NN

4 5 6 7 8 9

−4

−3

−2

−1

0

1

log tim

e p

er

query

(lo

g1

0 s

)

dataset size (log10

)

Linear scan

1000−NN

100−NN

10−NN

1−NN

sqrt(n)

Figure 4.6: Run-times per query for multi-index hashing with 1, 10, 100, and 1000 nearestneighbors, and a linear scan baseline on 1B 64-bit binary codes given by LSH from SIFT. Runon an AMD Opteron processor.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

tim

e p

er

query

(s)

dataset size (billions)

Linear scan

1000−NN

100−NN

10−NN

1−NN

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

tim

e p

er

qu

ery

(s)

dataset size (billions)

Linear scan

1000−NN

100−NN

10−NN

1−NN

4 5 6 7 8 9

−4

−3

−2

−1

0

1

log tim

e p

er

query

(lo

g1

0 s

)

dataset size (log10

)

Linear scan

1000−NN

100−NN

10−NN

1−NN

sqrt(n)

Figure 4.7: Run-times per query for multi-index hashing with 1, 10, 100, and 1000 nearestneighbors, and a linear scan baseline on 1B 128-bit binary codes given by LSH from SIFT. Runon an AMD Opteron processor.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

20

40

60

tim

e p

er

query

(s)

dataset size (billions)

Linear scan

1000−NN

100−NN

10−NN

1−NN

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

tim

e p

er

qu

ery

(s)

dataset size (billions)

Linear scan

1000−NN

100−NN

10−NN

1−NN

4 5 6 7 8 9

−3

−2

−1

0

1

log tim

e p

er

query

(lo

g1

0 s

)

dataset size (log10

)

Linear scan

1000−NN

100−NN

10−NN

1−NN

sqrt(n)

Figure 4.8: Run-times per query for multi-index hashing with 1, 10, 100, and 1000 nearestneighbors, and a linear scan baseline on 1B 256-bit binary codes given by LSH from SIFT. Runon an AMD Opteron processor.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 62

4.4.4 Direct lookups with a single hash table

An alternative to linear scan and multi-index hashing is to hash the entire codes into a single

hash table, and then use direct hashing with each query. As suggested in the introduction and

Fig. 4.1, although this approach avoids the need for any candidate checking, it may require a

prohibitive number of lookups. Nevertheless, for sufficiently small code lengths or search radii,

it may be effective in practice.

Given the complexity associated with efficiently implementing collision detection in large

hash tables, we do not directly experiment with the single hash table approach. Instead, we

consider the empirical number of lookups one would need, as compared to the number of items

in the database. If the number of lookups is vastly greater than the size of the dataset one

can readily conclude that linear scan is likely to be as fast or faster than direct indexing into a

single hash table.

Fortunately, the statistics of neighborhood sizes and required search radii for kNN tasks are

available from the linear scan and multi-index hashing experiments reported above. For a given

query, one can use the kth nearest neighbor’s Hamming distance to compute the number of

lookups from a single hash table that are required to find all of the query’s k nearest neighbors.

Summed over the set of queries, this provides an indication of the expected run-time.

Fig. 4.9 shows the total number of lookups required for 1-NN and 1000-NN tasks by the

single hash table (SHT) approach and the multi-index hashing (MIH) on 64- and 128-bit LSH

codes from SIFT. They are plotted as a function of the size of the dataset, from 104 to 109

items. For comparison, the plots also show the number of database items, and the number of

operations that were needed for linear scan. Note that Fig. 4.9 has logarithmic scales.

It is evident that with a single hash table the number of lookups is almost always several

orders of magnitude larger than the number of items in the dataset. And not surprisingly, this is

also several orders of magnitude more lookups than required for multi-index hashing. Although

the relative speed of a lookup operation compared to a candidate check, as used in linear scan,

depends on the implementation, there are a few important considerations. Linear scan has

an exactly serial memory access pattern and so can make very efficient use of cache, whereas

lookups in a hash table are inherently random. Furthermore, in any plausible implementation of

a single hash table for 64 bit or longer codes, there will be some penalty for collision detection.

As illustrated in Fig. 4.9, the only cases where a single hash table might potentially be more

efficient than linear scan are with very small codes (64 bits or less), with a large dataset (1

billion items or more), and a small search distances (e.g., for 1-NN). In all other cases, linear

scan requires orders of magnitude fewer operations. With any code length longer than 64 bits,

a single hash table approach is completely infeasible to run, requiring upwards of 15 orders of

magnitude more operations than linear scan for 128-bit codes.

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 63

64 bits 128 bits

4 5 6 7 8 90

2

4

6

8

10

12

14

16

18

20

Dataset Size (log10

)

Nu

mb

er

of

Op

era

tio

ns (

log

10)

1000NN SHT

1NN SHT

Linearscan

1000NN MIH

1NN MIH

4 5 6 7 8 90

5

10

15

20

25

30

35

40

Dataset Size (log10

)

Nu

mb

er

of

Op

era

tio

ns (

log

10)

1000NN SHT

1NN SHT

Linearscan

1000NN MIH

1NN MIH

Figure 4.9: The number of lookup operations required to solve exact nearest neighbor searchin hamming space for LSH codes from SIFT features, using the simple single hash table (SHT)approach and multi-index hashing (MIH). Also shown is the number of candidate check oper-ations required to search using linear scan. Note that the axes have a logarithmic scale. Withsmall codes (64 bits), many items (1 billion) and small search distance (1-NN), it is conceivablethat a single hash table might be faster than linear scan. In all other cases, a single hash tablerequires many orders of magnitude more operations than linear scan. Note also that MIH willnever require more operations than a single hash table - in the limit of very large dataset sizes,MIH will use only one hash table and becomes equivalent.

4.4.5 Substring Optimization

The substring hash tables used above have been formed by simply dividing the full codes into

disjoint and consecutive sequences bits. For LSH and MLH, this is equivalent to randomly

assigning bits to substrings.

It natural to ask whether further gains in efficiency are possible by optimizing the assignment

of bits to substrings. In particular, by careful substring optimization one may be able to

maximize the discriminability of the different substrings. In other words, while the radius of

substring search, and hence the number of lookups is determined by the desired search radius

on the full codes, and will remain fixed, by optimizing the assignment of bits to substrings one

might be able to reduce the number of candidates needed to validate.

To explore this idea, we considered a simple method in which bits are assigned to substrings

one by one in a greedy fashion, based on the correlation between bits. In particular, of those

bits not yet assigned, we assign a bit to the next substring that minimizes the maximum

correlation between that bit and all other bits already in that substring. Initialization also

occurs in a greedy manner: A random bit is assigned to the first substring, and we assign the

first bit to substring j that is maximally correlated with the first bit of substring j − 1. This

approach significantly decreases the correlation between bits within a single substring, which

should make the distribution of codes within substring buckets more uniform. Hence, lowering

the number of candidates within a given search radius. Arguably an even better approach would

be to maximize the entropy of the entries within each substring hash table, thereby making

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 64

optimized speedup vs. linear scan (consecutive, % improvement)# bits 1-NN 10-NN 100-NN 1000-NN

64 788 (781, 1%) 750 (698, 7%) 570 (547, 4%) 317 (306, 4%)128 826 (747, 10%) 472 (426, 11%) 237 (208, 14%) 103 (91, 12%)256 284 (220, 29%) 138 (111, 25%) 68 (58, 18%) 31 (27, 18%)

Table 4.3: Empirical run-time improvements from optimizing substrings vs. consecutive sub-strings, for 1 billion LSH codes from SIFT features (AMD machine). speedup factors vs. linearscan are shown with optimized and consecutive substrings, and the percent improvement. Allexperiments used 10M codes to compute the correlation between bits for substring optimizationand all results are averaged over 10000 queries each.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

tim

e p

er

qu

ery

(s)

dataset size (millions)

Cons. 10−NN

Cons. 100−NN

Cons. 1000−NN

Opt. 10−NN

Opt. 100−NN

Opt. 1000−NN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5tim

e p

er

qu

ery

(s)

dataset size (millions)

Cons. 10−NN

Cons. 100−NN

Cons. 1000−NN

Opt. 10−NN

Opt. 100−NN

Opt. 1000−NN

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

tim

e p

er

qu

ery

(s)

dataset size (millions)

Cons. 10−NN

Cons. 100−NN

Cons. 1000−NN

Opt. 10−NN

Opt. 100−NN

Opt. 1000−NN

Figure 4.10: Run-times for multi-index-hashing using codes from LSH on SIFT features withconsecutive (solid) and optimized (dashed) substrings. From left to right: 64-bit, 128-bit,256-bit codes, run on the AMD machine.

the distribution of substrings as uniform as possible. This entropic approach is, however, much

more complicated and left to future work.

The results obtained with the correlation-based greedy algorithm show that optimizing

substrings can provide overall run-time reductions on the order of 20% against consecutive sub-

strings for some cases. Table 4.3 displays the improvements achieved by optimizing substrings

for different codes lengths and different values of k. Fig. 4.10 shows the run-time performance

of optimized substrings.

4.5 Implementation details

Our implementation of multi-index hashing is publicly available at [MIH]. Nevertheless, for the

interested reader we describe some of the important details here.

As explained above, the algorithm hinges on hash tables built on disjoint s-bit substrings

of the binary codes. We use direct address tables for the substring hash tables because the

substrings are usually short (s ≤ 32). Direct address tables explicitly allocate memory for 2s

buckets and store all data points associated with each substring in its corresponding bucket.

There is a one-to-one mapping between buckets and substrings, so no time is spent on collision

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 65

detection.

One could implement direct address tables with an array of 2s pointers, some of which may

be null (for empty buckets). On a 64-bit machine, pointers are 8 bytes long, so just storing

an empty address table for s = 32 requires 32 GB (as done in [NPF12]). For greater efficiency

here, we use sparse direct address tables by grouping buckets into sets of 32 elements. For each

bucket group, a 32-bit binary vector encodes whether each bucket in the group is empty or

not. Then, a single pointer per group is used to point to a resizable array that stores the data

points associated with that bucket group. Data points within each array are ordered by their

bucket index. To facilitate fast access, for each non-empty bucket we store the index of the

beginning and the end of the corresponding segment of the resizable array. Compared to the

direct address tables in [NPF12], for s = 32, and bucket groups of size 32, an empty address

table requires only 1.5 GB. Also note that accessing elements in any bucket of the sparse address

table is slightly more expensive than a non-sparse address table, but still O(1).

Memory Requirements: We store one 64-bit pointer for each bucket group, and a 32-bit binary

vector to encode whether buckets in a group are empty; this entails 2(s−5) · (8 + 4) bytes for an

empty s-bit hash table (s ≥ 5), or 1.5 GB when s = 32. Bookkeeping for each resizable array

entails 3 32-bit integers. In our experiments, most bucket groups have at least one non-empty

bucket. Taking this into account, the total storage for an s-bit address table becomes 2(s−5) ·24

bytes (3 GB for s = 32).

For each non-empty bucket within a bucket group, we store a 32-bit integer to indicate the

index of the beginning of the segment of the resizable array corresponding to that bucket. The

number of non-empty buckets is at most mmin(n, 2s), where m is the number of hash tables,

and n is the number of codes. Thus we need an extra mmin(n, 2s) ·4 bytes. For each data point

per hash table we store an ID to reference the full binary code; each ID is 4 bytes since n ≤ 232

for our datasets. This entails 4mn bytes. Finally, storing the full binary codes themselves

requires nms/8 bytes, since q = ms.

The total memory cost is m2(s−5)24 +mmin(n, 2s)4 + 4mn+ nms/8 bytes. For s = log2 n,

this cost is O(nq). For 1B 64-bit codes, and m = 2 hash tables (32 bits each), the cost is 28 GB.

For 128-bit and 256-bit codes our implementation requires 57 GB and 113 GB. Note that the

last two terms in the memory cost for storing IDs and codes are irreducible, but the first terms

can be reduced in a more memory efficient implementation.

Duplicate Candidates: When retrieving candidates from the m substring hash tables, some

codes will be found multiple times. To detect duplicates, and discard them, we allocate one

bit-string with n bits. When a candidate is found we check the corresponding bit and discard

the candidate if it is marked as a duplicate. Before each query we initialize the bit-string to

zero. In practice this has negligible run-time. In theory clearing an n-bit vector requires O(n),

but there are more efficient ways to store an n-bit vector without explicit initialization.

Hamming Distance: To compare a query and a candidate (for multi-index search or linear

scan), we compute the Hamming distance on the full q-bit codes, with one xor operation for

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 66

n 104 105 106 2× 106 5× 106 107 2× 107 5× 107 108 2× 108 5× 108 109

q = 64m 5 4 4 3 3 3 3 2 2 2 2 2

q/ log2 n 4.82 3.85 3.21 3.06 2.88 2.75 2.64 2.50 2.41 2.32 2.21 2.14

q = 128m 10 8 8 6 6 5 5 5 5 4 4 4

q/ log2 n 9.63 7.71 6.42 6.12 5.75 5.50 5.28 5.00 4.82 4.64 4.43 4.28

q = 256m 19 15 13 12 11 11 10 10 10 9 9 8

q/ log2 n 19.27 15.41 12.84 12.23 11.50 11.01 10.56 10.01 9.63 9.28 8.86 8.56

Table 4.4: Selected number of substrings used for the experiments, as determined by cross-validation, vs. the suggested number of substrings based on the heuristic q / log2 n.

every 64 bits, followed by a pop count to tally the ones. We used the built-in GCC function

__builtin_popcount for this purpose.

Number of Substrings: The number of substring hash tables we use is determined with a

hold-out validation set of database entries. From that set we estimate the running time of the

algorithm for different choices of m near q / log2 n, and select the m that yields the minimum

run-time. As shown in Table 4.4 this empirical value for m is usually the closest integer to

q / log2 n.

Translation Lookaside Buffer and Huge Pages: Modern processors have an on-chip cache that

holds a lookup table of memory addresses, for mapping virtual addresses to physical addresses

for each running process. Typically, memory is split into 4KB pages, and a process that allocates

memory is given pages by the operating system. The Translation Lookaside Buffer (TLB) keeps

track of these pages. For processes that have large memory footprints (tens of GB), the number

of pages quickly overtakes the size of the TLB (typically about 1500 entries). For processes

using random memory access this means that almost every memory access produces a TLB

miss - the requested address is in a page not cached in the TLB, hence the TLB entry must be

fetched from slow RAM before the requested page can be accessed. This slows down memory

access, and causes volatility in run-times for memory-access intensive processes.

To avoid this problem, we use the libhugetlbfs Linux library. This allows the operating

system to allocate Huge Pages (2MB each) rather than 4KB pages. This reduces the number

of pages; hence it reduces the frequency of TLB misses, improves memory access speed, and

reduces run-time volatility. The increase in speed of multi-index hashing results reported here

compared to those in [NPF12] are attributed to the use of libhugetlbfs.

4.6 Conclusion

This chapter describes a new algorithm for exact nearest neighbor search on large-scale datasets

of binary codes. The algorithm is a form of multi-index hashing that has provably sub-linear

run-time behavior for uniformly distributed codes. It is storage efficient and easy to implement.

We show empirical performance on datasets of binary codes obtained from 1 billion SIFT, and

80 million Gist features. With these datasets we find that, for 64-bit and 128-bit codes, our new

Chapter 4. Fast Exact Search in Hamming Space with Multi-Index Hashing 67

multi-index hashing implementation is often more than two orders of magnitude faster than a

linear scan baseline.

While the basic algorithm is developed in this chapter there are several interesting avenues

for future research. For example our preliminary research shows that log2 n is a good choice for

the substring length, and it should be possible to formulate a sound mathematical basis for this

choice. The assignment of bits to substrings was shown to be important above, however the

algorithm used for this assignment is clearly suboptimal. It is also likely that different substring

lengths might be useful for the different hash tables.

Our theoretical analysis proves sub-linear run-time behavior of the multi-index hashing al-

gorithm on uniformly distributed codes, when search radius is small. Our experiments demon-

strate sub-linear run-time behavior of the algorithm on real datasets, while the binary code in

our experiments are clearly not uniformly distributed3. Bridging the gap between theoretical

analysis and empirical findings for the proposed algorithm remains an open problem. In par-

ticular, we are interested in more realistic assumptions on the binary codes, which still allow

for theoretical analysis of the algorithm.

While the current paper concerns exact nearest-neighbor tasks, it would also be interesting

to consider approximate methods based on the same multi-index hashing framework. Indeed,

there are several ways that one could find approximate rather than the exact nearest neighbors

for a given query. For example, one could stop at a given radius of search, even though kNN

items may not have been found. Alternatively, one might search until a fixed number of unique

candidates have been found, even though all substring hash tables have not been inspected to

the necessary radius. Such approximate algorithms have the potential for even greater efficiency,

and would be the most natural methods to compare to existing approximate methods, such as

Hamming distance LSH. That said, such comparisons are more difficult than for exact methods

since one must take into account not only the storage and run-time costs, but also some measure

of the cost of errors (usually in terms of recall and precision).

Finally, recent results have shown that for many datasets in which the binary codes are

the result of some form of vector quantization, an asymmetric Hamming distance is attractive

[GP11, JDS11]. In such methods, rather than converting the query into a binary code, one

directly compares a real-valued query to the database of binary codes. The advantage is that

the quantization noise entailed in converting the query to a binary string is avoided. Thus, one

can use a more accurate distance in the binary code space to approximate the desired distance

in the original feature space. One simple way to approach asymmetric Hamming distance is

to use multi-index hashing with Hamming distance, and then only use an asymmetric distance

when culling candidates. The development of more interesting and effective methods is another

promising avenue for future work.

3In some of our experiments with 1 billion binary codes, tens of thousands of codes fall into the same bucketof 32-bit substring hash tables. This is extremely unlikely with uniformly distributed codes.

Chapter 5

Cartesian k-means

Techniques for vector quantization, like the well-known k-means algorithm, are used widely in

vision and learning. Common applications include codebook learning for object recognition

[SZ03], approximate nearest neighbor search (NNS) for information retrieval [NS06, PCI+07],

and feature compression to handle very large datasets [PLSP10].

In general terms, quantization techniques partition an input vector space into multiple

regions {Si}ki=1, and map points in each region into region-specific representatives {ci}ki=1,

known as centers. As such, a quantizer q(x) applied to a data point x, can be expressed as

q(x) =

k∑i=1

1(x∈Si) ci , (5.1)

where 1(·) is the usual indicator function.

The quality of a quantizer is expressed in terms of expected distortion, a common measure

of which is squared error ‖x − q(x)‖22. In this case, given centers {ci}, the region to which a

point is assigned with minimal distortion is obtained by Euclidean NNS. The k-means algorithm

can be used to learn k centers from data.

To reduce expected distortion, crucial for many applications, one can shrink region volumes

by increasing k, the number of regions. In practice, however, increasing k results in prohibitive

storage and run-time costs. Even if one resorts to approximate k-means with approximate

NNS [PCI+07] or hierarchical k-means [NS06], it is hard to scale to large k (e.g., k = 264) as

storing the centers is untenable.

This chapter concerns methods for scalable quantization with tractable storage and run-

time costs. Inspired by Product Quantization (PQ), a state-of-the-art algorithm for approxi-

mate NNS with high-dimensional data (e.g., [JDS11]), compositionality is one of the key ideas.

By expressing data in terms of recurring, reusable parts, the representational capacity of com-

positional models grows exponentially in the number of parameters. Compression techniques

like JPEG accomplish this by encoding images as disjoint rectangular patches. PQ divides the

feature space into disjoint subspaces that are quantized independently. Other examples include

68

Chapter 5. Cartesian k-means 69

part-based recognition models (e.g., [TMF07]), and tensor-based models for style-content sep-

aration (e.g., [TF00]). Here, with a compositional parametrization of region centers, we find a

family of models that reduce the encoding cost of k centers down from k to between log2 k and√k. A model parameter controls the trade-off between model fidelity and compactness.

We formulate two related algorithms, Orthogonal k-means (ok-means) and Cartesian k-

means (ck-means). They are natural extensions of k-means, and are closely related to other

hashing and quantization methods. The ok-means algorithm is a generalization of the Iterative

Quantization (ITQ) algorithm for finding similarity preserving binary codes [GL11]. The ck-

means model is an extension of ok-means, and can be viewed as a generalization of PQ. A

similar generalization of PQ, called optimized product quantization, was developed concurrently

by Ge, He, Ke and Sun, and also appeared in CVPR 2013 [GHKS13].

We evaluate ok-means and ck-means experimentally on large-scale approximate NNS tasks,

and on codebook learning for recognition. For NNS we use datasets of 1M GIST and 1B

SIFT features, with both symmetric and asymmetric distance measures on the latent codes.

We consider codebook learning for a generic form of recognition on CIFAR-10. In all cases,

ck-means delivers impressive performance.

5.1 k-means

Given a dataset of n p-dimensional points, D ≡ {xj}nj=1, the k-means algorithm partitions the

n points into k clusters, and represents each cluster by a center point. Let C ∈ Rp×k be a

matrix whose columns comprise the k cluster centers, i.e., C = [c1, c2, · · · , ck]. The k-means

objective is to minimize the within-cluster squared distances:

`k-means(C) =∑x∈D

mini‖x− ci‖22

=∑x∈D

minb∈H1/k

‖x− Cb‖22 (5.2)

where H1/k ≡ {b |b ∈ {0, 1}k and ‖b‖2 = 1}, i.e., b is a binary vector comprising a 1-of-

k encoding. Lloyd’s k-means algorithm [Llo82] finds a local minimum of (5.2) by iterative,

alternating optimization with respect to C and the b’s.

The k-means model is simple and intuitive, using NNS to assign points to centers. The

assignment of points to centers can be represented with a log k-bit index per data point. The

cost of storing the centers grows linearly with k.

5.2 Orthogonal k-means with 2m centers

With a compositional model one can represent cluster centers more efficiently. One such

approach is to reconstruct each input with an additive combination of the columns of C.

Chapter 5. Cartesian k-means 70

To this end, instead of the 1-of-k encoding in (5.2), we let b be a general m-bit vector,

b ∈ H′m ≡ {0, 1}m, and C ∈ Rp×m. As such, each cluster center is the sum of a subset of

the columns of C. There are 2m possible subsets, and therefore k = 2m centers. While the

number of parameters in the quantizer is linear in m, the number of centers increases exponen-

tially.

While efficient in representing cluster centers, the approach is problematic, because solving

b = argminb∈H′m

‖x− Cb‖22 , (5.3)

is intractable; i.e., the discrete optimization is not submodular. Obviously, for small 2m one

could generate all possible centers and then perform NNS to find the optimal solution, but this

would not scale well to large values of m.

One key observation is that if the columns of C are orthogonal, then optimization (5.3)

becomes tractable. To explain this, without loss of generality, assume the bits belong to {−1, 1}instead of {0, 1}, i.e., b′ ∈ Hm ≡ {−1, 1}m. Then,

‖x− Cb′‖22 = xTx + b′TCTCb′ − 2xTCb′ . (5.4)

For diagonal CTC, the middle quadratic term on the RHS becomes trace(CTC), independent

of b′. As a consequence, when C has orthogonal columns, one can easily see that,

argminb′∈Hm

‖x− Cb′ ‖22 = sign(CTx) , (5.5)

where sign(·) is the element-wise sign function.

To reduce quantization error further we can also introduce an offset, denoted µ ∈ Rp, to

translate x. Taken together with (5.5), this leads to the loss function for the model we call

orthogonal k-means (ok-means):1

`ok-means(C,µ) =∑x∈D

minb′∈Hm

‖x− µ− Cb′‖22 . (5.6)

Clearly, with a change of variables, b′ = 2b− 1, we can define new versions of µ and C, with

identical loss, for which the unknowns are binary, but in {0, 1}m.

The ok-means quantizer encodes each data point as a vertex of a transformed m-dimensional

unit hypercube. The transform, via C and µ, maps the hypercube vertices onto the input feature

space, ideally as close as possible to the data points. The matrix C has orthogonal columns

and can therefore be expressed in terms of rotation and scaling; i.e., C ≡ RD, where R ∈ Rp×m

has orthonormal columns (RTR = Im), and D ∈ Rm×m is diagonal and positive definite. The

goal of learning is to find the parameters, R, D, and µ, which minimize quantization error.

Fig. 5.1 depicts a small set of 2D data points (red x’s) and two possible quantizations.

1While closely related to ITQ, we use the term ok-means to emphasize the relationship to k-means.

Chapter 5. Cartesian k-means 71

[+1+1

][−1+1

]

[+1−1

][−1−1

]

[−1+1

]

[−1−1

]

[+1+1

]

[+1−1

]

Figure 5.1: Two quantizations of 2D data (red ×’s) by ok-means. Cluster centers are depictedby dots, and cluster boundaries by dashed lines. (Left) Clusters formed by a 2-cube withno rotation, scaling, or translation; centers = {b′|b′ ∈ H2}. (Right) Rotation, scaling andtranslation are used to reduce distances between data points and cluster centers; centers ={µ +RDb′ | b′ ∈ H2}.

Fig. 5.1 (left) depicts the vertices of a 2-cube with C = I2 and zero translation. The cluster

regions are simply the four quadrants of the 2D space. The distances between data points and

cluster centers, i.e., the quantization errors, are relatively large. By comparison, Fig. 5.1 (right)

shows how a transformed 2-cube, the full model, can greatly reduce quantization errors.

5.2.1 Learning ok-means

To derive the learning algorithm for ok-means we first re-write the objective in matrix terms.

Given n data points, let X = [x1,x2, · · · ,xn] ∈ Rp×n. Let B ∈ {−1, 1}m×n denote the cor-

responding cluster assignment coefficients. Our goal is to find the assignment coefficients B

and the transformation parameters, namely, the rotation R, scaling D, and translation µ, that

minimize

`ok-means(B,R,D,µ) = ‖X − µ1T −RDB‖2f (5.7)

= ‖X ′ −RDB‖2f (5.8)

where ‖·‖f denotes the Frobenius norm, X ′ ≡ X − µ1T, R is constrained to have orthonormal

columns (RTR = Im), and D is a positive diagonal matrix.

Like k-means, coordinate descent is effective for optimizing (5.8). We first initialize µ and

R, and then iteratively optimize `ok-means with respect to B, D, R, and µ:

• Optimize B and D: With straightforward algebraic manipulation of (5.8), one can show

that

`ok-means = ‖RTX ′−DB‖2f + ‖R⊥TX ′‖2f , (5.9)

Chapter 5. Cartesian k-means 72

where columns of R⊥ span the orthogonal complement of the column-space of R (i.e., the

block matrix [R R⊥] is orthogonal).

It follows that, given X ′ and R, we can optimize the first term in (5.9) to solve for B and

D. Here, DB is the least-squares approximation to RTX ′, where RTX ′ and DB are m×n.

Further, the ith row of DB can only contain elements from {−di,+di} where di = Dii.

Wherever the corresponding element of RTX ′ is positive (negative) we should put a positive

(negative) value in DB. The optimal di is determined by the mean absolute value of the

elements in the ith row of RTX ′:

B ← sign(RTX ′

)(5.10)

D ← Diag(meanrow

(abs(RTX ′))) (5.11)

• Optimize R: Using the objective (5.8), find the matrix R that minimizes ‖X ′−RA‖2f , subject

to RTR = Im, and A ≡ DB. This is equivalent to an Orthogonal Procrustes problem [Sch66],

and can be solved exactly using SVD. In particular, by adding p − m rows of zeros to the

bottom of D, DB becomes p × n. Then R is square and orthogonal and can be estimated

with SVD. But since DB is degenerate we are only interested in the first m columns of R.

The remaining columns of R are unique only up to rotation of the null-space.

• Optimize µ: Given R, B and D, the optimal µ is given by the column average of X−RDB:

µ ← meanrow

(X−RDB) (5.12)

5.2.2 Distance estimation for approximate nearest neighbor search

One application of scalable quantization is distance estimation for approximate NNS. Before

introducing more advanced quantization techniques, we describe some experimental results

with ok-means on Euclidean approximate NNS. The benefits of ok-means for approximate NNS

are two-fold. Storage costs for the centers is reduced to O(log k), from O(k) with k-means.

Second, substantial speedups are possible by exploiting fast methods for NNS on binary codes

in Hamming space (e.g., Chapter 4).

Generally, in terms of a generic quantizer q(·), there are two natural ways to estimate the

distance between two vectors, z and x [JDS11]. Using the Symmetric quantizer distance (SQD)

‖z−x‖ is approximated by ‖q(z)−q(x)‖. Using the Asymmetric quantizer distance (AQD),

only one of the two vectors is quantized, and ‖z−x‖ is estimated as ‖z−q(x)‖. While SQD

might be slightly faster to compute, AQD incurs lower quantization errors, and hence is more

accurate.

For approximate NNS, in a pre-processing stage, each database entry, x, is encoded into a

binary vector corresponding to the cluster center index to which x is assigned. At test time,

the queries may or may not be encoded into indices, depending on whether one uses SQD or

AQD. The main benefit is that just using quantized dataset entries, one can obtain a good

Chapter 5. Cartesian k-means 73

approximation of distances.

In the ok-means model, the quantization of an input x is straightforwardly shown to be

qok(x) = µ +RD sign(RT(x− µ)) . (5.13)

The corresponding m-bit cluster index is sign(RT(x − µ)). Given two indices, b1,b2 ∈ Hm,

the symmetric ok-means quantizer distance is,

SQDok(b1,b2) = ‖µ +RDb1 − µ−RDb2‖22 (5.14)

= ‖D(b1 − b2)‖22 . (5.15)

In effect, SQDok is a weighted Hamming distance. It is the sum of the squared diagonal entries

of D corresponding to bits where b1 and b2 differ. Interestingly, in our experiments with ok-

means, Hamming and weighted Hamming distances yield similar results. Thus, in ok-means

experiments we simply report results for Hamming distance, to facilitate comparisons with other

hashing techniques. When the scaling in ok-means is constrained to be isotropic (i.e., D = αImfor α ∈ R+), then SQDok becomes a constant multiple of the usual Hamming distance. As

discussed in Section 5.4, this isotropic ok-means is almost identical to ITQ [GL11].

The ok-means defines AQD between a feature vector z and a cluster index b as,

AQDok(z,b) = ‖z− µ−RDb‖22 (5.16)

= ‖RTz′ −Db‖22 + ‖R⊥Tz′‖22 , (5.17)

where z′ = z−µ. For approximate NNS, in comparing distances from query z to a dataset of

binary indices, the second term on the RHS of (5.17) is irrelevant, since it is a constant, which

does not depend on b. Without this term, AQDok becomes a form of asymmetric Hamming

(AH) distance between RTz′ and b. While previous work [GP11] discussed various ad-hoc AH

distance measures for ITQ and binary hashing in general, in our model, we derived optimal AH

distance for ok-means and ITQ for Euclidean distance estimation.

5.2.3 Experiments with ok-means

Following [JDS11], we report approximate NNS results on 1M SIFT, a corpus of 128D SIFT

descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. The training

set is used to train models. The base set is the database, and the test points are queries. The

number of bits, m, is typically less than p, but no pre-processing is needed for dimensionality

reduction. To initialize optimization, R is chosen as a random rotation of the first m principal

directions of the training data, and µ is chosen as the mean of the data.

For each query, we find R nearest neighbors, and compute Recall@R, the fraction of queries

for which the ground-truth 1st Euclidean nearest neighbor is found in the R retrieved items.

Fig. 5.2 shows the Recall@R plots for ok-means with Hamming (H) ≈ SQDok and asymmetric

Chapter 5. Cartesian k-means 74

1 10 100 1K 10K0

0.2

0.4

0.6

0.8

11M SIFT, 64−bit encoding (k = 2

64)

Re

ca

ll@R

R

PQ (AD)ok−means (AH)ITQ (AH)ok−means (H)ITQ (H)

Figure 5.2: Euclidean approximate NNS results for different methods and distance functionson the 1M SIFT dataset.

Hamming (AH) ≡ AQDok distance, vs. ITQ [GL11] and PQ [JDS11]. The PQ method exploits

a more complex asymmetric distance function denoted AD ≡ AQDck (defined in Section 5.5.1).

Note first that ok-means improves upon ITQ, with both Hamming and asymmetric Hamming

distances. This is due to the non-uniform scaling parameters (i.e., D) in ok-means. Thus, if one

is interested in efficient Hamming distance retrieval, ok-means is preferred over ITQ. Clearly,

better results are obtained with the asymmetric distance functions.

Fig. 5.2 also shows that PQ achieves superior recall rates. This stems from its joint encoding

of multiple feature dimensions. In ok-means, each bit represents a partitioning of the feature

space into two clusters, separated by a hyperplane. The intersection of m orthogonal hyper-

planes yields 2m regions. Hence, we obtain just two clusters per dimension, and each dimension

is encoded independently. In PQ, by comparison, multiple dimensions are encoded jointly, with

arbitrary numbers of clusters. PQ thereby captures statistical dependencies among different

dimensions. We next extend ok-means to jointly encode multiple dimensions as well.

5.3 Cartesian k-means

In the Cartesian k-means (ck-means) model, each cluster center is expressed parametrically as

an additive combination of multiple subcenters. Let there be m sets of subcenters, each with h

elements.2 Let C(i) be a matrix whose columns comprise the elements of the ith subcenter set;

C(i)∈Rp×h. Finally, assume that each cluster center, c, is the sum of exactly one element from

each subcenter set:

c =m∑i=1

C(i)b(i) , (5.18)

where b(i) ∈ H1/h is a 1-of-h encoding.

2While here we assume a fixed cardinality for all subcenter sets, the model easily allows sets with differentcardinalities.

Chapter 5. Cartesian k-means 75

d4

d3

d1

d5d2

d′2

d′1 d′4

d′3

d′5

qck( ) =

[d4

d′3

]

qck( ) =

[d1

d′5

]

qck( ) =

[d3

d′1

]

Figure 5.3: Depiction of Cartesian quantization on 4D data, with the first and last two dimen-sions sub-quantized on the left and right. Cartesian k-means quantizer denoted qck, combinesthe sub-quantizations in subspaces, and produces a 4D reconstruction.

As a concrete example (see Fig. 5.3), suppose we are given 4D inputs, x∈R4, and we split

each datum into m=2 parts:

u(1) =[I2 0

]x , and u(2) =

[0 I2

]x . (5.19)

Then, suppose we quantize each part, u(1) and u(2), separately. As depicted in Fig. 5.3 (left

and middle), we could use h=5 subcenters for each one. Placing the corresponding subcenters

in the columns of 4×5 matrices C(1) and C(2),

C(1) =

[d1 d2 d3 d4 d5

02×5

], C(2) =

[02×5

d′1 d′2 d′3 d′4 d′5

],

we obtain a model (5.18) that provides 52 possible centers with which to quantize the data.

More generally, the total number of model centers is k = hm. Each center is a member of

the Cartesian product of the subcenter sets, hence the name Cartesian k-means. Importantly,

while the number of centers is hm, the number of subcenters is only mh. The model provides

a super-linear number of centers with a linear number of parameters.

The learning objective for Cartesian k-means is

`ck-means(C) =∑x∈D

min{b(i)}mi=1

∥∥∥x− m∑i=1

C(i)b(i)∥∥∥22

(5.20)

where b(i) ∈ H1/h, and C ≡ [C(1), · · · , C(m)] ∈ Rp×mh. If we let bT ≡ [b(1)T, · · · ,b(m)T] then

the second sum in (5.20) can be expressed succinctly as Cb.

The key problem with this formulation is that the minimization of (5.20) with respect to

the b(i)’s is intractable. Nevertheless, motivated by orthogonal k-means (Section 5.2), encoding

can be shown to be both efficient and exact if we impose orthogonality constraints on the sets

of subcenters. To that end, assume that all subcenters in different sets are pairwise orthogonal;

Chapter 5. Cartesian k-means 76

i.e.,

∀ i, j | i 6= j C(i)TC(j) = 0h×h . (5.21)

In other words, each subcenter matrix C(i) spans a linear subspace of Rp, and the linear sub-

spaces for different subcenter sets do not intersect. Hence, (5.21) implies that only the subcen-

ters in C(i) can explain the projection of x onto the C(i) subspace.

In the example depicted in Fig. 5.3, the input features are simply partitioned according

to (5.19), and the subspaces clearly satisfy the orthogonality constraints. It is also clear that

C ≡ [C(1) C(2)] is block diagonal, with 2×5 blocks, denoted D(1) and D(2). The quantization

error therefore becomes

∥∥x− Cb∥∥22

=∥∥∥[u(1)

u(2)

]−

[D(1) 0

0 D(2)

][b(1)

b(2)

]∥∥∥2=

∥∥∥u(1)−D(1)b(1)∥∥∥2 +

∥∥∥u(2)−D(2)b(2)∥∥∥2.

In words, the squared quantization error is the sum of the squared errors on the subspaces.

One can therefore solve for the binary coefficients of the subcenter sets independently.

In the general case, assuming that pairwise orthogonality constraints in (5.21) are satisfied,

C can be expressed as a product RD, where R has orthonormal columns, and D is block

diagonal; i.e., C = RD where

R = [R(1), · · · , R(m)] , and D=

D(1) 0 . . . 0

0 D(2) 0...

. . ....

0 0 . . . D(m)

, (5.22)

and hence C(i) =R(i)D(i). With si≡ rank(C(i)), it follows that R(i) ∈ Rp×si and D(i) ∈ Rsi×h.

Clearly,∑si≤p, because of the orthogonality constraints.

Replacing C(i) with R(i)D(i) in the RHS of (5.20), we find

∥∥x− Cb∥∥22

=

m∑i=1

∥∥u(i) −D(i)b(i)∥∥22

+ ‖R⊥Tx‖22 , (5.23)

where u(i) ≡ R(i)Tx, and R⊥ is the orthogonal complement of R. This shows that, with or-

thogonality constraints (5.21), the ck-means encoding problem can be split into m independent

sub-encoding problems, one for each subcenter set.

To find the b(i) that minimizes∥∥u(i)−D(i)b(i)

∥∥22

, we perform NNS with u(i) against h si-dim

vectors in D(i). This entails a cost of O(hsi). Fortunately, all the elements of b can be found

very efficiently, in O(∑hsi) ⊆ O(hs), where s≡

∑si. If we also include the cost of rotating

x to obtain each u(i), the total encoding cost is O(ps+hs) ⊆ O(p2+hp). Alternatively, one

could perform NNS on p-dimensional C(i)’s, to find the b(i)’s, which costs O(mhp). Table 5.1

Chapter 5. Cartesian k-means 77

method #centers #bits cost cost(s)ok-means 2m m O(mp) O(mp)

ck-means hm m log hO(p2 + hp) O(ps+hs) oror O(mhp) O(ps+mhs)

k-means k log k O(kp) O(ps+ ks)

Table 5.1: A summery of ok-means, ck-means, and k-means in terms of number of centers,number of bits needed for indices (i.e., log #centers), and the storage cost of representation,which is the same as the encoding cost to convert inputs to indices. The last column shows thecosts given an assumption that C has a rank of s ≥ m.

summarizes the quantization models in terms of their number of centers, index size, and cost

of storage and encoding.

5.3.1 Learning ck-means

We can re-write the ck-means objective (5.20) in matrix form with the Frobenius norm; i.e.,

`ck-means(R,D,B) = ‖X −RDB ‖2f (5.24)

where the columns of X and B comprise the data points and the subcenter assignment coef-

ficients. The input to the learning algorithm is the training data X, the number of subcenter

sets m, the cardinality of the subcenter sets h, and an upper bound on the rank of C, i.e., s. In

practice, we also let each rotation matrix R(i) have the same number of columns, i.e., si = s/m.

The outputs are the matrices {R(i)}mi=1 and {D(i)}mi=1 that provide a local minima of (5.24).

Learning begins with the initialization of R and D, followed by iterative coordinate descent

in B, D, and R:

• Optimize B and D: With R fixed, the objective is given by (5.23) where ‖R⊥TX‖2f is

constant. Given data projections U (i) ≡ R(i)TX, to optimize for B and D we perform one

step of k-means for each subcenter set:

– Assignment: Perform NNS for each subcenter set to find the assignment coefficients, B(i),

B(i) ← argminB(i)

‖U (i) −D(i)B(i)‖2f

– Update: D(i) ← argminD(i)

‖U (i) −D(i)B(i)‖2f

• Optimize R: Placing the D(i)’s along the diagonal of D, as in (5.22), and concatenating

B(i)’s as rows of B, i.e., BT = [B(1)T, . . . , B(m)T], the optimization of R reduces to the

orthogonal Procrustes problem:

R← argminR

‖X −RDB‖2f .

Chapter 5. Cartesian k-means 78

In experiments below, R ∈ Rp×p, and rank(C) ≤ p is unconstrained. For high-dimensional data

where rank(X) � p, for efficiency it may be useful to constrain rank(C). One can also retain

a low-dimensional subspace using PCA.

5.4 Relations and related work

As mentioned above, there are close mathematical relationships between ok-means, ck-means,

ITQ for binary hashing [GL11], and PQ for vector quantization [JDS11]. It is instructive to

specify these relationships in some detail.

5.4.1 Iterative Quantization vs. Orthogonal k-means

ITQ [GL11] is a variant of locality-sensitive hashing, mapping data to binary codes for fast

retrieval. To extract m-bit codes, ITQ first zero-centers the data matrix to obtain X ′. PCA is

then used for dimensionality reduction, from p down to m dimensions, after which the subspace

representation is randomly rotated. The composition of PCA and the random rotation can be

expressed as WTX ′ where W ∈Rp×m. ITQ then solves for the m×m rotation matrix, R, that

minimizes

`ITQ(B,R) = ‖RTWTX ′−B‖2f , s.t. RTR = Im×m, (5.25)

where B ∈ {−1,+1}n×m.

ITQ rotates the subspace representation of the data to match the binary codes, and then

minimizes quantization error within the subspace. By contrast, ok-means maps the binary

codes into the original input space, and then considers both the quantization error within the

subspace and the out-of-subspace projection error. A key difference is the inclusion of ‖R⊥X ′‖2fin the ok-means objective (5.9). This is important since one can often reduce quantization errors

by projecting out significant portions of the feature space.

Another key difference between ITQ and ok-means is the inclusion of non-uniform scaling

in ok-means. This is important when the data are not isotropic, and contributes to the marked

improvement in recall rates in Fig. 5.2.

5.4.2 Orthogonal k-means vs. Cartesian k-means

We next show that ok-means is a special case of ck-means with h = 2, where each subcenter set

has two elements. To this end, let C(i)=[c(i)1 c

(i)2 ], and let b(i)=[b

(i)1 b

(i)2 ]T be the 2-dimensional

indicator vector that selects the ith subcenter. Since b(i) is a 1-of-2 encoding ([0

1

]or[1

0

]), it

follows that:

b(i)1 c

(i)1 + b

(i)2 c

(i)2 =

c(i)1 + c

(i)2

2+ b′i

c(i)1 − c

(i)2

2, (5.26)

Chapter 5. Cartesian k-means 79

where b′i ≡ b(i)1 − b

(i)2 ∈{−1,+1}. With the following setting of the ok-means parameters,

µ=m∑i=1

c(i)1 +c

(i)2

2, and C=

[c(1)1 −c

(1)2

2, . . . ,

c(m)1 −c

(m)2

2

],

it should be clear that∑

iC(i)b(i) = µ+Cb′, where b′ ∈ {−1,+1}m, and b′i is the ith bit of b′,

used in (5.26). Similarly, one can also map ok-means parameters onto corresponding subcenters

for ck-means.

Thus, there is a 1-to-1 mapping between the parametrization of cluster centers in ok-means

and ck-means for h = 2. The benefits of ok-means are its small number of parameters, and its

intuitive formulation. The benefit of the ck-means generalization is its joint encoding of multiple

dimensions with an arbitrary number of centers, allowing it to capture intrinsic dependence

among data dimensions.

5.4.3 Product Quantization vs. Cartesian k-means

PQ first splits the input vectors into m disjoint sub-vectors, and then quantizes each sub-vector

separately using a k-means sub-quantizer. Thus PQ is a special case of ck-means where the

rotation R is not optimized; rather, R is fixed in both learning and retrieval. This is impor-

tant because the sub-quantizers act independently, thereby ignoring intra-subspace statistical

dependence. Thus the selection of subspaces is critical for PQ [JDS11, JDSP10]. Jegou et

al. [JDSP10] suggest using PCA, followed by random rotation, before applying PQ to high-

dimensional vectors. The idea of finding a rotation that minimizes quantization error was

mentioned in [JDSP10], but it was considered to be too difficult to be estimated.

Here we show that one can find a rotation to minimize the quantization error. The ck-means

learning algorithm optimizes sub-quantizers in an inner loop, but they interact in an outer loop

that optimizes the rotation (Section 5.3.1).

5.5 Experiments

5.5.1 Euclidean distance estimation for approximate NNS

Euclidean distance estimation for approximate NNS is a useful task for comparing quantiza-

tion techniques. Given a database of ck-means indices, and a query, we use Asymmetric and

Symmetric ck-means quantizer distance, denoted AQDck and SQDck. The AQDck between a

query z and a binary index b ≡[b(1)

T, . . . , b(m)T

]T, derived in (5.23), is

AQDck(z,b) =m∑i=1

∥∥u(i)−D(i)b(i)∥∥22

+ ‖R⊥Tx‖22 . (5.27)

Here,∥∥u(i)−D(i)b(i)

∥∥22

is the distance between the ith projection of z, i.e., u(i), and a subcenter

projection from D(i) selected by b(i). Given a query, these distance values for each i, and all

Chapter 5. Cartesian k-means 80

h possible values of b(i) can be pre-computed and stored in a query-specific h × m lookup

table. Once created, the lookup table is used to compare all database points to the query.

So computing AQDck entails m lookups and additions, somewhat more costly than Hamming

distance, but still fairly efficient. The last term on the RHS of (5.27) is irrelevant for NNS.

Since PQ is a special case of ck-means with pre-defined subspaces, the same distance estimates

are used for PQ (c.f., [JDS11]).

The SQDck between binary codes b1 and b2 is given by

SQDck(b1,b2) =

m∑i=1

∥∥D(i)b(i)1 −D

(i)b(i)2

∥∥22. (5.28)

Since b(i)1 and b

(i)2 are 1-of-h encodings, an m × h × h lookup table can be created to

store all pairwise sub-distances. While the cost of computing SQDck is the same as AQDck,

SQDck could also be used to estimate the distance between the indexed database entries, for

diversifying the retrieval results, or to detect near duplicate elements.

Datasets. We use the 1M SIFT dataset, as in Section 5.2.3, along with two others, namely,

1M GIST (960D features) and 1B SIFT, both comprising disjoint sets of training, base and

test vectors. 1M GIST has 500K training, 1M base, and 1K query vectors. 1B SIFT has 100M

training, 1B base, and 10K query points. In each case, recall rates are averaged over queries in

test set for a database populated from the base set. For expedience, we use only the first 1M

training points for the 1B SIFT experiments.

Parameters. In NNS experiments below, for both ck-means and PQ, we use m=8 and h=256.

Hence the number of clusters is k = 2568 = 264, so 64-bits are used as database indices. Using

h = 256 is particularly attractive because the resulting lookup tables are small, encoding is

fast, and each subcenter index fits into one byte. As h increases we expect retrieval results to

improve, but encoding and indexing of a query become slower.

Initialization. To initialize the D(i)’s for learning, as in k-means, we simply begin with

random samples from the set of U (i)’s (see Section 5.3.1). To initialize R we consider the

different methods that Jegou et al. [JDS11] proposed for splitting the feature dimensions into

m sub-vectors for PQ: (1) natural: sub-vectors comprise consecutive dimensions, (2) structured:

dimensions with the same index modulo 8 are grouped, and (3) random: random permutations

are used.

For PQ in the experiments below, we use the orderings that produced the best results

in [JDS11], namely, the structured ordering for 960D GIST, and the natural ordering for 128D

SIFT. For learning ck-means, R is initialized to the identity with SIFT corpora. For 1M GIST,

where the PQ ordering has a significant impact, we consider all three orderings to initialize R.

Results. Fig. 5.4 shows Recall@R plots for ck-means and PQ [JDS11] with symmetric and

asymmetric distances (SD ≡ SQDck and AD ≡ AQDck) on the 3 datasets. The horizontal axis

represents the number of retrieved items, R, on a log-scale. The vertical axis shows Recall@R.

Chapter 5. Cartesian k-means 81

1 10 100 1K 0

0.2

0.4

0.6

0.8

11M SIFT, 64−bit encoding (k = 2

64)

Re

ca

ll@R

R

ck−means (AD)PQ (AD)ck−means (SD)PQ (SD)ok−means (AH)ITQ (AH)

1 10 100 1K 10K0

0.2

0.4

0.6

0.8

11M GIST, 64−bit encoding (k = 2

64)

R

ck−means (AD)PQ (AD)ck−means (SD)PQ (SD)ok−means (AH)ITQ (AH)

1 10 100 1K 10K0

0.2

0.4

0.6

0.8

11B SIFT, 64−bit encoding (k = 2

64)

R

ck−means (AD)PQ (AD)ck−means (SD)PQ (SD)ok−means (AH)ITQ (AH)

Figure 5.4: Euclidean nearest neighbor recall@R (number of items retrieved) based on differentquantizers and corresponding distance functions on the 1M SIFT, 1M GIST, and 1B SIFTdatasets. The dashed curves use symmetric distance. (AH ≡ AQDok, SD ≡ SQDck, AD ≡AQDck)

The results consistently favor ck-means. On the high-dimensional GIST data, ck-means with

AD significantly outperforms other methods; even ck-means with SD performs on par with PQ

with AD. On 1M SIFT, the Recall@10 numbers for PQ and ck-means, both using AD, are

59.9% and 63.7%. On 1B SIFT, Recall@100 numbers are 56.5% and 64.9%. As expected, with

increasing dataset size, the difference between methods become more significant.

In 1B SIFT, each real-valued feature vector is 128 bytes, hence a total of 119 GB. Using any

method in Fig. 5.4 (including ck-means) to index the database into 64 bits, this storage cost

reduces to only 7.5 GB. This allows one to work with much larger datasets. In the experiments

we use linear scan to find the nearest items according to quantizer distances. For NNS using 10K

SIFT queries on 1B SIFT this takes about 8 hours for AD and AH and 4 hours for Hamming

distance on a 2×4-core computer. Search can be sped up significantly; using a coarse initial

quantization and an inverted file structure for AD and AH, as suggested by [JDS11, BL12], and

using the multi-index hashing method of [NPF12] for Hamming distance. In the experiments

we did not implement these efficiencies as we focus primarily on the quality of quantization for

Chapter 5. Cartesian k-means 82

1 10 100 1K 10K0

0.2

0.4

0.6

0.8

1

1M GIST, 64−bit encoding (k = 264

)

Re

ca

ll@R

R

ck−means (1) (AD)ck−means (2) (AD)ck−means (3) (AD)PQ (2) (AD)PQ (1) (AD)PQ (3) (AD)

Figure 5.5: PQ and ck-means results using natural (1), structured (2), and random (3) orderingto define the (initial) subspaces.

1 10 100 1K 10K0

0.2

0.4

0.6

0.8

11M GIST, encoding with 64, 96, and 128 bits

Re

ca

ll@R

R

ck−means 128−bitck−means 96−bitck−means 64−bitPQ 128−bitPQ 96−bitPQ 64−bit

Figure 5.6: PQ and ck-means results using different number of bits for encoding. In all casesasymmetric distance is used.

distance estimation.

Fig. 5.5 compares ck-means to PQ when R in ck-means is initialized using the 3 orderings

of [JDS11]. It shows that ck-means is superior in all cases. Similarly interesting, it also shows

that despite the non-convexity of the optimization objective, ck-means learning tends to find

similarly good encodings under different initial conditions. Finally, Fig. 5.6 compares ck-means

to PQ with different numbers of centers on GIST.

5.5.2 Learning visual codebooks

While the task of distance estimation for NNS requires too many clusters for k-means, it is

interesting to compare k-means and ck-means on a task with a moderate number of clusters.

To this end, we consider codebook learning for bag-of-words models [CDF+04, LSP06]. We use

ck-means with m=2 and h=√k, and hence k centers. The main advantage of ck-means here is

that finding the closest cluster center is done in O(√k) time, much faster than standard NNS

Chapter 5. Cartesian k-means 83

Codebook Accuracy

PQ (k = 402) 75.9%ck-means (k = 402) 78.2%

k-means (k = 1600) [CLN11] 77.9%

PQ (k = 642) 78.2%ck-means (k = 642) 79.7%

k-means (k = 4000) [CLN11] 79.6%

Table 5.2: Recognition accuracy on the CIFAR-10 test set using different codebook learningalgorithms.

with k-means in O(k). Alternatives for k-means, to improve efficiency, include approximate

k-means [PCI+07], and hierarchical k-means [NS06]. Here, we only compare to exact k-means.

CIFAR-10 [Kri09] comprises 50K training and 10K test images (32 × 32 pixels). Each image

is one of 10 classes (airplane, bird, car, cat, deer, dog, frog, horse, ship, and truck). We use

a publicly available code from Coates et al. [CLN11], with changes to the codebook learning

and cluster assignment modules. Codebooks are built on 6 × 6 whitened color image patches.

One histogram is created per image quadrant, and a linear SVM is applied to 4k-dimensional

feature vectors.

Recognition accuracy rates on the test set for different models and k are given in Table 5.2.

Despite having fewer parameters, ck-means performs on par or better than k-means. This is

consistent for different initializations of the algorithms. Although k-means has higher fidelity

than ck-means, with fewer parameters, ck-means may be less susceptible to overfitting. Ta-

ble 5.2, also compares with the approach of [WWX12], where PQ without learning a rotation is

used for clustering features into codewords. As expected, learning the rotation has a significant

impact on recognition rates, outperforming different initializations of PQ.

5.6 More recent quantization techniques

Some recent quantization techniques [BL14, MHL14a, BL15] questioned the necessity of impos-

ing orthogonality constraints on the center matrix by models such as ok-means and ck-means.

Such techniques relax the orthogonality constraints and aim to address the center assignment

problem via approximate inference to solve,

enc(x) = argmin{b(i)}mi=1

∥∥∥x− m∑i=1

C(i)b(i)∥∥∥22, (5.29)

where b(i) ∈ H1/h is 1-of-h encoding. Additive quantization [BL14] suggests using beam search

to solve (5.29), while stacked quantization [MHL14a] suggests learning the sub-centers such that

b(i)’s are amenable to greedy optimization, in which b(1), . . . ,b(m) are estimated sequentially.

Both of these techniques are interesting and they achieve some improvements in quantization

Chapter 5. Cartesian k-means 84

error over ck-means, but at the cost of more expensive encoding algorithms.

Given a query vector z and the quantization of a dataset point x as b, one can still estimate

Euclidean distance by

AQD(z,b) =∥∥∥z− m∑

i=1

C(i)b(i)∥∥∥22

(5.30)

= ‖z‖22 +

m∑i=1

zTC(i)b(i) +∥∥∥ m∑i=1

C(i)b(i)∥∥∥22. (5.31)

Even though, the last term on the RHS of (5.31) is not easy to compute, as the subcenter

matrices are not orthogonal anymore, one can replace this term with the original `2 norm of

the dataset point x, i.e., ‖x‖22. Thus, if for each data point x, one stores b = enc(x) and ‖x‖22,one can still approximate Euclidean distance efficiently, using query specific lookup tables to

cache zTC(i)b(i) for different i and b(i).

Furthermore, for many applications, one is interested in approximating cosine similarity or

dot product between the vectors, in which case, one does not even need to store ‖x‖22, as the 2nd

order statistics is irrelevant. For cosine similarity, one can normalize the vectors to have unit

length first, and then use quantization techniques to estimate zTx using terms zTC(i)b(i). Thus,

an important application of vector quantization (with orthogonal or non-orthogonal centers)

is to approximate dot product [VZ12, SF13], e.g., to enable fast evaluation of classifiers, when

there are millions of categories.

5.7 Summary

In this chapter, we present the Cartesian k-means model, a generalization of k-means with

a parametrization of the cluster centers such that number of centers is super-linear in the

number of parameters. The method is also shown to be a generalization of the ITQ algorithm

and Product Quantization. In experiments on large-scale retrieval and codebook learning for

recognition the results are impressive, outperforming product quantization by a significant

margin. An implementation of the method is available at https://github.com/norouzi/

ckmeans.

Chapter 6

Conclusion

This thesis develops several algorithms, in various flavors for large-scale similarity search. Com-

mon threads among these algorithms include space partitioning, subspace projection, and com-

pact discrete codes for memory efficient representations of large datasets. For building real-

world similarity search systems, one can make use of the techniques developed in Chapters 2

and 3 to learn mappings from data points to binary codes, followed by the use the multi-index

hashing of Chapter 4 to perform efficient NNS on binary codes. As an alternative, one can

make use of any Euclidean distance metric learning algorithm to learn a semantic Euclidean

embedding of the data, followed by scalable, compositional k-means models of Chapter 5 for

fast Euclidean NNS. We believe that both of these approaches to scalable similarity search are

applicable to big data engineering applications.

We introduce a general framework for learning binary hash functions that preserve similarity

structure of data in a compact way, in Chapters 2 and 3. Our formulation establishes a link

between latent structured prediction and hash function learning. This leads to the development

of a piecewise smooth upper bound on hashing empirical loss. We develop two exact and efficient

loss-augmented inference algorithms for pairwise hinge loss and triplet ranking loss functions.

Strong retrieval results are reported on multiple benchmarks, along with promising classification

results using no more than k-nearest neighbor on the binary codes.

Convolutional neural networks have achieved great success in many vision applications in-

cluding image classification [KSH12]. The Hamming distance metric learning (HDML) frame-

work of Chapter 3 can easily accommodate hash functions based on convolutional neural net-

works. One interesting direction for further research involves applying HDML with convolu-

tional nets to a large-scale classification task with thousands of class labels.

The established link between latent structured prediction and hash function learning can

inspire drawing similar connections between latent structured prediction and other machine

learning problems with latent structures. For example, in a recent work, we showed that there

is a similar link between latent structured prediction and learning decision trees [NCFK15,

NCJ+15]. We showed that by adopting a convex-concave upper bound on empirical loss one can

optimize decision trees in a non-greedy fashion. We believe there exist many other problems in

85

Chapter 6. Conclusion 86

machine learning that can benefit from the upper bound approach inspired by latent structural

SVMs [YJ09], and advocated in Chapters 2 and 3.

We introduce Multi-Index Hashing (MIH), a method for building multiple hash tables on

binary code substrings to enables exact k-nearest neighbor search in Hamming distance, in

Chapter 4. The approach is based on pigeonhole principle, is simple, and easy to implement. We

present theoretical analysis on uniformly distributed codes, in addition to promising empirical

results on non-uniformly distributed codes that include dramatic speedups over a linear scan

baseline.

In addition to research directions outlined in Section 4.6, one interesting avenue for future

research involves adopting the multi-index hashing algorithm for more general distance mea-

sures, such as asymmetric Hamming and quantization based distances discussed in Chapter 5.

One can extend the use of the pigeonhole principle, and potentially exploit priority queues to

search within code substrings to enable fast NNS based on more general distance functions. A

very recent paper by Matsui et al. [MYA15] has pursued this research direction and presented

promising results.

We develop new models related to k-means with a compositional parametrization of cluster

centers in Chapter 5. In such models, the number of effective quantization regions increases

super-linearly in the number of parameters. This allows one to efficiently quantize data us-

ing billions or trillions of centers. Two compositional models are presented called Orthogonal

k-means (ok-means) and Cartesian k-means (ck-means), which generalize previous work on

quantization such as Iterative Quantization [GL11] and Product Quantization [JDS11]. The

models are tested on large-scale Euclidean NNS tasks and showed great success.

There is a connection between k-means and mixture of Gaussians for density estimation.

An interesting research direction is the development of a density model based on Orthogonal

and Cartesian k-means. In such density models the covariance matrix is factored into a shared

rotation matrix and a mixture of interchangeable block diagonal covariance matrices. We find

that maximum likelihood estimation of such models are challenging, as there is no counterpart

for orthogonal Procrustes problem in the probabilistic setting. That said, gradient descent

methods on orthogonal matrices can be used to optimize such models.

Another interesting direction is concerns the formulation of ok-means and ck-means mixture

models. Such models make use of multiple rotation matrices, each of which is assigned to a

subset of data points. At training and test time, one considers all of the rotation matrices

and their associated cluster centers and picks the rotation matrix and the cluster center that

minimize quantization error. The main research question is whether such mixture models will

lead to a sufficient reduction in quantization error to justify the increase in encoding and storage

cost.

Since the development of Cartesian k-means, more recent quantization techniques such

as [BL14, KA14, ZDW14, MHL14b, BL15, ZQTW15] have been proposed that improve quan-

tization error by relaxing the orthogonality constraints imposed on the cluster centers by the

Chapter 6. Conclusion 87

ck-means model. Some of these models are slower at encoding time and require larger number

of parameters in the representation, but they generally provide more accurate quantization

results.

For all of our large-scale distance estimation and NNS experiments we used 1M SIFT, 1M

GIST, and 1B SIFT datasets created by [JDS11, JTDA11]. While these datasets are large and

useful, we believe that the conclusions drawn based only on SIFT and Gist may be limited in its

applicability to a broader range of applications. We think that research in the field of large-scale

similarity search would benefit from new standard large-scale benchmarks based on different

types of feature descriptors such as Fisher descriptors [PD07] and Convnet features [RASC14].

We focus on designing machine learning tools that optimize different forms of space par-

titioning based on different objectives, all of which map high-dimensional data to compact

discrete codes. We also develop data structures that facilitate fast near neighbor search on

discrete codes. We hope that tools and techniques developed in this thesis will constitute a

step toward the use of web-scale datasets in computer vision and machine learning.

Bibliography

[AI06] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approxi-

mate nearest neighbor in high dimensions. In FOCS, 2006.

[AI08] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest

neighbor in high dimensions. Communications of the ACM, 51(1):117–122, 2008.

[AMP11] M. Aly, M. Munich, and P. Perona. Distributed kd-trees for retrieval from very

large image collections. In BMVC, 2011.

[And09] Alexandr Andoni. Nearest neighbor search: the old, the new, and the impossible.

PhD thesis, MIT, 2009.

[AOV12] A. Alahi, R. Ortiz, and P. Vandergheynst. Freak: Fast retina keypoint. In CVPR,

2012.

[Bat89] R. Battiti. Accelerated backpropagation learning: Two optimization methods.

Complex Systems, 1989.

[Ben75] Jon Louis Bentley. Multidimensional binary search trees used for associative search-

ing. Communications of the ACM, 18(9), 1975.

[BHS13] Aurelien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning

for feature vectors and structured data. arXiv:1306.6709, 2013.

[BL12] A. Babenko and V. Lempitsky. The inverted multi-index. In CVPR, 2012.

[BL14] Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector

compression. In CVPR, 2014.

[BL15] Artem Babenko and Victor Lempitsky. Tree quantization for large-scale similarity

search and classification. In CVPR, 2015.

[Bro97] Andrei Z Broder. On the resemblance and containment of documents. In Compres-

sion and Complexity of Sequences 1997, 1997.

[BTF11a] A. Bergamo, L. Torresani, and A. Fitzgibbon. Picodes: Learning a compact code

for novel-category recognition. In NIPS, 2011.

88

Bibliography 89

[BTF11b] Alessandro Bergamo, Lorenzo Torresani, and Andrew Fitzgibbon. Picodes: Learn-

ing a compact code for novel-category recognition. In NIPS, 2011.

[CDF+04] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization

with bags of keypoints. In Workshop on statistical learning in computer vision,

ECCV, 2004.

[Cha02] M.S. Charikar. Similarity estimation techniques from rounding algorithms. In

STOC, 2002.

[CLN11] A. Coates, H. Lee, and A.Y. Ng. An analysis of single-layer networks in unsupervised

feature learning. In AISTATS, 2011.

[CLSF10] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent

elementary features. In ECCV, 2010.

[CLVZ11] Ken Chatfield, Victor S Lempitsky, Andrea Vedaldi, and Andrew Zisserman. The

devil is in the details: an evaluation of recent feature encoding methods. In BMVC,

2011.

[Col02] M. Collins. Discriminative training methods for hidden markov models: Theory

and experiments with perceptron algorithms. In EMNLP, 2002.

[CPT04] Antonio Criminisi, Patrick Perez, and Kentaro Toyama. Region filling and object

removal by exemplar-based image inpainting. IEEE Trans. Image Processing, 2004.

[CSSB10] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of

image similarity through ranking. JMLR, 2010.

[DCL08] W. Dong, M. Charikar, and K. Li. Asymmetric distance estimation with sketches

for similarity search in high-dimensional spaces. In SIGIR, 2008.

[DIIM04] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-

sensitive hashing scheme based on p-stable distributions. In SoCG, 2004.

[DKJ+07] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric

learning. In ICML, 2007.

[DS02] D. Decoste and B. Scholkopf. Training invariant support vector machines. Machine

Learning, 2002.

[FG06] J. Flum and M. Grohe. Parameterized Complexity Theory. Springer Press, 2006.

[FJP02] William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-

resolution. IEEE CG&A, 22, 2002.

Bibliography 90

[FSSM07] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-consistent local

distance functions for shape-based image retrieval and classification. In ICCV,

2007.

[GHKS13] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization

for approximate nearest neighbor search. In CVPR, 2013.

[GIM99] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via

hashing. In VLDB, 1999.

[GKRL13] Yunchao Gong, Sudhakar Kumar, Henry Rowley, and Svetlana Lazebnik. Learning

binary codes for high-dimensional data using bilinear projections. In CVPR, 2013.

[GL11] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learn-

ing binary codes. In CVPR, 2011.

[GP11] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. In

CVPR, 2011.

[GPY94] D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval.

In FOCS, 1994.

[GRHS04] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood com-

ponents analysis. In NIPS, 2004.

[HE07] James Hays and Alexei A. Efros. Scene completion using millions of photographs.

Proc. SIGGRAPH, 2007.

[HRCB11] J. He, R. Radhakrishnan, S.-F. Chang, and C. Bauer. Compact hashing with joint

optimization of search accuracy and time. In CVPR, 2011.

[HS06] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with

neural networks. Science, 2006.

[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the

curse of dimensionality. In STOC, 1998.

[JDS08] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric

consistency for large scale image search. In ECCV, 2008.

[JDS11] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor

search. IEEE Trans. PAMI, 2011.

[JDSP10] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating local descriptors into a

compact image representation. In CVPR, 2010.

Bibliography 91

[JTDA11] H. Jegou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors:

re-rank with source coding. In ICASSP, 2011.

[KA14] Yannis Kalantidis and Yannis Avrithis. Locally optimized product quantization for

approximate nearest neighbor search. In CVPR, 2014.

[KD09] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings.

In NIPS, 2009.

[KG09] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image

search. In ICCV, 2009.

[KGF12] D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in imagenet.

In ECCV, 2012.

[KPCB14] K, P, C, and B. http://www.kpcb.com/blog/2014-internet-trends, 2014.

[Kri09] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s

thesis, University of Toronto, 2009.

[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification

with deep convolutional neural networks. In NIPS, 2012.

[Llo82] S. P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information

Theory, 1982.

[Low04] David G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,

2004.

[LSP06] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid

matching for recognizing natural scene categories. In CVPR, 2006.

[LUC] LUC. https://github.com/apache/lucene-solr/.

[LWJ+12] Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised

hashing with kernels. In CVPR, 2012.

[MHK10] D. McAllester, T. Hazan, and J. Keshet. Direct Loss Minimization for Structured

Prediction. In ICML, 2010.

[MHL14a] Julieta Martinez, Holger H Hoos, and James J Little. Stacked quantizers for com-

positional vector compression. arXiv:1411.2173, 2014.

[MHL14b] Julieta Martinez, Holger H Hoos, and James J Little. Stacked quantizers for com-

positional vector compression. arXiv:1411.2173, 2014.

[MIH] MIH. https://github.com/norouzi/mih/.

Bibliography 92

[ML09] M. Muja and D. Lowe. Fast approximate nearest neighbors with automatic algo-

rithm configuration. In International Conference on Computer Vision Theory and

Applications, 2009.

[ML14] Marius Muja and David G. Lowe. Scalable nearest neighbor algorithms for high

dimensional data. IEEE Trans. PAMI, 2014.

[MNI] MNIST. http://yann.lecun.com/exdb/mnist/.

[MP69] M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.

[MYA15] Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. Pqtable: Fast exact

asymmetric distance neighbor search for product quantization using hash tables. In

ICCV, 2015.

[NCFK15] Mohammad Norouzi, Maxwell D Collins, David J Fleet, and Pushmeet Kohli.

Co2 forest: Improved random forest by continuous optimization of oblique splits.

arXiv:1506.06155, 2015.

[NCJ+15] Mohammad Norouzi, Maxwell D Collins, Matthew Johnson, David J Fleet, and

Pushmeet Kohli. Efficient non-greedy optimization of decision trees. In NIPS,

2015.

[NF11] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In

ICML, 2011.

[NF13] M. Norouzi and D. J. Fleet. Cartesian k-means. In CVPR, 2013.

[NFS12] M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming Distance Metric Learning.

In NIPS, 2012.

[NPF12] M. Norouzi, A. Punjani, and D.J. Fleet. Fast search in hamming space with multi-

index hashing. In CVPR, 2012.

[NPF14] Mohammad Norouzi, Ali Punjani, and David J Fleet. Fast exact search in hamming

space with multi-index hashing. IEEE Trans. PAMI, 2014.

[NS06] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR,

2006.

[OT01] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation

of the spatial envelope. IJCV, 2001.

[PCI+07] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with

large vocabularies and fast spatial matching. In CVPR, 2007.

Bibliography 93

[PD07] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for

image categorization. In CVPR, 2007.

[PLSP10] Florent Perronnin, Yan Liu, Jorge Sanchez, and Herve Poirier. Large-scale image

retrieval with compressed fisher vectors. In CVPR, 2010.

[RASC14] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.

Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR Work-

shop, 2014.

[RFF12] Mohammad Rastegari, Ali Farhadi, and David Forsyth. Attribute discovery via

predictable discriminative binary codes. In ECCV, 2012.

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representa-

tions by error propagation. MIT Press, 1986.

[RKKI15] Mohammad Rastegari, Cem Keskin, Pushmeet Kohli, and Shahram Izadi. Compu-

tationally bounded retrieval. In CVPR, 2015.

[RL09] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant

kernels. In NIPS, 2009.

[Sam06] Hanan Samet. Foundations of multidimensional and metric data structures. Morgan

Kaufmann, 2006.

[SBBF12] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. Ldahash: improved matching

with smaller descriptors. IEEE Trans. PAMI, 34, 2012.

[Sch66] P.H. Schonemann. A generalized solution of the Orthogonal Procrustes problem.

Psychometrika, 31, 1966.

[SF13] Mohammad Amin Sadeghi and David Forsyth. Fast template evaluation with vector

quantization. In NIPS, 2013.

[SH07] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving

class neighbourhood structure. In AISTATS, 2007.

[SH09] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Ap-

proximate Reasoning, 2009.

[SJ04] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative

comparisons. In NIPS, 2004.

[SSP03] P.Y. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural

networks applied to visual document analysis. In ICDR, 2003.

Bibliography 94

[SSS06] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring

photo collections in 3d. In Proc. SIGGRAPH, 2006.

[SSS08] Noah Snavely, Steven M Seitz, and Richard Szeliski. Modeling the world from

internet photo collections. IJCV, 80, 2008.

[SSSN04] S. Shalev-Shwartz, Y. Singer, and A.Y. Ng. Online and batch learning of pseudo-

metrics. In ICML, 2004.

[SVD03] G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estimation with parameter-

sensitive hashing. In ICCV, 2003.

[SZ03] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object

matching in videos. In ICCV, 2003.

[TCFL12] Tomasz Trzcinski, Christos Marios Christoudias, Pascal Fua, and Vincent Lepetit.

Boosting binary keypoint descriptors. In NIPS, 2012.

[TF00] J.B. Tenenbaum and W.T. Freeman. Separating style and content with bilinear

models. Neural Comp., 2000.

[TFF08] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: A large data

set for nonparametric object and scene recognition. IEEE Trans. PAMI, 2008.

[TFW08] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for

recognition. In CVPR, 2008.

[TGK03] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS,

2003.

[THJA04] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine

learning for interdependent and structured output spaces. In ICML, 2004.

[TMF07] A. Torralba, K.P. Murphy, and W.T. Freeman. Sharing visual features for multiclass

and multiview object detection. IEEE Trans. PAMI, 2007.

[VZ12] Andrea Vedaldi and Andrew Zisserman. Sparse kernel approximations for efficient

classification and detection. In CVPR, 2012.

[WBS06] K.Q. Weinberger, J. Blitzer, and L.K. Saul. Distance metric learning for large

margin nearest neighbor classification. In NIPS, 2006.

[WKC10] J. Wang, S. Kumar, and S.F. Chang. Sequential Projection Learning for Hashing

with Compact Codes. In ICML, 2010.

[WTF08] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.

Bibliography 95

[WWX12] S. Wei, X. Wu, and D. Xu. Partitioned k-means clustering for fast construction of

unbiased visual vocabulary. The Era of Interactive Media, 2012.

[XNJR02] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning, with

application to clustering with side-information. In NIPS, 2002.

[YJ09] C. N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In

ICML, 2009.

[YKGC14] Felix X Yu, Sanjiv Kumar, Yunchao Gong, and Shih-Fu Chang. Circulant binary

embedding. In ICML, 2014.

[YR03] A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Comp., 15,

2003.

[ZDW14] Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approxi-

mate nearest neighbor search. In ICML, 2014.

[ZM06] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM

computing surveys (CSUR), 2006.

[ZQTW15] Ting Zhang, Guo-Jun Qi, Jinhui Tang, and Jingdong Wang. Sparse composite

quantization. In CVPR, 2015.