7
Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao , Fuchen Long , Tao Mei , Yong Rui Microsoft Research, Beijing, China University of Science and Technology of China, Hefei, China {tiyao, tmei, yongrui}@microsoft.com, [email protected] Abstract Hashing techniques have been intensively investi- gated for large scale vision applications. Recent research has shown that leveraging supervised in- formation can lead to high quality hashing. How- ever, most existing supervised hashing method- s only construct similarity-preserving hash codes. Observing that semantic structures carry comple- mentary information, we propose the idea of co- training for hashing, by jointly learning projection- s from image representations to hash codes and classification. Specifically, a novel deep semantic- preserving and ranking-based hashing (DSRH) ar- chitecture is presented, which consists of three components: a deep CNN for learning image repre- sentations, a hash stream of a binary mapping lay- er by evenly dividing the learnt representations in- to multiple bags and encoding each bag into one hash bit, and a classification stream. Meanwhile, our model is learnt under two constraints at the top loss layer of hash stream: a triplet ranking loss and orthogonality constraint. The former aims to pre- serve the relative similarity ordering in the triplets, while the latter makes different hash bit as indepen- dent as possible. We have conducted experiments on CIFAR-10 and NUS-WIDE image benchmarks, demonstrating that our approach can provide supe- rior image search accuracy than other state-of-the- art hashing techniques. 1 Introduction The rapid development of Web 2.0 technologies has led to the surge of research activities in large scale visual search [Mei et al., 2014]. One fundamental research problem is sim- ilarity search, i.e., nearest neighbor search, which attempts to identify similar instances according to a query example. The need to search for millions of visual examples in a high- dimensional feature space, however, makes the task compu- tationally expensive and thus challenging. Hashing techniques, one direction of the most well-known Approximate Nearest Neighbor (ANN) search methods, have received intensive research attention for its great efficien- cy in gigantic data. The basic idea of hashing is to con- sky clouds sunset tree sky clouds sunset tree sky clouds sunset (a) (b) (c) Figure 1: Three exemplary images. Both images in (a) and (b) are associated with four tags: “sky,” “clouds,” “sunset” and “tree.” The image in (c) is labeled with “sky,” “clouds” and “sunset.” struct a series of hash functions to map each example in- to a compact binary code, making the Hamming distances on similar examples minimized and simultaneously max- imized on dissimilar examples. In the literature, there have been several techniques, including traditional hashing models based on hand-crafted features [Weiss et al., 2008; Wang et al., 2012] and deep models [Lai et al., 2015; Liong et al., 2015], being proposed for addressing the prob- lem of hashing. The former seek hashing function on hand- crafted features, which separate the encoding of feature rep- resentations and their quantization to hash codes, resulting in sub-optimal solution. The latter jointly learn feature repre- sentations and projections from them to hash codes in a deep architecture. We are investigating in this paper how to design a deep architecture for hashing to characterize the relative similarity between images, meanwhile making the obtained hash bits as independent as possible. While existing hashing approaches are promising to mea- sure similarity, the relationship between two images is more complex especially when images are with multiple semantic labels and is usually reflected by the number of common la- bels that two images have. Figure 1 shows three exemplary images. The two images in (a) and (b) are both associated with “sky,” “clouds,” “sunset” and “tree,” while the image in (c) is only relevant to “sky,” “clouds” and “sunset.” In this case, the hash codes of image in (a) should be closer in prox- imity to the image in (b) than the image in (c). Therefore, in practice, how to preserve semantic structures of the data in form of class labels is also essential to be further taken into account for hashing. By consolidating the idea of co-training between hashing and preserving semantic structures, this paper presents a nov- el Deep Semantic-Preserving and Ranking-Based Hashing

Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

  • Upload
    lamhanh

  • View
    223

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval

Ting Yao †, Fuchen Long ‡, Tao Mei †, Yong Rui †

†Microsoft Research, Beijing, China ‡University of Science and Technology of China, Hefei, China

{tiyao, tmei, yongrui}@microsoft.com, [email protected]

AbstractHashing techniques have been intensively investi-gated for large scale vision applications. Recentresearch has shown that leveraging supervised in-formation can lead to high quality hashing. How-ever, most existing supervised hashing method-s only construct similarity-preserving hash codes.Observing that semantic structures carry comple-mentary information, we propose the idea of co-training for hashing, by jointly learning projection-s from image representations to hash codes andclassification. Specifically, a novel deep semantic-preserving and ranking-based hashing (DSRH) ar-chitecture is presented, which consists of threecomponents: a deep CNN for learning image repre-sentations, a hash stream of a binary mapping lay-er by evenly dividing the learnt representations in-to multiple bags and encoding each bag into onehash bit, and a classification stream. Meanwhile,our model is learnt under two constraints at the toploss layer of hash stream: a triplet ranking loss andorthogonality constraint. The former aims to pre-serve the relative similarity ordering in the triplets,while the latter makes different hash bit as indepen-dent as possible. We have conducted experimentson CIFAR-10 and NUS-WIDE image benchmarks,demonstrating that our approach can provide supe-rior image search accuracy than other state-of-the-art hashing techniques.

1 IntroductionThe rapid development of Web 2.0 technologies has led tothe surge of research activities in large scale visual search[Mei et al., 2014]. One fundamental research problem is sim-ilarity search, i.e., nearest neighbor search, which attemptsto identify similar instances according to a query example.The need to search for millions of visual examples in a high-dimensional feature space, however, makes the task compu-tationally expensive and thus challenging.

Hashing techniques, one direction of the most well-knownApproximate Nearest Neighbor (ANN) search methods, havereceived intensive research attention for its great efficien-cy in gigantic data. The basic idea of hashing is to con-

sky

clouds

sunset

tree

sky

clouds

sunset

tree

sky

clouds

sunset

(a) (b) (c)

Figure 1: Three exemplary images. Both images in (a) and (b) areassociated with four tags: “sky,” “clouds,” “sunset” and “tree.” Theimage in (c) is labeled with “sky,” “clouds” and “sunset.”

struct a series of hash functions to map each example in-to a compact binary code, making the Hamming distanceson similar examples minimized and simultaneously max-imized on dissimilar examples. In the literature, therehave been several techniques, including traditional hashingmodels based on hand-crafted features [Weiss et al., 2008;Wang et al., 2012] and deep models [Lai et al., 2015;Liong et al., 2015], being proposed for addressing the prob-lem of hashing. The former seek hashing function on hand-crafted features, which separate the encoding of feature rep-resentations and their quantization to hash codes, resulting insub-optimal solution. The latter jointly learn feature repre-sentations and projections from them to hash codes in a deeparchitecture. We are investigating in this paper how to designa deep architecture for hashing to characterize the relativesimilarity between images, meanwhile making the obtainedhash bits as independent as possible.

While existing hashing approaches are promising to mea-sure similarity, the relationship between two images is morecomplex especially when images are with multiple semanticlabels and is usually reflected by the number of common la-bels that two images have. Figure 1 shows three exemplaryimages. The two images in (a) and (b) are both associatedwith “sky,” “clouds,” “sunset” and “tree,” while the image in(c) is only relevant to “sky,” “clouds” and “sunset.” In thiscase, the hash codes of image in (a) should be closer in prox-imity to the image in (b) than the image in (c). Therefore, inpractice, how to preserve semantic structures of the data inform of class labels is also essential to be further taken intoaccount for hashing.

By consolidating the idea of co-training between hashingand preserving semantic structures, this paper presents a nov-el Deep Semantic-Preserving and Ranking-Based Hashing

Page 2: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

(DSRH) architecture, as illustrated in Figure 2. The input toour architecture is in the form of triplets, i.e., a query image, asimilar image and a dissimilar image. A shared DCNN is thenexploited to produce image representations, followed by im-porting into a hash stream for hash code encoding and a clas-sification stream for measuring semantic structures. A tripletranking loss is designed with orthogonality constraint to char-acterize relative similarities at the top of hash stream, whilea classification error is formulated in classification stream.By jointly learning hash stream and classification stream, thegenerated hash codes are expected to better present semanticsimilarities between images.

The main contributions of this paper include:

• We propose a novel hashing architecture, which com-bines hash coding and classification for preserving notonly relative similarity between images but also seman-tic structures on images.

• A triplet ranking loss with orthogonality constraint is ex-ploited to optimize our architecture, making each hashbit as independent as possible.

• An extensive set of experiments on two widely useddatasets demonstrate the advantages of our proposedmodel over several state-of-the-art hashing techniques.

2 Related WorkWe briefly group related works into two categories: hand-crafted features based hashing and deep models for hashing.

The research on hand-crafted features based hashing hasproceeded along three dimensions: unsupervised hashing,semi-supervised hashing and supervised hashing. Unsuper-vised hashing [Gionis et al., 1999; Gong and Lazebnik, 2011]refers to the setting when the label information is not avail-able. Locality Sensitive Hashing (LSH) [Gionis et al., 1999]is one of the most well-known representative, which sim-ply uses random linear projections to construct hash func-tions. Another effective method called Iterative Quantization(ITQ) [Gong and Lazebnik, 2011] was suggested for betterquantization rather than random projections. Spectral Hash-ing (SH) in [Weiss et al., 2008] was proposed to design com-pact binary codes by preserving the similarity between sam-ples, which can be viewed as an extension of spectral clus-tering [Zelnik-manor and Perona, 2014]. Semi-supervisedhashing approaches try to improve the quality of hash codesby leveraging the supervised information into learning proce-dure. For example, Wang et al. developed a Semi-SupervisedHashing (SSH) [Wang et al., 2012] method which utilizespairwise information on labeled samples to preserve relativesimilarity. In another work [Kim and Choi, 2011], Semi-Supervised Discriminant Hashing (SSDH) learns hash codesbased on Fisher’s discriminant analysis to maximize separa-bility between labeled data in different classes while the un-labeled data are used for regularization. When all label in-formation is available, we refer to the problem as supervisedhashing. The representative in this category is Kernel-basedSupervised Hashing (KSH) which was proposed by Liu etal. in [Liu et al., 2012]. It maps the data to compact bina-ry codes whose Hamming distances are minimized on similar

pairs and simultaneously maximized on dissimilar pairs. In[Norouzi and Fleet, 2011], Norouzi et al. proposed MinimalLoss Hashing (MLH) method, which aims to learn similarity-preserving binary codes by exploiting pairwise relationship.

Inspired by recent advances in image representation us-ing deep convolutional neural networks, a few deep archi-tecture based hashing methods have been proposed for im-age retrieval. Semantic Hashing [Salakhutdinov and Hinton,2009] is the first work to exploit deep learning techniques forhashing. It applies stacked Restricted Boltzmann Machine(RBM) to learn hash codes for visual search. Xia et al. pro-posed Convolutional Neural Networks Hashing (CNNH) [X-ia et al., 2014] to decompose the hash learning process in-to a stage of learning approximate hash codes followed bya deep-networks-based stage of simultaneously learning im-age features and hash functions. However, separating hashinglearning into two stages may result in a sub-optimal solution.Later in [Lai et al., 2015], Lai et al. proposed Network inNetwork Hashing (NINH) to combine feature learning andhash coding into one stage.

In summary, our approach belongs to deep architecturebased hashing. The aforementioned approaches often focuson similarity-preserving learning for hashing. Our work inthis paper contributes by studying not only preserving rela-tive similarity between images, but also how image semanticsupervision could be further leveraged for boosting hashing.

3 Deep Semantic-preserving andRanking-based Hashing (DSRH)

In this section, we will present the proposed Deep Semantic-Preserving and Ranking-Based Hashing (DSRH) in details.Figure 2 illustrates an overview of our architecture, whichconsists of three components: a shared DCNN for learningimage representations, a hash stream for encoding each im-age into hash codes and a classification stream for leverag-ing semantic supervision. Specifically, the hash stream isdesigned with multiple bags construction plus orthogonalityconstraint trained in a triplet-wise manner, while the classifi-cation stream reinforces the hash learning to preserve seman-tic structures on images. We will discuss the two streams inthe Section 3.2 and 3.3, respectively.

3.1 NotationSuppose that we have n images and each of them can be pre-sented as X . The goal of image hashing is to learn a mappingH : X → {0, 1}k, such that an input image X will be encod-ed into a k-bit binary codeH(X).

3.2 Hash StreamThe hash coding of each image is treated independently inpoint-wise hashing learning methods, regardless of the rela-tionships of similar or dissimilar between images. More im-portantly, the relative similarity relations like “for query im-age X , it should be more similar to image X+ than to imageX−,” are reflected in the image class labels that image X andX+ belong to the same class while image X− comes fromother categories. The utilization of these relative similarityrelations has been proved to be effective in hash coding [Pan

Page 3: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

Query images

Similar images

Dissimilar images

Triplets of images with labels

...

...

plants

road

vehicle

Joint learning

...

...

road

vehicle

...

animal...

.........

......

......

......

...

flowers

Softmax loss or Cross entropy loss

Triplet ranking loss and Orthogonality

constraint

Convolutional neural networks

plants

road

vehicle

road

vehicle

animal

flowers

Has

h st

ream

Cla

ssifi

catio

n st

ream

Figure 2: Deep Semantic-Preserving and Ranking-Based Hashing (DSRH) framework (better viewed in color). The input to DSRH archi-tecture is in the form of image triplets. A shared deep convolutional neural networks is exploited for learning image representations, followedby two streams, i.e., hash stream and classification stream. Hash stream is to encode each image into hash codes by first dividing the learntrepresentations into multiple bags and then convert each bag into one hash bit. Triplet loss with orthogonality constraint is measured in hashstream. Classification stream is to characterize the semantic structures on image and softmax loss or cross entropy loss is computed for singlelabel and multi-label classification, respectively. Both hash stream and classification stream are jointly learnt by minimizing two losses.

et al., 2015; Lai et al., 2015; Li et al., 2014]. Inspired by theidea of preserving relative similarity in deep architecture [Laiet al., 2015], we propose a hash stream with multiple bagsconstruction plus orthogonality constraint learnt in a triplet-wise manner, which aims to preserve relative similarity aswell as make each hash bit as independent as possible.

Triplet Ranking LossSpecifically, we can easily obtain a set of triplets T based onimage labels, where each tuple (X,X+, X−) consists of aquery image X , a similar image X+ and a dissimilar imageX−. To preserve the similarity relations in the triplets, weaim to learn a hash mapping H(·) which makes the compactcode H(X) more similar to H(X+) than to H(X−). Thus,the triplet ranking hinge loss is employed and defined as

ltriplet(H(X),H(X+),H(X−))= max(0, 1−

∥∥H(X)−H(X−)∥∥H

+∥∥H(X)−H(X+)

∥∥H)

s.t. H(X),H(X+),H(X−) ∈ {0, 1}k,

(1)

where || · ||H represents Hamming distance. For ease of op-timization, natural relaxation tricks are utilized on Eq.(1) tochange integer constraint to the range constraint and replaceHamming norm with l2 norm. Then, the loss function is re-formulated as

ltriplet(H(X),H(X+),H(X−))

= max(0, 1−∥∥H(X)−H(X−)

∥∥22+∥∥H(X)−H(X+)

∥∥22)

s.t. H(X),H(X+),H(X−) ∈ [0, 1]k

.

(2)

The gradients in the back-propagation of the triplet rankingloss are computed as

∂ltriplet∂h

=(2h− − 2h+)× I‖h−h+‖2

2−‖h−h−‖2

2+1>0

∂ltriplet∂h+

=(2h+ − 2h

)× I‖h−h+‖2

2−‖h−h−‖2

2+1>0

∂ltriplet∂h−

=(2h − 2h−

)× I‖h−h+‖2

2−‖h−h−‖2

2+1>0

, (3)

where H(X),H(X+),H(X−) are represented as vectorh,h+,h−, respectively. The indicator function Icondition = 1if condition is true; otherwise Icondition = 0.

Multiple Bags Construction plus OrthogonalityConstraintFor hashing representation learning, compactness is a criticalcriterion to guarantee its performance in efficient similaritysearch. Given a certain small length of binary codes, the re-dundancy lies in different bits would badly affect its perfor-mance. By removing the redundancy, we can either incorpo-rate more information in the same length of binary codes, orshorten the binary codes while maintaining the same amountof information. Thus to alleviate the redundancy problem, wedevelop multiple bags construction in our deep architecture.The multiple bags module has a unique construction whichdivides the input features into k bags firstly and then encodeseach bag into one hash bit by a fully-connected layer. It aimsto reduce the bit redundancy and the effectiveness has beenproved in the hashing work [Lai et al., 2015]. Moreover, anorthogonality constraint is further imposed at the top loss lay-er of hash stream to decorrelate different hash bit.

Page 4: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

Letm and k denote the number of triplets in a batch and thenumber of output hash bits, respectively. After we obtain thematrix H ∈ [0, 1]m×k of hash bits, a projection H = 2H− 1

is exploited to transform H to H ∈ [−1, 1]m×k. Thus, theloss function with orthogonality constraint is given by

min(ltriplet(H∗, H+, H−))

s.t. H∗, H+, H− ∈ [−1, 1]m×k

1

mHT H = I H ∈ {H∗, H+, H−}

, (4)

where H∗, H+, H− represents the matrix of the approximatehash bits of query images, similar images and dissimilar im-ages from all the triplets, respectively.

The orthogonality constraint in Eq.(4) makes the optimiza-tion difficult to be solved. To address this problem, the or-thogonal constraint 1

mHT H = I can be relaxed by append-ing the converted soft penalty term to the objective function.Then, the final loss function can be rewritten as

min(ltriplet(H) + λlorthogonal(H)) , (5)

where the hyper-parameter λ is the tradeoff parameter be-tween triplet ranking loss and orthogonality constraint. For-mally, the orthogonality constraint loss is

lorthogonal(H) =1

3

∑∥∥∥∥ 1

mHT H− I

∥∥∥∥2F

, (6)

where || · ||F represents the Frobenius norm.Therefore, the gradient of the orthogonality constraint loss

respect to hash stream is computed by

∂lorthogonal

∂H=

4

3mH(HT H− I

), (7)

where H ∈ {H∗, H+, H−}.

3.3 Classification StreamImage labels not only provide knowledge in classifying butalso are useful supervised information for mining semanticstructures in images. A valid question is how to leveragethe semantic supervision into hashing and make the generat-ed hash codes better reflecting semantic similarities betweenimages. Specifically, we propose a co-training mechanismby combining hash stream and classification stream. In theclassification stream, a classification error is measured andthe whole architecture of the two streams are jointly learnt byminimizing triplet ranking loss in hash stream and classifica-tion loss in classification stream.

Softmax OptimizationFor the single label classification, we use softmax optimiza-tion method. Given an input image xi, the softmax loss isthen formulated as

lsoftmax(θ) = −1

n

[n∑i=1

c∑j=1

I(yi=j)logeθ

Tj x

i∑cl=1 e

θTlxi

], (8)

where θ denotes the parameters of our architecture andyi ∈ {1, 2, ...c} represents image label.

The gradient with respect to θj for optimization is

∂lsoftmax∂θj

= − 1

n

n∑i=1

[xi(I(yi=j|xi) − p(y

i = j|xi; θ))], (9)

where p(yi = j|xi; θ) is the predicted probability:

p(yi = j|xi; θ) = eθTj x

i∑cl=1 e

θTlxi

. (10)

Cross Entropy OptimizationIf an image contains multiple labels, we refer to this problemas multi-label classification. Cross entropy loss is employedin this case. Similar to softmax loss, cross entropy loss iscomputed by

lCrossEntropy(θ) = −1

n

n∑i=1

c∑j=1

[pi(yij = 1)log(pi(y

ij = 1; θ))

+ (1− pi(yij = 1))log(1− pi(yij = 1; θ))]

,

(11)

where pi is the predict probability which is the same asEq.(10). yi ∈ {0, 1}c is a binary label vector, where c is thenumber of labels.

3.4 Image RetrievalAfter the optimization of DSRH, we can employ hash streamin the architecture to generate k-bit hash codes for each inputimage. In this procedure, an image X is first encoded into ak-dimension feature vector h. Then, a quantization operationb = sign(h) is exploited to generate hash codes b, wheresign(h) is a sign function on vector h with sign(hi) = 1 ifhi > 0 and otherwise sign(hi) = 0. Given a query image,the retrieval list of images is produced by sorting the ham-ming distances of hash codes between the query image andimages in search pool.

4 ExperimentsWe conducted extensive evaluations of our proposed methodon two image datasets, i.e., CIFAR-101, a tiny image collec-tion and NUS-WIDE2, a large scale web image dataset.

4.1 DatasetThe CIFAR-10 dataset consists of 60,000 real world tinyimages in 10 classes. Each class has 6,000 images in size32× 32. We randomly select 1,000 images (100 images perclass) as the test query set. For the unsupervised setting, allthe rest images are used as training samples. For the super-vised setting, we additionally sample 500 images from eachclass in the training samples and constitute a subset of 5,000labeled images for training.

The NUS-WIDE dataset contains 269,648 images collect-ed from Flickr. Each of these images is associated with oneor multiple labels in 81 semantic concepts. For a fair com-parison, we follow the settings in [Lai et al., 2015] to usethe subset of images associated with 21 most frequent labels,

1http://www.cs.toronto.edu/ kriz/cifar.html2http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm

Page 5: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

12 24 32 48

The number of bits

0.2

0.4

0.6

0.8

1.0M

AP

DSRHDRH+

DRHNINHKSH

SHITQLSHPCAH

(a)

12 24 32 48

The number of bits0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

onw

ithi

nH

amm

ing

radi

us2 DSRH

DRH+

DRHNINHKSH

SHITQLSHPCAH

(b)

0.0 0.2 0.4 0.6 0.8 1.0

Recall0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

DSRHDRH+

DRHNINHKSH

SHITQLSHPCAH

(c)

Figure 3: Comparisons with state-of-the-art approaches on CIFAR-10 dataset. (a) Mean average precision (MAP) performance. (b) Precisionwithin Hamming radius 2 using hash lookup. (c) Precision-Recall curves on 48 bits. For better viewing, please see original color pdf file.

where each label associates with at least 5,000 images. We re-size each image to 256× 256. Similar to the split in CIFAR-10, we randomly select 2,100 images (100 images per class)as the test query set. For the unsupervised setting, all the restimages are used as the training set. For the supervised setting,we uniformly sample 500 images from each class to constructa subset for training.

4.2 Experimental SettingsOn both datasets, we utilize the 19-layer VGG [Simonyanand Zisserman, 2015] as our basic DCNN architecture. Inbetween, the first 18 layers follow the exactly same architec-tures as VGG network and the number of neurons in the lastfully-connected layer is set to s× k, where s and k is thenumber of bags and hash bits, respectively. We empiricallyset s = 30 in all our experiments. The hyper-parameter λ isdetermined by using a validation set and set to 0.25 finally.

We implement the proposed method based on the open-source Caffe [Jia et al., 2014] framework. In all experiments,our networks are trained by stochastic gradient descent with0.9 momentum. The start learning rate is set to 0.01, and wedecrease it to 10% after 5, 000 iterations on CIFAR-10 andafter 20, 000 iterations on NUS-WIDE. The mini-batch sizeof images is 64. The weight decay parameter is 0.0002.

4.3 Protocols and Baseline MethodsWe follow three evaluation protocols, i.e., mean average pre-cision (MAP), hash lookup and precision-recall curve, whichare widely used in [Gong and Lazebnik, 2011; Liu et al.,2012; Wang et al., 2012]. We compare the performances ofour proposed model DSRH with six hashing methods in-cluding five traditional models, i.e., PCA Hashing (PCAH),Locality Sensitive Hashing (LSH) [Gionis et al., 1999],Spectral Hashing (SH) [Weiss et al., 2008], Iterative Quan-tization (ITQ) [Gong and Lazebnik, 2011] and SupervisedHashing with Kernels (KSH) [Liu et al., 2012], and one deepmodel, i.e., Network In Network Hashing (NINH) [Lai etal., 2015]. Moreover, two slightly different settings of ourDSRH are named as DRH+ and DRH , which only in-cludes individual hash stream with and without orthogonalityconstraint, respectively.

For NINH , DRH , DRH+ and DSRH , the raw pix-el images are set as input. For the other baseline methods,we use the 512-dimensional GIST vector for each image inCIFAR-10 and the output of 1000-way fc8 classification lay-er in Alexnet [Krizhevsky et al., 2012] for NUS-WIDE.

4.4 Results on CIFAR-10 DatasetFigure 3(a) shows the MAP performances of nine runs onCIFAR-10 dataset. Overall, the results across different num-ber of hash bits consistently indicate that our DSRH out-performs others. In particular, the MAP of DSRH makesthe relative improvement over the best traditional competi-tor KSH and deep model NINH by 132.1%∼179.5% and33.8%∼41.6%, respectively, which is so far the highest per-formance reported on CIFAR-10 dataset. There is a signifi-cant performance gap between the traditional and deep mod-els. It is not surprising to see that DRH improves NINHsince DRH exploits a more powerful image representationbrought by a deeper CNN. By additionally incorporating or-thogonality constraint, DRH+ exhibits better performancethan DRH . Our DSRH further improves DRH+ with arelative increase of 1.6%∼3.6%, demonstrating the advan-tage of boosting hashing by preserving semantic structuresthrough classification.

In the evaluation of hash lookup within Hamming radius 2as shown in Figure 3(b), the precisions for most of the tradi-tional methods drop when a longer size of hash codes is used(48 bits in our case). This is because the number of samplesfalling into a bucket decreases exponentially for longer sizesof hash codes. Therefore, for some query images, there areeven no any neighbor in a Hamming ball of radius 2. Evenin this case, the precision of our proposed DSRH still has avery slight improvement from 78.57% of 32 bits to 79.17%of 48 bits, indicating fewer failed queries for DSRH .

We further detail the precision-recall curves in Figure 3(c).The results confirm the trends seen in Figure 3(a) and demon-strate performance improvements using the proposedDSRHapproach over other methods.

4.5 Results on NUS-WIDE DatasetFigure 4 shows the experimental results on NUS-WIDEdataset. MAP performance and precision with Hamming ra-

Page 6: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

12 24 32 48

The number of bits0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

MA

P

DSRHDRH+

DRHNINHKSH

SHITQLSHPCAH

(a)

12 24 32 48

The number of bits0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pre

cisi

onw

ithi

nH

amm

ing

radi

us2

DSRHDRH+

DRHNINHKSH

SHITQLSHPCAH

(b)

0.0 0.2 0.4 0.6 0.8 1.0

Recall0.0

0.2

0.4

0.6

0.8

1.0

Pre

cisi

on

DSRHDRH+

DRHNINHKSH

SHITQLSHPCAH

(c)

Figure 4: Comparisons with state-of-the-art approaches on NUS-WIDE dataset. (a) Mean average precision (MAP) performance. (b)Precision within Hamming radius 2 using hash lookup. (c) Precision-Recall curves on 48 bits. Better viewed in original color pdf file.

PCAH

LSH

ITQ

SH

KSH

NINH

DSRH

PCAH

LSH

ITQ

SH

KSH

NINH

DSRH

Figure 5: Examples showing the top 10 image retrieval results by different methods in response to two query images on NUS-WIDE dataset(better viewed in color). In each row, the first image with a red bounding box is the query image and the images whose annotations completelycontain all the labels of the query image are regarded as excellent ones, which are enclosed in a blue bounding box.

dius 2 using hash lookup are given in Figure 4(a) and (b), re-spectively. Our DSRH model consistently outperforms oth-ers. In particular, the MAP performance and precision withHamming radius 2 using hash lookup of DSRH can achieve87.61% and 73.59% with 48 hash bits, which make the im-provement over the best competitor NINH by 22.53% and15.85%. Furthermore, DSRH , in comparison, is benefitedfrom utilizing semantic supervision and thus shows a rela-tive increase of 1.0%∼1.8% over DRH+ in terms of MAP.Similar to the observations on CIFAR-10 dataset, the preci-sions of most methods decrease when increasing the size ofhash codes to 48 bits and the drop in precision of our DSRHis much less compared to others. Figure 4(c) shows theprecision-recall curves and the results indicate that DSRHconstantly leads to better performance.

Figure 5 further illustrates the top ten image search resultsby different methods in response to two query images. Wecan see that the proposed DSRH method achieves more sat-isfying results and retrieves eight “excellent images” in thereturned top ten images. It is worth noticing that “excellentimages” here refer to images whose annotations completelycontain all the labels of the query image. As a result, the im-ages retrieved by our DSRH approach are more similar insemantics with the query image.

5 Conclusion and Discussion

In this paper, we have presented deep semantic-preservingand ranking-based hashing architecture which explores bothrelative similarity between images and semantic supervisionon images. Particularly, given triplets of images with label-s, we exploit a shared DCNN to learn image representation,followed by two streams, i.e., hash stream and classificationstream. Hash stream aims to encode each image into hashcodes by characterizing the relative similarities between im-ages in a triplet and meanwhile making the generated hashcodes as compact as possible, while classification stream is topreserve the semantic structures on images. Basically, utiliz-ing only hash stream shows better performance than state-of-the-art hashing techniques on two image datasets. By joint-ly learning hash stream and classification stream to reinforcehashing, further improvements are consistently observed inthe experiments.

Our future works are as follows. First, as our architecture isa co-training process, how the architecture performs on clas-sification task will be further investigated and evaluated. Nex-t, more in-depth studies of how to fuse the two streams in aprinciple way could be explored. Furthermore, how to applyour proposed architecture to video domain seems interesting.

Page 7: Deep Semantic-Preserving and Ranking-Based … Semantic-Preserving and Ranking-Based Hashing for Image Retrieval Ting Yao y, Fuchen Long z, Tao Mei , Yong Rui y yMicrosoft Research,

References[Gionis et al., 1999] Aristides Gionis, Piotr Indyk, and Ra-

jeev Motwani. Similarity search in high dimensions viahashing. In Proceedings of International Conference onVery Large Data Bases (VLDB), 1999.

[Gong and Lazebnik, 2011] Yunchao Gong and SvetlanaLazebnik. Iterative quantization: A procrustean approachto learning binary codes. In Proceedings of Computer Vi-sion and Pattern Recognition (CVPR), 2011.

[Jia et al., 2014] Yangqing Jia, Evan Shelhamer, Jeff Don-ahue, Sergey Karayev, Jonathan Long, Ross B. Girshick,Sergio Guadarrama, and Trevor Darrell. Caffe: Convo-lutional architecture for fast feature embedding. CoRR,2014.

[Kim and Choi, 2011] Saehoon Kim and Seungjin Choi.Semi-supervised discriminant hashing. In Proceedings ofInternational Conference on Data Mining (ICDM), 2011.

[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,and Geoffrey E. Hinton. Imagenet classification with deepconvolutional neural networks. In Proceedings of NeuralInformation Processing Systems (NIPS), 2012.

[Lai et al., 2015] Hanjiang Lai, Yan Pan, Ye Liu, andShuicheng Yan. Simultaneous feature learning and hashcoding with deep neural networks. In Proceedings of Com-puter Vision and Pattern Recognition (CVPR), 2015.

[Li et al., 2014] Yinxiao Li, Yan Wang, Michael Case, Shih-Fu Chang, and Peter K. Allen. Real-time pose estima-tion of deformable objects using a volumetric approach.In Proceedings of IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2014.

[Liong et al., 2015] Venice Erin Liong, Jiwen Lu, GangWang, Pierre Moulin, and Jie Zhou. Deep hashing forcompact binary codes learning. In Proceedings of Com-puter Vision and Pattern Recognition (CVPR), 2015.

[Liu et al., 2012] Wei Liu, Jun Wang, Rongrong Ji, Yu-GangJiang, and Shih-Fu Chang. Supervised hashing with k-ernels. In Proceedings of Computer Vision and PatternRecognition (CVPR), 2012.

[Mei et al., 2014] Tao Mei, Yong Rui, Shipeng Li, andQi Tian. Multimedia search reranking: A literature sur-vey. ACM Computing Surveys, 46(3), 2014.

[Norouzi and Fleet, 2011] Mohammad Norouzi and David J.Fleet. Minimal loss hashing for compact binary codes.In Proceedings of International Conference on MachineLearning (ICML), 2011.

[Pan et al., 2015] Yingwei Pan, Ting Yao, Houqiang Li,Chong-Wah Ngo, and Tao Mei. Semi-supervised hashingwith semantic confidence for large scale visual search. InProceedings of ACM conference on Research and Devel-opment in Information Retrieval (SIGIR), 2015.

[Salakhutdinov and Hinton, 2009] R. Salakhutdinov andG.E. Hinton. Semantic hashing. International Journal ofApproximate Reasoning, 50(7):969–978, 2009.

[Simonyan and Zisserman, 2015] Karen Simonyan and An-drew Zisserman. Very deep convolutional networks forlarge-scale image recognition. In Proceedings of Inter-national Conference on Learning Representations (ICLR),2015.

[Wang et al., 2012] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for large scalesearch. IEEE Trans, Pattern Anal. Mach. Intell (TPAMI),34(12):2393–2406, 2012.

[Weiss et al., 2008] Yair Weiss, Antonio Torralba, and RobFergus. Spectral hashing. In Proceedings of Neural Infor-mation Processing Systems (NIPS), 2008.

[Xia et al., 2014] Rongkai Xia, Yan Pan, Hanjiang Lai, Con-g Liu, and Shuicheng Yan. Supervised hashing for imageretrieval via image representation learning. In Proceed-ings of Association for the Advancement of Artificial Intel-ligence (AAAI), 2014.

[Zelnik-manor and Perona, 2014] Lihi Zelnik-manor andPietro Perona. Self-tuning spectral clustering. In Proceed-ings of Neural Information Processing Systems (NIPS),2014.