[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

BOOSTING MULTIPLE HASH TABLES TO SEARCH

JIN-CHENG LI

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China E-MAIL: [email protected]

Abstract: Hashing based approximate nearest neighbor (ANN) search

is an important technique to reduce the search time and storage

for large scale information retrieval problems. Semi-supervised hashing (SSH) methods capture semantic similarity of data and

avoid overfitting to training data. SSH outperforms supervised, unsupervised and Random Projection based hashing methods in

semantic retrieval task. But, current semi-supervised and supervised hashing methods search by Hash lookup in a single hash table usually subject to a low recall. In order to achieve a high recall, an exhaustive search by hamming ranking is needed. It dramatically decreases the retrieval precision and increases the search time. In this paper, we propose to learn multiple

semi-supervised hashing tables using boosting technique to

overcome this problem. Multiple hash tables are learned sequentially with boosting to maximize hashing accuracy of each hash table. The mis-hash samples in the current hash table will

be penalized by large weights and then the algorithm uses the new weight values to learn the next hash table. Given a query,

the true semantic similar samples missed from the active buckets (the buckets in the small hamming radius of the query)

of one hash table are more likely to be found in the active buckets of the next hash table. Experimental results show that our method achieves a high recall while preserves a high precision that outperforms several state-of-the-art hashing methods.

Keywords: Approximate Nearest Neighbor Search (ANN);

Semi-supervised Hashing; Boosting multiple hash tables; large

scale retrieval

1. Introduction

Due to the use of Internet, thousands of images and documents are uploaded to the internet every day. Finding similar items (e.g. images or documents) for a given query in web scale database by exact search is not scalable due to the computational and memory limitation. This problem gets even worse in high-dimensional applications. Fortunately, many applications, e.g. image search and document search, do not need exact search results. Approximate search result is sufficient to its application, which allows fast similarity search in huge database. There are two major approximate

978-1-4673-1487-9/12/$31.00 ©2012 IEEE

similarity search approaches: Tree based approaches and Hashing approaches. The main idea of tree approaches is to either partition the data (R-tree) [2] or the space (KD-tree) [1] recursively. Tree based approaches require a large memory to store tree nodes. When the dimensionality of data increases, the search performance of tree based approaches decreases significantly. On the contrary, Hashing approaches embedding high-dimensional data to a hamming space (binary code) which significantly reduce the storage requirement for data. The expected query time is sub-linear or even constant 0(1). Here, we focus on hashing approaches. Based on different choices of hash functions, hashing approaches can be divided into four major categories: Random Projection (RP), Un-supervised, supervised and semi-supervised methods.

The random projection hashing methods generate hash functions randomly from a specific distribution. The most well known RP hashing technique is the Locality-Sensitivity Hashing (LSH) [3]. A simple LSH family is: hJJ:x)=sgn«(wkf x+bk)lt), wk�Piw), bk�U[O,t]. Wk denotes the kth random vector generated with s-stable (Cauchy distribution is I-stable, Gaussian distribution is 2-stable) distribution and bk denotes the kth threshold drew from a uniform distribution. SIKH [4] is proposed based on Random Fourier Features, it samples Wk with the same way as LSH but use cosine partition function instead of linear function. A long code is needed for RP methods to achieve high precision while it also reduces the probability of collision exponentially leading to a low recall.

Rather than using random projection to embed data to binary code, several authors have pursued to generate compact codes using machine learning techniques with unsupervised data. These methods mainly based on Graph partition theory or Principal Component Analysis (PCA). Spectral hashing [5], Anchor Graph Hashing [6], Iterative Quantization [7] are examples of unsupervised hashing methods. Unsupervised Hashing methods generate compact hash codes in comparison to RP hashing methods. But, its performance may decrease substantially when most of the data variance is contained in top few principal directions.

Unsupervised hashing methods yield compact codes in comparison to random projection based hashing methods and

57


achieve higher performance when it is measured by Euclidean distance. But, two samples with small Euclidean distance may not necessary to be semantically similar. In multimedia retrieval problems, the retrieval accuracy is usually measured by semantic similarity. So, generating semantic preserving hash function is crucial for multimedia retrieval task. In order to learn semantic preserving hash functions, authors in [8] proposed a supervised hashing method (MLH) using structural SVMs with latten variables. The MLH minimizes empirical loss over training pairs with weakly labeled data in the hamming space. In paper [9], authors proposed a semantic hash method by encoding the learned metric parameterization into a randomized locality-sensitive hash (LSH) function. Supervised hashing methods require large labeled training sets and the training is much slower comparing to unsupervised methods. To overcome these problems of supervised hashing methods authors of [10] proposed a semi-supervised hashing (SSH) method using only a small number of labeled data while maintaining the robustness to overfitting. The SSH can be solved efficiently by eigenvalue decomposition or sequentially learning algorithm [11].

From the above literature review, semi-supervised and supervised hashing methods can preserve semantic similarity in comparison to unsupervised hashing and RP based hashing methods. Semi-supervised methods have the advantage over supervised that it only needs a small set of labeled data and short training times. In large scale applications, obtaining a large number of label samples is too expensive and not practical. So, semi-supervised hashing methods are good choice for semantic retrieval. But, current semi-supervised and supervised hashing methods search by Hash lookup (return the result within hamming distance d with respect to the query) in a single hash table usually subject to a low recall. In order to achieve a high recall, an exhaustive search by hamming ranking (sort the items in the database according to the hamming distance between them and the query) is needed which dramatically decrease the retrieval precision and increase the search time. In this paper, we propose to learn multiple semi-supervised hashing tables use boosting technique to overcome this problem.

Major Contributions of this paper: We propose a Boosting algorithm to lean multiple hash

tables that balance retrieval recall and precision. Our method maximizes the hash accuracy of each hash table in the hamming distance smaller than a small threshold d. Given a query, the mis-hash points did not found in the current hash table will have a larger probability of being found in the next hash table. In the search step, we only need to access the buckets that have hamming distance small than d to the query of all hash tables until we get the required number of relevant

results (constant search time complexity). The proposed algorithm greatly increases retrieval recall while preserves a high retrieval precision.

The remainder of this paper is organized as follows. In Section 2, we introduce related works closely to our proposed methods. Section 3 presents the proposed boosting multiple hash table algorithm. Experimental results on an real dataset will be shown in Section 4, and we conclude this work in section 5.

2. Related Works

Before discussing the works related to this paper, we introduce some notations used in this paper first. Suppose given a dataset matrixXERDXN which has Ntraining samples {Xi}, i=I, . . . ,N. Each sample has the same dimension D, XiE � (each column is a sample in X). A portion of points, Xi E �XL, of X have label information. There are two categories of label information, must-link (Xi, Xi) E M and cannot-link (Xi, Xi) E C. A point pair (Xi, Xi) E M is denoted as a similar-pair in which Xi and Xi share common class label. Similarly, (Xi, X) E C denotes dissimilar-pair which Xi and Xi have different class labels. Each point in Xi at least belongs to one of these two types of label information. Hashing algorithms use the given dataset X to learn K hash functions hix)=sgn«Wkfx+b):-{-l,l}, k=1, . . .K. These K hash functions H=[hl(X), . . . ,h�x)] map points from RD to K dimension hamming space {-I, 1 } K. To simplify notations, we assume the data are zero centered in the following discussion. The hash function can be written as hix)= sgn«wkfx) for simplicity, where, sgn(u)=l if u?:O, and sgn(u)=-1 if u<O.

2.1. Semi-supervised Hashing (SSH)

Semi-supervised hashing methods learn K hash functions H=[h1(x), . . . ,h�x)] that parameterized by K D-dimensional weight vectors W=[W]""WK]' These K hash functions map point pair in (Xi, Xi) EM to the same bit value and map point pair in (Xi, Xi) E C to different bit values. This objective can be formulated as maximizing the following empirical hashing accuracy:

maxJ (H) = �[(X'�M hk(x,)hk h)+ (x�C hk(x,)hk h)] (1)

To simplify notation, suppose H(Xi) E RK XL maps the points in Xi to their K-bit hash codes. And define a matrix S E KXL incorporating the pairwise label information from Xi as:

58


(2)

The objective function J(H) can be rewritten in the following compact form:

maxJ ( H) =± tr { H (XI )SH (XI n =± tr {Sgn (WT Xl )Ssgn (WT Xl n (3)

Maximizing the empirical accuracy J(H) only may overfit to the training data easily, especially when the labeled training set is small. To avoid overfiting, authors in literatures introduce two additional constraints which use all the data X including unlabeled points that similar to Spectral Hashing. The final objective function with constraints is:

H'

=maxJ ( H) N

S.t. �>k (Xi ) = O,k = 1, ... ,K (4) i=l

� H (X ) H (X)' =1

n

The first constraint provides the property that each hash function generates a balanced partition of the data. The second constraint suggests that bits generated by different hash functions are independent with each other. As show in [10], this maximization problem can be converted to an eigenvalue decomposition problem with close form solution after three relaxing steps. First, relax the sign function to its signed magnitude function. Then, relax the balanced partition as maximum variance of hash bits (as a regularization term multiplied by a regularization parameter A). Lastly, relax the independent of hash bits by imposing orthogonal constraint on the projection directions. Instead of using eigenvalue decomposition method to obtain the K hash functions in a single shot, authors in [11] proposed a sequential optimization algorithm (SPLH). The hash functions are learned iteratively. In each iteration, the pairwise label matrix S is updated by imposing higher weights on point-pairs violated by the previous hash function. This sequential learning algorithm generates more compact codes than simply eigenvalue decomposition [10]. In comparison to SPLH learning a single hash table, our proposed method learns multiple hash tables sequentially. The pairwise label matrix S is updated by imposing higher weight on point pairs violated by all the previous hash tables.

2.2. Hashing use Multiple Hashing tables

There are three previous works using multiple hash tables, but all of them are in unsupervised setting. Multi-Probe LSH [12] randomly generates multiple LSH hash tables. Principal Component Hashing [13] generates

multiple hash tables independently using different principle directions of the data. Complementary Hashing [14] use a boosting like strategy to learn multiple dependent hash tables but in unsupervised manner.

3. Proposed methods

In subsection 3.1, we propose a boosting multiple semi-supervised hash table method (BMHT) based on SSH which tradeoff between retrieval precision and recall. And then, we define efficient search algorithms of BMHT in subsection 3.2.

3.1. Boosting Multiple Hash tables (BMHT)

The idea of boosting multiple hash tables is quite intuitive. The hash tables are learned sequentially using Adaboost technique. The semantic similar point-pairs should fall into the same bucket or the nearby buckets (within a small hamming distance d) while semantic dissimilar point pairs should fall into different buckets with large hamming distance. If the current hash table mis-hashes some point-pairs, the pairwise label matrix S is updated by imposing higher weights on these point-pairs violating by the current hash tables and the new weighted label matrix S is used to learn the next hash table. The mis-hashed points will be expected to be hashed to the correct hash bucket in the next hash table. This boosting process implicitly creates dependency between hash tables and progressively maximizes hash accuracy and recall. Algorithm I describes the proposed Boosting multiple Semi-supervised hash table algorithm.

We maximize the precision in the hamming distance smaller than d (a small integer). We consider that two similar points should have hamming distance smaller than d otherwise the hashing functions make mistakes. On the other hand, if semantic dissimilar points have hamming distance small than d will cause false positives and decrease the retrieval precision. So, we penalize dissimilar points having hamming distance smaller than d + 0 to reduce false positive. Where, 0 is a safe margin. We penalize these two types of mistakes by increasing or decreasing the weight values and learn the next set of hashing functions (next hash table) to correct the mistake made by the previous hash tables. Given a query, the hash tables obtain by our method will be expected to yield high accuracy in the hamming distance within d, and achieve high recall because of the boosting algorithm (most of points similar to the query will be found in the buckets within hamming distance d).

F or convenient, we let S = S+ + S- , where

59


S+ R'x' d S- R'x' h' h . h ' . 1 b 1 E an E w IC Incorporates t e paITWIse a e information of semantic similar and semantic dissimilar from � separately:

S;= {l,(Xi,Xj)EM,

0, otherwise

S�= {-l'(Xi'Xj)EC "

0, otherwise (5)

The updating rules of S+ and S- are similar to Perceptorn Learning, i.e. error correcting:

0, (Xi'Xj)EM,�IIH(Xi)-H(Xj)112 �d

M;'(Xi'Xj)EM,�IIH(Xi)-Hht >d

Mif,(xi,xj) E C,�IIH(Xi)-Hh )112 � d+t5

0, (Xi,xj) E C,�IIH(Xi)-H(Xj )112> d +15

(6)

where 8S; = ±IIH (xi )-H (xj JI12 -d and Mij = ±IIH (xi )-H (xj JI12 -d-8 ,

denote the error cause by the current hash tables (the error will be set to zero if the point-pairs are correctly hashed by current hash tables, and the weights of these point-pairs will not be updated anymore). For each violation, Sij is updated by Sij = Sij + 17I1Sij' The step size TJ is chosen such that the

update is numerically stable. The weights update strategy here will not cause the change of sign of entries in S.

3.2. Efficient Search algorithm of BMHT

BMHT has multiple hash tables that make hash lookup and hamming ranking search algorithms can not be applied directly. Here, we define a variant of hamming ranking and hash lookup algorithm for BMHT. In BMHT, a sample in the database may have different hash codes in different hash tables. Given a query, hash lookup in BMHT is performed by returning all the samples have hamming distance small than a threshold d in all L hash tables i.e. take a set union operation on the of returned results by L hash tables). To compare to other hashing search methods by hamming ranking, we also define the hamming ranking algorithm of BMHT, which return the items in the database ranked according to its minimum hamming distance with respect to the query among the L hash tables. Algorithm 1. Boosting multiple semi-supervised hash tables

Input: data X, pairwise labeled data}{z, initial pairwise labels S, length of hash code of hash tables K, learning rate Il, number of

hash tables L. For k=1 to L do

Learn a hash table, obtain Wk using SSH algorithm (SPLH); Update the label matrices base on current hash tables:

Sk+l = Sk + 17M End for

4. Experimenal results

4.1. Experiment setup

In this Section, we evaluate the performance of our proposed Boosting Multiple Hash tables algorithm (BMHT) in comparison to several state-of-the-art hashing methods i.e. LSH, Spectral Hashing, ITQ and SPLH. We use a large dataset with label information: MNIST (70K images) in our experiments. To validate the performance of BMTH, we adopt two search algorithms commonly used in hashing literatures to rank and return the search results: Hamming ranking and hash lookup (Specially, the search algorithm of BMTH is a little bit different from the one being described in Section 3.2). Given a query, the hamming ranking search algorithm compute the hamming distance between database items and the query, and then sort the results according to the hamming distancein ascending order. A number of the top ranked images or the whole list of images will be returned as the search result. The hash lookup algorithm returns all the points in the buckets that fall within a small hamming radius d of the query. Hash lookup algorithm needs to maintain a small lookup table, but needs constant search time. On the other hand, hamming ranking algorithm needs linear time complexity of the database size. We use several information retrieval metrics to measure the performance of different methods. For hamming ranking, we use the precision-recall curve of the whole database. For hash lookup, retrieval accuracy and recall percentage are computed within a hamming radius (d=2) with different hash code lengths.

4.2. Experimental results on MNIST Digit Data

The MNIST dataset [14] consists of 70K handwritten digit samples. Each sample is an image with size of 28 X 28 pixels yielding a 784 dimensional data vector. Samples of MINST dataset distribute in ten classes, i.e. digits from 0 to 9. A class label is associated with each sample. The whole dataset was partitioned into two parts: a training set with 69K samples and a test set with 1K samples. For unsupervised hashing method i.e. Spectral Hashing and ITQ, we use the 69K samples as training data. For semi-supervised method SPLH and our proposed methods, we sample lK samples with label from the training set as supervised data to construct the pairwise label information matrix S. The source code of LSH, SH and ITQ are downloaded from author's websites. These three methods do not need additional parameters tuning. For SPLH, the regularization parameter is obtained by cross-validation in the range of [0.1:0.1:0.9, 1:1:9, 10:10:100]. The BMHT algorithm has two sets of parameters i.e. multiple hash table parameters and

60


semi-supervised hashing parameters of each hash table. The semi-supervised hashing parameters are set to the same as SPLH. The multiple hash table parameters (The number of hash table: L and the active hash distance: d) are user define parameters. The parameters selected by SPLH and BMHT are listed in Table 1. Where, "NaN" in Table 1 denotes that this hashing method does not have this parameter.

Table 1 Parameters selected by SPLH and BMHT in MNIST dataset

Parameters Hashin Methods SPLH BMHT

Regularization Parameter: A 0.2 0.2 #Hash Tables: L 3

D NaN 2

8 NaN 2 Learning rate: 1] NaN 4

4.2.1 Search by Hamming ranking In this subsection, we use hamming ranking as search

algorithm to rank the whole database. Figure 1 compares all Recall-Precision curves of all methods with different hash code lengths using semantic labels as ground truth. Our proposed BMHT outperforms all the other methods in hash code size from 16 bits to 64 bits. Unsupervised hashing methods i.e. LSH, SH and ITQ that perform well in preserving Euclidean distance are less efficient than semi-supervised hashing algorithm using semantic labels as ground truth. The performance of SH algorithm even decrease as the code length increases due to its lack of theoretical guarantee. Due to the fact that most of the data variance is in the top few principal directions, the performance of ITQ that use peA almost can not be improved when the hash code length is larger than 32 bits. Our proposed BMHT method using only three hash tables yeild a strongly upward trajectory and achieves the best performance that greatly improves the recall and precision. These show that the boosting in the proposed method is effective. 4.2.2 Search by hash lookup

In this subsection, we use hash lookup as search algorithm. Given a query, samples yielding hamming distance at most 2 with respect to the query will be returned as results. Retrieval accuracy and recall percentage will be computed. For BMHT, we use the hash lookup algorithm proposed in Section 3.2 that all items yielding hamming distance no larger than 2 in all hash tables will be returned. Tables 2 to 4 list the results with hash code lengths form 16 bits to 64 bits. The BMHT method outperforms other methods with a good balance between precision and recall in all these three code sizes. Especially, at hash code length of 64 bits, the BMHT

achieves a higher precision and more than three times higher recall rate in comparison to the SPLH. LSH and SH achieve the best precision when the code length is large, but the recall is rather low (near zero at code length of 64 bits). LSH and SH return almost zero samples using hash lookup within this hamming distance which make their retrieval accuracy meaningless. This suggests that, most semantic similar items do not necessary to have small Euclidean distance.

61

Recall precision curve@16bits

(a) Recall Precision curve@ 16 bits

Recall precision curve@32 bits

0.2

O . l o1----;;'-;------;;"' 2;-- '�'-----;'C:-. -----':,';-, -=::,.,��,.,�,�,�,�, _ Recall

(b) Recall Precision curve@32 bits

Recall precision curve@64 bits

u u u u u u u U Rec<lll

(c) Recall Precision curve@64 bits Figure 1 Comparison with state-of-the-art methods on MNIST dataset

using Semantic labels as ground truth.

Table 2 Accuracy and recall% at 16 bits use hash lookup with d<=2


d<=2 Hashin Methods 16 bits BMHT SPLH ITQ SH LSH

Accuracy 74.69% 81.80% 77.04% 67.19% 51.46%

Recall% 36.39% 18.13% 11.00% 5.86% 3.05%


d<=2 32 bits Accuracy Recall%

BMHT 92.83%

16.14%

SPLH 95.10%

7.14%

Hashin Methods ITQ SH

99.06% 98.19%

1.94% 0.25%

LSH 96.30%

0.07%


d<=2 64 bits Accuracy Recall%

BMHT 98.82%

4.92%

5. Conclusions

SPLH 98.29% 1.50%

Hashin Methods ITQ SH 99.95% 100.00%

0.15% 0.00%

LSH 100.00% 0.00%

In this paper, we proposed a Boosting Multiple Hashing Tables algorithm (BMHT) to overcome inefficient search of current semi-supervised and supervised hashing algorithms. The BMHT algorithm yields a better balance between retrieval accuracy and recall. In our experiments, BMHT significantly improves retrieval recall while retains a high retrieval accuracy by boosting with only three hash tables. The proposed BMHT algorithm is adaptable to other current hashing algorithms. The generalization of BMHT to other semi-supervised or supervised hashing algorithms is straightforward. In future, we will examine the performance of BMHT using different hash code length of hash tables and combine different hashing algorithms.

Acknowledgements

This work is supported by National Natural Science Foundation of China (Nos. 61003171 and 61003172), a Program for New Century Excellent Talents in University (No. NCET-ll-0162) and the Fundamental Research Funds for the Central Universities 2011ZM0066.

References

[1] J. L. Bentley, "Multidimensional binary search trees used for associative searching," in Proc. Commun. ACM, Sep. 1975, vol. 18, no. 9, pp. 509-517.

[2] A. Guttman, "R-trees: A dynamic index structure for spatial searching," in Proc. ACM SIGMOD, 1984, pp. 47-57.

[3] A. Andoni, P. Indyk, "Near-optimal hashing algorithms for approximate nearnest neighbor in high dimensions",

[4]

47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, 2006. Maxim Raginsky, Svetlana Lazebnik, "Locality-Sensitive Binary codes from Shift-Invariant Kernels", Advances in Neural Information Processing Systems (NIPS), 2009.

[5] Yair Weiss, Antonio Torralba, Rob Fergus, "Spectral Hashing", Advances in Neural Information Processing Systems (NIPS), 2008.

[6] Wei Liu, Jun Wang, Sanjiv Kumar, Shih-Fu Chang, "Hashing with Graphs", In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.

[7] Yunchao Gong, Svetlana Lazebnik, "Iterative Quantization: A Procrustean Approach to Learning Binary Codes", In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

[8] Mohammad Norouzi, David J. Fleet, "Minimal Loss Hashing for compact Binary codes", In Proceedings of the 28th International Conference on Machine Learning (ICML), 20l1.

[9] Brian Kulis, Prateek Jain, Kristen Grauman, "Fast Similarity Search for Learned Metrics", IEEE Trans. On Pattern Analysis and Machine Intelligent, Vol. 31, No. l2, pp. 2l43-2l57, 2009.

[10] Jun Wang, Sanjiv Kumar, Shih-Fu Chang, "Semi-Supervised Hashing for Scalable Image Retrieval", In Proceedings of the 23th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[11] Jun Wang, Sanjiv Kumar, Shih-Fu Chang, "Sequential Projection Learning for Hashing with Compact Codes", In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.

[12] Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, "Multi-probe LSH: Efficient indexing for high-dimensional similarity search", In VLDB, pages 950-961, 2007.

[13] Y. Matsushita and T. Wada, "Principal component hashing: An accelerated approximate nearest neighbor search", In PSIVT, pages 374-385, 2009.

[14] Hao Xu, Jingdong Wang, Zhu Li, Gang Zeng, Shipeng Li, Nenghai Yu, "Complementary Hashing for Approximate Nearest Neighbor Search", In Proceedings of the 17th IEEE International Conference on Computer Vision, 2011.

[15] http://yann.lecun.com/exdb/mnistJ.

62

Documents

[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and