13
3368 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014 Coupled Binary Embedding for Large-Scale Image Retrieval Liang Zheng, Shengjin Wang, Member, IEEE, and Qi Tian, Senior Member, IEEE Abstract—Visual matching is a crucial step in image retrieval based on the bag-of-words (BoW) model. In the baseline method, two keypoints are considered as a matching pair if their SIFT descriptors are quantized to the same visual word. However, the SIFT visual word has two limitations. First, it loses most of its discriminative power during quantization. Second, SIFT only describes the local texture feature. Both drawbacks impair the discriminative power of the BoW model and lead to false positive matches. To tackle this problem, this paper proposes to embed multiple binary features at indexing level. To model correlation between features, a multi-IDF scheme is introduced, through which different binary features are coupled into the inverted file. We show that matching verification methods based on binary features, such as Hamming embedding, can be effectively incorporated in our framework. As an extension, we explore the fusion of binary color feature into image retrieval. The joint integration of the SIFT visual word and binary features greatly enhances the precision of visual matching, reducing the impact of false positive matches. Our method is evaluated through extensive experiments on four benchmark datasets (Ukbench, Holidays, DupImage, and MIR Flickr 1M). We show that our method significantly improves the baseline approach. In addition, large- scale experiments indicate that the proposed method requires acceptable memory usage and query time compared with other approaches. Further, when global color feature is integrated, our method yields competitive performance with the state-of-the-arts. Index Terms— Feature fusion, coupled binary embedding, multi-IDF, image retrieval. I. I NTRODUCTION T HIS paper focuses on the task of large scale partial- duplicate image retrieval. Given a query image, our target is to find images containing the same object or scene in a large database in real time. Due to the low descriptive power of texts Manuscript received January 1, 2014; revised April 14, 2014; accepted June 2, 2014. Date of publication June 12, 2014; date of current version July 1, 2014. This work was supported in part by the National High Technology Research and Development Program of China (863 program) under Grant 2012AA011004 and in part by the National Science and Technology Support Program under Grant 2013BAK02B04. The work of Q. Tian was supported in part by the Army Research Office under Grant W911NF-12-1-0057, in part by the Faculty Research Awards through the NEC Laboratories of America, in part by the 2012 UTSA START-RResearch Award, and in part by the National Science Foundation of China under Grant 61128007. The associate editor coordinating the review of this manuscript and approving it for publication was Mr. Pierre-Marc Jodoin. (Corresponding authors: Shengjin Wang and Qi Tian). L. Zheng and S. Wang are with the State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Electrical Engineering, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]). Q. Tian is with the Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249-1604 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2330763 or tags [2], [3], content based image retrieval (CBIR) has been a hot topic in computer vision community. One of the most popular approaches to perform such a task is the Bag-of-Words (BoW) model [4]. The introduction of the SIFT descriptor [5] has enabled accurate partial-duplicate image retrieval based on feature matching. Specifically, the BoW model first constructs a codebook via unsupervised clustering algorithms [6], [7]. Then, an image is represented as a histogram of visual words, produced by feature quantization. Each bin of the histogram is weighted with tf-idf score [4] or its variants [1], [8], [9]. With the inverted file data structure, images are indexed for efficient retrieval. Essentially, one key issue of the BoW model involves visual word matching between images. Accurate feature matching leads to high image retrieval performance. However, two drawbacks compromise this procedure. First, in quantization, a 128-D double SIFT feature is quantized to a single integer. Although it enables efficient online retrieval, the discriminative power of SIFT feature is largely lost. Features that lie away from each other may actually fall into the same cell, thus producing false positive matches. Second, the state-of-the-art systems rely on the SIFT descriptor, which only describes the local gradient distribution, with rare description of other characteristics, such as color, of this local region. As a result, regions which are similar in texture space but different in color space may also be considered as a true match. Both drawbacks lead to false positive matches and impair the image retrieval accuracy. Therefore, it is undesirable to take visual word index as the only ticket to visual matching. Instead, the matching procedure should be further checked by other cues, which should be efficient in terms of both memory and time. A reasonable choice to address the above problem involves the usage of binary features. Typically, the binary features are extracted along with SIFT, and embedded into the inverted file. The reason why binary feature can be employed for matching verification is two-fold. First, compared with floating-point vectors of the same length, binary features consume much less memory. For example, for a 128-D vector, it takes 512 bytes and 16 bytes for the floating-point and binary features, respec- tively. Second, during matching verification, the Hamming distance between two binary features can be efficiently calcu- lated via xor operations, while the Euclidean distance between floating-point vectors is very expensive to compute. Previous work of this line includes Hamming Embedding (HE) [1] and its variants [10], [11], which use binary SIFT features for verification. Meanwhile, binary features also include spatial context [12], heterogeneous feature such as color [13], etc. In light of the effectiveness of binary features, this paper proposes to refine visual matching via the embedding of 1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

12.Coupled Binary Embedding For

Embed Size (px)

DESCRIPTION

Coupled Binary Embedding For

Citation preview

  • 3368 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    Coupled Binary Embedding forLarge-Scale Image Retrieval

    Liang Zheng, Shengjin Wang, Member, IEEE, and Qi Tian, Senior Member, IEEEAbstract Visual matching is a crucial step in image retrieval

    based on the bag-of-words (BoW) model. In the baseline method,two keypoints are considered as a matching pair if their SIFTdescriptors are quantized to the same visual word. However, theSIFT visual word has two limitations. First, it loses most ofits discriminative power during quantization. Second, SIFT onlydescribes the local texture feature. Both drawbacks impair thediscriminative power of the BoW model and lead to false positivematches. To tackle this problem, this paper proposes to embedmultiple binary features at indexing level. To model correlationbetween features, a multi-IDF scheme is introduced, throughwhich different binary features are coupled into the invertedfile. We show that matching verification methods based onbinary features, such as Hamming embedding, can be effectivelyincorporated in our framework. As an extension, we explore thefusion of binary color feature into image retrieval. The jointintegration of the SIFT visual word and binary features greatlyenhances the precision of visual matching, reducing the impact offalse positive matches. Our method is evaluated through extensiveexperiments on four benchmark datasets (Ukbench, Holidays,DupImage, and MIR Flickr 1M). We show that our methodsignificantly improves the baseline approach. In addition, large-scale experiments indicate that the proposed method requiresacceptable memory usage and query time compared with otherapproaches. Further, when global color feature is integrated, ourmethod yields competitive performance with the state-of-the-arts.

    Index Terms Feature fusion, coupled binary embedding,multi-IDF, image retrieval.

    I. INTRODUCTION

    THIS paper focuses on the task of large scale partial-duplicate image retrieval. Given a query image, our targetis to find images containing the same object or scene in a largedatabase in real time. Due to the low descriptive power of texts

    Manuscript received January 1, 2014; revised April 14, 2014; acceptedJune 2, 2014. Date of publication June 12, 2014; date of current version July 1,2014. This work was supported in part by the National High TechnologyResearch and Development Program of China (863 program) under Grant2012AA011004 and in part by the National Science and Technology SupportProgram under Grant 2013BAK02B04. The work of Q. Tian was supportedin part by the Army Research Office under Grant W911NF-12-1-0057, in partby the Faculty Research Awards through the NEC Laboratories of America, inpart by the 2012 UTSA START-R Research Award, and in part by the NationalScience Foundation of China under Grant 61128007. The associate editorcoordinating the review of this manuscript and approving it for publicationwas Mr. Pierre-Marc Jodoin. (Corresponding authors: Shengjin Wang andQi Tian).

    L. Zheng and S. Wang are with the State Key Laboratory of IntelligentTechnology and Systems, Tsinghua National Laboratory for InformationScience and Technology, Department of Electrical Engineering, TsinghuaUniversity, Beijing 100084, China (e-mail: [email protected];[email protected]).

    Q. Tian is with the Department of Computer Science, University of Texas atSan Antonio, San Antonio, TX 78249-1604 USA (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TIP.2014.2330763

    or tags [2], [3], content based image retrieval (CBIR) has beena hot topic in computer vision community.

    One of the most popular approaches to perform such a taskis the Bag-of-Words (BoW) model [4]. The introduction ofthe SIFT descriptor [5] has enabled accurate partial-duplicateimage retrieval based on feature matching. Specifically, theBoW model first constructs a codebook via unsupervisedclustering algorithms [6], [7]. Then, an image is represented asa histogram of visual words, produced by feature quantization.Each bin of the histogram is weighted with tf-idf score [4] orits variants [1], [8], [9]. With the inverted file data structure,images are indexed for efficient retrieval.

    Essentially, one key issue of the BoW model involves visualword matching between images. Accurate feature matchingleads to high image retrieval performance. However, twodrawbacks compromise this procedure. First, in quantization,a 128-D double SIFT feature is quantized to a single integer.Although it enables efficient online retrieval, the discriminativepower of SIFT feature is largely lost. Features that lie awayfrom each other may actually fall into the same cell, thusproducing false positive matches. Second, the state-of-the-artsystems rely on the SIFT descriptor, which only describesthe local gradient distribution, with rare description of othercharacteristics, such as color, of this local region. As a result,regions which are similar in texture space but different in colorspace may also be considered as a true match. Both drawbackslead to false positive matches and impair the image retrievalaccuracy. Therefore, it is undesirable to take visual word indexas the only ticket to visual matching. Instead, the matchingprocedure should be further checked by other cues, whichshould be efficient in terms of both memory and time.

    A reasonable choice to address the above problem involvesthe usage of binary features. Typically, the binary features areextracted along with SIFT, and embedded into the inverted file.The reason why binary feature can be employed for matchingverification is two-fold. First, compared with floating-pointvectors of the same length, binary features consume much lessmemory. For example, for a 128-D vector, it takes 512 bytesand 16 bytes for the floating-point and binary features, respec-tively. Second, during matching verification, the Hammingdistance between two binary features can be efficiently calcu-lated via xor operations, while the Euclidean distance betweenfloating-point vectors is very expensive to compute. Previouswork of this line includes Hamming Embedding (HE) [1] andits variants [10], [11], which use binary SIFT features forverification. Meanwhile, binary features also include spatialcontext [12], heterogeneous feature such as color [13], etc.

    In light of the effectiveness of binary features, this paperproposes to refine visual matching via the embedding of

    1057-7149 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • ZHENG et al.: COUPLED BINARY EMBEDDING FOR LARGE-SCALE IMAGE RETRIEVAL 3369

    Fig. 1. Two sample image retrieval results from the Holidays dataset. Foreach query, the baseline results (the first row) and results with color fusion(the second row) are demonstrated. The results start from the second imagein the rank list.

    multiple binary features. On one hand, binary features providecomplementary clues to rebuild the discriminative power ofSIFT visual word. On the other hand, in this feature fusionprocess, binary features are coupled by links derived froma virtual multi-index structure. In this structure, SIFT visualword and other binary features are combined at indexing levelby taking each feature as one dimension of the virtual multi-index. Therefore, the image retrieval process votes for candi-date images not only similar in local texture feature, but alsoconsistent in other feature spaces. With the concept of multi-index, a novel IDF scheme, called multi-IDF, is introduced.We show that binary feature verification methods such asHamming Embedding, can be effectively incorporated in ourframework. Moreover, we extend the proposed framework byembedding binary color feature.

    This paper argues that feature fusion by coupled binaryfeature embedding significantly enhances the discriminativepower of SIFT visual word. First, SIFT binary feature retainsmore information from the original feature, providing effectivecheck for visual word matching. Second, color binary featuregives complementary clues to SIFT feature (see Fig. 2 forthe effects of color fusion). Both aspects serve to improvefeature matching accuracy. Extensive experiments on fourimage retrieval datasets confirm that the proposed method dra-matically improves image retrieval accuracy, while remainingefficient as well. Fig. 1 gives some examples where our methodreturns challenging images candidates while the conventionalSIFT-based model fails.

    The rest of the paper is organized as follows. After a briefreview of related work in Section II, we introduce the proposedbinary feature embedding method based on virtual multi-indexin Section III. Section IV presents the experimental resultson four benchmark datasets for image retrieval applications.Finally, conclusions are given in Section V.

    II. RELATED WORKThis paper aims at improving BoW-based image retrieval

    via indexing-level feature fusion. So we briefly review four

    Fig. 2. An example of image matching using the baseline (left) and theproposed fusion (right) method. For each image pair, the query image is onthe left. The first row represents matching between relevant image, while thesecond row contains irrelevant ones. We also show the ranks of the candidateimages. We can see that the fusion of color information improves performancesignificantly.

    closely related aspects, i.e, feature quantization, spatial con-straint encoding, feature fusion, and indexing strategy.

    A. Feature QuantizationUsually, hundreds or thousands of local features, e.g,

    SIFT [5] or its variants [14], [15] are extracted in an image.To reduce memory cost and speed up image matching, eachSIFT feature is assigned to one or a few nearest centroidsin the codebook via approximate nearest neighbor (ANN)algorithms [6], [7]. This process is featured by a significantinformation loss from a 128-D double vector to a 1-D integer.To reduce quantization error, multiple assignment [16] or softquantization [17] is employed, which instead increase thequery time and memory overload. Another choice includesthe Fisher Vector (FV) [18]. In FV, the Gaussian MixtureModel (GMM) is used to train a codebook. The quantizationprocess is performed softly by estimating the probability thata given feature falls into each Gaussian mixture. Quantizationerror can also be tackled using binary features. Hammingembedding [1] generates binary signatures coupling SIFTvisual word for matching verification. These binary featuresprovide information to filter out false matches, rebuildingthe discriminative power of SIFT visual word. Quantizationartifact can also be addressed by modeling spatial constraintsamong local features [12], [19], [20]. Another recent trendincludes designing codebook-free methods [21], [22] for effi-cient feature quantization.

    B. Feature FusionThe combination of multiple features has been demonstrated

    to obtain superior performance in various tasks, such astopic modeling [23], [24], boundary detection [25], characterrecognition [26], object classification [27] and detection [28],[29] tasks. Typically, early and late fusions are the twomain approaches. Early fusion [27] refers to fusing multiplefeatures at pixel level, while late fusion [30] learns semanticconcepts directly from unimodal features. In the field ofinstance retrieval, feature fusion is no easy task due to thelack of sufficient training data. Douze et al. [31] combine fisher

  • 3370 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    vector and attributes in a manner equivalent to early fusion.Zhang et al. [32] perform late fusion by combining ranklists of BoW and global features by graph fusion. In [33],co-indexing is employed to augment the inverted index withglobally similar images. In [34], multi-level features includingthe popular CNN feature [35] are integrated, and state-of-the-art results on benchmarks are reported on benchmarks.Wengert et al. [13] use global and local color features to pro-vide complementary information. Their work is similar to oursin that binary color feature is also integrated into the invertedfile. However, in their work the trade-off between color andSIFT hamming embeddings is heuristic and dependent on thedataset. Instead this paper focuses on the indexing-level featurefusion by modeling correlations between features, which couldbe generalized on different datasets, providing a different viewfrom previous works.

    C. Indexing StrategyThe inverted file structure [4] greatly promotes the efficiency

    of large scale image retrieval [36]. In essence, the invertedfile stores image IDs where the corresponding visual wordappears. Modified inverted file may also include other cuesfor further visual match check, such as binary Hamming codes[1], [37], feature position, scale, and orientation [38], [39], etc.For example, Zhou et al. perform on-the-fly spatial codingwith the metadata stored in the inverted file. Zheng et al. [37]employ indexing-level feature fusion with a 2D inverted file,and greatly improve retrieval efficiency. In [40], a Bayes prob-abilistic model is proposed to merge multiple inverted indices,while cross-indexing [41] traverses two inverted indices inan iterative manner. The closest inspiring work to oursincludes [42], which addresses ANN problem via invertedmulti-indices built on product quantization (PQ) [43]. In theirwork, each dimension of the multi-index corresponds to asegment of the SIFT feature vector, so the multi-index is aproduct of the de-composition of the SIFT feature. Oppositeto [42], in this work, we compose SIFT and color featuresinto the multi-index, implicitly performing feature fusion atindexing level. The multi-index used in this paper serves as anillustration of the coupling mechanism and derives the multi-IDF formula, which is a bridge between features.

    III. PROPOSED APPROACHThis section provides a formal description of the proposed

    framework for binary feature embedding.

    A. Binary Feature Verification RevisitThe SIFT visual word is such a weak discriminator that

    false positive matches occur prevalently: dissimilar SIFT fea-tures are assigned to the same visual word, and vice versa.To rebuild its discriminative power, binary features areemployed to provide further verification for visual wordmatching pairs.

    The Hamming Embedding (HE) proposed in [1] suggestsa way to inject SIFT binary feature into the retrieval system.This paper, however, exploits the embedding of multiple binarysignatures from heterogeneous features.

    A binary feature can be generated as,

    f = ( f1, f2, . . . , fn)T q() b = (b1, b2, . . . , bm) (1)where an n-D feature f is projected into an m-D binarysignature b, which is stored in the inverted file.

    During online query, given a query feature x , its matchingstrength with a indexed feature y which is quantized to thesame visual word with x can be written as,

    f (x, y) ={

    exp(d2b/ 2) , if db < ,

    0 otherwise, (2)

    where db denotes the Hamming distance between the binarysignatures of x and y, is a weighting parameter, and is apredefined threshold. If db exceeds , then x and y are viewedas a false match and rejected.

    In this scenario, given a query image Q, and a databaseimage I, the similarity function between them can be formu-lated as,

    sim(Q, I ) =

    xQ,yI f (x, y) id f 2Q2I2 , (3)

    where id f stands for the inverse document frequency (IDF)of the corresponding visual word. For a conventional invertedfile, the IDF value of visual word wi is formulated as,

    id f (wi ) = log Nni

    (4)

    where N represents the overall number of images in thecorpus, and ni denotes the number of images in which wiappears. The basic idea of IDF is to assign more weight torare words, and less weight to frequent words.

    In this paper, the binary features used are not limited tothose derived from the original SIFT [1]. It also involvesother heterogenous binary features, such as color feature, orpotentially, the recently proposed ORB [44], BRISK [45],FREAK [46], etc. In essence, our task is to perform featurefusion on the indexing level bridged by the introduction ofmulti-IDF scheme, which differs from previous works onbinary verification significantly.

    B. Organization of the Inverted FileThe inverted file is prevalently used to index database

    images in the BoW-based image retrieval pipeline. Thisdata structure not only calculates the inner product betweenimages explicitly, but, more importantly, enables efficientonline retrieval process. We assume that an image collectionpossesses N images denoted as D = {Ii }Ni=1. Each image Iihas a set of keypoints {x j }dij=1, where di is the number of key-points in Ii . Given a codebook {wi }Kk=1 of size K , image Ii isquantized to a vector representation vi = [vi,1, vi,2, . . . , vi,K ]T,where vi,k stands for the response of visual word wk in Ii .A conventional inverted file can be denoted asW = {W1, W2, . . . , WK }. In W , each entry Wi contains alist of postings. For an image-level inverted file, each postingstores the image ID and the TF score. For a keypoint-levelinverted file, each posting stores the image ID and othermetadata [1], [38], [47] associated with the indexed keypoint.

  • ZHENG et al.: COUPLED BINARY EMBEDDING FOR LARGE-SCALE IMAGE RETRIEVAL 3371

    Fig. 3. Structure of the keypoint-level inverted file. Each posting in thelist stores the information of an indexed keypoint, e.g., the image ID, binaryfeatures, etc.

    In this paper, a keypoint-level inverted file is organizedas shown in Fig. 3. Similar to Hamming Embedding [1],binary features of each SIFT visual word are stored in thecorresponding entry of the 1-D inverted file. However, ourwork differs from [1] in two aspects. First, we illustrate thefeasibility of multiple feature fusion at indexing level. Second,a multi-IDF scheme (see Section III-D) is employed, whichtakes advantage of the multi-index structure (see Section III-C)and feature correlation.

    C. A Multi-Index IllustrationIn this section, we provide an alternative explanation of the

    proposed method from the perspective of multi-index structure,which stands as the foundation of the multi-IDF formula (seeSection III-D).

    With M kinds of features, the dimension of the multi-index is M . Each dimension corresponds to a conventionalinverted file of feature Fm, m = 1, 2, . . . , M . Then build-ing the multi-index can be processed as follows. First,for each keypoint {xi} in an image, multiple descriptors( f 0i , f 1i , . . . , f Mi ) are computed. Then, the descriptors asso-ciated with a keypoint are quantized into a visual word tuple(w0i , w

    1i , . . . , w

    Mi ) using codebooks {Cm}Mm=0 of each feature.

    Finally, for each tuple, an entry in the multi-index is identified,where the metadata of this keypoint can be stored.

    During online retrieval, each image is represented by a bagof word tuples as in the offline phase. In this manner, everykeypoint is described by multiple features. Then, for each wordtuple, we find and vote for the candidate images from thecorresponding entry in the multi-index.

    1) Embedding Binary Features: This paper embeds binaryfeatures into the SIFT visual word framework. Due to its bit-wise nature, each binary feature equals to a decimal number.So the binary feature itself can be viewed as a visual word:there is no need to train a codebook explicitly [21]. Thereason why we use binary features instead of traditional visualwords is that a coarser-to-fine mechanism is implied in binaryfeatures. Basically, the Hamming distance between two binaryfeatures represents their similarity, while the traditional visualword only allows a hard matching mode. A binary featureof k bits corresponds to a codebook of size 2k . Consequently,

    1It refers to neighborhood in the feature space, but for better illustration,we place them as neighborhood in the inverted file.

    Fig. 4. An example of binary multi-index fusing SIFT and color feature.The codebook sizes are 1M and 222, respectively. During online retrieval, theentry corresponding to word tuple (S4, C4) is located. Then, the neighborhoodentries1 are also checked as an implementation of Multiple Assignment (MA).Color indicates the weight of these entries. A darker color signifies a largerweight.

    we can adapt each binary feature to the virtual multi-indexeasily. Specifically, the SIFT binary features involved in [1]can be coupled with the SIFT visual word in the virtualmulti-index as well.

    To boost the recall of candidate images, Multiple Assign-ment (MA) [16], [17] is employed. Suppose r0, r1, . . . , rMnearest neighbors are used in the quantization of query fea-tures, then a weight is computed according to its distance tothe cluster center. Similar to [17], the weight takes the formof exp( d ), where d is the distance to the center, and isthe weighting factor. For the SIFT visual word, we adopt theEuclidean distance, while Hamming distance is used for binaryfeatures. A 2D multi-index fusing SIFT and color feature isillustrated in Fig. 4.

    Note that, the memory cost of the multi-index grows rapidlywith the number of dimensions, leading to memory ineffi-ciency. As a result, in practice, the multi-index structure isused to calculate the multi-IDF only, and the inverted file inFig. 3 is actually being used.

    D. Multi-IDF FormulaIn this section, we introduce the multi-IDF formula which

    bridges various features in the multi-index.1) Conventional IDF: In essence, as Robertson [48] sug-

    gests, IDF weight can be interpreted in terms of probability,i.e., the probability that a random image I contains visualword wi , estimated as

    P(wi ) = P(wi occurs in I ) niN (5)In Eq. 5, it is assumed that the occurrences of different visualwords across the image database are statistically independent.

  • 3372 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    Note that the independence assumption is also taken in thewhole BoW model. Under this assumption, the IDF value canbe estimated as the inverse of the fraction in Eq. 5, plus alogarithm operator,

    id f (wi ) = log P(wi ) = log Nni

    (6)

    Note that, the reason why a log() operator is used associateswith the idea of addition-based scoring function (see Eq. 8below).

    2) Proposed IDF: Without loss of generality, we illustratethe case of 2D multi-index. For each entry, its IDF value isdetermined by the probability that a visual word tuple (w0i , w1j )occurs in an arbitrary image I . For simplicity, assume featuresF0 and F1 are independent, i.e, at each keypoint, the valuedistribution of the two features does not affect each other.Therefore, the occurrence of visual words w0i and w

    1j are also

    independent. We can derive

    P(w0i , w1j ) = P(w0i )P(w1j )

    n0iN

    n1j

    N(7)

    where n0i and n1j stand for the number of images containingvisual word wi and w j for feature F0 and F1, respectively.2Following Eq. 6, the IDF value of this entry can be calculatedas,

    id f (w0i , w1j )indep = log P(w0i , w1j )= (log P(w0i ) + log P(w1j ))= id f (w0i ) + id f (w1j ) (8)

    The advantage of the multi-IDF is two-fold. First, it greatlyreduces the computational complexity from O(K (0) K (1))to O(K (0) + K (1)): only the IDF value of each individualfeature codebook needs to be calculated. In the case of higher-order multi-index, the computation efficiency would be moreprominent. Secondly, multi-IDF in Eq. 8 can be adoptedeasily when more features are fused under the assumption ofindependence between features.

    Nonetheless, when the independence assumption does nothold, the multi-IDF formula would undergo some smallchanges. Consider the extreme case where two features areperfectly dependent, e.g, they are the same feature, the IDFformula is the same as the conventional IDF,

    id f (w0i , w1j )dep = id f (w0i ) = id f (w1j ) (9)In light of Eq. 8 and Eq. 9, IDF value for two partially

    dependent features can be viewed as a weighted sum of eachindividual IDF,

    id f (w0i , w1j ) = id f (w0i ) + t id f (w1j ), t [0, 1] (10)where the parameter t measures the independence of featuresF0 and F1. Larger independence leads to a larger t . Perfect

    2We note that Eq. 7 calculates the probability that the two words occur inthe same image, not the same spot. However, the two probabilities differ inmultiplying a constant n, i.e, the average number of visual words or keypointsin an image, which can be omitted afterwards.

    Fig. 5. An example of visual match. Top: A matched SIFT pair betweentwo images. The 11-D color name descriptors of the matched keypoints inthe left (middle) and right (bottom) images are presented below. Also shownare the prototypes of the 11 basic colors (colored discs). In this example, thetwo matched keypoints differ a lot in color, thus considered as a false positivematch.

    independence and dependence correspond to 1 and 0, reducingEq. 10 to Eq. 8 and Eq. 9, respectively.

    Eq .10 can be generalized to the multiple feature case as,

    id f (w0i , w1j1, . . . wMjM ) = id f (w0i ) +M

    m=1tm id f (wmjm ) (11)

    where tm [0, 1] is the correlation of F0 and Fm .In this paper, we take F0 as the principal feature (SIFT visualword), and Fm as the auxiliary feature (binary features). Notethat, when multiple features are fused in the multi-index, itis a multiple dimensional indexing structure, in which eachdimension corresponds a feature.

    Specifically, if we set tm to 0, the proposed frameworkreduces to case of binary feature verification. Coupled with themulti-index, Eq. 11 helps generalize binary feature embeddingby building links among these features. The effectiveness ofthe multi-IDF scheme is evaluated in Section IV-C.

    E. Fusion of Color FeatureIn this paper, we embed color feature with SIFT at indexing

    level. Fig. 2 and Fig. 5 provide examples of how color featurehelps filter our false matches.

  • ZHENG et al.: COUPLED BINARY EMBEDDING FOR LARGE-SCALE IMAGE RETRIEVAL 3373

    1) Color Descriptor: This paper employs the Color Names(CN) descriptor [28] for two reasons. First, it is shownin [28] that CN has superior performance compared withseveral commonly used color descriptors such as the Robusthue descriptor [49] and Opponent derivative descriptor[49]. Second, although colored SIFT descriptors such asHSV-SIFT [50] and HueSIFT [51] provide color information,the descriptors typically lose some invariance properties andare high-dimensional [52].

    Basically, the CN descriptor assigns to each pixel a 11-Dvector, of which each dimension encodes one of the elevenbasic colors: black, blue, brown, grey, green, orange, pink,purple, red, white and yellow. The effectiveness of CN hasbeen validated in image classification and detection applica-tions [27], [28]. We further test it in the scenario of imageretrieval.

    2) Feature Extraction: At each keypoint, two descriptors areextracted, i.e., a SIFT descriptor and a CN descriptor. In thisscenario, SIFT is extracted with the standard algorithm [5].As with CN, we first compute CN vectors of pixels surround-ing the keypoint, with the area proportional to the scale of thekeypoint. Then, we take the average CN vector as the colorfeature. The two descriptors of a keypoint are individuallyquantized, binarized, and fed into our model, respectively.

    3) Binarization: Because the CN descriptor has explicitsemantic meaning in each dimension, we do not adopt theclassical clustering method to perform quantization. Instead,we directly convert a CN vector into a binary feature, whichitself can be viewed as a distinct visual word [21]. Specifically,we try two binarization schemes, producing 11-bit vector b(11)and 22-bit vector b(22), respectively. Suppose the CN vectoris presented as ( f1, f2, . . . , f11)T, and the binarization can beprocessed as follows,

    b(11)i ={

    1, if fi th,0, if fi < th (12)

    (b(22)i , b(22)i+11) =

    (1, 1), if fi > th1,(1, 0), if th2 < fi th1,(0, 0), if fi th2

    (13)

    where bi (i = 1, 2, . . . , 11) is the i th entry of the resultingbinary feature. Thresholds th = g3, th1 = g2, th2 = g5, where(g1, g2, . . . , g11) is the sorted vector of ( f1, f2, . . . , f11)in descending order. A comparison of the two quantizers(Eq. 12 and Eq. 13) is shown in Section IV-C. Intuitively, thebinarization schemes introduced in Eq. 12 and Eq. 13 representa uniform partition (similar to the lattice partitioning in [53])of the CN feature space. In fact, a vital difference between theCN vector and the SIFT vector is that each entry of the CNvector has explicit physical meaning. In Eq. 13, we proposeto assign (1, 1) to the two most salient colors of a local regionand (0, 0) to the least dominant colors. We speculate that tworegions are of a similar color if they have the same dominantcolors, and vice versa. Another consideration of Eq. 13 is thatthe minor colors may be subject to the impact of illuminationor view changes. In this manner, this binarization scheme ismore robust to image variations. On the other hand, one mayask why we choose g2 and g5 instead of a more uniform

    Fig. 6. Sample images used to calculate the correlation coefficients. In thisexample, images in each row are obtained by the same text query. The textqueries used to crawl these images are (from top to bottom): Asia, Atlanta,Babados, and Brazil.

    threshold setting. In fact, region for CN extraction is quitesmall: in typical cases it covers tens of pixels. In such a smallpatch, the number of dominant colors is small (see Fig. 5for an example). After counting the several dominant colors,approximately half of the CN dimensions are close to zero.Therefore, we use the presented thresholds which also yieldsatisfying performance in the experiments.

    Therefore, in this paper, we do not employ the binarizationmethod proposed in [1]. Nevertheless, this paper providesa comparison with the standard Locality-Sensitive Hashing(LSH) [54] as well as the state-of-the-art Kernelized Locality-Sensitive Hashing (KLSH) [55] methods in Section IV-C.

    4) Estimation of Feature Correlation: To determine t inEq. 10, we propose a simple scheme to measure the correlationbetween features. We take as an example calculating thefeature correlation between SIFT visual word and binaryCN. To this end, we crawled 200K high-resolution imagesuploaded by users from Flickr using the names of 60 countriesand regions across the world. These images are generallyhigh-resolution, with the most common size of 1024 768.The content of the images are very diverse, from scenes toobjects, which can be viewed as a good representation ofnatural images. Some sample images are shown in Fig. 6.From the images, we extract over 2 109 (SIFT, CN) featuretuples. Then, we perform quantization on the two features:classical codebook for SIFT, and 22-bit quantizer for CN.Further, two histograms are calculated using these data: Forfeature tuples with the same/different SIFT visual word,compute the Hamming distance of CN features, and calculatethe normalized distance histogram. Finally, we calculate thecorrelation coefficient of the two histograms, which serves asthe estimation of feature correlation parameter t .

    Briefly, the intuition is that, for highly independent features,the two histograms should be very similar: whether or notthe SIFT visual words of a keypoint pair are the same, theHamming distance of the other feature is not affected. In thiscase, the correlation coefficient of the two histograms is closeto 1, which also means a larger t . The dependent featurescase can be analysed in a similar way. Results for featurecorrelation calculation is presented in Fig. 7. Specifically, wepresent the statistical results obtained on the Flickr200K andHolidays datasets. Note that, the reason why the Holidaysdataset is included is that the correlation coefficient may besubject to dataset bias, or dataset dependency. The correlation

  • 3374 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    Fig. 7. Hamming distance histograms of CN feature given identical ordifferent SIFT visual words, codebook size = 20K . Statistical results on(a) Flickr200K and (b) Holidays datasets are presented. Results on Holidaysdataset are used to analyze the dataset independency of feature correlation.The correlation coefficients of the two histograms on the two datasets are0.950 and 0.931, respectively. We take 0.950 as an estimation of t in Eq. 10.

    coefficients on the two datasets are 0.950 and 0.931, respec-tively. It suggests that the global correlation between SIFTvisual word and CN feature may be independent on datasets.

    IV. EXPERIMENTSTo evaluate the effectiveness of the our method, we con-

    ducted experiments on four public benchmark datasets: theUkbench [7], the Holidays [1], the DupImage [38], and theMIR Flickr 1M [56] datasets.

    A. Datasets1) Ukbench: The Ukbench dataset consists of 10200 images

    of 2550 groups. Each group contains four images of the sameobject or scene, taken from different viewpoints. Each of the10200 images is taken as query image. The performance ismeasured by the recall for the top-4 candidates, referred to asN-S score (maximum 4).

    2) Holidays: The Holidays dataset is composed of 500queries from 1491 annotated personal holiday photos. mAP(mean Average Precision) is used to evaluate the performance.

    3) DupImage: The DupImage dataset contains 1104 imagesfrom 33 annotated groups. From this ground truth dataset, 108representative queries are selected, and mAP is employed toevaluate image retrieval performance.

    4) MIR Flickr 1M: This dataset contains 1 million imagesretrieved from Flickr. We add this dataset to the Holidays,Ukbench, and DupImage datasets to test the scalability of theproposed method.

    B. Parameter AnalysisTwo parameters are used in CN Hamming Embedding and

    SIFT Hamming Embedding (Eq. 1), respectively: Hammingdistance threshold and weighting factor . For SIFT HE,we use the same parameter setting as in [1] and [57]. ForCN, we vary the values of and , and compare mAP on theHolidays and DupImage datasets. The results are demonstratedin Fig. 8. From Fig. 8(a), it is shown that when the weighingfactor increases, mAP first increases and then drops afterthe peak, where = 4. In Fig. 8(b), the mAP performance

    Fig. 8. Impact of (a) weighting parameter and (b) Hamming distancethreshold on retrieval accuracy. Results on Holidays and DupImage datasetswith 1M codebook are presented. Here, BoW + CN is used.

    increases with , and keeps stable when is large. With theseresults, we set = 4 and = 9 for CN in the followingexperiments.

    For MA on the SIFT side, we set the number of nearestneighbors r to 3, and the weighting factor to 5000. As withSIFT Hamming Embedding, we choose the same parametersas [16].

    C. Evaluation1) Color Binarization: To evaluate the effectiveness of the

    binarization method in Eq. 12 and Eq. 13, we compare it withthe Locality-Sensitive Hashing (LSH) algorithm [54] which isdesigned for the Euclidean space. We try the LSH method withdifferent number of bits used for the hash keys on the threedatasets. We also present results obtained by the KernelizedLocality-Sensitive Hashing (KLSH) [55] with a Gaussian RBFkernel. The number of bits is set to 22 for KLSH. The resultsare presented in Table I.

    Table I indicates that although LSH with 22 bits performsslightly better on Holidays, when the same number of bits isused, this paper generally yields higher results than both LSHand KLSH. As suggested in Section III-E, the CN descriptorhas a explicit attribute in each dimension, and this paperconsiders dominant colors only, which is less affected bydescriptor noise. Moreover, the dimension of the CN vector

  • ZHENG et al.: COUPLED BINARY EMBEDDING FOR LARGE-SCALE IMAGE RETRIEVAL 3375

    TABLE ICOMPARISON WITH LSH METHODS

    TABLE IICOMPARISON OF CONVENTIONAL IDF AND MULTI-IDF

    is not high, which may deteriorate the performance of thehashing algorithms. For these reasons, the proposed methodproduces better performance with the CN descriptor.

    2) Multi-IDF: In our experiment, SIFT codebook of varioussizes are trained to calculate feature correlation between SIFTvisual word and binary CN feature, as illustrated in Fig. 7.With the increase of codebook size, the correlation coefficientdecreases a little (t = 0.94 when codebook size is 1M).Because the value of t does not change much with thecodebook size, and that Table III favors its value between 0.9and 1, in the following experiments, we set t to 0.95 for binaryCN feature.

    Table II compares performance of the conventional SIFTIDF (conv. IDF) with the proposed IDF when CN is fused(BoW + CN in this case). The codebook size is set to 1M.As expected, the multi-IDF improves accuracy consistentlyon all the three datasets. Specifically, improvements over theconventional IDF include an N-S score of +0.036, and mAP of+1.9% and +1.1% on the these datasets, respectively. Similarsituation can be observed from Table IV: for codebooks ofsize 200K or 250K, the multi-IDF method brings benefits aswell. In its nature, multi-IDF is derived from the concept ofmulti-index. So it considers the contribution of both the SIFTvisual word and color feature, while the conventional IDF onlytakes into account the principal feature. Therefore, multi-IDFserves as a bridge in coupling different binary features. In thefollowing experiments, multi-IDF is employed in every binaryfeature embedding scheme.

    3) Impact of Codebook Size: On the SIFT side, we con-struct codebooks of various sizes using Approximate K-Means(AKM) algorithm [6]. For CN, two quantizers are compared,i.e, two binary codebooks of size 211 and 222. The results arepresented in Fig. 9.

    Fig. 9 demonstrates that with the increase of SIFT codebooksize, the retrieval accuracy improves consistently. This isconsistent with many previous works [6], [7], and [38]: a finer

    TABLE IIIIMPACT OF t ON THE THREE DATASETS

    partition of the SIFT space enhances the discriminative power.Then, the introduction of color feature significantly promotesthe retrieval performance with various SIFT codebook sizes,which clearly demonstrates the complementary nature of colorto SIFT visual word. Moreover, the color codebook of size222 is shown to outperform the size of 211 (also shown inTable I). In essence, this larger codebook corresponds to amore informative quantization scheme. The quantizer in Eq. 13gives more weight to the most salient two colors than Eq. 12:two keypoints are more likely to be irrelevant if their dominantcolors are different. In addition, Multiple Assignment (MA) ofSIFT visual word moderately improves accuracy.

    4) Combination of Hamming Embedding: To test the gen-eralization of our method, we further extend the frameworkto include Hamming Embedding [1]. In this scenario, twobinary features, i.e, the binarized Color Name (CN) and SIFTHamming Embedding (HE) are fused with the SIFT visualword baseline. Specifically, we calculate a 64-bit HE signaturefor each keypoint, and use the first 16 bits to calculate the HEIDF. The feature correlation coefficient between HE and SIFTvisual word is 0.99. The results are summarized in Table IVand Fig. 10.

    These results demonstrate the complementary nature ofSIFT visual word, HE and CN: when applied separately,both HE and CN improve the accuracy of SIFT visual wordbaseline; when they are combined in our framework (CN +HE), further improvement can be achieved.

    We also note that HE outperforms CN when they arefused with SIFT visual word separately. Actually, HE isdirectly derived from the SIFT feature and retain much morediscriminative power than the visual word. On the other hand,CN is an auxiliary feature to SIFT, so its effect may be indirect.Also, HE is stored as an 8-byte signature while CN onlycosts less than 3 bytes, so the big disparity in the amountof information they carry may be another reason.

    In addition, to further support our contribution, we buildtwo inverted files for SIFT and color descriptors, respectively.We train a 200K codebook for SIFT, and a 200 codebookfor CN. The binary features are injected into the invertedfiles separately. Then, we merge the scores later to rank theretrieved images. On Holidays dataset, the SIFT and CNinverted files produce mAP = 0.739 and 0.505, respectively.When the scores are merged (a simple addition), we obtainmAP = 0.635, which is significantly lower than mAP = 0.796reported in this paper. The reason is that the CN descriptoralone is not a good discriminator compared with SIFT. Dueto its low dimension and sensitivity to illumination changes,it may produce much more false matches than SIFT. Therefore,the CN feature is more effective if it is used complementaryto SIFT, improving matching precision, as done in this paper.

  • 3376 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    TABLE IVCOMPARISON OF IMAGE RETRIEVAL ACCURACY FOR VARIOUS METHODS ON BENCHMARK DATASETS

    Fig. 9. Impact of codebook size of SIFT and binary CN features on image retrieval accuracy. N-S score for (a) Ukbench, Mean Average Precision (mAP)for (b) Holidays and (c) DupImage datasets are presented.

    Fig. 10. Large scale experiments with MIR Flickr 1M dataset. N-S score for (a) Ukbench, mAP for (b) Holidays and (c) DupImage datasets are presented.Different fusion strategies are compared. Note that multi-IDF is applied in all but the baseline methods.

    For thorough evaluation, we provide additional results onthe fusion of more features. Specifically, another two featuresare employed, i.e., H-S histogram and histogram of rota-tion invariant LBP descriptor [58]. The feature dimension is1000-D and 10-D, respectively. The two features are binarizedusing LSH into 64 bits and 32 bits, respectively. Resultsare shown in Table V. We can see that, the integration ofH-S histgram helps further improve performance, but LBPhas marginal effect on performance. Since more bits of color

    feature is considered, it is natural that H-S histogram bringsmore benefits. On the other hand, however, as LBP alsodescribes texture feature as SIFT, it provides little comple-mentary cues on the fusion task. Due to memory and accuracyconsiderations, in the following experiments, we stick to theHE+CN fusion method.

    5) Effectiveness of Graph Fusion as a Post-Processing Step:Post-processing steps are effective in improving the accuracyof an image retrieval system. Popular choices include spatial

  • ZHENG et al.: COUPLED BINARY EMBEDDING FOR LARGE-SCALE IMAGE RETRIEVAL 3377

    TABLE VIMPACT OF INTEGRATION OF MORE FEATURES

    TABLE VIRESULTS BY GRAPH FUSION OF GLOBAL HSV HISTOGRAM

    verification (SP) based on RANSAC [6], Query Expansion(QE) [59], k-NN reranking [60], etc. In this paper, we adoptthe Graph Fusion (GF) technique introduced in [32] andimproved in [61] for two reasons. First, query images in theHolidays and Ukbench datasets typically have few groundtruth images (1-2 in Holidays, 3 in Ukbench), so QE and k-NNmethods may have poor performance. Second, in this paper,we are interested in the integration of color information, andGF is shown to yield good results for global color features.

    In more detail, we extract a 1000-D HSV histogram foreach image. Following [62], the HSV feature vector is firstL1 normalized, and then each dimension is exerted a squareroot scaling. We name this HSV variant as rootHSV. Then,during off-line stage, two image graphs [63] are constructedcorresponding to the rootHSV and BoW features, respectively.Query a query image, the rank lists obtained by our methodand rootHSV are merged using GF. Experiments are conductedon Holidays and Ukbench datasets, with results presented inTable VI. These results indicate that rootHSV significantlyoutperforms the HSV histogram. The reason lies in thatrootHSV suppresses the impact of color burstiness [57]phenomenon, thus being more robust to condition changes.Moreover, when combined with the global color feature, theperformance of the proposed method is boosted. It demon-strates the complementary nature of local and global colordescriptors. In fact, while the local CN descriptors functionsto improve precision at feature level, the global HSV featurefurther improves recall at image level.

    6) Scalability: To evaluate the scalability of the proposedbinary feature embedding method, we populate the Ukbench,Holidays, and DupImage datasets with various fractions of theMIR Flickr 1M dataset. The image retrieval accuracy is plottedagainst the database size, as shown in Fig. 10.

    Fig. 10 shows that for each database size, the fusion ofbinary features outperforms the baseline. Specifically, on the1M datasets, the CN + HE + MA approach improves thebaseline significantly: N-S score from 3.05 to 3.49 (+0.44) onUkbench, mAP from 0.321 to 0.721 (+0.400%) on Holidays,and mAP from 0.270 to 0.644 (+0.374%) on DupImage.Moreover, as the database gets scaled up, the accuracy ofour method drops slower: more improvement is achieved on

    TABLE VIIAVERAGE QUERY TIME (S) ON HOLIDAYS + 1M DATASET

    TABLE VIIIMEMORY COST FOR DIFFERENT APPROACHES

    larger databases. Consequently, the proposed method is shownto work well on large scale settings, validating its scalability.

    7) Time Efficiency: Efficiency is an important issue invarious computer vision tasks [37], [68]. In our experiment,results are obtained on a server with 2.40 GHz CPU and32 GB memory. The average feature extraction time for SIFTand color feature is 0.657s on the 1M dataset, and SIFTquantization takes 0.81s on average using a 1M codebook.Table VII compares the average query time of different fusionapproaches. The time cost of feature extraction and quantiza-tion are not included. It takes the baseline approach 0.106sto perform one query on 1M dataset. The fusion of colorfeature marginally increases retrieval time to 0.133s per query,mainly due to the use of matching weight defined in Eq. 1.SIFT Hamming Embedding is slightly more efficient than CNbecause it filters out a considerable number of false matches.Moreover, the combination of CN and HE consumes 0.145sto finish one online query, which is acceptable considering thesignificant gain in accuracy.

    In the last column of Table IX, we compare query time withsome state-of-the-art methods. Note that, since query time isdependent on many factors, such as the machine used, thenumber of features of the 1M dataset, etc, it is not directlycomparable. But it can roughly indicate the time efficiency ofour method.

    8) Memory Cost: We compare the memory cost of all fusionapproaches considered in this paper in Table VIII. Here, welist the reported results of memory usage for each indexedfeature, and re-compute the memory usage on the 1M datasetas in our case. For each indexed keypoint, the baseline uses4 bytes to store image ID (one entry per descriptor [1]). In HE,8 bytes are allocated for the 64-bit binary SIFT signature.In comparison, the fusion of CN costs less than 3 bytes tostore the 22-bit binary color feature, yet still brings aboutcompetitive performance improvement. On the 1M dataset,the fusion of HE and CN takes 6.1 GB memory in total. Wealso compare the memory usage of our method with somestate-of-the-art systems in Table VIII. The Bag-of-Colors [13]method which is most similar to ours consumes 8 bytesmemory for each feature. We speculate that when more bitsare used for color feature embedding, the performance couldbe further improved. On the other hand, methods such as [47]

  • 3378 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    TABLE IXCOMPARISON WITH STATE-OF-THE-ART METHODS WITHOUT

    POST-PROCESSING STEPS

    TABLE XCOMPARISON WITH STATE-OF-THE-ART METHODS WITH

    POST-PROCESSING STEPS

    and [16] embed discriminative clues such as spatial contexts,orientation, scale into the inverted file, which can be well-complementary to our method.

    9) Comparison With State-of-the-Arts: We first pro-vide a comparison with state-of-the-art methods withoutpost-processing steps in Table IX. Our method achievesN-S = 3.60 and mAP = 0.796 on Ukbench and Holidaysdatasets, respectively. It is shown that our method comparesfavorably to the state-of-the-arts on Ukbench. On the holidaysdataset, the result reported in [16] is slightly higher, because intheir work, weak geometry constraints (WGC) are used whichis absent in this paper. In fact, spatial constraints typicallylead to performance gain in partial-duplicate image retrievalat a cost of higher computational cost.

    Then, Table X demonstrates the comparisons with state-of-the-arts with various post-processing techniques, such asRANSAC verification [6], [69], RNN reranking [70], etc. FromTable VI, this paper reports N-S = 3.79 and mAP = 0.852 onUkbench and Holidays, respectvely. Both results outperformsthe state-of-the-art systems.

    V. CONCLUSIONBinary Embedding methods are effective for visual match-

    ing verification. In this paper, we propose a coupled binaryembedding method using a binary multi-index framework

    to fuse SIFT visual word with binary features at indexinglevel. To model the correlation between different features,we introduce a new IDF family, called the multi-IDF, whichcan be viewed as a weighted sum of individual IDF of eachfused feature. Specifically, we explore the integration of thelocal color descriptor in the retrieval framework. Throughextensive experiments on three benchmark datasets, we showthat significant improvement can be achieved when mul-tiple binary features are fused. Moreover, we demonstratethe effectiveness of multi-IDF in coupling different binaryfeatures. Further, when merged with global color feature bygraph fusion, we are capable of outperforming the state-of-the-art methods. In large-scale settings, by storing binaryfeatures in the inverted file, the proposed method consumesacceptable memory usage and query time compared with otherapproaches.

    In the future work, more investigation will be focused onthe mechanism of how features complement each other andpromotes visual matching accuracy. Since our method canbe easily extended to include other binary features such asrecently proposed ORB [44], BRISK [45], FREAK [46], etc,various feature fusion and selection strategies will also beexplored to further improve performance.

    REFERENCES[1] H. Jgou, M. Douze, and C. Schmid, Hamming embedding and weak

    geometric consistency for large scale image search, in Proc. 10th Eur.Conf. Comput. Vis. ECCV, 2008, pp. 304317.

    [2] W. Lu, J. Wang, X.-S. Hua, S. Wang, and S. Li, Contextual imagesearch, in Proc. 19th ACM Multimedia, 2011, pp. 513522.

    [3] X. Li, Y.-J. Zhang, B. Shen, and B.-D. Liu, Image tag completion bylow-rank factorization with dual reconstruction structure preserved, inProc. IEEE Int. Conf, Image Process. (ICIP), Oct. 2014.

    [4] J. Sivic and A. Zisserman, Video Google: A text retrieval approachto object matching in videos, in Proc. IEEE Int. Conf. Comput. Vis.,(ICCV), Oct. 2003, pp. 14701477.

    [5] D. G. Lowe, Distinctive image features from scale-invariant keypoints,Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004.

    [6] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, Objectretrieval with large vocabularies and fast spatial matching, in Proc.Comput. Vis. Pattern Recognit. (CVPR), 2007, pp. 18.

    [7] D. Nister and H. Stewenius, Scalable recognition with a vocabularytree, in Proc. Comput. Vis. Pattern Recognit. (CVPR), vol. 2. 2006,pp. 21612168.

    [8] L. Zheng, S. Wang, Z. Liu, and Q. Tian, Lp-norm IDF for large scaleimage search, in Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR),Jun. 2013, pp. 16261633.

    [9] L. Zheng, S. Wang, and Q. Tian, Lp-norm IDF for scalable imageretrieval, IEEE Trans. Image Process., to be published.

    [10] D. Qin and C. W. L. van Gool, Query adaptive similarity for large scaleobject retrieval, in Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR),Jun. 2013, pp. 16101617.

    [11] G. Tolias, Y. Avrithis, and H. Jgou, To aggregate or not to aggregate:Selective match kernels for image search, in Proc. IEEE Int. Conf.Comput. Vis., (ICCV), Dec. 2013, pp. 14011408.

    [12] Z. Liu, H. Li, W. Zhou, R. Zhao, and Q. Tian, Contextual hashing forlarge-scale image search, IEEE Trans. Image Process., vol. 23, no. 4,pp. 16061614, Apr. 2014.

    [13] C. Wengert, M. Douze, and H. Jgou, Bag-of-colors for improvedimage search, in Proc. 19th ACM Multimedia, 2011, pp. 14371440.

    [14] R. Arandjelovic and A. Zisserman, Three things everyone shouldknow to improve object retrieval, in Proc. IEEE Comput. Vis. PatternRecognit. (CVPR), Jun. 2012, pp. 29112918.

    [15] K. Simonyan, A. Vedaldi, and A. Zisserman, Descriptor learning usingconvex optimisation, in Proc. 12th Eur. Conf. Comput. Vis. (ECCV),Oct. 2012, pp. 243256.

    [16] H. Jgou, M. Douze, and C. Schmid, Improving bag-of-featuresfor large scale image search, Int. J. Comput. Vis., vol. 87, no. 3,pp. 316336, 2010.

  • ZHENG et al.: COUPLED BINARY EMBEDDING FOR LARGE-SCALE IMAGE RETRIEVAL 3379

    [17] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, Lost inquantization: Improving particular object retrieval in large scale imagedatabases, in Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR),Jun. 2008, pp. 18.

    [18] F. Perronnin, J. Snchez, and T. Mensink, Improving the fisher kernelfor large-scale image classification, in Proc. 11th Eur. Conf. Comput.Vis. (ECCV), 2010, pp. 143156.

    [19] L. Xie, Q. Tian, and B. Zhang, Spatial pooling of heterogeneousfeatures for image classification, IEEE Trans. Image Process., vol. 23,no. 5, pp. 19942008, May 2014.

    [20] W. Zhou, H. Li, Y. Lu, and Q. Tian, Principal visual word discovery forautomatic license plate detection, IEEE Trans. Image Process., vol. 21,no. 9, pp. 42694279, Sep. 2012.

    [21] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian, Towardscodebook-free: Scalable cascaded hashing for mobile image search,IEEE Trans. Image Process., vol. 16, no. 3, pp. 601611, Apr. 2014.

    [22] X. Blix, G. Roig, and L. V. Gool, Nested sparse quantization forefficient feature coding, in Proc. 12th Eur. Conf. Comput. Vis. (ECCV),2012, pp. 744758.

    [23] Z. Niu, G. Hua, X. Gao, and Q. Tian, Semi-supervised relational topicmodel for weakly annotated image recognition in social media, in Proc.IEEE Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014.

    [24] Z. Niu, G. Hua, X. Gao, and Q. Tian, Context aware topic modelfor scene recognition, in Proc. IEEE Comput. Vis. Pattern Recognit.(CVPR), Jun. 2012, pp. 27432750.

    [25] F. He and S. Wang, Beyond 2 difference: Learning optimal metricfor boundary detection, arXiv:1406.0946, 2014.

    [26] B. Su, X. Ding, L. Peng, and C. Liu, A novel baseline-independentfeature set for arabic handwriting recognition, in Proc. 12th Int. Conf.Document Anal. Recognit. (ICDAR), Aug. 2013, pp. 12501254.

    [27] F. S. Khan, J. van de Weijer, and M. Vanrell, Modulating shape featuresby color attention for object recognition, Int. J. Comput. Vis., vol. 98,no. 1, pp. 4964, 2012.

    [28] F. S. Khan, R. M. Anwer, J. van de Weijer, A. D. Bagdanov, M. Vanrell,and A. M. Lopez, Color attributes for object detection, in Proc. IEEEIEEE Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 33063313.

    [29] L. Dong, L. Yali, H. Fei, and W. Shengjin, Object detection in imagewith complex background, in Proc. 3rd Int. Conf. Multimedia Technol.,Nov. 2013.

    [30] P. Gehler and S. Nowozin, On feature combination for multiclass objectclassification, in Proc. IEEE 12th Int. Conf. Comput. Vis. (ICCV),Oct. 2009, pp. 221228.

    [31] M. Douze, A. Ramisa, and C. Schmid, Combining attributes and Fishervectors for efficient image retrieval, in Proc. IEEE Comput. Vis. PatternRecognit. (CVPR), Jun. 2011, pp. 745752.

    [32] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas, Queryspecific fusion for image retrieval, in Proc. 12th Eur. Conf. Comput.Vis. (ECCV), Oct. 2012, pp. 660673.

    [33] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian, Semantic-aware co-indexing for near-duplicate image retrieval, in Proc. IEEE Int. Conf.Comput. Vis. (ICCV), Dec. 2013.

    [34] L. Zheng, S. Wang, F. He, and Q. Tian, Seeing the big picture: Deepembedding with contextual evidences, arXiv:1406.0132, 2014.

    [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-basedlearning applied to document recognition, Proc. IEEE, vol. 86, no. 11,pp. 22782324, Nov. 1998.

    [36] X. Zhang, Z. Li, L. Zhang, W.-Y. Ma, and H.-Y. Shum, Efficientindexing for large scale visual search, in Proc. IEEE Int. Conf. Comput.Vis. (ICCV), Oct. 2009, pp. 11031110.

    [37] L. Zheng, S. Wang, Z. Liu, and Q. Tian, Packing and padding: Coupledmulti-index for accurate image retrieval, in Proc. IEEE Comput. Vis.Pattern Recognit. (CVPR), Jun. 2014.

    [38] W. Zhou, H. Li, Y. Lu, and Q. Tian, SIFT match verification bygeometric coding for large-scale partial-duplicate web image search,ACM Trans. Multimedia Comput., Commun., Appl., vol. 9, no. 1, p. 4,2013.

    [39] L. Zheng and S. Wang, Visual phraselet: Refining spatial constraintsfor large scale image search, IEEE Signal Process. Lett., vol. 20, no. 4,pp. 391394, Apr. 2013.

    [40] L. Zheng, S. Wang, W. Zhou, and Q. Tian, Bayes merging of multiplevocabularies for scalable image retrieval, in Proc. IEEE Comput. Vis.Pattern Recognit. (CVPR), Jun. 2014.

    [41] Z. Liu, H. Li, L. Zhang, W. Zhou, and Q. Tian, Cross-indexing ofbinary SIFT codes for large-scale image search, IEEE Trans. ImageProcess., vol. 23, no. 5, pp. 20472057, Apr. 2014.

    [42] A. Babenko and V. Lempitsky, The inverted multi-index, in Proc. IEEEComput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 30693076.

    [43] H. Jegou, M. Douze, and C. Schmid, Product quantization for nearestneighbor search, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,pp. 117128, Jan. 2011.

    [44] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficientalternative to SIFT or SURF, in Proc. IEEE Int. Conf. Comput. Vis.(ICCV), Nov. 2011, pp. 25642571.

    [45] S. Leutenegger, M. Chli, and R. Y. Siegwart, BRISK: Binary robustinvariant scalable keypoints, in Proc. IEEE Int. Conf. Comput. Vis.(ICCV), Nov. 2011, pp. 25482555.

    [46] A. Alahi, R. Ortiz, and P. Vandergheynst, FREAK: Fast retina key-point, in Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012,pp. 510517.

    [47] X. Wang, M. Yang, T. Cour, S. Zhu, K. Yu, and T. X. Han, Contextualweighting for vocabulary tree based image retrieval, in Proc. IEEE Int.Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 209216.

    [48] S. Robertson, Understanding inverse document frequency: On theoret-ical arguments for IDF, J. Documentation, vol. 60, no. 5, pp. 503520,2004.

    [49] J. Van De Weijer and C. Schmid, Coloring local feature extraction, inComputer Vision. Berlin, Germany: Springer-Verlag, 2006, pp. 334348.

    [50] A. Bosch, A. Zisserman, and X. Muoz, Scene classification using ahybrid generative/discriminative approach, IEEE Trans. Pattern Anal.Mach. Intell., vol. 30, no. 4, pp. 712727, Apr. 2008.

    [51] J. Van De Weijer, T. Gevers, and A. D. Bagdanov, Boosting colorsaliency in image feature detection, IEEE Trans. Pattern Anal. Mach.Intell., vol. 28, no. 1, pp. 150156, Jan. 2006.

    [52] K. E. Van De Sande, T. Gevers, and C. G. Snoek, Evaluating colordescriptors for object and scene recognition, IEEE Trans. Pattern Anal.Mach. Intell., vol. 32, no. 9, pp. 15821596, Sep. 2010.

    [53] T. Tuytelaars and C. Schmid, Vector quantizing feature space with aregular lattice, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2007,pp. 18.

    [54] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, Locality-sensitivehashing scheme based on p-stable distributions, in Proc. 28th Annu.Symp. Comput. Geometry, 2004, pp. 253262.

    [55] B. Kulis and K. Grauman, Kernelized locality-sensitive hashing forscalable image search, in Proc. IEEE 12th Int. Conf. Comput. Vis.(ICCV), Oct. 2009, pp. 21302137.

    [56] B. T. Mark J. Huiskes and M. S. Lew, New trends and ideas in visualconcept detection: The MIR flickr retrieval evaluation initiative, in Proc.ACM Multimedia Inform. Retr. (MIR), 2010, pp. 527536.

    [57] H. Jgou, M. Douze, and C. Schmid, On the burstiness of visualelements, in Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR),Jun. 2009, pp. 11691176.

    [58] T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971987,Jul. 2002.

    [59] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, Total recall:Automatic query expansion with a generative feature model for objectretrieval, in Proc. IEEE 11th Int. Conf. Comput. Vis. (ICCV), Oct. 2007,pp. 18.

    [60] X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu, Object retrieval andlocalization with spatially-constrained similarity measure and k-NN re-ranking, in Proc. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012,pp. 30133020.

    [61] Z. Liu, S. Wang, L. Zheng, and Q. Tian, Visual reranking with improvedimage graph, in Proc. Int. Conf. Acoust., Speech, Signal Process.(ICASSP), 2014, pp. 69096913.

    [62] R. Arandjelovic and A. Zisserman, Three things everyone shouldknow to improve object retrieval, in Proc. IEEE Comput. Vis. PatternRecognit. (CVPR), Jun. 2012, pp. 29112918.

    [63] L. Xie, Q. Tian, W. Zhou, and B. Zhang, Fast and accurate near-duplicate image search with affinity propagation on the imageweb,Comput. Vis. Image Understand., vol. 124, pp. 3141, Jul. 2014.

    [64] M. Jain, H. Jgou, and P. Gros, Asymmetric hamming embedding:Taking the best of our bits for large scale image search, in Proc. ACMMultimedia, 2011, pp. 14411444.

    [65] J. Wang, J. Wang, Q. Ke, G. Zeng, and S. Li, Fast approximate k-meansvia cluster closures, in Proc. IEEE Comput. Vis. Pattern Recognit.(CVPR), Jun. 2012, pp. 30373044.

    [66] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, Visual place recognitionwith repetitive structures, in Proc. IEEE Comput. Vis. Pattern Recognit.(CVPR), Jun. 2013, pp. 883890.

  • 3380 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2014

    [67] A. Mikulk, M. Perdoch, O. Chum, and J. Matas, Learning a finevocabulary, in Proc. 11th Eur. Conf. Comput. Vis. (ECCV), 2010,pp. 114.

    [68] X. Yang, X. Gao, D. Tao, and X. Li, Improving level set method forfast auroral oval segmentation, IEEE Trans. Image Process., vol. 23,no. 7, pp. 28542865, Jul. 2014.

    [69] X. Zhang, S. Wang, and X. Ding, Beyond dominant plane assumption:Moving objects detection in severe dynamic scenes with multi-classesransac, in Proc. Int. Conf. Audio, Lang. Image Process. (ICALIP),Jul. 2012, pp. 822827.

    [70] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. Van Gool, Helloneighbor: Accurate object retrieval with k-reciprocal nearest neighbors,in Proc. IEEE Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011,pp. 777784.

    [71] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek, Accurate imagesearch using the contextual dissimilarity measure, IEEE Trans. PatternAnal. Mach. Intell., vol. 32, no. 1, pp. 211, Jan. 2010.

    [72] C. Deng, R. Ji, W. Liu, D. Tao, and X. Gao, Visual reranking throughweakly supervised multi-graph learning, in Proc. IEEE Int. Conf.Comput. Vis.(ICCV), Dec. 2013, pp. 26002607.

    Liang Zheng received the B.E. degree in life sci-ence from Tsinghua University, Beijing, China, in2010, where he is currently pursuing the Ph.D.degree in information and communication engineer-ing with the Department of Electronic Engineering.His research interests include multimedia informa-tion retrieval and computer vision.

    Shengjin Wang (M04) received the B.E. degreefrom Tsinghua University, Beijing, China, and thePh.D. degree from the Tokyo Institute of Technol-ogy, Tokyo, Japan, in 1985 and 1997, respectively.From 1997 to 2003, he was a member of theresearch staff with the Internet System ResearchLaboratories, NEC Corporation, Tokyo. Since 2003,he has been a Professor with the Department ofElectronic Engineering, Tsinghua University, wherehe is currently the Director of the Research Instituteof Image and Graphics. His current research interests

    include image processing, computer vision, video surveillance, and patternrecognition.

    Qi Tian (M96SM03) received the B.E. degreein electronic engineering from Tsinghua University,Beijing, China, in 1992, the M.S. degree in electricaland computer engineering from Drexel University,Philadelphia, PA, USA, in 1996, and the Ph.D.degree in electrical and computer engineering fromthe University of Illinois at Urbana-Champaign,Champaign, IL, USA, in 2002. He is currently aProfessor with the Department of Computer Science,University of Texas at San Antonio, San Anto-nio, TX, USA. He took one-year faculty leave at

    Microsoft Research Asia, Beijing, from 2008 to 2009. His research interestsinclude multimedia information retrieval and computer vision.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 600 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice