13
arXiv:1405.4054v1 [cs.CV] 16 May 2014 1 Optimized Cartesian K -Means Jianfeng Wang, Jingdong Wang, Jingkuan Song, Xin-Shun Xu, Heng Tao Shen, Shipeng Li Abstract—Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian K- Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high- dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search. Index Terms—Clustering, Cartesian product, Nearest neighbor search 1 I NTRODUCTION Nearest neighbor (NN) search in large data sets has wide applications in information retrieval, computer vision, ma- chine learning, pattern recognition, recommendation sys- tem, etc. However, exact NN search is often intractable because of the large scale of the database and the curse of the high dimensionality. Instead, approximate nearest neighbor (ANN) search is more practical and can achieve orders of magnitude speed-ups than exact NN search with near-optimal accuracy [29]. There has been a lot of research interest on designing effective data structures, such as k-d tree [4], randomized k-d forest [30], FLANN [22], trinary-projection tree [11], [39], and neighborhood graph search [1], [35], [37], [38]. The hashing algorithms have been attracting a large amount of attentions recently as the storage cost is small and the distance computation is efficient. Such approaches map data points to compact binary codes through a hash function, which can be generally expressed as b = h(x) ∈{0, 1} L , where x is a P -dimensional real-valued point, h(·) is the hash function, and b is a binary vector with L entries. For description convenience, we will use a vector or a code to name b interchangeably. Jianfeng Wang is with University of Science and Technology of China. Email: [email protected]. Jingdong Wang and Shipeng Li are with Microsoft Research, Beijing, P.R. China. Emails:{jingdw, spli}@microsoft.com. Xin-Shun Xu is with Shandong University. Email: [email protected]. Jingkuan Song and Heng Tao Shen are with School of Information Technology and Electrical Engineering, The University of Queensland, Australia. Email:{jk.song,shenht}@itee.uq.edu.au. The pioneering hashing work, locality sensitive hashing (LSH) [3], [8], adopts random linear projections and the similarity preserving is probabilistically guaranteed. Other approaches based on random functions include kernelized LSH [14], non-metric LSH [21], LSH from shift-invariant kernels [25], and super-bit LSH [10]. To preserve some notion of similarities, numerous efforts have been devoted to finding a good hash function by exploring the distribution of the specific data set. Typical approaches are unsupervised hashing [5], [12], [13], [33], [36], [40], [41], [42] and supervised hashing [16], [23], with kernelized version [7], [17], and extensions to multi- modality [31], [32], [43], etc. Those algorithms usually use Hamming distance, which is only able to produce a few distinct distances, resulting in limited ability and flexibility of distance approximation. The quantization-based algorithms have been shown to achieve superior performances [9], [24]. The representative algorithms include product quantization (PQ) [9] and Carte- sian K-means (CKM) [24], which are modified versions of the conventional K-means algorithm [19]. The quantiza- tion approaches typically learn a codebook {d 1 , ··· , d K }, where each codeword d k is a P -dimensional vector. The data point x is encoded in the following way, k = arg min k∈{1,2,··· ,K} x d k 2 2 , (1) where ‖·‖ 2 denotes the l 2 norm. The index k indicates which codeword is the closest to x and can be represented as a binary code of length log 2 (K)1 . The crucial problem for quantization algorithms is how to learn the codebook. In the traditional K-means, the codebook is composed of the cluster centers with a minimal squared distortion error. The drawbacks when applying K- means to ANN search include that the size of the codebook 1. In the following, we omit the ⌈·⌉ operator without affecting the understanding.

Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

arX

iv:1

405.

4054

v1 [

cs.C

V]

16 M

ay 2

014

1

Optimized Cartesian K-MeansJianfeng Wang, Jingdong Wang, Jingkuan Song, Xin-Shun Xu, Heng Tao Shen, Shipeng Li

Abstract—Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearestneighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generatesa sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance betweentwo data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode asubvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which mayimpose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian K-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiplesub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different subcodebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces.This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-lifedatasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.

Index Terms—Clustering, Cartesian product, Nearest neighbor search

1 INTRODUCTION

Nearest neighbor (NN) search in large data sets has wideapplications in information retrieval, computer vision, ma-chine learning, pattern recognition, recommendation sys-tem, etc. However, exact NN search is often intractablebecause of the large scale of the database and the curseof the high dimensionality. Instead, approximate nearestneighbor (ANN) search is more practical and can achieveorders of magnitude speed-ups than exact NN search withnear-optimal accuracy [29].

There has been a lot of research interest on designingeffective data structures, such ask-d tree [4], randomizedk-d forest [30], FLANN [22], trinary-projection tree [11],[39], and neighborhood graph search [1], [35], [37], [38].

The hashing algorithms have been attracting a largeamount of attentions recently as the storage cost is smalland the distance computation is efficient. Such approachesmap data points to compact binary codes through a hashfunction, which can be generally expressed as

b = h(x) ∈ {0, 1}L,

wherex is a P -dimensional real-valued point,h(·) is thehash function, andb is a binary vector withL entries. Fordescription convenience, we will use a vector or a code tonameb interchangeably.

• Jianfeng Wang is with University of Science and Technology of China.Email: [email protected].

• Jingdong Wang and Shipeng Li are with Microsoft Research, Beijing,P.R. China.Emails:{jingdw, spli}@microsoft.com.

• Xin-Shun Xu is with Shandong University.Email: [email protected].

• Jingkuan Song and Heng Tao Shen are with School of InformationTechnology and Electrical Engineering, The University of Queensland,Australia.Email:{jk.song,shenht}@itee.uq.edu.au.

The pioneering hashing work, locality sensitive hashing(LSH) [3], [8], adopts random linear projections and thesimilarity preserving is probabilistically guaranteed. Otherapproaches based on random functions include kernelizedLSH [14], non-metric LSH [21], LSH from shift-invariantkernels [25], and super-bit LSH [10].

To preserve some notion of similarities, numerous effortshave been devoted to finding a good hash function byexploring the distribution of the specific data set. Typicalapproaches are unsupervised hashing [5], [12], [13], [33],[36], [40], [41], [42] and supervised hashing [16], [23],with kernelized version [7], [17], and extensions to multi-modality [31], [32], [43], etc. Those algorithms usually useHamming distance, which is only able to produce a fewdistinct distances, resulting in limited ability and flexibilityof distance approximation.

The quantization-based algorithms have been shown toachieve superior performances [9], [24]. The representativealgorithms include product quantization (PQ) [9] and Carte-sianK-means (CKM) [24], which are modified versions ofthe conventionalK-means algorithm [19]. The quantiza-tion approaches typically learn acodebook{d1, · · · ,dK},where eachcodeworddk is a P -dimensional vector. Thedata pointx is encoded in the following way,

k∗ = argmink∈{1,2,··· ,K} ‖x− dk‖22, (1)

where‖ · ‖2 denotes thel2 norm. The indexk∗ indicateswhich codeword is the closest tox and can be representedas a binary code of length⌈log2(K)⌉1.

The crucial problem for quantization algorithms is howto learn the codebook. In the traditionalK-means, thecodebook is composed of the cluster centers with a minimalsquared distortion error. The drawbacks when applyingK-means to ANN search include that the size of the codebook

1. In the following, we omit the⌈·⌉ operator without affecting theunderstanding.

Page 2: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

2

is quite limited and computing the distances between thequery and the codewords is expensive. PQ [9] addressesthis problem by splitting theP -dimensional space intomultiple disjoint subspaces and making the codebook as theCartesian product of thesub codebooks, each of which islearned on each subspace using the conventionalK-meansalgorithm. The compact code is formed by concatenatingthe indices of the selected sub codeword within each subcodebook. CKM [24] improves PQ by optimally rotatingtheP dimensional space to give a lower distortion error.

In PQ and CKM, only one sub codeword on eachsubvector is used to quantize the data points. which resultsin limited capability of reducing the distortion error andthus limited search accuracy. In this paper, we first presenta simple algorithm, extended CartesianK-means (ECKM),which extends CKM by using multiple (e.g.,C) sub code-words for a data point from the sub codebook in eachsubspace. Then, we propose the optimized CartesianK-means (OCKM) algorithm, which learnsC sub codebooksin each subspace instead of a single sub codebook likeECKM, and selectsC sub codewords, each chosen froma different sub codebook. We show that both PQ and CKMare constrained versions of our OCKM under the samecode length, which suggests that our OCKM can lead to alower quantization error and thus a higher search accuracy.Experimental results also validate that our OCKM achievessuperior performance.

The remainder of this paper is organized as follows.Related work is first reviewed in Sec. 2. The proposedECKM is introduced in Sec. 3, followed by the OCKMin Sec. 4. Discussions and experimental results are givenin Sec. 5 and 6, respectively. Finally, a conclusion is madein Sec. 7.

2 RELATED WORK

Hashing is an emerging technique to represent the high-dimensional vectors as binary codes for ANN search, andhas achieved a lot of success in multimedia applications,e.g. image search [6], [15], video retrieval [2], [31], eventdetection [26], document retrieval [27].

According to the form of the hash function, we roughlycategorize the binary encoding approaches as those basedon Hamming embedding and on quantization. Roughly, theformer adopts the Hamming distance as the dissimilaritybetween the codes, while the latter does not.

Table 1 illustrates part of the notations and descriptionsused in the paper. Generally, we use the uppercase unboldedsymbol as a constant, the lowercase unbolded as the index,the uppercase bolded as the matrix and the lowercasebolded as the vector.

2.1 Hamming embedding

Linear mapping is one of typical hash functions. Each bitis calculated by

hi(x) = sign(wTi x+ ui), (2)

TABLE 1Notations and descriptions.

Symbol Description

N number of training pointsP dimension of training pointsM number of subvectorsS number of dimensions on each subvectorK number of (sub) codewordsm index of the subvectori index of the training pointR rotation matrixDm codebook onm-th subvectorbm

i1-of-K encoding vector onm-th subvector

where wi is the projection vector,ui is the offset, andsign(z) is a sign function which is1 if z > 0, and 0otherwise.

Such approaches include [3], [5], [12]. The differencesmainly reside in how to obtain the parameters in the hashfunction. For example, LSH [3] adopts a random parameterand the similarity is probability preserved. Iterative quanti-zation hashing [5] constructs hash functions by rotating theaxes so that the difference between the binary codes andthe projected data is minimized.

Another widely-used approach is the kernel-based hashfunction [7], [13], [14], [17], i.e.

hi(x) = sign(∑

j

wijκ(x, zj)), (3)

wherezj is the vector in the same space withx, andκ(·, ·)is the kernel function. The cosine function can also be usedto generate the binary codes, such as in [40].

2.2 Quantization

In the quantization-based encoding methods, different con-straints on the codeword lead to different approaches, i.e.K-Means [18], [19], Product Quantization (PQ) [9] andCartesianK-Means (CKM) [24].

2.2.1 K-Means

GivenN P -dimensional pointsX = {x1, · · · ,xN} ⊂ RP ,

the K-means algorithm partitions the database intoKclusters, each of which associates one codeworddi ∈ R

P .Let D = [d1, · · · ,dK ] ⊂ R

P be the correspondingcodebook. Then the codebook is learned by minimizingthe within-cluster distortion, i.e.

min

N∑

i=1

‖xi −Dbi‖22

s. t. bi ∈ {0, 1}K

‖bi‖1 = 1 i ∈ {1, · · · , N}

wherebi is a1-of-K encoding vector (K dimensions withone1 andK − 1 0s. ) to indicate which codeword is usedto quantizexi, and‖ · ‖1 is the l1 norm.

The problem can be solved by iteratively alternatingoptimization with respect toD and{bi}

Ni=1 [18].

Page 3: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

3

2.2.2 Product Quantization

One issue ofK-Means is the size of the codebook isquite limited due to the storage and computational cost.To address the problem, PQ [9] splits eachxi into Mdisjoint subvectors. Assume them-th subvector containsSm dimensions and then

∑M

m=1 Sm = P . Without loss ofgenerality,Sm is set toS , P/M andP is assumed tobe divisible byM . On them-th subvector,K-means isperformed to obtainK sub codewords. By this method, itgeneratesKM clusters with onlyO(KP ) storage, whileK-means requiresO(KMP ) storage with the same number ofclusters. Meanwhile, the computing complexity is reducedfrom O(KMP ) to O(KP ) to encode one data point.

Let Dm ∈ RS×K be the matrix of them-th sub code-

book and each column is aS-dimensional sub codeword.PQ can be taken as optimizing the following problem withrespect to{Dm}Mm=1 and{bm

i }N,Mi=1,m=1.

min fpq,M,K =

N∑

i=1

xi −

D1b1i

...DMbM

i

2

2

s. t. bmi ∈ {0, 1}

K

‖bmi ‖1 = 1 i ∈ {1, · · · , N},m ∈ {1, · · · ,M}

(4)

wherebmi is also the1-of-K encoding vector on them-th

subvector and the index of1 indicates which sub codewordis used to encodexi.

2.2.3 Cartesian K-Means

CKM [24] optimally rotates the original space and formu-lates the problem as

min fck,M,K =

N∑

i=1

xi −R

D1b1i

...DMbM

i

2

2

s. t. RTR = I

bmi ∈ {0, 1}

K

‖bmi ‖1 = 1 i ∈ {1, · · · , N},m ∈ {1, · · · ,M}

(5)

The rotation matrixR is optimally learned by minimizingthe distortion.

If R is constrained to be the identity matrixI, it willbe reduced to Eqn. 4. Thus, we can assert that under theoptimal solutions, we havef∗

ck,M,K ≤ f∗pq,M,K , where the

asterisk superscript indicates the objective function with theoptimal parameters.

3 EXTENDED CARTESIAN K-MEANS

In both PQ and CKM, only one sub codeword is usedto encode the subvector. To make the representation moreflexible, we propose the extended CartesianK-means(ECKM), where multiple sub codewords can be used ineach subspace.

Mathematically, we allow thel1 norm ofbmi to be a pre-

set numberC (C ≥ 1), instead of limiting it to be exactly

1. Meanwhile, any entry ofbmi is relaxed as a non-negative

integer instead of a binary value. The formulation is

min feck,M,K,C =

N∑

i=1

xi −R

D1b1i

...DMbM

i

2

2

s. t. RTR = I

bmi ∈ Z

K+

‖bmi ‖1 = C

(6)

whereZ+ denotes the set of non-negative integers. Theconstraint is applied on all the pointsi ∈ {1, · · · , N} andon all the subspacesm ∈ {1, · · · ,M}. In the following,we omit the range ofi,m without confusion.

For them-th sub codebookDm ∈ RS×K , traditionally

only one sub codeword can be selected and there are onlyK choices to encode them-th subvector ofRTxi. In theextended version, any feasiblebm

i satisfyingbmi ∈ Z

K+ and

‖bmi ‖1 = C constructs a quantizer, i.e.Dmbm

i . Thus, thetotal number of choices is

(

K+C−1K−1

)

≥ K. For examplewith K = 256 andC = 2, the difference is

(

K+C−1K−1

)

=32896≫ K = 256. With a more powerful representation,the distortion errors can be potentially reduced.

In theory,log2(

K+C−1K−1

)

bits can be used to encode onebmi , and the code length isM log2(

(

K+C−1K−1

)

). Practically,we uselog2(K) bits to encode one position of1. The l1norm ofbm

i is C, which can be interpreted that there areC1s in bm

i . ThenMC log2(K) bits are allocated to encodeone data point.

3.1 Learning

Similar to [24], we present an iterative coordinate descentalgorithm to solve the problem in Eqn. 6. There are threekinds of unknown variables,R, Dm, and bm

i . In eachiteration, two of them are fixed, and the other one isoptimized.

3.1.1 Solve R with bmi and Dm fixed

With

X ,[

x1 · · · xN

]

D ,

D1

. . .DM

B ,[

b1 · · · bN

]

bi ,

[

b1i

T· · · bM

i

T]T

,

we re-write the objective function of Eqn. 6 in a matrixform as

‖X−RDB‖2F ,

where‖·‖F is the Frobenius norm. The problem of solvingR is the classic Orthogonal Procrustes problem [28] and thesolution can be obtained as follows: if SVD ofX(DB)

T

is X(DB)T= UΣVT , the optimalR will be UVT .

Page 4: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

4

Algorithm 1 Code Generation for ECKM

Input: zmi , Dm ∈ RS×K , C

Output: bmi

1: bmi = zeros(K, 1)

2: r = zmi3: for c = 1 : C do4: k∗ = argmink ‖r− dm

k ‖22

5: r = r− dmk∗

6: bmi (k∗) = bmi (k∗) + 17: end for

3.1.2 Solve Dm with bmi and R fixed

Let zi , RTxi and them-th subvector ofzi be zmi . Theobjective function of Eqn. 6 can also be written as,

N∑

i=1

M∑

m=1

‖zmi −Dmbmi ‖

22 =

M∑

m=1

‖Zm −DmBm‖2F , (7)

where

Zm , [zm1 , · · · , zmN ]

Bm , [bm1 , · · · ,bm

N ].

Each Dm can be individually optimized as(ZmBmT )(BmBmT )+, where (·)+ denotes the matrix(pseudo)inverse.

3.1.3 Solve bmi with Dm and R fixed

From Eqn. 6 and Eqn. 7,bmi can be solved by optimizing

min geck(bmi ) = ‖zmi −Dmbm

i ‖22

s. t. bmi ∈ Z

K+

‖bmi ‖1 = C

This is an integer quadratic programming and challeng-ing to solve. Here, we present a simple but practicallyefficient algorithm, based on matching pursuit [20] andillustrated in Alg. 1. In each iteration, we hold a residualvariabler, initialized byzmi (Line 2 in Alg. 1). Letdm

k bethek-th column ofDm. Each column is scanned to obtainthe best one to minimize the distortion error (Line 4), i.e.

k∗ = argmink‖r− dm

k ‖22.

Thenr is subtracted bydmk∗ (Line 5) for the next iteration,

and thek∗-th dimension ofbmi increases by1 (Line 6) to

indicate thek∗-th sub codeword is selected. The processstops untilC iterations are reached.

4 OPTIMIZED CARTESIAN K-MEANS

Before introducing the proposed OCKM, we first presentanother equivalent formulation of the ECKM. Since eachentry of bm

i in Eqn. 6 is a non-negative integer, and thesum of all the entries isC, we replace it by

bmi =

C∑

c=1

bm,ci (8)

with

bm,ci ∈ {0, 1}K

‖bm,ci ‖1 = 1.

(9)

Given any feasiblebmi , we can always find at least

one group of{bm,ci }Cc=1 satisfying Eqn. 9 and Eqn. 8.

Any group of {bm,ci }Cc=1 satisfying Eqn. 9 can also con-

struct a validbmi by Eqn. 8 for Eqn. 6. For example, if

bmi =

[

2 0 1 0]

, we can replace it by the summationof

[

1 0 0 0]

,[

1 0 0 0]

and[

0 0 1 0]

.Substituting Eqn. 8 into the objective function of Eqn. 6,

we have

feck,M,K,C =

N∑

i=1

xi −R

c D1b

1,ci

...∑

c DMb

M,ci

2

2

.

On the m-th subvector,bm,ci represents the selected

sub codeword. There are in total ofC selections from asingle sub codebook. To further reduce the distortion errors,we propose to expand one sub codebook toC differentsub codebooksDm,c ∈ R

S×K , c ∈ {1, · · · , C}, each ofwhich is used for sub codeword selection. In summary, theformulation is as follows.

minfock,M,K,C =

N∑

i=1

xi −R

cD1,cb

1,ci

...∑

cDM,cb

M,ci

2

2

s. t. RTR = I

bm,ci ∈ {0, 1}K

‖bm,ci ‖1 = 1

(10)

which we call Optimized CartesianK-Means (OCKM).Since anybm,c

i requireslog2(K) bits to encode, the codelength of representing each point isMC log2(K).

4.1 Learning

Similar with ECKM, an iterative coordinate descent algo-rithm is employed to optimizeR, Dm,c andbm,c

i .

4.1.1 Solve R with Dm,c and bm,ci fixed

The objective function is re-written in a matrix form as

‖X−RDB‖2F ,

where

D ,

Dm

. . .

Dm

(11)

Dm ,[

Dm,1 · · · Dm,C]

(12)

B ,

[

B1T · · · BMT]T

(13)

Bm ,[

bm1 · · · bm

N

]

(14)

bmi ,

[

bm,1i

T· · · b

m,Ci

T]T

. (15)

Then optimizingR is the Orthogonal Procrustes Prob-lem [28].

Page 5: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

5

Algorithm 2 Code generation for OCKM

Input: zmi , Dm ∈ RS×KC

Output: bmi

1: [bmi , error] = GenCodeOck(zmi , Dm, 1)

Algorithm 3 [b, error] = GenCodeOck(zmi , Dm, idx)

1: if idx == C then2: k∗ = argmink ‖z− d

m,idxk ‖22

3: b = zeros(K, 1)4: b(k∗) = 15: error= ‖z− d

m,idxk∗ ‖22

6: else7: [k∗1 , · · · , k

∗T ] = argmink ‖z

mi − d

m,idxk ‖22

8: best.error= LARGE9: for i = 1 : T do

10: k ← k∗i11: z′ = zmi − d

m,idxk

12: [b′, error′] = GenCodeOck(z′, Dm, idx + 1)13: if error′ < best.error then14: best.error= error′

15: best.idx = k16: best.b = b′

17: end if18: end for19: b1 = zero(K, 1)20: b1(best.idx) = 121: b = [b1; best.b]22: end if

4.1.2 Solve Dm,c with R and bm,ci fixed

Similar with Eqn. 7 in ECKM, the objective function ofOCKM can be written as

M∑

m=1

‖Zm − DmBm‖2F .

Each Dm can also be individually solved by the matrix(pseudo)inversion.

4.1.3 Solve bm,ci with R and Dm,c fixed

The sub problem is

min gock(bm,ci ) = ‖zmi −

C∑

c=1

Dm,cbm,ci ‖22

s. t. bm,ci ∈ {0, 1}K

‖bm,ci ‖1 = 1

One straightforward method to solve the sub problem isto greedily find the best sub codeword inDm,c one byone similar with Alg. 1 for ECKM. One drawback is thesucceeding sub codewords can only be combined with theprevious one sub codeword.

To increase the accuracy with a reasonable time cost, weimprove it as multiple best candidates matching pursuit.The algorithm is illustrated in Alg. 2 and Alg. 3. The input

is the target vectorzmi , and the sub codebooksDm (definedin Eqn. 12). The output is the binary code represented asbmi (defined in Eqn. 15).The function[b, error] = GenCodeOck(zmi , Dm, idx) in

Alg. 3 encodeszmi with the last(C−idx+1) sub codebooks{Dm,c, c ∈ {idx, · · · , C}}. The encoding vectorb with(C − idx + 1)K dimensions and the distortion error arereturned.

At first, idx = 1 and we search the top-T best columnsin Dm,idx (Line 7 in Alg. 3) with T being a pre-definedparameter. Letdm,idx

k be thek-th column ofDm,idx. Thefinal selected one is taken among theT best candidates.For each candidate, the target vector is substracted by thecorresponding sub codeword (Line 11), and then the restcodesb′ are generated by recursively calling the functionGenCodeOck with the parameter idx+ 1 (Line 12).

Among the T candidates, the one with the smallestdistortion error stored in best.idx is selected to constructthe final binary representation (Line 19, 20, 21). In Line 8,the error is initialized as a large enough constant LARGE.Analysis. The parameterT controls the time cost and theaccuracy towards the optimality. If the time complexity isJ(C), we can derive the recursive relation

J(C) = SK + TJ(C − 1).

As shown in Line 7 of Alg. 3,T sub codewords are selectedand here we simply compare with each sub codeword,resulting inO(SK) complexity. SinceT is generally farsmaller thanK, the cost of partially sorting to obtain theT best ones can be ignored. For each of theT best subcodeword, the complexity of finding the binary code in therest sub codebooks isJ(C−1) (Line 12). WithJ(1) = SK,we can derive the complexity is

J(C) = SKTC − 1

T − 1. (16)

Since there areM subvectors, the complexity of encodingone full vector isJ(C)M = PK(TC − 1)/(T − 1) =O(PKTC−1). The time cost increases with a largerT .

Generally, Alg. 2 can achieve a better solution with alargerT . If the position of1 in b

m,ci is uniformly distributed

and independent with the others, we can calculate theprobability of obtaining the optimal solution by Alg. 2.On each subvector, there areKC different cases forbm

i .In Alg. 2, Line 7 is executedC − 1 times, and thusTsub codewords are selected for each of the firstC − 1sub codebooks. All the sub codewords in the last subcodebook can be taken to be tried to find the one withthe minimal distortion (Line 2). Then,TC−1K differentcases are checked, and the probability to find the optimalsolution is

TC−1K

KC=

(

T

K

)C−1

. (17)

If T = K, the probability will be1. It is certain that theoptimal solution can be found, but with a high time cost.The probability increases with a largerT . Meanwhile, itdecreases exponentially withC. Generally, we setC = 2

Page 6: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

6

0 5 10 158

8.5

9

9.5

10x 10

8

T

Dis

tort

ion

Fig. 1. Distortion errors on the training set of SIFT1Mvs different T (s) with M = 8, K = 256 and C = 2.

to have a better sub optimal solution. Fig. 1 illustrates therelationship between the optimized distortion errors andTon the SIFT1M training set, which is described in Sec. 6.In practice, we chooseT = 10 as a tradeoff.

5 DISCUSSIONS

5.1 Connections

Our approaches are closely related with PQ [9] andCKM [24]. PQ splits the original vector into multiplesubvectors to address the scalability issues. CKM rotatesthe space optimally and thus can achieve better accuracy.In each subspace, both PQ and CKM generate a single subcodebook and choose one sub codeword to quantize theoriginal point. Our ECKM extends the idea by choosingmultiple sub codewords from the single sub codebook,while our OCKM generates multiple sub codebooks, eachof which contributes one sub codeword.

Next, we theoretically discuss the relations between ourOCKM and others.

Theorem 1. Under optimal solutions, we have:

f∗ock,M,K,C ≤ f∗

ck,M,K (18)

f∗ock,M,K,C ≤ f∗

eck,M,K,C . (19)

Proof: If we limit Dm,c1 = Dm,c2 , c1, c2 ∈{1, · · · , C} in Eqn. 10, OCKM is reduced to the ECKMin Eqn. 6 by relations in Eqn. 8 and Eqn. 9, which provesthe Eqn. 19.

DenoteRck, {Dmck}

Mm=1, {b

mi,ck}

N,Mi=1,m=1 as the optimal

solution of CKM in Eqn. 5. A feasible solution of OCKMcan be constructed by

Rock = Rck

Dm,cock =

{

Dmck c = 1

0 c ≥ 2

bm,ci,ock = bm

i,ck c ∈ {1, · · · , C}.

With the constructed parameters, the objective function ofOCKM remains the same with CKM, which proves theEqn. 18.

This theorem implies the proposed OCKM can poten-tially achieve a lower distortion error with the number ofpartitionsM andK fixed.

Theorem 2. Under the optimal solutions, we have,

f∗ock,M ′,K,C ≤ f∗

ck,M,K (20)

if M ′ = M/C andM is divisible byC.

Proof: The basic idea is for the optimal solution ofCKM, every consecutiveC sub codebooks and the binaryrepresentation are grouped to construct a feasible solutionof OCKM with an equal objective function.

Specifically, the construction is

Rock = Rck

Dp,qock =

0(q−1)S×K

D(p−1)C+q

ck0(C−q)S×K

bp,qi,ock = b

(p−1)C+q

i,ck ,

where0a×b is a matrix of sizea× b with all entries being0, andp ∈ {1, · · · ,M ′}, q ∈ {1, · · · , C}.

TakeC = 2, M = 2 as an example. The formulation ofCKM is

min fck,2,K =

N∑

i=1

xi −R

[

D1 0

0 D2

] [

b1i

b2i

]∥

2

2

s. t. RTR = I

bmi ∈ {0, 1}

K

‖bmi ‖1 = 1

Let Rck, {Dmck}

2m=1, {bm

i,ck}N,2i=1,m=1, be the optimal solu-

tions of CKM. Then

Rock = Rck

D1,1ock =

[

D1ck0

]

D1,2ock =

[

0

D2ck

]

b1,cock = bc

ck c ∈ {1, 2}

will be feasible for the problem of OCKM, i.e.

min fock,1,K,2 =

N∑

i=1

xi −R[

D1,1 D1,2]

[

b1,1i

b1,2i

]∥

2

2

s. t. RTR = I

b1,ci ∈ {0, 1}

K

‖b1,ci ‖1 = 1

and they have identical objective function values.In Theorem 2, the code length of both approaches is

M/C × C × log2(K) = M log2(K), which ensures thedistortion error of OCKM is not larger than that of CKMwith the same code length.

Page 7: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

7

Theorem 1 and Theorem 2 guarantee the advantages ofour OCKM with multiple sub codebooks over the approachwith single sub codebook.

5.2 Inequality Constraints or Equality Constraints

One may expect to replace the equality constraint‖bm,c

i ‖1 = 1 in Eqn. 10 as the inequality, i.e.

‖bm,ci ‖1 ≤ 1. (21)

This can potentially give a lower distortion under the sameM and K. However, under the same code length, thisinequality constraint cannot be better than the equalityconstraints.

For the inequality case, there areK + 1 different valuesfor bm

i,inequality, i.e. ‖bm,ci,inequality‖1 = 0, or 1. The subscripts

equality and inequality are used for the problem with theequality constraint and that with the inequality constraint,respectively. Then, the code length isMC log2(K + 1).

With the same code length, the equality case can con-sumeK + 1 sub codewords on each subvector. The sizeof D

m,cequality is S × (K + 1), and the size ofbm,c

i,equality is(K + 1)× 1.

From any feasible solution of the inequality case, wecan derive the feasible solution of the equality case withthe same objective function value, i.e.

Requality= Rinequality

Dm,cequality =

[

Dm,cinequality,0S×1

]

bm,cequality=

[

bm,cinequality

0

]

if ‖bm,cinequality‖1 = 1

[

0K×1

1

]

if ‖bm,cinequality‖1 = 0.

In the equality case, the last sub codeword is enforcedto be0S×1, and the other sub codewords are filled by theone in the inequality case. Ifbm,c

inequality is all 0s, the entryof bm,c

equality corresponding to the last sub codeword is set as1, or follows b

m,cinequality. This can ensure the multiplication

Dm,cequalityb

m,cequality equalsDm,c

inequalitybm,cinequality.

The objective function value remains the same, whilewith the optimal solution the equality case may obtain alower distortion.

5.3 Implementation

In OCKM and ECKM, there are three kinds of optimizers:rotation matrixR, sub codebooksDm or Dm,c, andbm

i orbm,ci . In our implementation,R is initialized as the identity

matrixI. The sub codebookDm andDm,c are initialized byrandomly choosing the data on the corresponding subvector.

The solution ofR, Dm and Dm,c are optimal in theiterative optimization process, but the solution ofbm

i andbm,ci are sub optimal. To guarantee that the objective

function value is non-increasing in the iterative coordinatedescent algorithm, we updatebm

i or bm,ci only if the codes

of Alg. 1 or Alg. 2 can provide a lower distortion error.

Algorithm 4 Optimization of OCKM

Input: {xi}Ni=1, M

Output: R, {Dm,c}M,Cm=1,c=1, and{bm,c

i }N,M,Ci=1,m=1,c=1

1: R = I

2: Randomly initialize{Dm,c}M,Cm=1,c=1 from the data set.

3: Update{bm,ci }N,M,C

i=1,m=1,c=1 by Alg. 24: while !convergeddo5: UpdateR6: Update{Dm,c}M,C

m=1,c=1

7: for i = 1 : N do8: for m = 1 : M do9: Getnew b

m,c

i from Alg. 210: if gock(new b

m

i ) < gock(bmi ) then

11: bmi = new b

m

i

12: end if13: end for14: end for15: end while

The whole algorithm of OCKM is shown in Alg. 4 and theone of ECKM can be similarly obtained.

The distortion errors of OCKM with different numbers ofiterations are shown in Fig. 2 on SIFT1M (Sec. 6.1.1 for thedataset description), and we use100 iterations through allthe experiments. The optimization scheme is fast and forinstance on the training set of SIFT1M, the time cost ofeach iteration is about4.2 seconds in our implementations.(All the experiments are conducted on a server with an IntelXeon 2.9GHz CPU.)

5.4 Distance Approximation for ANN search

In this subsection, we discuss the methods of the EuclideanANN search by OCKM, and analyze the query time. SinceECKM is a special case of OCKM, we only discuss OCKM.

Let q ∈ RD be the query point. The approximate

distance toxi encoded asbTi ,

[

b1i

T· · · bM

i

T]

is

distAD(q, bi) (22)

=‖q−RDbi‖22

=‖q‖22 − 2

M∑

m=1

C∑

c=1

zmT (Dm,cb

m,ci ) + ‖Dbi‖

22

∝1

2‖q‖22 −

M∑

m=1

C∑

c=1

zmT (Dm,cb

m,ci ) +

1

2‖Dbi‖

22, (23)

wherezm is them-th subvector ofRTq.The first item‖q‖22/2 is constant with all the database

points and can be ignored in comparison. The third item‖Dbi‖

22/2 is independent of the query point. Thus, it is pre-

computed once as the lookup table for all the quires. Thisprecomputation cost is not low compared with the linearscan cost for a single query, but is negligible for a largeamount of queries which is the case in real applications.Moreover, this term is computed only using the binary code

Page 8: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

8

0 50 100 150 200 250 3000.8

0.9

1

1.1

1.2

1.3

1.4

1.5x 10

9

Number of Iterations

Dis

tort

ion

Fig. 2. Distortion vs the number of iterations on thetraining set of SIFT1M with M = 8, K = 256 and C = 2.

bi and no access to the originalxi is required. For the sec-ond item, we can pre-compute{−zmTd

m,ck }K,M,C

k=1,m=1,c=1

and store it as the lookup tables. Then there areMC + 1table lookups andMC+1 addition operations to calculatethe distance. The1 corresponds to the third item of Eqn. 23.

If the query point is also represented by the binary codes,denoted asbq, we can recoverq asq′ , RDbq. Then theapproximate distance to any database point will be identicalwith Eqn. 22, i.e.

distSD(bq, bi) = distAD(q′, bi). (24)

Eqn. 22 is usually referred as the asymmetric distancewhile Eqn. 24 as the symmetric distance. Since the sym-metric distance encodes both the query and the databasepoints, the accuracy is generally lower than the asymmetricdistance, which only encodes the database points.

Analysis of query time. We adopt an exhaustive search inwhich each database point is compared against the querypoint and the points with smallest approximate distances arereturned. The exhaustive search scheme is fast in practicebecause each comparison only requires a few table lookupsand additional operations.

Table 2 lists the code length and the comparison amongPQ, CKM and our OCKM for exhaustive search. Under thesame code length, OCKM consumes only one more tablelookup and one more addition than the others. Consideringthe other computations in the querying, the differences oftime cost are minor in practice.

Take Mck = 8, K = 256, C = 2, Mock = 4 asan example. The code length of OCKM and CKM areboth 64. The number of table lookups are9 for OCKMand 8 for CKM. With these configurations on SIFT1Mdata set, the exhaustive querying over 1 million databasepoints costs about24.3ms for OCKM and 23.5ms forCKM in our implementations. Thus, the on-line querytime is comparable with the state-of-the-art approaches,but the proposed approach can potentially provide a betteraccuracy.

TABLE 2Comparison in terms of the code length, the number

of table lookups and the number of addition operationsfor exhaustive search.

OCKM CKM [24] PQ [9]

Code Length MC log2(K) M log

2(K) M log

2(K)

#(Table Lookups) MC + 1 M M

#(Additions) MC + 1 M M

6 EXPERIMENTS

6.1 Settings

6.1.1 DatasetsExperiments are conducted on three widely-used high-dimensional datasets: SIFT1M [9], GIST1M [9], andSIFT1B [9]. Each dataset comprises of one training set(from which the parameters are learned), one query set, andone database (on which the search is performed). SIFT1Mprovides 105 training points, 104 query pints and106

database points with each point being a128-dimensionalSIFT descriptor of local image structures around the featurepoints. GIST1M provides5×105 training points,103 querypoints and106 database points with each point being a960-dimensional GIST feature. SIFT1B is composed of108 training points, 104 query points and as large as109 database points. Following [24], we use the first106

training points on the SIFT1B datasets. The whole trainingset is used on SIFT1M and GIST1M.

6.1.2 CriteriaANN search is conducted to evaluate our proposed ap-proaches, and three indicators are reported.

• Distortion: distortion is referred here as the sum of thesquared loss after representing each point as the binarycodes or the indices of the sub codewords. Generallyspeaking, the accuracy is better with a lower distortion.

• Recall: recall is the proportion over all the querieswhere the true nearest neighbor falls within the topranked vectors by the approximate distance.

• Mean overall ratio: mean overall ratio [34] reflectsthe general quality of all top ranked neighbors. Letribe thei-th nearest vector of a queryq with the exactEuclidean distance, andr∗i be the i-th point of theranking list by the approximate distance. The rank-iratio, denoted byRi(q), is

Ri(q) =‖q− r∗i ‖2‖q− ri‖2

.

The overall ratio is the mean of allRi(q), i.e.

1

k

k∑

i=1

Ri(q).

The mean overall ratio is the mean of the overall ratiosof all the queries. When the approximate results arethe same as exact search results, the overall ratio willbe 1. The performance is better with a lower meanoverall ratio.

Page 9: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

9

2 4 8 160

1

2

3

4

5x 10

9

M

Dis

tort

ion

OCKMECKMCKMPQ

2 4 8 161

2

3

4

5

6x 10

5

M

Dis

tort

ion

OCKMECKMCKMPQ

(a) SIFT1M (b) GIST1M

Fig. 3. Distortion on the training set.

2 4 8 160

1

2

3

4

5x 10

10

M

Dis

tort

ion

OCKMECKMCKMPQ

2 4 8 162

4

6

8

10

12x 10

5

M

Dis

tort

ion

OCKMECKMCKMPQ

(a) SIFT1M (b) GIST1M

Fig. 4. Distortion on the database set.

6.1.3 Approaches

We compare our Optimized CartesianK-Means (OCKM)with Product Quantization (PQ) [9] and CartesianK-Means(CKM) [24]. Besides, the results of our extended CartesianK-Means (ECKM) are also reported. Following [24], wesetK = 256 to make the lookup tables small and fit thesub index into one byte.

A suffix ‘-A’ or ‘-S’ is appended to the name ofapproaches to distinguish the asymmetric distance or thesymmetric distance in ANN search. For example, OCKM-A represents the database points are encoded by OCKM,and the asymmetric distance is used to rank all the databasepoints.

We do not compare with other state-of-the-art hashingalgorithms, such as spectral hashing (SH) [40] and iterativequantization (ITQ) hashing [5], because it is demonstratedPQ is superior over SH [9] and CKM is better thanITQ [24].

6.2 Results

6.2.1 Comparison with the number of subvectorsfixed

The distortion errors on the training set and database set areillustrated in Fig. 3 and Fig. 4, respectively. From the twofigures, our OCKM achieves the lowest distortion, followedby ECKM. This is because under the sameM , both CKMand ECKM are the special case of OCKM, as discussed inTheorem 1.

Fig. 5 and Fig. 6 show the recall and the mean overallratio for ANN search at the10-th top ranked point, respec-tively. With the same type of the approximate distance,our approach OCKM achieves the best performance: the

2 4 8 160

0.2

0.4

0.6

0.8

1

M

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−SECKM−SCKM−SPQ−S

2 4 8 160

0.2

0.4

0.6

0.8

1

M

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−SECKM−SCKM−SPQ−S

(a) SIFT1M (b) GIST1M

Fig. 5. Recall for ANN search at the 10-th top rankedpoint.

2 4 8 161

1.1

1.2

1.3

1.4

1.5

MM

ean

Ove

ral R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−SECKM−SCKM−SPQ−S

2 4 8 161

1.1

1.2

1.3

1.4

M

Mea

n O

vera

l Rat

io

OCKM−AECKM−ACKM−APQ−AOCKM−SECKM−SCKM−SPQ−S

(a) SIFT1M (b) GIST1M

Fig. 6. Mean overall ratio for ANN search at the 10-thtop ranked point.

highest recall and the lowest mean overall ratio. With thelowest distortion errors demonstrated in Fig. 5 and Fig. 6,the OCKM is more accurate for encoding the data points.

6.2.2 Comparison with the code length fixed

We useMock, Meck, Mck, Mpq to denote the number ofsubvectors in OCKM, ECKM, CKM, and PQ, respectively.The code length of CKM isMck log2(K), while the codelength of OCKM is MockC log2(K). Fixing C = 2 asthe analysis in Sec. 4.1.3, we setMock = Mck/2 withMck being 4, 8, and 16 for code length32, 64 and 128,respectively. TheMpq is identical withMck, while Meck iswith Mock. In this way, the code length is identical throughall the approaches.

The results in terms of recall on SIFT1M, GIST1M, andSIFT1B are shown in Fig. 7. From these results, we cansee that:

• Generally, our OCKM outperforms all the others underthe same type of approximate distance. For example ofthe asymmetric distance with 64 bits, the improvementof OCKM is about5 percents on SIFT1M in Fig. 7(b), 4 percents on GIST1M in Fig. 7 (e),4 percents onSIFT1B in Fig. 7 (h) at the10-th top ranked point. Theperformance of OCKM mainly benefits from the lowdistortion errors, which is also discussed in Theorem 2.Fig. 8 illustrates the distortion on the database underthe same code length for SIFT1M and GIST1M. Wecan see under the same code length, our approachachieves the lowest distortions.

• The improvement is even better with a smaller codelength. To present the observation more clearly, weextract the recall at the100-th nearest neighbor from

Page 10: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

10

32 64 128

101

102

103

104

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

100

101

102

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

100

101

102

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

(a) (b) (c)

102

103

104

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

100

101

102

103

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

101

102

103

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

(d) (e) (f)

101

102

103

104

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

100

101

102

103

104

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

100

101

102

103

0

0.2

0.4

0.6

0.8

1

Number of Retrieved Points

Rec

all

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

(g) (h) (i)

Fig. 7. Recall for ANN search. The first row corresponds to SIFT1M; the second to GIST1M; and the third toSIFT1B. The code lengths are 32, 64 and 128 from the left-most column to the right-most.

32 64 1280.5

1

1.5

2

2.5

3

3.5

4x 10

10

Code length

Dis

tort

ion

OCKMECKMCKMPQ

16 32 64 1285

6

7

8

9

10

11

12x 10

5

Code length

Dis

tort

ion

OCKMECKMCKMPQ

(a) SIFT1M (b) GIST1M

Fig. 8. Distortion under the same code length on thedatabase set.

Fig. 7 and plot Fig. 9. With a larger code length,the recalls of our OCKM and the second best CKMapproach1. With a smaller code length, our OCKMgains larger improvement.

• ECKM is not quite competitive with the same codelength. The possible reason is that the number of subcodebooks is smaller than those of the others. Takethe code length of64 bits as an example. There are8 subvectors and each has one sub codebook for PQand CKM, resulting in8 sub codebooks. OCKM isequipped with4 subvectors, but each has two subcodebooks, also resulting in8 sub codebooks. Com-paratively, ECKM has4 subvectors, each of which hasone sub codebook, and there are only4 sub codebooksin total. Smaller numbers of sub codebooks maydegrade the performance of ECKM. Compared withSIFT1M and SIFT1B, ECKM achieves even betterresults than PQ on GIST1M, which indicates GIST1Mis more sensitive to the rotation.

Fig. 10 illustrates the experiment results in terms of meanoverall ratio with different code lengths on SIFT1M andGIST1M. Mean overall ratio captures the whole quality of

Page 11: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

11

32 64 128

0.4

0.5

0.6

0.7

0.8

0.9

1

Code length

Rec

all@

100−

NN

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

32 64 1280

0.2

0.4

0.6

0.8

1

Code length

Rec

all@

100−

NN

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

32 64 1280

0.2

0.4

0.6

0.8

1

Code length

Rec

all@

100−

NN

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

(a) SIFT1M (b) GIST1M (c) SIFT1B

Fig. 9. Recall at the 100-th top ranked point under the same code length.

the returned points while the recall captures the positionof the nearest neighbor and ignores the quality of the otherpoints. Under this criterion, our OCKM achieves the lowestmean overall ratio and outperforms all the others. Thisimplies the returned nearest neighbors of OCKM are ofhigh quality and close to the query points.

7 CONCLUSION

In this paper, we proposed the Optimized CartesianK-Means (OCKM) algorithm to encode the high-dimensionaldata points for approximate nearest neighbor search. Thekey idea of OCKM is that in each subspace multiple subcodebooks are generated and each sub codebook contributesone sub codeword for encoding the subvector. The benefit isthat it reduces the quantization error with comparable querytime under the same code length. The theoretical analysisand experimental results show that OCKM achieves su-perior performance for ANN search over state-of-the-artapproaches.

ACKNOWLEDGMENT

This work was partially supported by the National BasicResearch Program of China (973 Program) under Grant2014CB347600 and ARC Discovery Project DP130103252.

REFERENCES

[1] S. Arya and D. M. Mount. Approximate nearest neighbor queriesin fixed dimensions. InSODA, pages 271–280, 1993.

[2] L. Cao, Z. Li, Y. Mu, and S.-F. Chang. Submodular video hashing:a unified framework towards video pooling and indexing. InACMMultimedia, pages 299–308, 2012.

[3] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions.In Sympo-sium on Computational Geometry, pages 253–262, 2004.

[4] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm forfinding best matches in logarithmic expected time.ACM Trans.Math. Softw., 3(3):209–226, 1977.

[5] Y. Gong and S. Lazebnik. Iterative quantization: A procrusteanapproach to learning binary codes. InCVPR, pages 817–824, 2011.

[6] J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, and S.-F.Chang. Mobile product search with bag of hash bits and boundaryreranking. InCVPR, pages 3005–3012, 2012.

[7] J. He, W. Liu, and S. Chang. Scalable similarity search withoptimized kernel hashing. InKDD, pages 1129–1138, 2010.

[8] P. Indyk and R. Motwani. Approximate nearest neighbors:Towardsremoving the curse of dimensionality. InSTOC, pages 604–613,1998.

[9] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearestneighbor search.IEEE Trans. Pattern Anal. Mach. Intell., pages 117–128, 2011.

[10] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian. Super-bit locality-sensitivehashing. InNIPS, pages 108–116, 2012.

[11] Y. Jia, J. Wang, G. Zeng, H. Zha, and X.-S. Hua. Optimizing kd-treesfor scalable visual descriptor indexing. InCVPR, pages 3392–3399,2010.

[12] W. Kong and W.-J. Li. Isotropic hashing. InNIPS, pages 1655–1663,2012.

[13] B. Kulis and T. Darrell. Learning to hash with binary reconstructiveembeddings. InNIPS, pages 1042–1050, 2009.

[14] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing.IEEE Trans. Pattern Anal. Mach. Intell., 34(6):1092–1104, 2012.

[15] Y.-H. Kuo, K.-T. Chen, C.-H. Chiang, and W. H. Hsu. Queryex-pansion for hash-based image object retrieval. InACM Multimedia,pages 65–74, 2009.

[16] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervisedhashingwith kernels. InCVPR, pages 2074–2081, 2012.

[17] X. Liu, J. He, D. Liu, and B. Lang. Compact kernel hashingwithmultiple features. InACM Multimedia, pages 881–884, 2012.

[18] S. P. Lloyd. Least squares quantization in pcm.IEEE Transactionson Information Theory, 28(2):129–136, 1982.

[19] J. B. MacQueen. Some methods for classification and analysisof multivariate observations. InProceedings of the fifth Berkeleysymposium on mathematical statistics and probability, volume 1,page 14, 1967.

[20] S. Mallat and Z. Zhang. Matching pursuits with time-frequencydictionaries.IEEE Transactions on Signal Processing, pages 3397–3415, 1993.

[21] Y. Mu and S. Yan. Non-metric locality-sensitive hashing. In AAAI,2010.

[22] M. Muja and D. G. Lowe. Fast approximate nearest neighbors withautomatic algorithm configuration. InVISSAPP (1), pages 331–340,2009.

[23] M. Norouzi and D. Fleet. Minimal loss hashing for compact binarycodes. InICML, pages 353–360, 2011.

[24] M. Norouzi and D. J. Fleet. Cartesian k-means. InCVPR, pages3017–3024, 2013.

[25] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes fromshift-invariant kernels. InNIPS, pages 1509–1517, 2009.

[26] J. Revaud, M. Douze, C. Schmid, and H. Jegou. Event retrieval inlarge video collections with circulant temporal encoding.In CVPR,2013.

[27] R. Salakhutdinov and G. Hinton. Semantic hashing.Int. J. Approx.Reasoning, 50(7):969–978, 2009.

[28] P. H. Schonemann. A generalized solution of the orthogonalprocrustes problem.Psychometrika, 31(1):1–10, 1966.

[29] G. Shakhnarovich, T. Darrell, and P. Indyk.Nearest-NeighborMethods in Learning and Vision: Theory and Practice. The MITpress, 2006.

[30] C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast imagedescriptor matching. InCVPR, 2008.

[31] J. Song, Y. Yang, Z. Huang, H. Shen, and R. Hong. Multiplefeaturehashing for real-time large scale near-duplicate video retrieval. InACM Multimedia, pages 423–432, 2011.

Page 12: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

12

32 64 128

0 200 400 600 800 1000

1.08

1.1

1.12

1.14

1.16

1.18

1.2

1.22

Number of Retrieved Points

Mea

n O

vera

ll R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

0 200 400 600 800 1000

1.04

1.06

1.08

1.1

1.12

Number of Retrieved Points

Mea

n O

vera

ll R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

0 200 400 600 800 10001.01

1.015

1.02

1.025

1.03

1.035

1.04

1.045

1.05

Number of Retrieved Points

Mea

n O

vera

ll R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

0 200 400 600 800 1000

1.08

1.1

1.12

1.14

1.16

1.18

Number of Retrieved Points

Mea

n O

vera

ll R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

0 200 400 600 800 10001.04

1.06

1.08

1.1

1.12

1.14

Number of Retrieved Points

Mea

n O

vera

ll R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

0 200 400 600 800 1000

1.03

1.04

1.05

1.06

1.07

1.08

1.09

Number of Retrieved Points

Mea

n O

vera

ll R

atio

OCKM−AECKM−ACKM−APQ−AOCKM−DECKM−DCKM−DPQ−D

Fig. 10. Mean overall ratio for ANN search. The results in the first row are on SIFT1M while those in the secondrow are on GIST1M. The first column corresponds to the code length 32; the second to 64; and the third to 128.

[32] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-mediahashing for large-scale retrieval from heterogeneous datasources. InSIGMOD, pages 785–796, 2013.

[33] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. Ldahash:Improved matching with smaller descriptors.IEEE Trans. PatternAnal. Mach. Intell., 34(1):66–78, 2012.

[34] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Efficient and accurate nearestneighbor and closest pair search in high-dimensional space. ACMTrans. Database Syst., 35(3), 2010.

[35] J. Wang and S. Li. Query-driven iterated neighborhood graph searchfor large scale indexing. InACM Multimedia, pages 179–188, 2012.

[36] J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing forapproximate nearest neighbor search. InACM Multimedia, pages133–142, 2013.

[37] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo. Fastneighborhood graph search using cartesian concatenation.In ICCV,pages 2128–2135, 2013.

[38] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo. Fastneighborhood graph search using cartesian concatenation.CoRR,abs/1312.3062, 2013.

[39] J. Wang, N. Wang, Y. Jia, J. Li, G. Zeng, H. Zha, and X.-S. Hua.Trinary-projection trees for approximate nearest neighbor search.IEEE Trans. Pattern Anal. Mach. Intell., 36(2):388–403, 2014.

[40] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing.In NIPS,pages 1753–1760, 2008.

[41] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu. Complementaryhashing for approximate nearest neighbor search. InICCV, pages1631–1638, 2011.

[42] X. Zhu, Z. Huang, H. Cheng, J. Cui, and H. T. Shen. Sparse hashingfor fast multimedia search.ACM Trans. Inf. Syst., 31(2):9, 2013.

[43] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao. Linear cross-modalhashing for effective multimedia search. InACM Multimedia, 2013.

Jianfeng Wang received his B.Eng. degreefrom the Department of Electronic Engineer-ing and Information Science in the Universityof Science and Technology of China (USTC)in 2010. Currently, he is a PhD student inMOE-Microsoft Key Laboratory of Multime-dia Computing and Communication, USTC.His research interests include multimedia re-trieval, machine learning and its applications.

Jingdong Wang received the BSc and MScdegrees in Automation from Tsinghua Uni-versity, Beijing, China, in 2001 and 2004,respectively, and the PhD degree in Com-puter Science from the Hong Kong Universityof Science and Technology, Hong Kong, in2007. He is currently a Lead Researcherat the Visual Computing Group, MicrosoftResearch, Beijing, P.R. China. His areas ofinterest include computer vision, machinelearning, and multimedia search. At present,

he is mainly working on the Big Media project, including large-scaleindexing and clustering, and Web image search and mining. He is aneditorial board member of Multimedia Tools and Applications.

Page 13: Optimized Cartesian K-Means - arXiv · Cartesian product of the sub codebooks, each of which is learned on each subspace using the conventional K-means algorithm. The compact code

13

Jingkuan Song is currently a ResearchFellow in University of Trento, Italy. He re-ceived his Ph.D degree from The Univer-sity of Queensland, and BS degree in Soft-ware Engineering from University of Elec-tronic Science and Technology of China. Hisresearch interest includes large-scale multi-media search, computer vision and machinelearning.

Xin-Shun Xu received his M.S. and Ph.D.Degrees in computer science from Shan-dong University, China, in 2002, and ToyamaUniversity, Japan, in 2005, respectively. Hejoined the School of Computer Science andTechnology at Shandong University as anassociate professor in 2005, and joined theLAMDA group of the National Key Labora-tory for Novel Software Technology, NanjingUniversity, China, as a postdoctoral fellowin 2009. Currently, he is a professor of the

School of Computer Science and Technology at Shandong Univer-sity, and the leader of MIMA (Machine Intelligence and Media Anal-ysis) group of Shandong University. His research interests includemachine learning, information retrieval, data mining, bioinformatics,and image/video analysis.

Heng Tao Shen is a Professor of ComputerScience in School of Information Technologyand Electrical Engineering, The University ofQueensland. He obtained his B.Sc. (with 1stclass Honours) and Ph.D. from Departmentof Computer Science, National University ofSingapore in 2000 and 2004 respectively.He then joined the University of Queenslandas a Lecturer and became a Professor in2011. His research interests include Multime-dia/Mobile/Web Search and Big Data Man-

agement. He is the winner of Chris Wallace Award for outstandingResearch Contribution in 2010 from CORE Australasia. He is anAssociate Editor of IEEE TKDE, and will serve as a PC Co-Chairfor ACM Multimedia 2015.

Shipeng Li joined and helped to found Mi-crosoft Research’s Beijing lab in May 1999.He is now a Principal Researcher and Re-search Area Manager coordinating multime-dia research activities in the lab. His re-search interests include multimedia process-ing, analysis, coding, streaming, networkingand communications. From Oct. 1996 to May1999, Dr. Li was with Multimedia Technol-ogy Laboratory at Sarnoff Corporation as aMember of Technical Staff. Dr. Li has been

actively involved in research and development in broad multimediaareas and international standards. He has authored and co-authored6 books/book chapters and 280+ referred journal and conferencepapers. He holds 140+ granted US patents. Dr. Li received hisB.S. and M.S. in Electrical Engineering (EE) from the University ofScience and Technology of China (USTC), Hefei, China in 1988and 1991, respectively. He received his Ph.D. in EE from LehighUniversity, Bethlehem, PA, USA in 1996. He was a faculty memberin Department of Electronic Engineering and Information Science atUSTC in 1991-1992. Dr. Li received the Best Paper Award in IEEETransaction on Circuits and Systems for Video Technology (2009).Dr. Li is a Fellow of IEEE.