Multiview Alignment Hashing for Efﬁcient Image Search …chennaisunday.com/2015JAVA/Multiview Alignment Ha… · · 2015-07-18Multiview Alignment Hashing for Efﬁcient Image

Multiview Alignment Hashing for Efficient ImageSearch

Li Liu, Mengyang Yu, Student Member, IEEE, and Ling Shao, Senior Member, IEEE

Abstract—Hashing is a popular and efficient method fornearest neighbor search in large-scale data spaces, by embeddinghigh-dimensional feature descriptors into a similarity-preservingHamming space with a low dimension. For most hashing methods,the performance of retrieval heavily depends on the choice ofthe high-dimensional feature descriptor. Furthermore, a singletype of feature cannot be descriptive enough for different imageswhen it is used for hashing. Thus, how to combine multiplerepresentations for learning effective hashing functions is animminent task. In this paper, we present a novel unsupervisedMultiview Alignment Hashing (MAH) approach based on Reg-ularized Kernel Nonnegative Matrix Factorization (RKNMF),which can find a compact representation uncovering the hiddensemantics and simultaneously respecting the joint probabilitydistribution of data. Specifically, we aim to seek a matrixfactorization to effectively fuse the multiple information sourcesmeanwhile discarding the feature redundancy. Since the raisedproblem is regarded as nonconvex and discrete, our objectivefunction is then optimized via an alternate way with relaxationand converges to a locally optimal solution. After finding thelow-dimensional representation, the hashing functions are finallyobtained through multivariable logistic regression. The proposedmethod is systematically evaluated on three datasets: Caltech-256, CIFAR-10 and CIFAR-20, and the results show that ourmethod significantly outperforms the state-of-the-art multiviewhashing techniques.

Index Terms—Hashing, Multiview, NMF, Alternate optimiza-tion, Logistic regression, Image similarity search

I. INTRODUCTION

LEARNING discriminative embedding has been a criticalproblem in many fields of information processing and

analysis, such as object recognition [1], [2], image/video re-trieval [3], [4] and visual detection [5]. Among them, scalableretrieval of similar visual information is attractive, since withthe advances of computer technologies and the development ofthe World Wide Web, a huge amount of digital data has beengenerated and applied. The most basic but essential scheme forsimilarity search is the nearest neighbor (NN) search: given aquery image, to find an image that is most similar to it withina large database and assign the same label of the nearestneighbor to this query image. NN search is regarded as alinear search scheme (O(N)), which is not scalable due to thelarge sample size in datasets of practical applications. Later,to overcome this kind of computational complexity problem,

Copyright (c) 2013 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

L. Liu, M. Yu and L. Shao are with the Department of ComputerScience and Digital Technologies, Northumbria University, Newcastle uponTyne, NE1 8ST, U.K. (e-mail: [email protected]; [email protected];[email protected]).

some tree-based search schemes are proposed to partition thedata space via various tree structures. Among them, KD-treeand R-tree [6] are successfully applied to index the data for fastquery responses. However, these methods cannot operate withhigh-dimensional data and do not guarantee faster search com-pared to the linear scan. In fact, most of the vision-based taskssuffer from the curse of dimensionality problems1, becausevisual descriptors usually have hundreds or even thousandsof dimensions. Thus, some hashing schemes are proposed toeffectively embed data from a high-dimensional feature spaceinto a similarity-preserving low-dimensional Hamming spacewhere an approximate nearest neighbor of a given query canbe found with sub-linear time complexity.

One of the most well-known hashing techniques that p-reserve similarity information is Locality-Sensitive Hashing(LSH) [7]. LSH simply employs random linear projections(followed by random thresholding) to map data points closein a Euclidean space to similar codes. Spectral Hashing(SpH) [8] is a representative unsupervised hashing method, inwhich the Laplace-Beltrami eigenfunctions of manifolds areused to determine binary codes. Moreover, principled linearprojections like PCA Hashing (PCAH) [9] has been suggestedfor better quantization rather than random projection hashing.Besides, another popular hashing approach, Anchor GraphsHashing (AGH) [10], is proposed to learn compact binarycodes via tractable low-rank adjacency matrices. AGH allowsconstant time hashing of a new data point by extrapolatinggraph Laplacian eigenvectors to eigenfunctions. More relevanthashing methods can be seen in [11], [12], [13].

However, single-view hashing is the main topic on whichthe previous exploration of hashing methods focuses. In theirarchitectures, only one type of feature descriptor is used forlearning hashing functions. In practice, to make a more com-prehensive description, objects/images are always representedvia several different kinds of features and each of them hasits own characteristics. Thus, it is desirable to incorporatethese heterogenous feature descriptors into learning hashingfunctions, leading to multi-view hashing approaches. Multi-view learning techniques [14], [15], [16], [17] have been wellexplored in the past few years and widely applied to visualinformation fusion [18], [19], [20]. Recently, a number ofmultiview hashing methods have been proposed for efficientsimilarity search, such as Multi-View Anchor Graph Hashing(MVAGH) [21], Sequential Update for Multi-View SpectralHashing (SU-MVSH) [22], Multi-View Hashing (MVH-CS)

1The effectiveness and efficiency of these methods drop exponentially asthe dimensionality increases, which is commonly referred to as the curse ofdimensionality.

IEEE Transactions on Image Processing, (Volume:24 , Issue: 3 ),March 2015.

[23], Composite Hashing with Multiple Information Sources(CHMIS) [24] and Deep Multi-view Hashing (DMVH) [25].These methods mainly depend on spectral, graph or deeplearning techniques to achieve data structure preserving encod-ing. Nevertheless, the hashing purely with the above schemesare usually sensitive to data noise and suffering from the highcomputational complexity.

The above drawbacks of prior work motivate us to propose anovel unsupervised mulitiview hashing approach, termed Mul-tiview Alignment Hashing (MAH), which can effectively fusemultiple information sources and exploit the discriminativelow-dimensional embedding via Nonnegative Matrix Factor-ization (NMF) [26]. NMF is a popular method in data miningtasks including clustering, collaborative filtering, outlier de-tection, etc. Unlike other embedding methods with positiveand negative values, NMF seeks to learn a nonnegative parts-based representation that gives better visual interpretationof factoring matrices for high-dimensional data. Therefore,in many cases, NMF may be more suitable for subspacelearning tasks, because it provides a non-global basis setwhich intuitively contains the localized parts of objects [26].In addition, since the flexibility of matrix factorization canhandle widely varying data distributions, NMF enables morerobust subspace learning. More importantly, NMF decomposesan original matrix into a part-based representation that givesbetter interpretation of factoring matrices for non-negativedata. When applying NMF to multiview fusion tasks, a part-based representation can reduce the corruption between anytwo views and gain more discriminative codes.

To the best of our knowledge, this is the first work usingNMF to combine multiple views for image hashing. It isworthwhile to highlight several contributions of the proposedmethod:

• MAH can find a compact representation uncovering thehidden semantics from different view aspects and simul-taneously respecting the joint probability distribution ofdata.

• To solve our nonconvex objective function, a new al-ternate optimization has been proposed to get the finalsolution.

• We utilize multivariable logistic regression to generate thehashing function and achieve the out-of-sample extension.

The rest of this paper is organized as follows. In Section2, we give a brief review of NMF. The details of our methodare described in Section 3. Section 4 reports the experimentalresults. Finally, we conclude this paper in Section 5.

II. A BRIEF REVIEW OF NMF

In this section, we mainly review some related algorithms,focusing on Nonnegative Matrix Factorization (NMF) and itsvariants. NMF is proposed to learn the nonnegative parts ofobjects. Given a nonnegative data matrix X = [x1, · · · ,xN ] ∈RD×N≥0 , each column of X is a sample data. NMF aims tofind two nonnegative matrices U ∈ RD×d≥0 and V ∈ Rd×N≥0

with full rank whose product can approximately represent theoriginal matrix X, i.e., X ≈ UV . In practice, we always have

d < min(D,N). Thus, we minimize the following objectivefunction

LNMF = ‖X − UV ‖2, s.t. U, V ≥ 0, (1)

where ‖·‖ is Frobenius norm. To optimize the above objectivefunction, an iterative updating procedure was developed in [26]as follows:

Vij ←(UTX)ij

(UTUV )ijVij , Uij ←

(XV T )ij(UV V T )ij

Uij , (2)

and normalizationUij ←

Uij∑i Uij

. (3)

It has been proved that the above updating procedure can findthe local minimum of LNMF . The matrix V obtained in NMFis always regarded as the low-dimensional representation whilethe matrix U denotes the basis matrix.

Furthermore, there also exists some variants of NMF. LocalNMF (LNMF) [27] imposes a spatial localized constrainton the bases. In [28], sparse NMF was proposed and later,NMF constrained with neighborhood preserving regularization(NPNMF) [29] was developed. Besides, researchers also pro-posed graph regularized NMF (GNMF) [30], which effectivelypreserves the locality structure of data. Beyond these methods,[31] extends the original NMF with the kernel trick as kernel-ized NMF (KNMF), which could extract more useful featureshidden in the original data through some kernel-induced non-linear mappings. Moreover, it can deal with data where onlyrelationships (similarities or dissimilarities) between objectsare known. Specifically, in their work, they addressed thematrix factorization by K ≈ UV , where K is the kernelmatrix instead of the data matrix X , and (U, V ) are similarwith standard NMF. More related to our work, a multiplekernels NMF (MKNMF) [32] approach was proposed, wherelinear programming is applied to determine the combinationof different kernels.

In this paper, we present a Regularized Kernel Nonnega-tive Matrix Factorization (RKNMF) framework for hashing,which can effectively preserve the data intrinsic probabilitydistribution and simultaneously reduce the redundancy of low-dimensional representations. Rather than locality-based graphregularization, we measure the joint probability of pairwisedata by the Gaussian function, which is defined over all thepotential neighbors and has been proved to effectively resistdata noise [33]. This kind of measurement is capable to capturethe local structure of the high-dimensional data while alsorevealing global structure such as the presence of clusters atseveral scales. To the best of our knowledge, this is the firsttime that NMF with multiview hashing has been successfullyapplied to feature embedding for large-scale similarity search.

III. MULTIVIEW ALIGNMENT HASHING

In the section, we introduce our new Multiview AlignmentHashing approach, referred as MAH. Our goal is to learn ahash embedding function, which fuses the various alignmentrepresentations from multiple sources while preserving thehigh-dimensional joint distribution and obtaining the orthogo-nal bases simultaneously during the RKNMF. Originally, we


need to find the binary solution which, however, is first relaxedto a real-valued range so that a more suitable solution can begained. After applying the alternate optimization, we convertthe real-valued solutions into binary codes. Fig. 1 shows theoutline of the proposed MAH approach.

A. Objective Function

Given the i-th view training data X(i) = [x(i)1 , · · · ,x(i)

N ] ∈RDi×N , we construct the corresponding N ×N kernel matrixKi using the Heat Kernel formulation:

Ki(x(i)p , x(i)

q ) = exp

(−||x(i)

p − x(i)q ||2

2τ2i

), ∀p, q,

where τi is the related scalable parameter. Actually, ourapproach can work with any legitimate kernel. Without lossof generality, one of the most popular kernels, Heat Kernel, isused here in the purposed method. Our discussion in furthersection will only focus on the Heat Kernel.

Then multiple kernel matrices from each view data{K1, · · · ,Kn} are computed and Ki ∈ RN×N≥0 , ∀i. We furtherdefine the fusion matrix

K =

n∑i=1

αiKi, subject ton∑i=1

αi = 1, αi ≥ 0, ∀i.

To obtain a meaningful low-dimensional matrix factoriza-tion, we then set a constraint for the binary representationV = [v1, · · · ,vN ] as the similarity probability regularization,which is utilized to preserve the data distribution in theintrinsic objective space. The optimization is expressed as:

arg minV ∈{0,1}d×N

1

2

∑p,q

W (i)pq ‖vp − vq‖2, (4)

where W (i)pq = Pr(x

(i)p ,x

(i)q ) is the symmetric joint probability

between x(i)p and x

(i)q in the i-th view feature space. In this

paper, we use a Gaussian function2 [33] to measure it:

Pr(x(i)p ,x(i)

q ) =

exp(−‖x(i)

p −x(i)q ‖

2/2σ2i

)∑

k 6=l exp(−‖x(i)

k −x(i)l ‖2/2σ

2i

) , if p 6= q

0, if p = q(5)

where σi is the Gaussian smooth parameter. ‖x(i)p − x

(i)q ‖

is always measured by the Euclidean distance. To refer theframework in [30], the similarity probability regularization forthe i-th view can be reduced to

arg minV ∈{0,1}d×N

Tr(V LiVT ), (6)

where Li = D(i) − W (i) is the Laplacian matrix, W (i) =(W

(i)pq

)∈ RN×N is the symmetric similarity matrix and D(i)

is a diagonal matrix with its entries D(i)pp =

∑pW

(i)pq . In

addition, to reduce the redundancy of subspaces and obtainthe compact bases in NMF simultaneously, we hope the basismatrix U of NMF should be as orthogonal as possible tominimize the redundancy, i.e., UTU − I = 0. Here we also

2Our approach can work with any legitimate similarity probability measure,though we focus on Gaussian similarity probability in this paper.

minimize ‖UTU−I‖2 and make basis U near-orthogonal sincethis objective function can be fused into the computation ofthe L2-norm loss function of NMF. Finally, we combine theabove-mentioned two constraints for U and V and constructour optimization problem as follows:

arg minU,V,αi

‖n∑i=1

αiKi − UV ‖2 + γ

n∑i=1

αi Tr(V LiVT ) (7)

+ η‖UTU − I‖2,

s.t. V ∈ {0, 1}d×N , U, V ≥ 0,

n∑i=1

αi = 1, αi ≥ 0, ∀i,

where γ and η are two positive coefficients balancing thetradeoff between the NMF approximation error and additionalconstraints.

B. Alternate Optimization via Relaxation

In this section, we first relax the discreteness condition tothe continuous condition, i.e., we relax the binary domainV ∈ {0, 1}d×N in Eq. (7) to the real number domainV ∈ Rd×N but keep the NMF requirements (nonnegativeconditions). To the best of our knowledge, there is no directway to obtain its global solution for a linearly constrainednonconvex optimization problem. Thus, in this paper theoptimization procedure is implemented as an iterativeprocedure and divided into two steps by optimizing (U, V )and α = (α1, · · · , αn) alternately [34]. For each step, one of(U, V ) and α is optimized while the other is fixed and at thenext step, we switch (U, V ) and α. The iteration procedurestops until it converges.

a) Optimizing (U, V ): By first fixing α, we substituteK =

∑ni=1 αiKi and L =

∑ni=1 αiLi. A Lagrangian

multiplier function is introduced for our problem as:

L1(U, V,Φ,Ψ) =‖K − UV ‖2 + γ Tr(V LV T )

+ η‖UTU − I‖2 + Tr(ΦUT ) + Tr(ΨV T ),(8)

where Φ and Ψ are two matrices in which all their entries arethe Lagrangian multipliers to constrain U, V ≥ 0 respectively.Then we let the partial derivatives of L1 with respect to Uand V be zeroes, i.e., ∇U,V L1 = 0, we obtain

∇UL1 = 2(−KV T + UV V T + 2ηUUTU − 2ηU

)+ Φ

= 0, (9)∇V L1 = 2

(−UTK + UTUV + γV L

)+ Ψ = 0. (10)

Using the Karush-Kuhn-Tucker (KKT) conditions [35] sinceour objective function satisfies the constraint qualificationcondition, we have the complementary slackness conditions:

ΦijUij = 0 and ΨijVij = 0, ∀i, j.

Multiplying Uij and Vij on the corresponding entries of Eq.(9) and (10), we have the following equations for Uij and Vij(−KV T + UV V T + 2ηUUTU − 2ηU

)ijUij = 0,(

−UTK + UTUV + γV L)ijVij = 0.


Fig. 1. Illustration of the working flow of MAH. During the training phase, the proposed optimization method is derived by an alternate optimization procedurewhich optimizes the kernel’s weight α and the factorization matrices (U, V ) alternately. Using the multivariable logistic regression, the algorithm then outputsthe projection matrix P and regression matrix Θ to generate the hash function, which are directly used in the test phase.

Hence, similar to the standard procedure of NMF [26], we getour updating rules

Uij ←

(KV T + 2ηU

)ij

(UV V T + 2ηUUTU)ijUij , (11)

Vij ←

(UTK + γVW

)ij

(UTUV + γV D)ijVij . (12)

where D =∑ni=1 αiD

(i) and W =∑ni=1 αiW

(i). Thisallocation is to guarantee that all the entries of U and V arenonnegative. U is also needed to be normalized as Eq. (3).The convergence of U is based on [36] and the convergenceof V is based on [30]. Similar to [37], they have proventhat after each update of U or V , the objective function ismonotonically nonincreasing.

b) Optimizing α: Due to the quadratic norm in our objec-tive function, the procedure for optimizing α is different fromthe general linear programming. For fixed U and V , omittingthe irrelevant terms, we define the Lagrangian function

L2(α, λ, β) =‖n∑i=1

αiKi − UV ‖2 + γ

n∑i=1

αi Tr(V LiVT )

+ λ(

n∑i=1

αi − 1) + βαT , (13)

where λ and β = (β1, · · · , βn) are the Lagrangian multipliersfor the constraints and α = (α1, · · · , αn). Considering thepartial derivatives of L2 with respect to α, λ and β, we havethe equality ∇α,λL2 = 0 and the inequality ∇βL2 ≥ 0. Then

we need

∇αL2 =

(2 Tr

((

n∑i=1

αiKi − UV )Kj

)

+γ Tr(V LjVT ) + λ+ βj

)1≤j≤n

= 0, (14)

∇λL2 =

n∑i=1

αi − 1 = 0, (15)

∇βL2 = α ≥ 0. (16)

And we also have complementary slackness conditions

βjαj = 0, j = 1, · · · , n. (17)

Suppose αj = 0 for some j, specifically denote J = {j|αj =0}, then the optimal solution would include some zeroes.In this case, there is no difference with the optimizationprocedure for minimizing ‖

∑j 6∈J αjKj−UV ‖2. Without loss

of generality, we suppose αj > 0, ∀j. Then β = 0. From Eq.(14), we have

n∑i=1

αi Tr(KiKj) = Tr(UVKj)− γ Tr(V LjVT )/2− λ/2,

(18)for j = 1, · · · , n. If we transform the above equations to thematrix form and denote Tj = Tr(UVKj)− γ Tr(V LjV

T )/2,we obtain


Tr(K2

1 ) Tr(K2K1) · · · Tr(KnK1)Tr(K1K2) Tr(K2

2 ) · · · Tr(KnK2)...

.... . .

...Tr(K1Kn) Tr(K2Kn) · · · Tr(K2

n)

α1

α2

...αn

=

T1 − λ/2T2 − λ/2

...Tn − λ/2

. (19)

Let us denote Eq. (19) by AαT = B. Note that matrixA is actually the Gram matrix of Ki based on Frobeniusinner product 〈Ki,Kj〉 = Tr(KiK

Tj ) = Tr(KiKj). Let M =

(vec(K1), · · · , vec(Kn)), where vec(Ki) is the vectorizationof Ki, then A = MTM . And rank(A) = rank(MTM) =rank(M) = n under the assumption that the kernel matricesK1, · · · ,Kn from n different views are linearly independent.Combined with Eq. (16) and eliminated λ by subtracting otherrows from the first row in Eq. (19), we can obtain the linearequations demonstrated in Eq. (20) in a matrix form.

We denote Eq. (20) by AαT = B. According to the varietyof different views, 1 = (1, · · · , 1) and all of the rows in Aare linearly independent. Then we have rank(A) = rank(A)−1 + 1 = n. Thus, the inverse of A exists and we have

αT = A−1B.

c) Global Convergence: Let us denote our original ob-jective function in Eq. (7) by L(U, V, α). Then for any m-thstep, the alternate iteration procedure will be:

(U (m), V (m))← arg minU,V

L(U, V, α(m−1))

andα(m) ← arg min

αL(U (m), V (m), α).

Thus we have the following inequality:

L(U (m−1), V (m−1), α(m−1)) ≥ L(U (m), V (m), α(m−1)) ≥L(U (m), V (m), α(m)) ≥ L(U (m+1), V (m+1), α(m)) ≥L(U (m+1), V (m+1), α(m+1)) ≥ · · · .

In other words, L(U (m), V (m), α(m)) are monotonic nonin-creasing as m→∞. Note that we have L(U, V, α) ≥ 0. Thenthe alternate iteration converges.

Following the above optimization procedure on (U, V )and α, we compute them alternately by fixing another untilthe object function converges. In practice, we terminate theiteration process when the difference |L(U (m), V (m), α(m))−L(U (m−1), V (m−1), α(m−1))| is less than a small threshold orthe number of iterations reaches a maximum value.

C. Hashing Function Generation

From the above section, we can easily compute the weightvector α = (α1, · · · , αn) and further get the fused kernelmatrix K and the combined joint probability Laplacian ma-trix L. Thus, from Eq. (11) and Eq. (12) we can obtain

the multiview RKNMF bases U ∈ RN×d and the low-dimensional representation V ∈ Rd×N , where d � Di,i = 1, · · · , n. We now convert the above low-dimensionalreal-valued representation from V = [v1, · · · ,vN ] into binarycodes via thresholding: if the l-th element of vp is larger thanthe specified threshold, the mapped bit v

(l)p = 1; otherwise

v(l)p = 0, where p = 1, · · · , N and l = 1, · · · , d.According to [38], a well-designed semantic hashing should

also be entropy-maximized to ensure efficiency. Moreover,from the information theory [39], the maximal entropy of asource alphabet is attained by having a uniform probabilitydistribution. If the entropy of codes over the bit-string is small,it means that documents are mapped to only a small part ofcodes (hash bins), thereby rendering the hash table inefficient.In our method, we set the threshold for each element in v

(l)p

as the median value of v(l)p , which can meet the entropy

maximizing criterion mentioned above. Therefore, the p-th bit-string will be 1 for half of it and 0 for the other half. Theabove scheme gives each distinct binary code roughly equalprobability of occurring in the data collection, hence achievesthe best utilization of the hash table. The binary code can berepresented as: V = [v1, · · · , vN ], where vp ∈ {0, 1}d andp = 1, · · · , N . A similar scheme has been used in [40], [41].

Up to now, this would only tell us how to compute thehash code of items in the training set, while for a new comingsample we cannot explicitly find the corresponding hashingfunction. Inspired by [42], we determine our out-of-sample ex-tension using a regression technique with multiple variables. Inthis paper, instead of applying linear regression as mentionedin [42], the binary coding environment forces us to implementour tasks via the logistic regression [43], which can betreated as a type of probabilistic statistical classification model.The corresponding probabilities describe the possibilities ofbinary responses. In n distributions Yi|Xi ∼ Bernoulli(pi),i = 1, · · · , n, for the function Pr(Yi = 1|Xi = x) := hθ(x)with the parameter θ, the likelihood function is

n∏i=1

Pr(Yi = yi|Xi = xi) =

n∏i=1

hθ(xi)yi(1− hθ(xi))1−yi .

According to the maximum log-likelihood criterion, we defineour logistic regression cost function:

J(Θ) =− 1

N

(N∑p=1

(〈vp, log(hΘ(vp))〉

+ 〈(1− vp), log(1− hΘ(vp))〉)

+ ξ||Θ||2),

(21)

where

hΘ(vp) =

(1

1 + e−(ΘTvp)i

)T1≤i≤d

is the regression function for each component in vp; log(x) =(log(x1), · · · , log(xn))T for x = (x1, · · · , xn)T ∈ Rn; 〈·, ·〉represents the inner product; Θ is the corresponding regressionmatrix with the size d × d; 1 is denoted as N × 1 all onesmatrix and ξ is the parameter for regularization term ξ||Θ||2in logistic regression, which is to avoid overfitting.


Tr(K2

1 )− Tr(K1Kn) · · · Tr(KnK1)− Tr(K2n)

.... . .

...Tr(K1Kn−1)− Tr(K1Kn) · · · Tr(KnKn−1)− Tr(K2

n)1 · · · 1

α1

...αn

=

T1 − Tn

...Tn−1 − Tn

1

. (20)

Further aiming to minimize J(Θ), a standard gradientdescent algorithm has been employed. According to [43], theupdated equation with the learning rate r can be written as:

Θt+1 = Θt − r

(1

N

N∑p=1

(hΘ(vp)− vp)vTp +

ξ

NΘt

). (22)

We stop the update iteration when the norm of differencebetween Θt+1 and Θt, ||Θt+1−Θt||2, is less than an empiricalsmall value (i.e., reach the convergence) and then output theregression matrix Θ.

In this way, given a sample, the related kernel matrices{Knew

1 , · · · ,Knewn } for each view are first computed via Heat

Kernel, where Knewi is a N×1 matrix, ∀i. We then fuse these

kernels with optimized weights α:

Knew =

n∑i=1

αiKnewi

and obtain the low-dimensional real-value representation by alinear projection matrix:

P = (UTU)−1UT

via RKNMF, which is the pseudoinverse of U . Since hΘ isa Sigmoid function, finally the hash code for the new samplecan be calculated as:

vnew = bhΘ(P ·Knew)e, (23)

where function b·e is the nearest integer function for eachentry of hΘ. In fact, it follows the property of hΘ ∈ (0, 1) tobinarize vnew by a threshold 0.5. If any bit of hΘ(P ·Knew)’soutput is larger than 0.5, we assign “1” to this bit, otherwise“0”. In this way, we can obtain our final Multiview AlignmentHashing (MAH) codes for any data points (see in Fig. 2).

It is noteworthy that our approach can be regarded as anembedding method. In practice, we map all the training andtest data in a same way to ensure that they are in the samesubspace by Eq. (23) with {P, α,Θ} which are obtained viamultiview RKNMF optimization and logistic regression. Thisprocedure is similar as a handcrafted embedding scheme, with-out re-training. The corresponding integrated MAH algorithmis depicted in Algorithm 1.

D. Complexity Analysis

The cost of MAH learning mainly contains two parts.The first part is for the constructions of Heat Kernels andsimilarity probability regularization items for different views,i.e., Ki and Li. From Section III-A, the time complexityof this part is O(2(

∑ni=1Di)N

2). The second part is forthe alternate optimization. The time complexity of the matrixfactorization in updating (U, V ) step is O(N2d) according to

Algorithm 1 Multiview Alignment Hashing (MAH)Input: A set of training kernel matrices from n different

views: {K1, · · · ,Kn} computed via Heat Kernel; theobjective dimension of hash code d; learning rate r forlogistic regression and regularization parameters {γ, η, ξ}.

Output: Kernel weights α = (α1, · · · , αn), basis matrix Uand regression matrix Θ.

1: Calculate matrix W (i) for each view through Eq. (5);2: Initialize α = (1/n, 1/n, · · · , 1/n);3: repeat4: Compute the basis matrix U and the low-dimensional

representation matrix V via Eq. (11) and Eq. (12);5: Obtain kernel weights αT = A−1B with Eq. (20);6: until convergence7: Calculate the regression matrix Θ by Eq. (22) and the final

MAH encoding for a sample is defined in Eq. (23).

[30]. And, the updating of α has the complexity of O(n2N2)in MAH. In total, the time complexity of MAH learning isO(2(

∑ni=1Di)N

2 + T × (N2d + n2N2)), where T is thenumber of iterations for alternate optimization. Empirically, Tis always less than 10, i.e., MAH converges within 10 rounds.

IV. EXPERIMENTS AND RESULTS

In this section, the MAH algorithm is evaluated for the highdimensional nearest neighbor search problem. Three differentdatasets are used in our experiments, i.e., Caltech-256 [44],CIFAR-10 [45] and CIFAR-20 [45]. Caltech-256 consists of30607 images associated with 256 object categories. CIFAR-10 and CIFAR-20 are both 60, 000-image subsets collectedfrom the 80-million tiny images dataset [46] with 10 classlabels and 20 class labels, respectively. Following the exper-imental setting in [22], [21], for each dataset, we randomlyselect 1000 images as the query set and the rest of datasetis used as the training set. Given an image, we would liketo describe it with multiview features extracted from it. Thedescriptors are expected to capture the orientation, intensity,texture and color information, which are the main cues of animage. Therefore, 512-dim Gist3 [47], 1152-dim histogram oforiented gradients (HOG)4 [48], 256-dim local binary pattern(LBP)5 [49] and 192-dim color histogram (ColorHist)6 arerespectively employed for image representation.

3Gabor filters are applied on images with 8 different orientations and 4scales. Each filtered image is then averaged over 4 × 4 grid leading to a512-dimensional vector (8 × 4 × 16 = 512).

436 4 × 4 non-overlapping windows yield a 1152-dimensional vector.5The LBP labels the pixels of an image by thresholding a 3× 3 neighbor-

hood, and responses are mapped to a 256-dimensional vector.6For each R,G,B channel, a 64-bin histogram is computed and the total

length is 3 × 64 = 192.


Fig. 2. The illustration of the embedding via Eq. (23), i.e., the nearest integer function.

In the test phase, a returned point is regarded as a trueneighbor if it lies in the top 100, 500 and 500 points closest to aquery for Caltech-256, CIFAR-10 and CIFAR-20, respectively.For each query, all the data points in the database are rankedaccording to their Hamming distances to the query, since it isfast enough with short hash codes in practice. We then evaluatethe retrieval results by the Mean Average Precision (MAP)and the precision-recall curve. Additionally, we also reportthe training time and the test time (the average searching timeused for each query) for all the methods. All experiments areperformed using Matlab 2013a on a server configured with a12-core processor and 128G of RAM running the Linux OS.

A. Compared Methods and Settings

We compare our method against six popular unsuper-vised multiview hashing algorithms, i.e., Multi-View AnchorGraph Hashing (MVAGH) [21], Sequential Update for Multi-View Spectral Hashing (SU-MVSH) [22], Multi-View Hashing(MVH-CS) [23], Composite Hashing with Multiple Infor-mation Sources (CHMIS) [24], Deep Multi-view Hashing(DMVH) [25] with a 4-layers deep-net and a derived version ofMVH-CS, termed MAV-CCA, which is a special case of MAV-CS when the averaged similarity matrix is fixed as the identitymatrix [23]. Besides, we also compare our method withtwo state-of-the-art single-view hashing methods, i.e., SpectralHashing (SpH) [8] and Anchor Graphs Hashing (AGH) [10].For single-view hashing methods, data from multiviews areconcatenated into a single representation for hashing learning.It is noteworthy that we have normalized all the features intoa same scale before concatenating the multiple features intoa single representation for single-view hashing algorithms.All of the above methods are then evaluated on six differentlengths of codes (16, 32, 48, 64, 80, 96). Following the sameexperimental setting, all the parameters used in the comparedmethods have been strictly chosen according to their originalpapers.

For our MAH, we apply Heat Kernel:

Ki(x(i)p , x(i)

q ) = exp

(−||x(i)

p − x(i)q ||2

2τ2i

), ∀p, q,

to construct the original kernel matrices, where τi is set asthe median of pairwise distances of data points. The selectionof the smooth parameter σi used in Eq. (5) also follows withthe same scheme of τi. The optimal learning rate r for eachdataset is selected from one of {0.01, 0.02, · · · , 0.10} with

TABLE IICOMPARISON OF DIFFERENT VARIANTS OF MAH TO PROVE THE

EFFECTIVENESS OF THE IMPROVEMENT.

`````````MethodsDatasets Caltech-256 CIFAR-10 CIFAR-20

MAH-1 0.034 0.091 0.047MAH-2 0.057 0.113 0.062MAH-3 0.090 0.360 0.106MAH 0.103 0.381 0.123

(MAH-1 is the original MAH with neither the probability regularization nor theorthogonal constraint; MAH-2 is the original MAH without the probability regularization;MAH-3 represents the original MAH without the orthogonal constraint.)

the step of 0.01 which yields the best performance by 10-fold cross-validation on the training data. The choice of threeregularization parameters {γ, η, ξ} is also done via cross-validation on the training set and we finally fix γ = 0.15,η = 0.325 and ξ = 0.05 for all three datasets. To further speedup the convergence of the proposed alternate optimizationprocedure, in our experiments, we apply a small trick withthe following steps:

1) For the first time to calculate U and V in the step ofoptimizing (U, V ) in Section III-B, we utilize Eq. (11)and Eq. (12) with random initialization, following theoriginal NMF procedure [26]. We then store the obtainedU and V .

2) From the second time, we optimize (U, V ) by using thestored U and V from the last time to initialize the NMFalgorithm, instead of using random values.

This small improvement can effectively reduce the timeof convergence in the training phase and make the proposedmethod more efficient for large-scale tasks. Fig. 3 shows theretrieval performance of MAH when four descriptors (i.e.,Gist, HOG, LBP, ColorHist) are used together via differentkernel combinations, or when single descriptors are used.Specifically, MAH denotes that the kernels are combined bythe proposed method:

K =

n∑i=1

αiKi,

where the weight αi is obtained via alternate optimization.MAH (Average) means that the kernels are combined byarithmetic mean

K =1

n

n∑i=1

Ki,

and MAH (Product) indicates the combination of kernels


Fig. 3. Performance (MAP) of MAH when four descriptors are used together via different kernel combinations in comparison to that of single descriptors.

Fig. 4. Performance (MAP) comparison with different numbers of bits.

Fig. 5. The precision-recall curves of all compared algorithms on the three datasets with the code length of 96 bits.


Fig. 6. Some retrieval results on the Caltech-256 dataset. The top-left image in each block is the query and the rest are the 8-nearest images. The incorrectresults are marked by red boxes.

Fig. 7. Comparison of some retrieval results via compared hashing methods on the Caltech-256 dataset. The top-left image in each block is the query andthe rest are the 8-nearest images. The incorrect results are marked by red boxes.

through geometric mean

K = (

n∏i=1

Ki)1n .

The corresponding combinations of similarity probability reg-ularization items Li also follow the above similar schemes.The results on three datasets demonstrate that integratingmultiple features achieves better performance than using singlefeatures and the proposed weighted combination improves theperformance compared with average and product schemes.Fig. 4 illustrates the MAP curves of all compared algorithms

on three datasets. In its entirety, firstly the retrieval accuracieson the CIFAR-10 dataset are obviously higher than that onthe more complicated CIFAR-20 and Caltech-256 datesets.Secondly, the multiview methods always achieve better re-sults than single-view schemes. Particularly, in most cases,MVAGH achieves higher performance than SU-MVSH, MVH-CS, MVH-CCA, DMVH-4layers and CHMIS. The results ofMVH-CS always climb up then go down when the length ofcodes increases. The same tendency also appears with CHMIS.In general, our MAH outperforms all other compared methods(also see Table I). In additional, we have also compared our


TABLE IMEAN AVERAGE PRECISION (MAP) OF 32 BITS WITH TRAINING AND TEST TIME ON THREE DATASETS.

Methods Caltech-256 CIFAR-10 CIFAR-20MAP

(100-NN)Traintime

Testtime

MAP(500-NN)

Traintime

Testtime

MAP(500-NN)

Traintime

Testtime

SU-MVSH 0.082 314.2s 48.4µs 0.241 387.3s 63.8µs 0.109 402.3s 72.7µsMVH-CS 0.063 371.1s 32.4µs 0.237 498.2s 48.8µs 0.081 562.0s 53.1µs

MVH-CCA 0.052 477.0s 28.3µs 0.270 544.5s 45.0µs 0.080 570.6s 56.1µsCHMIS 0.074 299.9s 90.1µs 0.273 333.6s 104.8µs 0.102 363.7s 113µsMVAGH 0.092 253.9s 57.0µs 0.359 276.2s 64.5µs 0.118 297.3s 74.8µs

DMVH-4layers 0.080 22104s 97.4µs 0.278 48829s 105.1µs 0.111 52350s 117.2µsMAH 0.103 274.7s 49.1µs 0.381 361.5s 62.9µs 0.123 397.5s 76.5µs

(All the time calculated does not include the cost of feature extraction.)

proposed method with other various baselines (with/withoutthe probability regularization or the orthogonal constraint) inTable II). It is obviously observed that the proposed methodhas significantly improved the effectiveness of NMF and itsvariants in terms of accuracies.

Furthermore, we present the precision-recall curves of allthe algorithms on three datasets with the code length of 96 bitsin Fig. 5. From this figure, we can further discover that MAHconsistently achieves the better results again by comparingthe Area Under the Curve (AUC). Some examples of retrievalresults on Caltech-256 are also shown in Fig. 6 and Fig. 7.

The proposed algorithm aims to generate a non-linear binarycode, which can better fit the intrinsic data structure thanlinear ones. The only existing non-linear multiview hashingalgorithm is MVAGH. However, the proposed algorithm cansuccessfully find the optimal aggregation of the kernel matrix,whereas MVAGH cannot. Therefore, it is expected that theproposed algorithm performs better than MVAGH and theexisting linear multiview hashing algorithms.

Finally, we list the training time and the test time fordifferent algorithms on three datasets in Table I. Consideringthe training time, since DMVH involves learning of a 4-layerdeep-net, it spends the most time to train hashing functions.MVH-CCA spends the second longest time for training. OurMAH only costs slightly more time than MVAGH and CHMISbut significantly less than other methods for training. For thetest phase, MVH-CS and MVH-CCA are the most efficientmethods, while MAH has competitive searching time as SU-MVSH. DMVH needs the most time for testing as well.

V. CONCLUSION

In this paper, we have presented a novel unsupervisedhashing method called Multiview Alignment Hashing (MAH),where hashing functions are effectively learnt via kernelizedNonnegative Matrix Factorization with preserving data jointprobability distribution. We incorporate multiple visual fea-tures from different views together and an alternate way isintroduced to optimize the weights for different views andsimultaneously produce the low-dimensional representation.We address this as a nonconvex optimization problem andits alternate procedure will finally converge at the locally

optimal solution. For the out-of-sample extension, multivari-able logistic regression has been successfully applied to ob-tain the regression matrix for fast hash encoding. Numericalexperiments have been systematically evaluated on Caltech-256, CIFAR-10 and CIFAR-20 datasets. The results manifestthat our MAH significantly outperforms the state-of-the-artmultiview hashing techniques in terms of searching accuracies.

REFERENCES

[1] L. Shao, L. Liu, and X. Li, “Feature learning for image classificationvia multiobjective genetic programming,” IEEE Transactions on NeuralNetworks and Learning Systems, vol. 25, no. 7, pp. 1359–1371, 2014.

[2] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learn-ing for visual recognition,” International Journal of Computer Vision,vol. 109, no. 1-2, pp. 42–59, 2014.

[3] J. Han, X. Ji, X. Hu, D. Zhu, K. Li, X. Jiang, G. Cui, L. Guo, andT. Liu, “Representing and retrieving video shots in human-centric brainimaging space,” IEEE Transactions on Image Processing, vol. 22, no. 7,pp. 2723–2736, 2013.

[4] J. Han, K. N. Ngan, M. Li, and H. Zhang, “A memory learningframework for effective image retrieval,” IEEE Transactions on ImageProcessing, vol. 14, no. 4, pp. 511–524, 2005.

[5] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object-oriented visual saliency detection framework based on sparse codingrepresentations,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 23, no. 12, pp. 2009–2021, 2013.

[6] V. Gaede and O. Gunther, “Multidimensional access methods,” ACMComputing Surveys (CSUR), vol. 30, no. 2, pp. 170–231, 1998.

[7] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in highdimensions via hashing,” in International Conference on Very LargeData Bases, vol. 99, 1999, pp. 518–529.

[8] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NeuralInformation Processing Systems, 2008.

[9] J. Wang, S. Kumar, and S. Chang, “Semi-supervised hashing for largescale search,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 34, no. 12, pp. 2393–2406, 2012.

[10] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” inInternational Conference on Machine Learning, 2011.

[11] K. Grauman and R. Fergus, “Learning binary hash codes for large-scaleimage search,” Machine Learning for Computer Vision, pp. 49–87, 2013.

[12] J. Cheng, C. Leng, P. Li, M. Wang, and H. Lu, “Semi-supervisedmulti-graph hashing for scalable similarity search,” Computer Visionand Image Understanding, vol. 124, pp. 12–21, 2014.

[13] P. Li, M. Wang, J. Cheng, C. Xu, and H. Lu, “Spectral hashing withsemantically consistent graph for image indexing,” IEEE Transactionson Multimedia, vol. 15, no. 1, pp. 141–152, 2013.

[14] C. Xu, D. Tao, and C. Xu, “Large-margin multi-view information bottle-neck,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 36, no. 8, pp. 1559–1572, 2014.

[15] T. Xia, D. Tao, T. Mei, and Y. Zhang, “Multiview spectral embedding,”IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 40,no. 6, pp. 1438–1446, 2010.


[16] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” CoRR,vol. abs/1304.5634, 2013.

[17] B. Xie, Y. Mu, D. Tao, and K. Huang, “m-sne: Multiview stochasticneighbor embedding,” IEEE Transactions on Systems, Man, and Cyber-netics, Part B: Cybernetics, vol. 41, no. 4, pp. 1088–1096, 2011.

[18] M. Wang, X. Hua, R. Hong, J. Tang, G. Qi, and Y. Song, “Unified videoannotation via multigraph learning,” IEEE Transactions on Circuits andSystems for Video Technology, vol. 19, no. 5, pp. 733–746, 2009.

[19] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu, “Multimodal graph-based reranking for web image search,” IEEE Transactions on ImageProcessing, vol. 21, no. 11, pp. 4649–4661, 2012.

[20] J. Yu, M. Wang, and D. Tao, “Semisupervised multiview distance metriclearning for cartoon synthesis,” IEEE Transactions on Image Processing,vol. 21, no. 11, pp. 4636–4648, 2012.

[21] S. Kim and S. Choi, “Multi-view anchor graph hashing,” in IEEEInternational Conference on Acoustics, Speech, and Signal Processing,2013.

[22] S. Kim, Y. Kang, and S. Choi, “Sequential spectral learning to hash withmultiple representations,” in European Conference on Computer Vision,2012.

[23] S. Kumar and R. Udupa, “Learning hash functions for cross-viewsimilarity search,” in International Joint Conference on Artificial In-telligence, 2011.

[24] D. Zhang, F. Wang, and L. Si, “Composite hashing with multipleinformation sources,” in ACM SIGIR Conference on Research andDevelopment in Information Retrieval, 2011.

[25] Y. Kang, S. Kim, and S. Choi, “Deep learning to hash with multiplerepresentations.” in IEEE International Conference on Data Mining,2012.

[26] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791,1999.

[27] S. Z. Li, X. Hou, H. Zhang, and Q. Cheng, “Learning spatially localized,parts-based representation,” in IEEE Conference on Computer Vision andPattern Recognition, 2001.

[28] P. O. Hoyer, “Non-negative matrix factorization with sparseness con-straints,” Journal of Machine Learning Research, vol. 5, pp. 1457–1469,2004.

[29] Q. Gu and J. Zhou, “Neighborhood preserving nonnegative matrixfactorization.” in British Machine Vision Conference, 2009.

[30] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegativematrix factorization for data representation,” IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1548–1560, 2011.

[31] D. Zhang, Z.-H. Zhou, and S. Chen, “Non-negative matrix factoriza-tion on kernels,” in Pacific Rim International Conference on ArtificialIntelligence, 2006.

[32] S. An, J.-M. Yun, and S. Choi, “Multiple kernel nonnegative matrixfactorization,” in IEEE International Conference on Acoustics, Speech,and Signal Processing, 2011.

[33] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,”Journal of Machine Learning Research, vol. 9, no. 11, 2008.

[34] J. C. Bezdek and R. J. Hathaway, “Some notes on alternating optimiza-tion,” Advances in Soft Computing-AFSS, pp. 288–300, 2002.

[35] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridgeuniversity press, 2004.

[36] W. Zheng, Y. Qian, and H. Tang, “Dimensionality reduction with cat-egory information fusion and non-negative matrix factorization for textcategorization,” Artificial Intelligence and Computational Intelligence,pp. 505–512, 2011.

[37] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Neural Information Processing Systems, 2000.

[38] S. Baluja and M. Covell, “Learning to hash: forgiving hash functions andapplications,” Data Mining and Knowledge Discovery, vol. 17, no. 3,pp. 402–430, 2008.

[39] C. E. Shannon, “A mathematical theory of communication,” ACMSIGMOBILE Mobile Computing and Communications Review, vol. 5,no. 1, pp. 3–55, 2001.

[40] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing forfast similarity search,” in ACM SIGIR Conference on Research andDevelopment in Information Retrieval, 2010.

[41] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hashing,” inIEEE Conference on Computer Vision and Pattern Recognition, 2013.

[42] D. Cai, X. He, and J. Han, “Spectral regression for efficient regularizedsubspace learning,” in IEEE International Conference on ComputerVision, 2007, pp. 1–8.

[43] D. W. Hosmer Jr and S. Lemeshow, Applied logistic regression. JohnWiley & Sons, 2004.

[44] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” 2007.

[45] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,” Technical Report of Computer Science DepartmentUniversity of Toronto, pp. 473–482, 2009.

[46] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: Alarge data set for nonparametric object and scene recognition,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 30,no. 11, pp. 1958–1970, 2008.

[47] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” International Journalof Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.

[48] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE Conference on Computer Vision and Pattern Recog-nition, 2005.

[49] T. Ahonen, A. Hadid, and M. Pietikainen, “Face recognition with localbinary patterns,” in European Conference on Computer Vision, 2004.

Li Liu received the B.Eng. degree in electronicinformation engineering from Xi’an Jiaotong Uni-versity, Xi’an, China, in 2011, and the Ph.D. degreein the Department of Electronic and Electrical En-gineering, University of Sheffield, Sheffield, U.K.,in 2014. Currently, he is a research fellow in theDepartment of Computer Science and Digital Tech-nologies at Northumbria University. His researchinterests include computer vision, machine learningand data mining.

Mengyang Yu (S’14) received the B.S. and M.S.degrees from the School of Mathematical Sciences,Peking University, Beijing, China, in 2010 and2013 respectively. He is currently working towardthe Ph.D. degree in the Department of ComputerScience and Digital Technologies at NorthumbriaUniversity, Newcastle upon Tyne, U.K. His researchinterests include computer vision, machine learningand data mining.

Ling Shao (M’09–SM’10) is a Professor with theDepartment of Computer Science and Digital Tech-nologies at Northumbria University, Newcastle uponTyne, U.K. Previously, he was a Senior Lecturer(2009-2014) with the Department of Electronic andElectrical Engineering at the University of Sheffieldand a Senior Scientist (2005-2009) with PhilipsResearch, The Netherlands. His research interestsinclude Computer Vision, Image/Video Processingand Machine Learning. He is an associate editorof IEEE Transactions on Image Processing, IEEE

Transactions on Cybernetics and several other journals. He is a Fellow of theBritish Computer Society and the Institution of Engineering and Technology.


Documents

Multiview Alignment Hashing for Efﬁcient Image Search …chennaisunday.com/2015JAVA/Multiview Alignment Ha… · · 2015-07-18Multiview Alignment Hashing for Efﬁcient Image