9
Learning Binary Codes for Collaborative Filtering Ke Zhou College of Computing Georgia Institute of Technology Atlanta, GA 30032 [email protected] Hongyuan Zha College of Computing Georgia Institute of Technology Atlanta, GA 30032 [email protected] ABSTRACT This paper tackles the efficiency problem of making recom- mendations in the context of large user and item spaces. In particular, we address the problem of learning binary codes for collaborative filtering, which enables us to effi- ciently make recommendations with time complexity that is independent of the total number of items. We propose to construct binary codes for users and items such that the preference of users over items can be accurately preserved by the Hamming distance between their respective binary codes. By using two loss functions measuring the degree of divergence between the training and predicted ratings, we formulate the problem of learning binary codes as a discrete optimization problem. Although this optimization problem is intractable in general, we develop effective relaxations that can be efficiently solved by existing methods. Moreover, we investigate two methods to obtain the binary codes from the relaxed solutions. Evaluations are conducted on three public-domain data sets and the results suggest that our pro- posed method outperforms several baseline alternatives. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information filtering; I.2.6 [Artificial Intelligence]: Learning General Terms Algorithms, Performance, Experimentation Keywords Recommender systems, Collaborative filtering, Learning bi- nary codes, Discrete optimization, Relaxed solutions 1. INTRODUCTION With the rapid growth of E-commerce, hundreds of thou- sands of products, ranging from books, mp3s to automobiles, are sold through online marketplaces nowadays. In addition, millions of customers with diverse backgrounds and prefer- ences make purchases online, generating great opportunities as well as challenges for E-commerce companies — How to match products to their potential buyers not only accurately but also efficiently. Since collaborative filtering is an essen- tial component for many existing recommendation systems, it has been actively investigated by a wide range of previ- ous studies to improve its accuracy [1, 19]. On the other hand, due to the nature of their applications, collaborative filtering systems are usually required to learn and predict the preferences between a large number of users and items. Therefore, for a given user, it is important to retrieve prod- ucts that satisfy her preferences efficiently, leading to fast response time and better user experience. Naturally, the problem can be viewed as a similarity search problem where we seek “similar” items for a given user. Recent studies show that binary coding is a promising approach for fast similarity search [9, 13, 14, 17, 21]. The basic idea is to represent data points by binary codes that preserve the original similarities between them. One significant advantage of this approach is that the retrieval of similar data points can be conducted by searching for data points within a small Hamming dis- tance, which can be performed in time that is independent of the total number of data [17]. However, no prior stud- ies have been focused on constructing binary codes for both users and items in the context of collaborative filtering — to the best of our knowledge — a gap we propose to fill in this paper. One key obstacle that hinders direct exploitation of the existing approaches to learning binary codes to the collab- orative filtering context is that most of them assume the similarities between any pairs of data points are given ex- plicitly, e.g., in the form of kernel functions or similarity graphs [13, 21, 24]. However, in collaborative filtering, the similarities between users and items are not known explic- itly. In fact, the main goal of collaborative filtering algo- rithms is to estimate and predict unobserved similarities between users and items from the training data in order to make recommendations. In this paper, we address the problem of learning binary codes for collaborative filtering. Specifically, we propose to learn compact yet effective binary codes for both users and items from the training rating data. Unlike previous works on learning binary codes, we do not assume the similarity between users and items are known ex- plicitly. Therefore, the binary codes we construct not only accurately preserve the observed preferences of users, but they also can be used to predict the unobserved preferences, making the proposed method conceptually unique compared with the existing methods. Our approach is based on the idea that the binary codes Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’12, August 12–16, 2012, Beijing, China. Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00. 498

[ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

Embed Size (px)

Citation preview

Page 1: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

Learning Binary Codes for Collaborative Filtering

Ke ZhouCollege of Computing

Georgia Institute of TechnologyAtlanta, GA 30032

[email protected]

Hongyuan ZhaCollege of Computing

Georgia Institute of TechnologyAtlanta, GA 30032

[email protected]

ABSTRACT

This paper tackles the efficiency problem of making recom-mendations in the context of large user and item spaces.In particular, we address the problem of learning binarycodes for collaborative filtering, which enables us to effi-ciently make recommendations with time complexity thatis independent of the total number of items. We proposeto construct binary codes for users and items such that thepreference of users over items can be accurately preservedby the Hamming distance between their respective binarycodes. By using two loss functions measuring the degree ofdivergence between the training and predicted ratings, weformulate the problem of learning binary codes as a discreteoptimization problem. Although this optimization problemis intractable in general, we develop effective relaxations thatcan be efficiently solved by existing methods. Moreover, weinvestigate two methods to obtain the binary codes fromthe relaxed solutions. Evaluations are conducted on threepublic-domain data sets and the results suggest that our pro-posed method outperforms several baseline alternatives.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Informationfiltering; I.2.6 [Artificial Intelligence]: Learning

General Terms

Algorithms, Performance, Experimentation

Keywords

Recommender systems, Collaborative filtering, Learning bi-nary codes, Discrete optimization, Relaxed solutions

1. INTRODUCTIONWith the rapid growth of E-commerce, hundreds of thou-

sands of products, ranging from books, mp3s to automobiles,are sold through online marketplaces nowadays. In addition,millions of customers with diverse backgrounds and prefer-ences make purchases online, generating great opportunities

as well as challenges for E-commerce companies — How tomatch products to their potential buyers not only accuratelybut also efficiently. Since collaborative filtering is an essen-tial component for many existing recommendation systems,it has been actively investigated by a wide range of previ-ous studies to improve its accuracy [1, 19]. On the otherhand, due to the nature of their applications, collaborativefiltering systems are usually required to learn and predictthe preferences between a large number of users and items.Therefore, for a given user, it is important to retrieve prod-ucts that satisfy her preferences efficiently, leading to fastresponse time and better user experience. Naturally, theproblem can be viewed as a similarity search problem wherewe seek“similar” items for a given user. Recent studies showthat binary coding is a promising approach for fast similaritysearch [9, 13, 14, 17, 21]. The basic idea is to represent datapoints by binary codes that preserve the original similaritiesbetween them. One significant advantage of this approachis that the retrieval of similar data points can be conductedby searching for data points within a small Hamming dis-tance, which can be performed in time that is independentof the total number of data [17]. However, no prior stud-ies have been focused on constructing binary codes for bothusers and items in the context of collaborative filtering —to the best of our knowledge — a gap we propose to fill inthis paper.

One key obstacle that hinders direct exploitation of theexisting approaches to learning binary codes to the collab-orative filtering context is that most of them assume thesimilarities between any pairs of data points are given ex-plicitly, e.g., in the form of kernel functions or similaritygraphs [13, 21, 24]. However, in collaborative filtering, thesimilarities between users and items are not known explic-itly. In fact, the main goal of collaborative filtering algo-rithms is to estimate and predict unobserved similaritiesbetween users and items from the training data in orderto make recommendations. In this paper, we address theproblem of learning binary codes for collaborative filtering.Specifically, we propose to learn compact yet effective binarycodes for both users and items from the training rating data.Unlike previous works on learning binary codes, we do notassume the similarity between users and items are known ex-plicitly. Therefore, the binary codes we construct not onlyaccurately preserve the observed preferences of users, butthey also can be used to predict the unobserved preferences,making the proposed method conceptually unique comparedwith the existing methods.

Our approach is based on the idea that the binary codes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’12, August 12–16, 2012, Beijing, China. Copyright 2012 ACM 978-1-4503-1462-6/12/08... $15.00.

498

Page 2: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

assigned to users and items should preserve the preferencesof users over items. Two loss functions are applied to mea-sure the divergence between the training data and the es-timates based on the binary codes. Unfortunately, thusformulated, the resulting discrete optimization problem isdifficult to solve in general. Through relaxing the binaryconstraints, it turns out the relaxed optimization problemcan be solved effectively by existing solvers. Moreover, wepropose two effective methods for rounding the relaxed so-lutions to obtain binary codes. One key property of thebinary codes obtained by the proposed method is that thedegree of preferences of a user to items can be measuredby the number of common bits between their correspondingbinary codes. Hence, the major advantage of representingusers and items by binary codes is to enable fast search: Inorder to provide recommendations for a user, we need onlyto search items with binary codes within a small Hammingdistance to the binary codes of the given user. We evalu-ated the proposed method over three data sets and compareit with several baseline alternatives. The results show thatthe binary codes obtained by the proposed method can pre-serve and predict the preferences of users more accuratelythan the baselines.The contributions of this paper are essentially threefold:

1) We propose to learn binary codes for collaborative filter-ing that accurately preserve the preference of users, whichgeneralizes the existing works of learning binary codes to thecontext of collaborative filtering. 2) We propose a frame-work to learning binary codes based on training rating data,which leads to relaxed optimization problems that can besolved effectively by the state-of-the-art optimization tech-niques. 3) Our experimental evaluations show that the bi-nary codes obtained by the proposed method can preservethe preference of users better than several baseline alterna-tives.The rest of the paper is organized as follows: In Section 2,

we briefly review existing studies for collaborative filteringand learning binary codes. In Section 3, we first formu-late the problem of learning binary codes for collaborativefiltering as a discrete optimization problem and introducethe two loss functions used in this work. Then, the learn-ing algorithm proposed with detailed derivations based ontransforming and relaxing the discrete optimization problemso that it can be optimized efficiently. Moreover, we discusstwo different methods for rounding real-valued solutions toobtain binary codes. The evaluations are described and ana-lyzed in Section 4. We conclude our work and present severalfuture research directions in Section 5.

2. RELATED WORK

2.1 Collaborative FilteringMany studies on recommender systems have been focused

on collaborative filtering approaches. These methods canbe categorized into memory-based and model-based. Thereader is referred to the survey papers [1, 19] for a compre-hensive summary of collaborative filtering algorithms.Recently, matrix factorization has become a popular di-

rection for collaborative filtering [2, 11, 15, 16]. These meth-ods are shown to be effective in many applications. Specif-ically, matrix factorization methods seek to associate bothusers and items with latent profiles represented by vectorsin a low dimensional Euclidean space that can capture their

characteristics. Specifically, the preference of a user on anitem is usually measured by some similarity, such as thedot-product, between their low dimensional profiles. Thesestudies are related to our work in the sense that both of themaim to find certain representations that preserve the prefer-ence between users and items. However, one key differenceis that our work aims to find binary codes in Hamming spacefor representing users and items, which has the nice prop-erty that the retrieval of interesting items for a user can beperformed in time that is independent of the total numberof items [17].

Another direction of collaborative filtering investigate theproblem of using binary codes to create fingerprints for users[3–5]. The idea is create binary fingerprints for each user us-ing randomized algorithms so that the similarity betweenusers can be approximated according to the fingerprints.However, these studies do not address the problem of si-multaneously representing users and items by binary finger-prints. Thus, the preference of users on items can not be es-timated directly from these fingerprints. On the other hand,the binary codes learnt by our proposed method can be di-rectly used to measure the preference of users over items.

2.2 Learning Binary CodesThe problem of learning binary codes for fast similar-

ity search has been investigated in several studies [12, 13,17, 21] Locality sensitive hashing tries to construct binarycodes that preserve certain distance (e.g., Lp distance) be-tween different points with high probability, which is usuallyarchived by random projection [6,8]. In [18], the problem oflearning effective binary codes is solved by utilizing the ideaof boosting. The work of [17] proposes to learn binary codesmaking use of Restricted Boltzmann Machine (RBM) forfast similarity search of documents. Recent work focuses onconstructing binary codes based on a given similarity func-tion [14,20,21]. The basic idea is to apply spectral analysistechniques to the data and embedding data points into a lowdimension space. For example, the work [21] investigate therequirements for compact and effective binary codes. Theirsolution relies on spectral graph partition, which can besolved by eigenvalue decomposition of the Laplacian matrixfor the graph. It has been shown that these methods archivesignificant performance improvements in terms of preservingthe similarity between data points. Although several exten-sion of this method has been studied [7, 24], these meth-ods only consider the problem of obtaining binary codes forone type of entities. However, in collaborative filtering, twotypes of entities, users and items, are naturally involved andthus should be consider simultaneously, which makes it dif-ficult to applies these method directly.

The work [23] proposes to learn binary codes for both doc-uments and terms by viewing the term-document matrix asa bipartite graph and apply the method proposed in [21] toobtain the binary codes. However, this method can not dealwith the problem of unobserved/missing ratings in collab-orative filtering. As shown in our experiments, the binarycodes obtained by this method quickly overfit the trainingdata and lead to poor prediction accuracy.

3. LEARNING BINARY CODESIn this section, we describe the proposed method for learn-

ing binary codes for collaborative filtering. We first describethe general formulation for this problem through optimiza-

499

Page 3: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

tion using squared and pairwise loss functions, respectively.Then, the learning method based on solving the relaxedproblem is derived in detail. Finally, we discuss two meth-ods to obtain binary codes from the real-valued solutions ofthe relaxed problems.

3.1 Problem FormulationThe goal of collaborative filtering is to recommend inter-

esting items to users according to their past ratings on theitems. Formally, we assume that rui represents the rating ofuser u ∈ U for item i ∈ I, where U and I are the user anditem space, respectively. Without loss of generality, we fur-ther assume that rui is a real number in the interval [0, 1].Moreover, we assign binary codes fu ∈ {−1, 1}B for eachuser u and hi ∈ {−1, 1}B for each item i, where B is thelength of the binary codes. Our goal is to construct binarycodes for users and items that preserve the preferences be-tween them — the degree of preference of user u over itemi can be estimated by the similarity between their binarycodes fu and hi. A natural way to define the similaritybetween user u and item i is the fraction of common bitsin their binary codes fu and hi, leading to the similarityfunction,

sim(fu, hi) =1

B

B∑

k=1

I(f (k)u = h

(k)i ),

where f(k)u and h

(k)i represent the k-th bit of the binary codes

fu and hi, respectively. I(·) denotes the indicator functionthat returns 1 if the statement in its parameter is true andzero otherwise.It is easy to check that the following holds for the similar-

ity function sim(·, ·) defined above:

sim(fu, hi) = 1−1

BdistH(fu, hi),

where distH(fu, hi) is the Hamming distance between twobinary codes fu and hi. The above fact suggests that thesmaller the Hamming distance is, the more similar their bi-nary codes become. Therefore, in order to find items withsimilar binary codes to a user represented by fu, it is suf-ficient to search items i within a small Hamming distancedistH(fu, hi). This allows us to find similar items in timethat is independent to the total number of items [17].In order to make accurate recommendations to users, we

need to find binary codes fu and hi for users and itemssuch that the preferences between them are preserved bythe similarities between their respective binary codes. Inaddition, in collaborative filtering, we only observe a subsetof all the possible ratings {rui|(u, i) ∈ O} where O ⊂ U × Iand we need to recommend items to users according to theirpreferences over items whose ratings are unobserved. There-fore, the key for accurate recommendations is to constructbinary codes that can not only preserve the observed rat-ings but also accurately predict the preferences of users onunobserved items.Our approach to learn binary codes is to estimate them

from observed ratings. Specifically, we propose to constructbinary codes that minimize the degree of divergence betweenthe observed ratings and the ratings estimated from the bi-nary codes. To this end, we apply two objective functionsto measure the degree of divergence between the observedratings and the model estimates:

• Squared Loss. Using this loss function, we seek tominimize the squared error of the observed ratings andthe similarity estimations from the binary codes, whichis a commonly used loss function for collaborative fil-tering:

minfu,hi∈{±1}B

Lsq =∑

(u,i)∈O

(rui − sim(fu, hi))2. (1)

• Pairwise Loss. Since we are more interested in pre-serving the relative orders between items rather thantheir absolute values, it is natural to consider the pair-wise loss function described as follows:

minfu,hi∈{±1}B

Lpair =∑

u

i,j∈Ou

(

(rui − ruj)

− (sim(fu, hi)− sim(fu, hj)))2

, (2)

where Ou = {(u, i) | (u, i) ∈ O}. Minimizing theabove loss function requires the binary codes to pre-serve the relative difference between each pair of itemsrated by the user.

Additionally, we also require the binary codes to be bal-anced — we would like each bit of the binary codes to haveequal chance to be 1 or −1. The balance constraint is equiv-alent to maximizing the entropy of each bit of the binarycodes, which indicates that each bit carries as much infor-mation as possible. Specifically, we will enforce the followingconstraints to the binary codes:

u

fu = 0, and∑

i

hi = 0.

The above constraints motivate the following regularized ob-jective function for learning binary codes. For example, forsquared loss, we have the following objective function:

minfu,hi∈{±1}B

(u,i)∈O

(rui − sim(fu, hi))2

+ λ(‖∑

u

fu‖2 + ‖

i

hi‖2), (3)

where the first term is the loss over observed ratings and thesecond term represents that we prefer balanced binary codes.The parameter λ controls the trade-off between minimizingthe empirical errors and the enforcement of the constraints.‖ · ‖ indicates Euclidean norm of a vector.

Similarly, we have the following regularized objective func-tion for the pairwise loss function:

minfu,hi∈{±1}B

u

i,j∈Ou

(

(rui − ruj)− (sim(fu, hi)−

sim(fu, hj)))2

+ λ(‖∑

u

fu‖2 + ‖

i

hi‖2). (4)

3.2 LearningThe objective functions defined in Equation (3) and Equa-

tion (4) are defined over the discrete space {±1}B , whichmakes them difficult to optimize in general. Therefore, wepropose to solve it approximately by transforming the ob-jective functions and then relaxing the space of solutions tobe [−1, 1]B . For the sake of concreteness, we describe our

500

Page 4: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

method for solving the squared loss in Equation (3) in de-tail. The pairwise loss in Equation (4) can be optimized ina similar approach.First, we notice that for binary codes f, h ∈ {±1}B , the

following property holds:

sim(f, h) =1

B

B∑

k=1

I(f (k) = h(k))

=1

2B

(

B∑

k=1

I(f (k) = h(k)) + (B −

B∑

k=1

I(f (k) 6= h(k)))

)

=1

2B

(

B +B∑

k=1

f(k)

h(k)

)

=1

2+

1

2BfTh.

Thus, by substituting the above equation into the regular-ized objective function defined in Equation (3), we can ex-press the objective function for squared loss as follows:

minfu,hi∈{±1}B

Lreg =∑

(u,i)∈O

(rui −1

2−

1

2BfTu hi))

2

+ λ(‖∑

u

fu‖2 + ‖

i

hi‖2). (5)

A widely used approach to obtain approximate solutions tothe above discrete optimization problem is to relax the spaceof solution to be real values and thus enables the applica-tion of the continuous optimization techniques to solve theproblem. To this end, we first relax the space of solutionto be real vectors in [−1, 1]B and then we will round thereal-valued solutions into {±1}B . The details of roundingare discussed in Section 3.3.It is also interesting to note that the above formulation

also reveals a nice connection between learning binary codesand the matrix factorization approaches that are widely ap-plied in collaborative filtering. In particular, the first term in(5) is the objective function that factorizes the linearly trans-formed matrix of observed ratings to find low-dimensionalrepresentations for users and items. The second term is dif-ferent from the usual ℓ2 regularization used in traditionalmatrix factorization since we would like the binary codes tobalanced rather than close to zero in this case.Given the relaxed problem, the partial derivatives of the

objective function Lreg with respect to fu and hi can beexpressed as follows:

∂Lreg

∂fu= −

1

B

i∈Ou

(rui −1

2−

1

2BfTu hi)hi + 2λ

u′

fu′ ,

∂Lreg

∂hi

= −1

B

u∈Oi

(rui −1

2−

1

2BfTu hi)fu + 2λ

i′

hi′ .

The relaxed problem can be solved by methods such asLBFGS [25] and stochastic gradient descent.

3.3 Obtaining Binary CodesAfter solving the relaxed optimization problem defined in

Equation (5), we obtain real-valued vectors fu and hi ∈[−1, 1]B for each user u and item i. In this section, wepropose two methods to obtain binary codes fu and hi ∈{±1}B from these real-valued vectors.

3.3.1 Rounding to Closest Binary Codes

A straightforward method is to find binary vectors fu andhi ∈ {±1}B that are closest to fu and hi. Specifically, weseek to optimize the following objective function to obtainfu for all u ∈ U :

minfu∈{±1}B

u

‖fu − fu‖2 (6)

subject to∑

ufu = 0. Similarly, we can obtain hi for all

item i ∈ I by:

minhi∈{±1}B

i

‖hi − hi‖2 (7)

subject to∑

uhi = 0.

It turns out that the optimization problems defined inEquation (6) and Equation (7) have the following solution:

f(k)u =

{

1, f(k)u > median(f

(k)u : u ∈ U),

−1, Otherwise,

and

h(k)i =

{

1, h(k)i > median(h

(k)i : i ∈ I),

−1, Otherwise,

where median(·) represents the median of a set of real num-bers.

3.3.2 Improved Rounding by Orthogonal Transfor-mations

Another method to obtain the binary codes from the re-laxed solution fu and hi makes use of the structure of thesolutions to the relaxed optimization problem. Similar ideashave been investigated for spectral clustering [22] and we ex-tend the idea to the context of learning binary codes. First,we observe that if fu and hi are optimal solutions for (5),

then Qfu and Qhi are also optimal solutions achieving thesame value of the objective function for an arbitrary orthog-onal matrix Q ∈ R

B×B , i.e., QTQ = I. This observationcan be proved as follows:

Lreg(Qfu, Qhi)

=∑

(u,i)∈O

(rui −1

2−

1

2B(Qfu)

T (Qhi))2

+ λ(‖∑

u

(Qfu)‖2 + ‖

i

(Qhi)‖2)

=∑

(u,i)∈O

(rui −1

2−

1

2BfTu hi)

2 + λ(‖∑

u

fu‖2 + ‖

i

hi‖2)

= Lreg(fu, hi),

where the second equation utilizes the fact the Q is an or-thogonal matrix. The above observation shows that apply-ing orthogonal transformations to an optimal solution of re-laxed optimization problem does not change the value of theobjective function, which motives the following method toobtain binary codes from the relaxed solution:

minQ,fu,hi∈{±1}B

u

‖fu −Qfu‖2 +

i

‖hi −Qhi‖2 (8)

subject to:∑

u

fu = 0,∑

i

hi = 0, QTQ = I.

501

Page 5: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

Intuitively, instead of directly finding binary codes that areclose to the relaxed solutions, we seek binary codes thatare close to some orthogonal transformation of the relaxedsolutions. Introducing the orthogonal transformation Q notonly preserves the optimality of the relaxed solutions butprovides us more flexibility to obtain better binary codes.The optimization problem defined in Equation (8) can be

solved by minimizing with respect to fu, hi and Q alterna-tively.Optimization with respect to fu and hi. Specifically,

we first fix the orthogonal transformation Q and optimizingwith respect to fu and hi:

minfu,hi∈{±1}B

u

‖fu −Qfu‖2 +

i

‖hi −Qhi‖2

subject to∑

ufu = 0,

ihi = 0. The solution can be ex-

pressed as follows:

f(k)u =

{

1, (Qfu)(k) > median((Qfu)

(k) : u ∈ U),

−1, Otherwise,

and

h(k)i =

{

1, (Qhi)(k) > median((Qhi)

(k) : i ∈ I),

−1, Otherwise

where (Qfu)(k) represents the k-th element of the trans-

formed vector Qfu.Optimization with Respect to Q. In this case, we

fix fu and hi for all u ∈ U and i ∈ I. Then we solvethe following optimization problem to update the orthogonaltransformation Q:

minQ∈RB×B

L(Q) =∑

u

‖fu −Qfu‖2 +

i

‖hi −Qhi‖2

= ‖F −QF‖2F + ‖H −QH‖2F (9)

subject to the constraint QTQ = I, where F = [f1, . . . , f|U|],

F = [f1, . . . , f|U|], H = [h1, . . . , h|I|] and H = [h1, . . . , h|I|].‖ · ‖F indicates the Frobenius norm. The following theoremenables us to solve the optimization problem efficiently bysingular value decomposition:

Theorem 1. Let UDV T be the singular value decompo-sition of the matrix (FFT +HHT ). Then, Q = UV T min-imizes the objective function defined in Equation (9).

Proof. First notice that the objective function L(Q) =

‖F‖2+‖F‖2−trace(FFTQT )+‖H‖2+‖H‖2−trace(HHTQT ).Therefore, the optimization problem is equivalent to max-imizing the following function subject to the orthogonalityconstraint QTQ = I:

trace(FFTQ

T )+trace(HHTQ

T ) = trace((FFT+HH

T )QT ).

Let us consider the Lagrange

L(Q,Λ) = trace((FFT +HH

T )QT )−1

2trace(Λ(QT

Q− I)),

where Λ is a symmetric matrix. By taking the gradient withrespect to Q, we have

(FFT +HH

T )− ΛQ = 0,

Thus, Λ = (FFT + HHT )QT = UDV TQT , which impliesthat Λ2 = UD2UT . Hence, Λ = UDUT . Substituting itinto the above equation, we have Q = UD−1UTUDV T =UV T .

In general, we perform the above two steps alternativelyuntil the solution converges and obtain the binary codes fuand hi.

4. EXPERIMENTSIn this section, we describe the experiments conducted to

evaluate the proposed method for learning binary codes. Forthe sake of simplicity, we denote the proposed method withsquared loss and pairwise loss defined in Equation (3) andEquation (4) by CFCodeReg and CFCodePair, respectively.

4.1 Evaluation MetricsWe apply two evaluation metrics to evaluate the perfor-

mance of CFCodeReg and CFCodePair. Our goal is to eval-uate whether the obtained binary codes can accurately pre-serve the preferences of users to items. The evaluation met-rics are described as follows:

• Discounted Cumulative Gain (DCG). DCG [10]is widely used to evaluate the quality of rankings. Inorder to compute DCG, we sort the items accordingto the Hamming distance between their binary codesto the binary codes for the user. The DCG value of aranking list is calculated by the following equation:

DCG@n =n∑

i=1

2ri − 1

log(i+ 1),

where ri is the rating assigned by a user to the i-th itemin the ranking list. DCG mainly focuses on evaluatingwhether the obtained binary codes can accurately pre-serve the relative orders of the of items rated by eachuser. We use DCG@5 as a evaluation metric in our ex-periments. When computing DCG, we only considerthe observed ratings in the test set.

In order to evaluate the performance of using binary codesfor recommending top-K items to users, we apply the follow-ing evaluation metrics:

• Precision: For each user u, we retrieve the set ofitems Su with binary codes within Hamming distance 3to the binary codes of the user. The precision is definedas the fraction of relevant items in Su. Formally,

Prec =|{i : i ∈ Su and item i is relevant to user u}|

|Su|.

In our experiments, all items with ratings that aregreater than or equal to 5 are regarded as relevantitems of users.

4.2 Data SetsWe used three data sets to evaluate the performance of CF-

CodeReg and CFCodePair, MovieLens, EachMovie and Netflix

with their statistics summarized in Table 1.

• The MovieLens1 data set contains 3900 movies, 6040users and about 1 million ratings. In this data set,about 4% of the user-movie dyads are observed. Theratings are integers ranging from 1 (bad) to 5 (good).

• The EachMovie data set, collected by HP/CompaqResearch, contains about 2.8 million ratings for 1628movies by 72,916 users. The ratings are integers rang-ing from 1 to 6.

1http://www.grouplens.org/node/73

502

Page 6: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

Table 1: Statistics for Data SetsDataset No. Users No. Items No. Ratings

MovieLens 6040 3900 1,000,209EachMovie 72,916 1628 2,811,983Netflix 480,189 17,770 104,706,033

• The Netflix2 is one of the largest test bed for collabora-tive filtering. It contains over 100 million ratings andwas split into one training and one test subsets. Thetraining set consists of 100,480,507 ratings for 17,770movies by 480,189 users. The test set contains about2.8 million ratings. All the ratings are ranged from 1to 5.

We split the three data sets into training and test sets asfollows: for Movielens and EachMovie, we randomly sample80% ratings for each user as the training set and the rest 20%is used as the test set. These two data sets are very sparseand thus a lot of ratings are not observed, which may leadto biased evaluation results for precision. Therefore, we alsoconstruct a dense data set from the netflix data as follows:We first select 5000 items with the most ratings and thensample 10000 users with at least 100 ratings to construct arelatively dense data set. For this data set, we sample 20%ratings for each users as the training set and the rest ratingsare used as the test set. For all three data sets, we generatefive independent splits and report the averaged performancein our evaluations. Moreover, we exclude all ratings in thetraining set and use only the ratings in the test set whencomputing the evaluation metrics.

4.3 Comparison of Rounding MethodsIn Section 3.3, we describe two methods for obtaining bi-

nary codes from the approximate real-valued solutions. Wenow compare the performance of these methods. To thisend, we denote the method that rounding to closest binarycodes described in Section 3.3.1 as Closest and the methodusing orthogonal transformation described in Section 3.3.2by OrthTrans. We apply CFCodeReg and CFCodePair withbinary codes of length 10 to all the three data sets and com-pare the binary codes obtained by Closest and OrthTrans.We report the performance on three data sets measured byprecision in Figure 1. From Figure 1, we can see that thebinary codes obtained by OrthTrans outperform the corre-sponding binary codes obtained by Closest, which suggeststhat the OrthTrans can obtain better approximations for bi-nary codes. Intuitively, OrthTrans is more flexible than Clos-

est through introducing the orthogonal transformation andthus enable us to obtain better approximation. We will useOrthTrans to obtain binary codes in the rest of our evalua-tions.

4.4 General Performance Results

4.4.1 Baseline Alternative Methods

We compare CFCodeReg and CFCodePair to the followingbaselines:

• Spectral Hashing [21] (SH): This method has beenshown to be effective to learning binary codes. Specif-ically, it formulates the problem of learning binarycodes as an eigenvalue problem for the similarity graph.

2http://www.netflixprize.com/

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

MovieLens EachMovie Netflix

Pre

cis

ion

CFCodeReg-ClosestCFCodeReg-Orth

CFCodePair-ClosestCFCodePair-Orth

Figure 1: Performance of two rounding methods onMovieLens, EachMovie and Netflix data sets

In particular, the training ratings are viewed as thesimilarities between users and items and a bipartitegraph is constructed between nodes representing usersand items [23]. Then, spectral hashing is applied toobtain binary codes for users and items based on thisgraph.

• BinMF: In this baseline, we first apply low-rank ma-trix factorization to fit the training ratings and obtainlow dimensional real-valued latent profiles for usersand items. Then, we binarize these vectors to obtainbinary codes through the orthogonal transformationsdescribed in Section 3.3.2 since it archives better per-formance as a rounding method.

4.4.2 Performance Analysis

We apply the proposed CFCodeReg and CFCodePair to allthree data sets and compare the obtained binary codes tothe two baselines described in Section 4.4.1. Specifically,we plot the performance measured by DCG and precisionwith respect to the length of the binary codes in Figure 2,Figure 3 and Figure 4 for MoveiLens, EachMovie and Netflix

data sets, respectively.We can observe that DCG and precision of both CFCodeReg

and CFCodePair increase in most cases with larger length ofbinary codes. Hence, the performance of CFCodeReg andCFCodePair improves when the number of bits increases.Therefore, we conclude that both methods can utilize theavailable bits to preserve the preference of users more accu-rately. We can also observe from the figures that the binarycodes obtained by CFCodeReg and CFCodePair outperformother baselines in terms of DCG. Thus, the proposed methodcan better preserve the relative orders between items accord-ing to the preference of users. Moreover, the improvementover baselines in terms of precision indicates that the binarycodes obtained by CFCodeReg and CFCodePair can be usedto recommend interesting items to users accurately.

Comparing the performance of CFCodeReg and CFCode-

Pair, we can find that CFCodePair outperforms CFCodeReg

in most cases, which suggests that the pairwise loss func-tion is more suitable for learning binary codes. This is be-cause the pairwise loss function emphasis more on the ordersbetween different items rather than their absolute ratings,which makes it a more reasonable loss function in the rank-ing scenario for collaborative filtering.

Another interesting observation is that SH does not workvery well in our case. Specifically, it overfits the trainingdata very quickly when the length of binary codes increases.

503

Page 7: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

0

0.05

0.1

0.15

0.2

0.25

0.3

5 10 15 20 25 30

Pre

cis

ion

Bits

CFCodeRegCFCodePair

MFBinSH

(a) Precision

60

62

64

66

68

70

72

74

5 10 15 20 25 30

DC

G

Bits

CFCodeRegCFCodePair

MFBinSH

(b) DCG

Figure 2: Performance with respect to the length of binary codes on Movielens data set

0.1

0.15

0.2

0.25

0.3

5 10 15 20 25 30

Pre

cis

ion

Bits

CFCodeRegCFCodePair

MFBinSH

(a) Precision

50

52

54

56

58

60

62

64

5 10 15 20 25 30

DC

G

Bits

CFCodeRegCFCodePair

MFBinSH

(b) DCG

Figure 3: Performance with respect to the length of binary codes on EachMoive data set

By observing the results, we find that SH usually fits thetraining data very well. However, it frequently assign simi-lar distances for users and items whose ratings are not in thetraining set. Therefore, its performance over the test set isreduced. In order to further investigate this point, we varythe length of binary codes and plot the variance of Hammingdistances on unobserved ratings for binary codes generatedby SH and CFCodeReg in Figure 5. We can observe that thevariances produced by SH decrease when the length of binarycodes grows. On the other hand, the variance generated byCFCodeReg are generally much higher than those generatedby SH, which indicates that CFCodeReg generates more di-verse codes when the length of binary codes increases. Wethink the reason is that SH usually fits the observed similar-ities while fails to predict the unobserved ones. This obser-vation verifies that CFCodeReg can not only fit the observedpreferences very well, but it also can predict the unobservedpreference accurately.

4.4.3 Impact of Parameters

We investigate the impact of the regularization parameterλ for the propose methods. To this end, we report the per-formance of CFCodeReg and CFCodePair measured by DCGwith respect to different values of λ in Figure 6. We onlyshow the results on the MovieLens data set due to the lackof space. From Figure 6, we can observe that the perfor-mance measured by DCG first increases and then decreasesin most cases, which indicates that a good value of λ canenhance the learning process and thus improves the accu-racy of learnt binary codes. In general, the value of λ canbe determined by cross validation.

In our experiments, the relaxed optimization problem ofEquation (5) is solved by LBFGS, which is an effective it-erative solver for optimization problems. In Figure 7, wepresent the performance measured by DCG with respect tothe number of iterations on MovieLens data set. We canobserve from Figure 7 that the performance measured byDCG both training and test set increases when the numberof iterations grows. The training process usually convergesin about one hundred iterations.

4.4.4 Compare with Low-rank Matrix Factorizations

It is also interesting to compare CFCodeReg to the low-rank matrix factorization method that is widely exploitedfor collaborative filtering. Since CFCodeReg is restricted touse binary codes in order to facilitate fast search, it canbe viewed as an approximation of the low-rank factoriza-tion methods. Thus, it is natural to investigate how closeCFCodeReg can approximate the performance of low-rankmatrix factorization. To this end, we vary to length of thebinary codes from 10 to 110 and report the performancemeasured by DCG on MovieLens data set in Figure 8. Wealso report the performance of low-rank matrix factoriza-tions when varying the rank of the factorization. We cansee that the performance of CFCodeReg increases in gen-eral when the length of the binary codes grows and becomevery close to the performance of low-rank matrix factoriza-tions. On the other hand, the performance of low-rank ma-trix factorizations is slightly reduced when the number oflatent dimensions increases which is generally explained bythe overfitting of the training data.

504

Page 8: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

5 10 15 20 25 30

Pre

cis

ion

Bits

CFCodeRegCFCodePair

MFBinSH

(a) Precision

55

60

65

70

75

80

5 10 15 20 25 30

DC

G

Bits

CFCodeRegCFCodePair

MFBinSH

(b) DCG

Figure 4: Performance with respect to the length of binary codes on Netflix data set

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

5 10 15 20 25 30

Var

Bits

CFCodeRegSH

Figure 5: Variance of predicted similarity on un-known ratings with respect to the length of binarycodes

4.5 Recommendation EfficiencyWe also compare the efficiency of obtaining top-K recom-

mendations. In particular, for MF we compute the predictedscores for every item for a given user and then select top-Kitems with the highest scores. For CFCodeReg, we retrieveitems with binary codes within Hamming distance 3 to thebinary codes of the user. We measure efficiency by the totaltime required to generate recommendations for all users. Tothis end, we run the recommendation program for 10 timesand report the average running time. The evaluation is con-ducted on a server with 16G main memory and use one ofits eight 2.5GHz cores. On MovieLens data set, CFCodeRegtakes 0.586 seconds to process all users while MF takes 64.9seconds. The significant efficiency improvement is expectedand can be explained by the fact that CFCodeReg only goesthrough a small fraction of items while MF computes theprediction for all items. This confirms that recommendationefficiency can be significantly improved by utilizing binarycodes.

5. CONCLUDING REMARKSIn this paper, we address the problem of learning binary

codes that preserves the preferences of users to items. Inparticular, we propose a framework that constructs binarycodes such that the Hamming distances of a user and herpreferred items are small. By applying two loss functions,the problem is formulated as a discrete optimization prob-lem defined on the training ratings data set. It turns outthat the resulting optimization problem can be solving ap-proximately by transforming the objective function and re-laxing the variables to real values. Moreover, we study two

70.1

70.2

70.3

70.4

70.5

70.6

70.7

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

DC

CFCodeRegCFCodePair

Figure 6: Performance measured by DCG with re-spect to the regularization parameter λ on MovieLens

data set

60

65

70

75

80

85

90

0 10 20 30 40 50 60 70 80 90 100

DC

G

No. of Iterations

TrainTest

Figure 7: Performance measured by DCG with re-spect to the number of iterations on MovieLens dataset

methods to obtain the binary codes from the real-valuedapproximations. Experiments on three data sets show thatthe proposed methods outperform several baselines and thusand can preserve the preference of users more accurately.

For future research directions, we plan to further inves-tigate other methods for solving the discrete problem moreaccurately. Specifically, we can investigate how to applysemidefinite programming for relaxing the original problem.Another direction is to study how to learn binary codes in-crementally. In particular, we would like to construct thebinary codes bit by bit in a sequential and incremental man-ner. It has the advantage that new bits of binary codes canbe introduced to improve the accuracy without re-trainingall of the bits. Finally, the problem of incorporating featuresfor users and items, such as demographical features for usersand descriptions of items, to learn the binary codes can bealso very interesting.

505

Page 9: [ACM Press the 18th ACM SIGKDD international conference - Beijing, China (2012.08.12-2012.08.16)] Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery

65

70

75

80

85

90

95

0 20 40 60 80 100 120

DC

G

No. of Latent Dimensions

CFCodeRegMF

Figure 8: Comparison of low-rank matrix factoriza-tion and CFCodeReg on MovieLens data set

6. ACKNOWLEDGEMENTPart of the work is supported by NSF IIS-1116886, NSF

IIS-1049694 and NSFC 61129001/F010403.

7. REFERENCES

[1] G. Adomavicius and a. Tuzhilin. Toward the nextgeneration of recommender systems: a survey of thestate-of-the-art and possible extensions. IEEETransactions on Knowledge and Data Engineering,17(6):734–749, June 2005.

[2] D. Agarwal, B. Chen, and P. Elango. Fast onlinelearning through offline initialization for time-sensitiverecommendation. In Proceedings of the 16th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 703–712. ACM,2010.

[3] Y. Bachrach and R. Herbrich. Fingerprinting RatingsFor Collaborative Filtering Theoretical and EmpiricalAnalysis. String Processing and Information Retrieval,pages 25–36, 2010.

[4] Y. Bachrach, E. Porat, and J. Rosenschein. Sketchingtechniques for collaborative filtering. In Proceedings ofthe 21st international joint conference on Artificalintelligence, pages 2016–2021. Morgan KaufmannPublishers Inc., 2009.

[5] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Googlenews personalization: scalable online collaborativefiltering. In Proceedings of the 16th internationalconference on World Wide Web - WWW ’07, page271, New York, New York, USA, 2007. ACM Press.

[6] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni.Locality-sensitive hashing scheme based on p-stabledistributions. In Proceedings of the twentieth annualsymposium on Computational geometry - SCG ’04,page 253, New York, New York, USA, 2004. ACMPress.

[7] J. He and W. Liu. Scalable similarity search withoptimized kernel hashing. Proceedings of the 16thACM SIGKDD, pages 1129–1138, 2010.

[8] P. Indyk and R. Motwani. Approximate nearestneighbors: towards removing the curse ofdimensionality. In Proceedings of the thirtieth annualACM symposium on Theory of computing - STOC ’98,pages 604–613, New York, New York, USA, 1998.ACM Press.

[9] H. Jegou, T. Furon, and J.-J. Fuchs. Anti-sparsecoding for approximate nearest neighbor search. Arxiv

preprint arXiv:1110.3767, (October):577–580, Oct.2011.

[10] J. Kekalainen. Binary and graded relevance in IRevaluations - Comparison of the effects on ranking ofIR systems. Information Processing and Management,41:1019–1033, 2005.

[11] Y. Koren. Factor in the neighbors: Scalable andaccurate collaborative filtering. ACM Transactions onKnowledge Discovery from Data (TKDD), 4(1):1–24,2010.

[12] B. Kulis and T. Darrell. Learning to hash with binaryreconstructive embeddings. Proceedings of Advances inNeural Information Processing Systems, 2009.

[13] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashingwith Graphs. In Proceedings of the 28th InternationalConference on Machine Learning, 2011.

[14] M. Norouzi and D. Fleet. Minimal Loss Hashing forCompact Binary Codes. In Proceedings of the 28thInternational Conference on Machine Learning,volume 1, 2011.

[15] A. Paterek. Improving regularized singular valuedecomposition for collaborative filtering. InProceedings of KDD Cup and Workshop, volume 2007,pages 5–8. Citeseer, 2007.

[16] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme.Factorizing personalized Markov chains fornext-basket recommendation. Proceedings of the 19thinternational conference on World wide web - WWW’10, page 811, 2010.

[17] R. Salakhutdinov and G. Hinton. Semantic hashing.International Journal of Approximate Reasoning,50(7):969–978, July 2009.

[18] G. Shakhnarovich, P. Viola, and T. Darrell. Fast poseestimation with parameter-sensitive hashing. InProceedings Ninth IEEE International Conference onComputer Vision, pages 750–757 vol.2. IEEE, 2003.

[19] X. Su and T. M. Khoshgoftaar. A Survey ofCollaborative Filtering Techniques. Advances inArtificial Intelligence, 2009(Section 3):1–20, 2009.

[20] J. Wang and S. Kumar. Sequential projection learningfor hashing with compact codes. InternationalConference on Machine Learning, 2010.

[21] Y. Weiss, A. Torralba, and R. Fergus. Spectralhashing. In Neural Information Processing Systems,number 1, pages 1–8, 2008.

[22] S. X. Yu and J. Shi. Multiclass spectral clustering. InProceedings Ninth IEEE International Conference onComputer Vision, number 0, pages 313–319 vol.1,Washington, DC, USA, 2003. IEEE.

[23] D. Zhang, J. Wang, D. Cai, and J. Lu. Laplacianco-hashing of terms and documents. Advances inInformation Retrieval, pages 577–580, 2010.

[24] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taughthashing for fast similarity search. In Proceeding of the33rd international ACM SIGIR conference onResearch and development in information retrieval,pages 18–25. ACM, 2010.

[25] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm778: L-BFGS-B: Fortran subroutines for large-scalebound-constrained optimization. ACM Transactionson Mathematical Software, 23(4):550–560, Dec. 1997.

506