[email protected] AutoRec: Autoencoders Meet ...users.cecs.anu.edu.au/~akmenon/papers/autorec/autorec-poster.pdf · dom 90%–10% train-test sets, and hold out 10% of the train-ing

AutoRec: Autoencoders Meet Collaborative Filtering Suvash Sedhain1,2, Aditya Krishna Menon2,1, Scott Sanner2,1, Lexing Xie1,2

ANU1, NICTA2

Motivation

Rating prediction problem

AutoRec model

Comparisons with existing methods

Experiments

Future work

Q: Is user- or item-based modelling better?

ML-1M ML-10MU-RBM 0.881 0.823I-RBM 0.854 0.825U-AutoRec 0.874 0.867I-AutoRec 0.831 0.782

(a)

f(·) g(·) RMSEIdentity Identity 0.872Sigmoid Identity 0.852Identity Sigmoid 0.831Sigmoid Sigmoid 0.836

(b)

ML-1M ML-10M NetflixBiasedMF 0.845 0.803 0.844I-RBM 0.854 0.825 -U-RBM 0.881 0.823 0.845LLORMA 0.833 0.782 0.834I-AutoRec 0.831 0.782 0.823

(c)Table 1: (a) Comparison of the RMSE of I/U-AutoRec and RBM models. (b) RMSE for I-AutoRec with choices of linear andnonlinear activation functions, Movielens 1M dataset. (c) Comparison of I-AutoRec with baselines on MovieLens and Netflixdatasets. We remark that I-RBM did not converge after one week of training. LLORMA’s performance is taken from [2].

AutoRec is distinct to existing CF approaches. Com-pared to the RBM-based CF model (RBM-CF) [4], thereare several di↵erences. First, RBM-CF proposes a gener-ative, probabilistic model based on restricted Boltzmannmachines, while AutoRec is a discriminative model basedon autoencoders. Second, RBM-CF estimates parametersby maximising log likelihood, while AutoRec directly min-imises RMSE, the canonical performance in rating predic-tion tasks. Third, training RBM-CF requires the use of con-trastive divergence, whereas training AutoRec requires thecomparatively faster gradient-based backpropagation. Fi-nally, RBM-CF is only applicable for discrete ratings, andestimates a separate set of parameters for each rating value.For r possible ratings, this implies nkr or (mkr) parame-ters for user- (item-) based RBM. AutoRec is agnostic to rand hence requires fewer parameters. Fewer parameters en-ables AutoRec to have less memory footprint and less proneto overfitting. Compared to matrix factorisation (MF) ap-proaches, which embed both users and items into a sharedlatent space, the item-based AutoRec model only embedsitems into latent space. Further, while MF learns a linearlatent representation, AutoRec can learn a nonlinear latentrepresentation through activation function g(·).3. EXPERIMENTAL EVALUATION

In this section, we evaluate and compare AutoRec withRBM-CF [4], Biased Matrix Factorisation [1] (BiasedMF),and Local Low-Rank Matrix Factorisation (LLORMA) [2]on the Movielens 1M, 10M and Netflix datasets. Follow-ing [2], we use a default rating of 3 for test users or itemswithout training observations. We split the data into ran-dom 90%–10% train-test sets, and hold out 10% of the train-ing set for hyperparamater tuning. We repeat this splittingprocedure 5 times and report average RMSE. 95% confi-dence intervals on RMSE were ±0.003 or less in each experi-ment. For all baselines, we tuned the regularisation strength� 2 {0.001, 0.01, 0.1, 1, 100, 1000} and the appropriate latentdimension k 2 {10, 20, 40, 80, 100, 200, 300, 400, 500}.A challenge training autoencoders is non-convexity of the

objective. We found resilient propagation (RProp) [3] togive comparable performance to L-BFGS, while being muchfaster. Thus, we use RProp for all subsequent experiments:Which is better, item- or user-based autoencodingwith RBMs or AutoRec? Table 1a shows item-based (I-)methods for RBM and AutoRec generally perform better;this is likely since the average number of ratings per item ismuch more than those per user; high variance in the numberof user ratings leads to less reliable prediction for user-basedmethods. I-AutoRec outperforms all RBM variants.How does AutoRec performance vary with linearand nonlinear activation functions f(·), g(·)? Table 1bindicates that nonlinearity in the hidden layer (via g(·)) iscritical for good performance of I-AutoRec, indicating its

Figure 2: RMSE of I-AutoRec on Movielens 1M as the num-ber of hidden units k varies.

potential advantage over MF methods. Replacing sigmoidswith Rectified Linear Units (ReLU) performed worse. Allother AutoRec experiments use identity f(·) and sigmoidg(·) functions.How does performance of AutoRec vary with thenumber of hidden units? In Figure 2, we evaluate theperformance of AutoRec model as the number of hiddenunits varies.We note that performance steadily increases withthe number of hidden units, but with diminishing returns.All other AutoRec experiments use k = 500.How does AutoRec perform against all baselines?Table 1c shows that AutoRec consistently outperforms allbaselines, except for comparable results with LLORMA onMovielens 10M. Competitive performance with LLORMA isof interest, as the latter involves weighting 50 di↵erent local

matrix factorization models, whereas AutoRec only uses asingle latent representation via a neural net autoencoder.Do deep extensions of AutoRec help? We developed adeep version of I-AutoRec with three hidden layers of (500,250, 500) units, each with a sigmoid activation. We usedgreedy pretraining and then fine-tuned by gradient descent.On Movielens 1M, RMSE reduces from 0.831 to 0.827 indi-cating potential for further improvement via deep AutoRec.Acknowledgments NICTA is funded by the Australian Gov-ernment as represented by the Dept. of Communications andthe ARC through the ICT Centre of Excellence program. Thisresearch was supported in part by ARC DP140102185.

References[1] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization tech-

niques for recommender systems. Computer, 42, 2009.[2] J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank

matrix approximation. In ICML, 2013.[3] M. Riedmiller and H. Braun. A direct adaptive method for

faster backpropagation learning: the rprop algorithm. InIEEE International Conference on Neural Networks, 1993.

[4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltz-mann machines for collaborative filtering. In ICML, 2007.

[5] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-basedcollaborative filtering recommendation algorithms. In WWW,2001.

Q: What are good choices of activations f(.), g(.)?


(a)


(b)















Q: How many hidden units are needed for AutoRec?


(a)


(b)






0 100 200 300 400 500number of hidden units

0.82

0.83

0.84

0.85

0.86

0.87

RM

SE










Q: How does AutoRec perform against all baselines?


(a)


(b)















Q: Do deep extensions of AutoRec help?

References

AutoRec: Autoencoders Meet Collaborative Filtering

Suvash Sedhain

†⇤, Aditya Krishna Menon

†⇤, Scott Sanner

†⇤, Lexing Xie

⇤††

NICTA,

⇤Australian National University

[email protected], { aditya.menon, scott.sanner }@nicta.com.au,

[email protected]

ABSTRACTThis paper proposes AutoRec, a novel autoencoder frame-work for collaborative filtering (CF). Empirically, AutoRec’scompact and e�ciently trainable model outperforms state-of-the-art CF techniques (biased matrix factorization, RBM-CF and LLORMA) on the Movielens and Netflix datasets.

Categories and Subject Descriptors D.2.8 [Informa-tion Storage and Retrieval]Information Filtering

Keywords Recommender Systems; Collaborative Filtering;Autoencoders

1. INTRODUCTIONCollaborative filtering (CF) models aim to exploit infor-

mation about users’ preferences for items (e.g. star ratings)to provide personalised recommendations. Owing to theNetflix challenge, a panoply of di↵erent CF models havebeen proposed, with popular choices being matrix factori-sation [1, 2] and neighbourhood models [5]. This paperproposes AutoRec, a new CF model based on the autoen-coder paradigm; our interest in this paradigm stems fromthe recent successes of (deep) neural network models for vi-sion and speech tasks. We argue that AutoRec has represen-tational and computational advantages over existing neuralapproaches to CF [4], and demonstrate empirically that itoutperforms the current state-of-the-art methods.

2. THE AUTOREC MODELIn rating-based collaborative filtering, we have m users,

n items, and a partially observed user-item rating matrixR 2 Rm⇥n. Each user u 2 U = {1 . . .m} can be representedby a partially observed vector r(u) = (Ru1, . . . Run) 2 Rn.Similarly, each item i 2 I = {1 . . . n} can be representedby a partially observed vector r(i) = (R1i, . . . Rmi) 2 Rm.Our aim in this work is to design an item-based (user-based)autoencoder which can take as input each partially observedr(i) (r(u)), project it into a low-dimensional latent (hidden)space, and then reconstruct r(i) (r(u)) in the output spaceto predict missing ratings for purposes of recommendation.

Formally, given a set S of vectors in Rd, and some k 2 N+,an autoencoder solves

min✓

X

r2S

||r� h(r; ✓)||22, (1)

Copyright is held by the author/owner(s).

WWW 2015 Companion, May 18–22, 2015, Florence, Italy.

ACM 978-1-4503-3473-0/15/05.

http://dx.doi.org/10.1145/2740908.2742726 .

r(i) = ( )R1i R3iR2i Rmi

+1

+1


. . .

. . .

V

W

i = 1...n

Figure 1: Item-based AutoRec model. We use plate notationto indicate that there are n copies of the neural network (onefor each item), where W and V are tied across all copies.

where h(r; ✓) is the reconstruction of input r 2 Rd,

h(r; ✓) = f (W · g(Vr+ µ) + b)

for activation functions f(·), g(·). Here, ✓ = {W,V,µ, b}for transformations W 2 Rd⇥k,V 2 Rk⇥d, and biases µ 2Rk,b 2 Rd. This objective corresponds to an auto-associativeneural network with a single, k-dimensional hidden layer.The parameters ✓ are learned using backpropagation.The item-based AutoRec model, shown in Figure 1, ap-

plies an autoencoder as per Equation 1 to the set of vectors{r(i)}ni=1, with two important changes. First, we account forthe fact that each r(i) is partially observed by only updatingduring backpropagation those weights that are associatedwith observed inputs, as is common in matrix factorisationand RBM approaches. Second, we regularise the learned pa-rameters so as to prevent overfitting on the observed ratings.Formally, the objective function for the Item-based AutoRec(I-AutoRec) model is, for regularisation strength � > 0,

min✓

nX

i=1

||r(i) � h(r(i); ✓))||2O +�2· (||W||2F + ||V||2F ), (2)

where || · ||2O means that we only consider the contributionof observed ratings. User-based AutoRec (U-AutoRec) isderived by working with {r(u)}mu=1. In total, I-AutoRec re-quires the estimation of 2mk + m + k parameters. Givenlearned parameters ✓̂, I-AutoRec’s predicted rating for useru and item i is

R̂ui = (h(r(i); ✓̂))u. (3)

Figure 1 illustrates the model, with shaded nodes corre-sponding to observed ratings, and solid connections corre-sponding to weights that are updated for the input r(i).

� � � � g f

W V r µ b m × k k × m m × 1 k × 1 m × 1


Suvash Sedhain



†⇤, Lexing Xie

⇤††

NICTA,



[email protected]









min✓

X

r2S

||r� h(r; ✓)||22, (1)



ACM 978-1-4503-3473-0/15/05.

http://dx.doi.org/10.1145/2740908.2742726 .


+1

+1


. . .

. . .

V

W

i = 1...n



h(r; ✓) = f (W · g(Vr+ µ) + b)



min✓

nX

i=1

||r(i) � h(r(i); ✓))||2O +�2· (||W||2F + ||V||2F ), (2)


R̂ui = (h(r(i); ✓̂))u. (3)



Suvash Sedhain



†⇤, Lexing Xie

⇤††

NICTA,



[email protected]









min✓

X

r2S

||r� h(r; ✓)||22, (1)



ACM 978-1-4503-3473-0/15/05.

http://dx.doi.org/10.1145/2740908.2742726 .


+1

+1


. . .

. . .

V

W

i = 1...n



h(r; ✓) = f (W · g(Vr+ µ) + b)



min✓

nX

i=1

||r(i) � h(r(i); ✓))||2O +�2· (||W||2F + ||V||2F ), (2)


R̂ui = (h(r(i); ✓̂))u. (3)


Prediction:

[1] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42, 2009. [2] J. Lee, S. Kim, G. Lebanon, and Y. Singer. Local low-rank matrix approximation. In ICML, 2013. [3] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: the rprop algorithm. In IEEE International Conference on Neural Networks, 1993. [4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In ICML, 2007. [5] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In WWW, 2001.

Further exploration of deep autoencoders, and applications to implicit feedback datasets.


Suvash Sedhain



†⇤, Lexing Xie

⇤††

NICTA,



[email protected]









min✓

X

r2S

||r� h(r; ✓)||22, (1)



ACM 978-1-4503-3473-0/15/05.

http://dx.doi.org/10.1145/2740908.2742726 .


+1

+1


. . .

. . .

V

W

i = 1...n


Data Description#users #items

Ml-1M 6,040 3,706Ml-10M 69,878 10,677Netflix 480,189 17,770


h(r; ✓) = f (W · g(Vr+ µ) + b)



min✓

nX

i=1

||r(i) � h(r(i); ✓))||2O +�2· (||W||2F + ||V||2F ), (2)

where || · ||2O means that we only consider the contributionof observed ratings. User-based AutoRec (U-AutoRec) isderived by working with {r(u)}mu=1. In total, I-AutoRec re-quires the estimation of 2mk + m + k parameters. Given

Autoencoders have been proven very successful in various vision and speech problems. Can we do the same for collaborative filtering?

For each item, construct (partially observed) vector of ratings Perform autoencoding on result, where -  Weights are tied across items -  Only observed ratings are used to

update model

A: Item-based is superior.

A: Nonlinearity in hidden unit is key for performance

A: Good performance with ~400 hidden units

A : Systemat ical ly outperforms state-of-the-art methods

Given a partially observed user-item rating matrix, Rm×n, fill in the missing entries

r(i)

Training objective:


Suvash Sedhain



†⇤, Lexing Xie

⇤††

NICTA,



[email protected]


Categories and Subject DescriptorsD.2.8 [Information Storage and Retrieval]: InformationFiltering

KeywordsCollaborative Filtering, Recommender Systems, Autoencoders






min✓

X

r2S

||r� h(r; ✓)||22, (1)


+1

+1


. . .

. . .

V

W

i = 1...n



h(r; ✓) = f (W · g(Vr+ µ) + b)



min✓

nX

i=1

||r(i) � h(r(i); ✓))||2O +�2· (||W||2F + ||V||2F ), (2)


R̂ui = (h(r(i); ✓̂))u. (3)


Activation functions

AutoRec RBM-CF Model type Discriminative Generative Objective RMSE Log-likelihood Optimisation Gradient-based (fast) Contrastive

divergence (slow) Ratings Real-valued Discrete

AutoRec Matrix Factorization Embedding Users only Users and items

(more parameters) Representation Non-linear Linear

Try it now: https://github.com/mesuvash/NNRec

A: Deep I-AutoRec with three hidden layers (500-200-500) reduced RMSE from 0.831 to 0.827 on ML-1M dataset.

A: Yes

4 5 ? ?

? ? ? 5

2 ? 4 ?

? 3 4 ?

. . .

. . .

. . .

. . .

Use

rs

Items

Embedding

Documents

[email protected] AutoRec: Autoencoders Meet ...users.cecs.anu.edu.au/~akmenon/papers/autorec/autorec-poster.pdf · dom 90%–10% train-test sets, and hold out 10% of the train-ing