34
Dictionary Learning for Massive Matrix Factorization Arthur Mensch, Julien Mairal Ga¨ el Varoquaux, Bertrand Thirion Inria Parietal, Inria Thoth October 6, 2016

Massive Matrix Factorization : Applications to collaborative filtering

Embed Size (px)

Citation preview

Page 1: Massive Matrix Factorization : Applications to collaborative filtering

Dictionary Learning forMassive Matrix Factorization

Arthur Mensch, Julien MairalGael Varoquaux, Bertrand Thirion

Inria Parietal, Inria Thoth

October 6, 2016

Page 2: Massive Matrix Factorization : Applications to collaborative filtering

Introduction

Why am I here ?

Inria Parietal: machine learning for neuro-imaging(fMRI data)

Matrix factorization: major ingredient in fMRI analysis

Very large datasets (2 TB): we designed faster algorithms

These algorithms can be used in collaborative filtering

D AX

Voxe

ls

Time

=

k spatial maps Time

x

1Work presented at ICML 2016

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 28

Page 3: Massive Matrix Factorization : Applications to collaborative filtering

Deroule

1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods

2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm

3 ResultsSettingBenchmarksParameter setting

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 28

Page 4: Massive Matrix Factorization : Applications to collaborative filtering

Deroule

1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods

2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm

3 ResultsSettingBenchmarksParameter setting

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 28

Page 5: Massive Matrix Factorization : Applications to collaborative filtering

Collaborative filtering

Collaborative platform

n users rate a fraction ofp items

e.g movies, restaurants

Estimate ratings forrecommendation

Use the ratings of other users for recommendation

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 28

Page 6: Massive Matrix Factorization : Applications to collaborative filtering

How to predict ratings ?

Credit: [Bell and Koren, 2007]

Joe like We weresoldiers, Black Hawkdown.

Bob and Alice like thesame films, and alsolike Saving privateRyan.

Joe should watchSaving private Ryan,because all of themindeed likes war films.

Need to uncover topics in items

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 28

Page 7: Massive Matrix Factorization : Applications to collaborative filtering

Predicting rate with scalar products

Embeddings to model the existence of genre/category/topics

Representative vectors forusers and items:

(αj)1≤j≤n, (di )1≤i≤p ∈ Rk

q-th coefficient of di , αj

= affinity with the “topic” q

Ratings xij (item i , user j):

xij = d>i αj ( + biases)

= Common affinity for topics

xij αjdi

k topics

di

αj

1

Learning problem: estimate D and A with known ratings

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28

Page 8: Massive Matrix Factorization : Applications to collaborative filtering

Predicting rate with scalar products

Embeddings to model the existence of genre/category/topics

Representative vectors forusers and items:

(αj)1≤j≤n, (di )1≤i≤p ∈ Rk

q-th coefficient of di , αj

= affinity with the “topic” q

Ratings xij (item i , user j):

xij = d>i αj ( + biases)

= Common affinity for topics

xij αjdi

k topics

di

αj

1

Learning problem: estimate D and A with known ratings

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28

Page 9: Massive Matrix Factorization : Applications to collaborative filtering

Predicting rate with scalar products

Embeddings to model the existence of genre/category/topics

Representative vectors forusers and items:

(αj)1≤j≤n, (di )1≤i≤p ∈ Rk

q-th coefficient of di , αj

= affinity with the “topic” q

Ratings xij (item i , user j):

xij = d>i αj ( + biases)

= Common affinity for topics

xij αjdi

k topics

di

αj

1

Learning problem: estimate D and A with known ratings

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 28

Page 10: Massive Matrix Factorization : Applications to collaborative filtering

Matrix factorization

1Ω ? X AD

p

n k

n

=

1X ∈ Rp×n ≈ DA ∈ Rp×k × Rk×n

Constraints / penalty on factors D and A

We only observe 1Ω ? X — Ω set of ratings provided by users

Recommender systems : millions of users, millions of items

How to scale matrix factorization to very large datasets ?

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 28

Page 11: Massive Matrix Factorization : Applications to collaborative filtering

Formalism

Finding representation in Rk for items and users:

minD∈Rp×k

A∈Rk×n

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ(‖D‖2

F + ‖A‖2F )

= ‖1Ω ? (X−DA)‖22 + λ(‖D‖2

F + ‖A‖2F ) 1Ω set of knownratings

`2 reconstruction loss — `2 penalty for generalization

Existing methods

Alternated minimization

Stochastic gradient descent

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 28

Page 12: Massive Matrix Factorization : Applications to collaborative filtering

Existing methods

Alternated minimization

Minimize over A, D: alternate between

D = minD∈Rp×k

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ‖D‖2

F

A = minA∈Rk×n

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ‖A‖2

F

No hyperparameters

Slow and memory expensive: use all ratings at each iteration

a.k.a. coordinate descent (variation in parameter update order)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 28

Page 13: Massive Matrix Factorization : Applications to collaborative filtering

Existing methods

Stochastic gradient descent

minA,D

∑(i ,j)∈Ω

fij(A,B)def= (xij − d>i α

j)2 +1

cjλ‖αj‖2

2 +1

ciλ‖di‖2

2

Gradient step for each rating:

(At ,Dt)← (At−1,Dt−1)− 1

ct∇(A,D)fij(At−1,Dt−1)

Fast and memory efficient – won the Netflix prize

Very sensitive to step sizes (ct) – need to cross-validate

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 28

Page 14: Massive Matrix Factorization : Applications to collaborative filtering

Towards a new algorithm

Best of both worlds ?

Fast and memory efficient algorithm

Little sensitive to hyperparameter setting

Subsampled online dictionary learning

Builds upon the online dictionary learning algorithm

popular in computer vision and interpretable learning (fMRI)

Adapt it to handle missing values efficiently

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 28

Page 15: Massive Matrix Factorization : Applications to collaborative filtering

Deroule

1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods

2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm

3 ResultsSettingBenchmarksParameter setting

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 28

Page 16: Massive Matrix Factorization : Applications to collaborative filtering

Dictionary learning

Recall: recommender system formalism

Non-masked matrix factorization with `2 penalty:

minD∈Rp×k

A∈Rk×n

n∑j=1

(xj −D>αj)2 + λ(‖D‖2F + ‖A‖2

F )

Penalties can be richer, and made into constraints

Dictionary learning

Learn the left side factor [Olshausen and Field, 1997]

minD∈C

n∑j=1

‖xj −Dαj‖22 +λΩ(αj) αj = argmin

α∈Rk

‖xi −Dα‖22 +λΩ(α)

Naive approach: alternated minimization

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 28

Page 17: Massive Matrix Factorization : Applications to collaborative filtering

Online dictionary learning [Mairal et al., 2010]

At iteration t, select xt in xjj (user ratings), improve D

Single iteration complexity ∝ sample dimension O(p)

(Dt)t converges in a few epochs (one for large n)

xt αtD

p

n k n

=Stream

1Very efficient in computer vision / networks / fMRI /hyperspectral images

Can we use it efficiently for recommender systems ?

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 28

Page 18: Massive Matrix Factorization : Applications to collaborative filtering

In short: Handling missing values

X

p

n

xt

Steam

Handle large n

n

Handle missing valuesOnline → online + partial

Batch →online

Mtxt

StreamIgno

re

UnknownUnaccessed

1Leverage streaming + partial access to samples

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 28

Page 19: Massive Matrix Factorization : Applications to collaborative filtering

In detail: online dictionary learning

Objective function involves latent codes (right side factor)

minD∈C

1

t

t∑i=1

‖xi −Dα∗i (D)‖22, α∗i (D) = argmin

α

1

2‖xi −Dα‖2

2 + λΩ(α)

Replace latent codes by codes computed with old dictionaries

Build an upper-bounding surrogate function

min1

t

t∑i=1

‖xi−Dαi‖22 αi = argmin

α

1

2‖xi−Di−1α‖2

2+λΩ(α)

Minimize surrogate — updateable online at low cost

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 28

Page 20: Massive Matrix Factorization : Applications to collaborative filtering

In detail: online dictionary learning

Algorithm outline

1 Compute code

– xt → complexity depends on p

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update the surrogate function

– Complexity in O(p)

gt =1

t

t∑i=1

‖xi −Dαi‖22 = Tr (

1

2D>DAt −D>Bt)

At = (1− 1

t)At−1 +

1

tαtα

>t Bt = (1− 1

t)Bt−1 +

1

txtα

>t

3 Minimize surrogate

– Complexity in O(p)

Dt = argminD∈C

gt(D) ∇gt = DAt − Bt

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28

Page 21: Massive Matrix Factorization : Applications to collaborative filtering

In detail: online dictionary learning

Algorithm outline

1 Compute code – xt → complexity depends on p

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update the surrogate function – Complexity in O(p)

gt =1

t

t∑i=1

‖xi −Dαi‖22 = Tr (

1

2D>DAt −D>Bt)

At = (1− 1

t)At−1 +

1

tαtα

>t Bt = (1− 1

t)Bt−1 +

1

txtα

>t

3 Minimize surrogate – Complexity in O(p)

Dt = argminD∈C

gt(D) ∇gt = DAt − Bt

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 28

Page 22: Massive Matrix Factorization : Applications to collaborative filtering

Specification for a new algorithm

Mtxt

StreamIgno

rep

n

1

Constrained : use only knownratings from Ω

Efficient: single iteration in O(s),# of ratings provided by user t

Principled: follows the onlinematrix factorization algorithm asmuch as possible

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 28

Page 23: Massive Matrix Factorization : Applications to collaborative filtering

Missing values in practice

Data stream: (xt)t → masked (Mtxt)t

= ratings from user t

Dimension: p (all items) → s (rated items)

Use only Mtxt in algorithm computation

→ complexity in O(s)Mtxt

StreamIgno

rep

n

1

Adaptation to make

Modify all parts of the algorithm to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28

Page 24: Massive Matrix Factorization : Applications to collaborative filtering

Missing values in practice

Data stream: (xt)t → masked (Mtxt)t

= ratings from user t

Dimension: p (all items) → s (rated items)

Use only Mtxt in algorithm computation

→ complexity in O(s)Mtxt

StreamIgno

rep

n

1Adaptation to make

Modify all parts of the algorithm to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 28

Page 25: Massive Matrix Factorization : Applications to collaborative filtering

Subsampled online dictionary learningCheck out paper !

Original online MF

1 Code computation

αt = argminα∈Rk

‖xt −Dt−1α‖22

+ λΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1

t(xtα

>t − Bt−1)

3 Surrogate minimization

Dj ← p⊥Crj(Dj−

1

(At)j,j(DAj

t−Bjt))

Our algorithm

1 Code computation: masked loss

αt = argminα∈Rk

‖Mt(xt −Dt−1α)‖22

+ λrkMt

pΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1∑t

i=1 Mi(Mtxtα

>t −MtBt−1)

3 Surrogate minimization

MtDj ← p⊥Cj (MtD

j −1

(At)j,jMt(D(Aj

t − (Bjt))

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 28

Page 26: Massive Matrix Factorization : Applications to collaborative filtering

Deroule

1 Matrix factorization for recommender systemsCollaborative filteringMatrix factorization formulationExisting methods

2 Subsampled online dictionary learningDictionary learning – existing methodsHandling missing values efficientlyNew algorithm

3 ResultsSettingBenchmarksParameter setting

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 28

Page 27: Massive Matrix Factorization : Applications to collaborative filtering

Experiments

Validation : Test RMSE (rating prediction) vs CPU time

Baseline : Coordinate descent solver [Yu et al., 2012] for

minD∈Rp×k

A∈Rk×n

∑(i ,j)∈Ω

(xij − d>i αj)2 + λ(‖D‖2

F + ‖A‖2F )

Fastest solver available apart from SGD — hyperparameters

↑ Our method has a learning rate with little influence

Datasets : Movielens, Netflix

Publicly available

Larger one in the industry...

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 28

Page 28: Massive Matrix Factorization : Applications to collaborative filtering

Results

Scalable algorithm: speed-up improves with size

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 28

Page 29: Massive Matrix Factorization : Applications to collaborative filtering

Performance

Dataset Test RMSE Convergence time Speed

CD SODL CD SODL -up

ML 1M 0.872 0.866 6 s 8 s ×0.75ML 10M 0.802 0.799 223 s 60 s ×3.7NF (140M) 0.938 0.934 1714 s 256 s ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix

Simple model: RMSE is not state-of-the-art

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 24 / 28

Page 30: Massive Matrix Factorization : Applications to collaborative filtering

Robustness to learning rate

Learning rate in algorithm to be set in [0.75, 1] (← theory)

In practice: Just set it in [0.8, 1]

1 10 40Epoch

0.800.810.820.830.840.850.860.87

RM

SE

on

test

set

Learning rate β0.75

0.78

0.81

0.83

0.86

0.89

0.92

0.94

0.97

1.00

MovieLens 10M

.1 1 10 200.930.940.950.960.970.980.99 Netflix

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 25 / 28

Page 31: Massive Matrix Factorization : Applications to collaborative filtering

Conclusion

Take-home message

Online matrix factorization can be adaptedto handle missing value efficiently, with verygood performance in reccommender system

Mtxt

StreamIgno

rep

n

1Algorithm usable in any rich model involving matrix factorization

Python package http://github.com/arthurmensch/modl

Article/slides at http://amensch.fr/publications

Questions ?

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28

Page 32: Massive Matrix Factorization : Applications to collaborative filtering

Conclusion

Take-home message

Online matrix factorization can be adaptedto handle missing value efficiently, with verygood performance in reccommender system

Mtxt

StreamIgno

rep

n

1Algorithm usable in any rich model involving matrix factorization

Python package http://github.com/arthurmensch/modl

Article/slides at http://amensch.fr/publications

Questions ?

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 26 / 28

Page 33: Massive Matrix Factorization : Applications to collaborative filtering

Appendix: Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 27 / 28

Page 34: Massive Matrix Factorization : Applications to collaborative filtering

Bibliography I

[Bell and Koren, 2007] Bell, R. M. and Koren, Y. (2007).

Lessons from the Netflix prize challenge.

ACM SIGKDD Explorations Newsletter, 9(2):75–79.

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).

Sparse coding with an overcomplete basis set: A strategy employed by V1?

Vision Research, 37(23):3311–3325.

[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).

Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.

In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 28 / 28