41
August 5, 2013 ML Hadoop @ Spotify If it’s slow, buy more racks

ML+Hadoop at NYC Predictive Analytics

Embed Size (px)

DESCRIPTION

How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/

Citation preview

Page 1: ML+Hadoop at NYC Predictive Analytics

August 5, 2013

ML ♡ Hadoop @ Spotify

If it’s slow, buy more racks

Page 2: ML+Hadoop at NYC Predictive Analytics

I’m Erik Bernhardsson

Master’s in Physics from KTH in StockholmStarted at Spotify in 2008, managed the Analytics team for two yearsMoved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC

2

Page 3: ML+Hadoop at NYC Predictive Analytics

August 5, 2013

What’s Spotify? What are the challenges?

Started in 2006Currently has 24 million users6 million paying usersAvailable in 20 countriesAbout 300 engineers, of which 70 in NYC

Page 4: ML+Hadoop at NYC Predictive Analytics

And adding 20K every day...

Big challenge: Spotify has over 20 million tracks

4

Page 5: ML+Hadoop at NYC Predictive Analytics

Good and bad news: we also have 100B streams

Let’s use collaborative filtering!

5

Hey,I like tracks P, Q, R, S!

Well,I like tracks Q, R, S, T!

Then you should check out track P!

Nice! Btw try track T!

Page 6: ML+Hadoop at NYC Predictive Analytics

Hadoop at Spotify

6

Page 7: ML+Hadoop at NYC Predictive Analytics

Back in 2009

Matrix factorization causing cluster to overheat? Don’t worry, put up curtain

7

Page 8: ML+Hadoop at NYC Predictive Analytics

Source:

Hadoop today700 nodes at our data center in London

8

Page 9: ML+Hadoop at NYC Predictive Analytics

The Discover page

9

Page 10: ML+Hadoop at NYC Predictive Analytics

Here’s a secret behind the Discover page

It’s precomputed every night

10

HADOOP

Cassandra Bartender

Log streams

Music recs

hdfs2cass

Page 11: ML+Hadoop at NYC Predictive Analytics

Here’s a secret behind the Discover page

It’s precomputed every night

10

HADOOP

Cassandra Bartender

Log streams

Music recs

hdfs2cass

Page 12: ML+Hadoop at NYC Predictive Analytics

Here’s a secret behind the Discover page

It’s precomputed every night

10

HADOOP

Cassandra Bartender

Log streams

Music recs

hdfs2cass

https://github.com/spotify/luigi

Page 13: ML+Hadoop at NYC Predictive Analytics

Here’s a secret behind the Discover page

It’s precomputed every night

10

HADOOP

Cassandra Bartender

Log streams

Music recs

hdfs2cass

https://github.com/spotify/luigi

https://github.com/spotify/hdfs2cass

Page 14: ML+Hadoop at NYC Predictive Analytics

OK so how do we come up with recommendations?

Let’s do collaborative filtering!In particular, implicit collaborative filteringIn particular, matrix factorization (aka latent factor methods)

11

Page 15: ML+Hadoop at NYC Predictive Analytics

Stop!!!

Break it down!!

12

Page 16: ML+Hadoop at NYC Predictive Analytics

AP AP AP AP AP AP

Hadoop(>100B streams)

Play track zplay track y

play track x

5k tracks/s

Step 1: Collect data

13

Page 17: ML+Hadoop at NYC Predictive Analytics

Step 2: Put everything into a big sparse matrix

14

Using some definition of correlation. Eg. for Pearson:

cij =

Pu NuiNuj

pPu N

2ui

qPu N

2uj

but it’s very slow because:

N =

0

BBBBBB@

0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0

1

CCCCCCA

NTN =

0

BB@

402 19 52 28819 59 147 1352 147 610 117288 13 117 300

1

CCA

O(U · (N/I)2)...where U = number of users

I = number of itemsN = number of nonzero entries

⇡ 107 · (1010/107)2 = 1013 mapper outputs

It’s an extremely sparse matrix

M =

0

BBBBBBBBBBBB@

......

.... . . . . . 53 . . . . . .

......

.... . . . . . . . . 12 . . .

......

.... . . 7 . . . . . . . . .

......

...

1

CCCCCCCCCCCCA

It’s a very big matrix too:

M =

0

BBB@

c11 c12 . . . c1nc21 c22 . . . c2n...

...cm1 cm2 . . . cmn

1

CCCA

| {z }107 items

9>>>>>>>>>=

>>>>>>>>>;

107 users

3

Page 18: ML+Hadoop at NYC Predictive Analytics

Matrix example

Roughly 25 billion nonzero entriesTotal size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)

15

Page 19: ML+Hadoop at NYC Predictive Analytics

Matrix example

Roughly 25 billion nonzero entriesTotal size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)

15

Erik

Never gonna give you up

Erik listened to Never gonna give you up 1

times

Page 20: ML+Hadoop at NYC Predictive Analytics

Idea is to find vectors for each user and itemHere’s how it looks like algebraically:

Step 3: Matrix factorization

16

Turns out people have been doing this in NLP for a while

M =

0

BBB@

c11 c12 . . . c1nc21 c22 . . . c2n...

...cm1 cm2 . . . cmn

1

CCCA

| {z }Lots of words

9>>>>>>>>=

>>>>>>>>;

Lots of documents

Or more generally:

P =

0

BBB@

p11 p12 . . . p1np21 p22 . . . p2n...

...pm1 pm2 . . . pmn

1

CCCA

The idea with matrix factorization is to represent this probability distribu-tion like this:

pui = aTubi

M 0 = ATB

0

BBBBBB@

1

CCCCCCA⇡

0

BBBBBB@

1

CCCCCCA

| {z }f

� � f

0

BBBBBB@

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

CCCCCCA

| {z }probabilities for next event

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

| {z }user vectors

✓. . . . . . .. . . . . . .

| {z }item vectors

We can look at it as a probability distribution:0

BBBBBB@

0 0.07 0.21 00.05 0 0 0.010.04 0 0.13 0.090 0 0 0.07

0.19 0.01 0 0.130 0.03 0 0

1

CCCCCCA

4

Page 21: ML+Hadoop at NYC Predictive Analytics

For instance, for PLSA

Probabilistic Latent Semantic Indexing (Hoffman, 1999)Invented as a method intended for text classification

17

P =

0

BBBBBB@

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

CCCCCCA⇡

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

| {z }user vectors

✓. . . . . . .. . . . . . .

| {z }item vectors

PLSA0

BBBBBB@

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

CCCCCCA

| {z }P (u,i)=

PzP (u|z)P (i,z)

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

| {z }P (u|z)

✓. . . . . . .. . . . . . .

| {z }P (i,z)

X

u

P (u|z) = 1

X

i,z

P (i, z) = 1

So in general we want to optimize

logY

P (u, i)Nui =X

u,i

Nui logP (u, i) =X

u,i

Nui logX

z

P (u|z)P (i, z)

N logP =

0

BBBBBB@

0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0

1

CCCCCCAlog

0

BBBBBB@

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

✓. . . .. . . .

1

CCCCCCA

KOREN

N =

0

BBBBBB@

0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0

1

CCCCCCA

5

Page 22: ML+Hadoop at NYC Predictive Analytics

Why are vectors nice?

Super small fingerprints of the musical style or the user’s tasteUsually something like 40-200 elementsHard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt:

18

0.87 1.17 -0.26 0.56 2.21 0.77 -0.03

Latent factor 1

Latent factor 2

track x's vector

Track X:

Page 23: ML+Hadoop at NYC Predictive Analytics

Another example of tracks in two dimensions

19

Page 24: ML+Hadoop at NYC Predictive Analytics

Implementing matrix factorization is a little tricky

Iterative algorithms that stake many steps to converge40 parameters for each item and userSo something like 1.2 billion parameters

“Google News Personalization: Scalable Online Collaborative Filtering”

20

Page 25: ML+Hadoop at NYC Predictive Analytics

One iteration, one map/reduce job

21

Reduce stepMap step

u % K = 0i % L = 0

u % K = 0i % L = 1 ... u % K = 0

i % L = L-1

u % K = 1i % L = 0

u % K = 1i % L = 1 ... ...

... ... ... ...

u % K = K-1i % L = 0 ... ... u % K = K-1

i % L = L-1

item vectorsitem%L=0

item vectorsitem%L=1

item vectorsi % L = L-1

user vectorsu % K = 0

user vectorsu % K = 1

user vectorsu % K = K-1

all log entriesu % K = 1i % L = 1

u % K = 0

u % K = 1

u % K = K-1

Page 26: ML+Hadoop at NYC Predictive Analytics

One iteration, one map/reduce job

21

Reduce stepMap step

u % K = 0i % L = 0

u % K = 0i % L = 1 ... u % K = 0

i % L = L-1

u % K = 1i % L = 0

u % K = 1i % L = 1 ... ...

... ... ... ...

u % K = K-1i % L = 0 ... ... u % K = K-1

i % L = L-1

item vectorsitem%L=0

item vectorsitem%L=1

item vectorsi % L = L-1

user vectorsu % K = 0

user vectorsu % K = 1

user vectorsu % K = K-1

all log entriesu % K = 1i % L = 1

u % K = 0

u % K = 1

u % K = K-1

Page 27: ML+Hadoop at NYC Predictive Analytics

Here’s what happens in one map shard

Input is a bunch of (user, item, count) tuplesuser is the same modulo K for all usersitem is the same modulo L for all items

22

One map taskDistributed

cache:All user vectors where u % K = x

Distributed cache:

All item vectors where i % L = y

Mapper Emit contributions

Map input:tuples (u, i, count)

where u % K = x

andi % L = y

Reducer New vector!

Page 28: ML+Hadoop at NYC Predictive Analytics

Might take a while to converge

Start with random vectors around the origin

23

Page 29: ML+Hadoop at NYC Predictive Analytics

Hadoop?

Yeah we could probably do it in Spark 10x or 100x faster.Still, Hadoop is a great way to scale things horizontally.

????

24

Page 30: ML+Hadoop at NYC Predictive Analytics

Nice compact vectors and it’s super fast to compute similarity

25

Latent factor 1

Latent factor 2

track xtrack y

cos(x, y) = HIGH

IPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

IPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

Page 31: ML+Hadoop at NYC Predictive Analytics

Music recommendations are now just dot products

26

Latent factor 1

Latent factor 2

track x

User u's vector

track y

Page 32: ML+Hadoop at NYC Predictive Analytics

It’s still tricky to search for similar tracks though

We have many million tracks and you don’t want to compute cosine for all pairs

27

Page 33: ML+Hadoop at NYC Predictive Analytics

Approximate nearest neighbors to the rescue!

Cut the space recursively by random plane.If two points are close, they are more likely to end up on the same side of each plane.

https://github.com/spotify/annoy

28

Page 34: ML+Hadoop at NYC Predictive Analytics

How do you retrain the model?

It takes a long time to train a full factorization model.We want to update user vectors much more frequently (at least daily!)However, item vectors are fairly stable.Throw away user vectors and recreate them from scratch!

29

Page 35: ML+Hadoop at NYC Predictive Analytics

The pipeline

“Hack” to recalculate user vectors more frequently.

Is this a little complicated? Yeah probably.

30

May 2013 logs

Matrix factorization

Item vectors

User vectors

June 2013 logs

Matrix factorization

Item vectors

User vectors

+ more logs

Seeding

User vectors (1)

Logs

User vectors (2)

More logs

User vectors (3)

More logs

User vectors (4)

More logs

User vectors (5)

More logs

Time

Page 36: ML+Hadoop at NYC Predictive Analytics

Ideal case

Put all vectors in Cassandra/Memcached, use Storm to update in real time

31

Page 37: ML+Hadoop at NYC Predictive Analytics

But Hadoop is pretty nice at parallelizing recommendations

24 core but not a lot of RAM? mmap is your friend

32

One map reduce job

Recs!

ANN index of all vectors

Distributed cache:User vectors

M M

M M

DC

M M

M M

DC

M M

M M

DC

Page 38: ML+Hadoop at NYC Predictive Analytics

Music recommendations!

Our latest baby, the Discover page. Featuring lots of different types of recommendations. Expect this to change quite a lot in the next few months!

33

Page 39: ML+Hadoop at NYC Predictive Analytics

More music recommendations!

Radio!

34

Page 40: ML+Hadoop at NYC Predictive Analytics

More music recommendations!

Related artists

35

Page 41: ML+Hadoop at NYC Predictive Analytics

Thanks!

Btw, we’re hiring Machine Learning Engineersand Data Engineers!Email me at [email protected]!