Big data and machine learning @ Spotify

Preview:

Citation preview

Oscar CarlssonData Engineerlad@spotify.com

Big Dataand Machine Learning@ Spotify

Friday 6/3 2015

● D-student starting 2009● Graduated last year from CSALL

(Student in this class 2013)

● Master thesis at Spotify

● Data Engineer at Spotify in Gothenburg

Me

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Supervised learning: data (X), labels (Y)

Unsupervised learning:data (X)

In the Machine Learning class:

What is data at Spotify?

Songs Track Metadata

User generated Users Playlists

Cover arts Listens Country, email etc Tracks of playlist

Album Clicks Add/Removes

Genres, Mood etc

Page views

30 Million songs

60 Million Monthly Active Users

58 Markets

15 Million subscribers

1.5 Billion Playlists

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Big Data and processing it

● 20 TB compressed data / DAY○ 200 TB generated and stored / day (replication)

● Our business is highly dependent on these logs○ We pay artist depending on plays, plays = logs

Too much to store on a single computer. We need a cluster to process it!.. this is typically what is called “Big Data”

Big Data and processing it

● Distributed computing and storage○ Hadoop

■ MapReduce○ Cassandra

● Hadoop cluster○ 1100 nodes○ ~8000 jobs/day

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Using data at Spotify

Everyone part of the company is interested in our data

● Product○ Are people using X? Should we focus on features such as Y?

● Insights○ What music is trending? What artists is popular where?

● Performance○ How is latency in country Y? Did this reduce stutter in country X?

Using data at Spotify

● Data-driven decision making○ Like.. every decision.○ Analysts / Data scientists

● A/B test everything!● A/B testing:

○ Statistical hypothesis testing○ Simple randomized experiment with >= 2

variants (A, B)

Using data at Spotify: A/B testing

Objective: Decrease time from loading playlist to first play

Hypothesis: The bigger button the faster users finds it

Test set up: ● A - variant 1

○ 2% US and SE MAU users● B - variant 2

○ 2% US and SE MAU users● Control - normal

○ Rest of users in US SE

“The shuffle button”

Using data at Spotify: A/B testing

CONTROL A B

Analytics: A/B testing

Metric:Share of users playing first play > 500ms

(500ms is made up)

Lets roll out A to all users and throw away B!

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

● Machine Learning○ User analysis○ Artist disambiguation○ Recommender systems

Outline

“ A music session somehow represents a moment for the user. Can we find these moments and

describe them? ”

● Take a subset of user listening data with new genre data○ Combine listens in sessions

■ Consequent plays, no 15 min pause○ Session = [genres]

● Clustering algorithms to find similar sessions○ K-means / Hierarchical clustering

● Describe the clusters using logistic regression

Machine Learning: Cluster user music sessions

Machine Learning: Cluster user music sessions

K-Means Per cluster classification

Machine Learning: Cluster user music sessions

Per cluster logistic regression

w: weight vector

Each w_i can be interpreted as the effect in the x_i variable

x_i = genres

Machine Learning: Cluster user music sessions

Clusters described by logistic regression name of x_iat largestw_i

Machine Learning: Cluster user music sessions

Machine Learning: Cluster user music sessions

Machine Learning

Artist disambiguation

Cleaning up the artists pages

Machine Learning: Artist disambiguation

Machine Learning: Artist disambiguation

Lets listen to those tracks!

Is it really the same Fredrik?

Machine Learning: Artist disambiguation

Machine Learning: Artist disambiguation

● Rank artists with probability of being ambiguous

● Apply clustering on each “ambiguous” artists albums/tracks○ Using features such as country, release year,

label/licensor etc.○ Distinct cluster could be different artists

● Nicely present this for manual curation

Machine Learning: Recommender system

The discover page

Machine Learning: Recommender system

Collaborative filtering

Machine Learning: Recommender system

Collaborative filtering● Build a matrix of user plays● Compute similarity between items

Machine Learning: Recommender system

4 Million tracks x 60 Million users→ Pairwise similarity infeasible Approximate the matrix with NMF

Machine Learning: Recommender system

Matrix factorization (latent factor models)

Machine Learning: Recommender system

Small vectorsCosine similarity and dot product efficient

Machine Learning: Recommender system

Finding recommendations:Approximate nearest neighbour (ANN)code: https://github.com/spotify/annoy

Related artists & Radio:Similar to user recommendations, more models and not

all CF-based

Multiple models:Score candidates from all models, combine and rank!

● More content-based ML○ Fingerprinting: Echo nest○ Content-based music recommendation using

convolutional neural networks

● Personalize everything○ Emails○ Ads○ User profiling

● ML on other parts of product than Rec Sys

.. final last words on the Future of ML at Spotify

Summary

● Multiple data sources -> multiple angles

● Data drives decision with A/B testing

● User analysis○ Cluster and describe with classifier

● Artist disambiguation○ Cluster and give to manual curators

● Recommender systems○ Collaborative filtering

● We supervise thesis workers○ Artist disambiguation/deduplication○ Cluster user music sessions○ Context-based recommender systems○ Personalized ads / Personalized emails

● We have internships!

www.spotify.com/jobs

.. and potentially you could help us?

Recommended