Big data and machine learning @ Spotify

Oscar CarlssonData Engineerlad@spotify.com

Big Dataand Machine Learning@ Spotify

Friday 6/3 2015

● D-student starting 2009● Graduated last year from CSALL

(Student in this class 2013)

● Master thesis at Spotify

● Data Engineer at Spotify in Gothenburg

● What is data at Spotify?

● Big data and processing it

● Using data at Spotify

● Machine Learning

Outline

Supervised learning: data (X), labels (Y)

Unsupervised learning:data (X)

In the Machine Learning class:

What is data at Spotify?

Songs Track Metadata

User generated Users Playlists

Cover arts Listens Country, email etc Tracks of playlist

Album Clicks Add/Removes

Genres, Mood etc

Page views

30 Million songs

60 Million Monthly Active Users

58 Markets

15 Million subscribers

1.5 Billion Playlists

Outline

Big Data and processing it

● 20 TB compressed data / DAY○ 200 TB generated and stored / day (replication)

● Our business is highly dependent on these logs○ We pay artist depending on plays, plays = logs

Too much to store on a single computer. We need a cluster to process it!.. this is typically what is called “Big Data”

Big Data and processing it

● Distributed computing and storage○ Hadoop

■ MapReduce○ Cassandra

● Hadoop cluster○ 1100 nodes○ ~8000 jobs/day

Outline

Using data at Spotify

Everyone part of the company is interested in our data

● Product○ Are people using X? Should we focus on features such as Y?

● Insights○ What music is trending? What artists is popular where?

● Performance○ How is latency in country Y? Did this reduce stutter in country X?

Using data at Spotify

● Data-driven decision making○ Like.. every decision.○ Analysts / Data scientists

● A/B test everything!● A/B testing:

○ Statistical hypothesis testing○ Simple randomized experiment with >= 2

variants (A, B)

Using data at Spotify: A/B testing

Objective: Decrease time from loading playlist to first play

Hypothesis: The bigger button the faster users finds it

Test set up: ● A - variant 1

○ 2% US and SE MAU users● B - variant 2

○ 2% US and SE MAU users● Control - normal

○ Rest of users in US SE

“The shuffle button”

Using data at Spotify: A/B testing

CONTROL A B

Analytics: A/B testing

Metric:Share of users playing first play > 500ms

(500ms is made up)

Lets roll out A to all users and throw away B!

Outline

● Machine Learning○ User analysis○ Artist disambiguation○ Recommender systems

Outline

“ A music session somehow represents a moment for the user. Can we find these moments and

describe them? ”

● Take a subset of user listening data with new genre data○ Combine listens in sessions

■ Consequent plays, no 15 min pause○ Session = [genres]

● Clustering algorithms to find similar sessions○ K-means / Hierarchical clustering

● Describe the clusters using logistic regression

Machine Learning: Cluster user music sessions

K-Means Per cluster classification

Per cluster logistic regression

w: weight vector

Each w_i can be interpreted as the effect in the x_i variable

x_i = genres

Clusters described by logistic regression name of x_iat largestw_i

Machine Learning

Artist disambiguation

Cleaning up the artists pages

Machine Learning: Artist disambiguation

Lets listen to those tracks!

Is it really the same Fredrik?

Machine Learning: Artist disambiguation

● Rank artists with probability of being ambiguous

● Apply clustering on each “ambiguous” artists albums/tracks○ Using features such as country, release year,

label/licensor etc.○ Distinct cluster could be different artists

● Nicely present this for manual curation

Machine Learning: Recommender system

The discover page

Collaborative filtering

Collaborative filtering● Build a matrix of user plays● Compute similarity between items

4 Million tracks x 60 Million users→ Pairwise similarity infeasible Approximate the matrix with NMF

Matrix factorization (latent factor models)

Small vectorsCosine similarity and dot product efficient

Finding recommendations:Approximate nearest neighbour (ANN)code: https://github.com/spotify/annoy

Related artists & Radio:Similar to user recommendations, more models and not

all CF-based

Multiple models:Score candidates from all models, combine and rank!

I just went through this quickly, read more details of Spotify Rec sys here:

Doing this on MapReduce Comparing with NetflixMusic Rec @ MLConf 2014

● More content-based ML○ Fingerprinting: Echo nest○ Content-based music recommendation using

convolutional neural networks

● Personalize everything○ Emails○ Ads○ User profiling

● ML on other parts of product than Rec Sys

.. final last words on the Future of ML at Spotify

Summary

● Multiple data sources -> multiple angles

● Data drives decision with A/B testing

● User analysis○ Cluster and describe with classifier

● Artist disambiguation○ Cluster and give to manual curators

● Recommender systems○ Collaborative filtering

● We supervise thesis workers○ Artist disambiguation/deduplication○ Cluster user music sessions○ Context-based recommender systems○ Personalized ads / Personalized emails

● We have internships!

www.spotify.com/jobs

.. and potentially you could help us?

Oscar Carlssonlad@spotify.comLinkedin

Thank you for listening!

Big data and machine learning @ Spotify

Data & Analytics

SPOTIFY LOOSE ENDS: PRICING, SUBSCRIBER-BASED VALUE AND BIG DATA!people.stern.nyu.edu/adamodar/pdfiles/blog/SpotifyLooseEnds.pdf · ¤The value of big data, seen through the prism

Parallelizing Big Data Machine Learning Applications with ...dsc.soic.indiana.edu/publications/Parallelizing Big Data Machine... · Parallelizing Big Data Machine Learning Applications

Big Data at Spotify - LINClinc.ucy.ac.cy/.../slides/company_presentation_spotify.pdfAnders Arpteg, 2015 Stockholm, Spotify Collaborative filtering Approximate 60M users x 4M songs

Big Universe, Big Data: Machine Learning and Image ... · Big Universe, Big Data: Machine Learning and Image Analysis for Astronomy ... CCD, typically acquired through only a handful

AI, Machine Learning & Big Data2020

Machine Learning, Big Data, Insights

Music recommendations at Spotify - Meetupfiles.meetup.com/1516886/Recommendations at Spotify v4.pdf · Music recommendations at Spotify Erik Bernhardsson erikbern@spotify.com. Spotify

Spotify · 2020-03-07 · KYRIBA CASE STUDY Spotify Spotifyが奏でるトレジャリー・ミュージック CASE STUDY Spotify様イノベーション－斬新なトレジャリー・マネジメント

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Big Data At Spotify

Big Data at Spotify - SICSictlabs-summer-school.sics.se/2015/...spotify.pdf · Anders Arpteg, 2015 Stockholm, Spotify 75+ million monthly active users Launched in 58 different countries

Big Machine Radio Revs Up - Country Aircheckcountryaircheck.com/pdfs/current081516.pdf · Big Machine Radio Revs Up Big Machine Radio launched in 2011 as an iHeartRadio channel to

Vienna Big Band Machine Minus Drums

Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13.

Machine Learning Big Data using Map Reduce - PBworksmachinelearningbigdata.pbworks.com/w/file/fetch/50030744/Machine... · Machine Learning Big Data using Map Reduce By Michael Bowles,

Introduction to Big Data/Machine Learning

Big Data Week: Forget Big Data - Think Machine Learning!

Machine learning @ Spotify - Madison Big Data Meetup

Best practices: Machine Learning & Data Science with the ...€¦ · Real-time Streaming Analytics – Tactical Machine Learning/Big Data/Advanced Analytics . Strategic Machine Learning/Big

Spotify lanceert Spotify Kids - Spotify-nl (bericht)€¦ · Spotify lanceert Spotify Kids Nieuwe app voor de volgende generatie luisteraars Gezinnen met een Premium Family-account