The Evolution of Big Data at Spotify

Josh Baer (jbx@spotify.com)

Who Am I?‣ Technical Product Owner at

Spotify, responsible for Hadoop

@l_phant

Overview

• Creating Music Charts in three parts:

• Playing Music

• Collecting Data

• Processing Data

• The Future

Building Music Charts

Play Music Collect Data Process

Playing Music

What is Spotify?

• Music Streaming Service

• Browse and Discover Millions of Songs, Artists and Albums

• By the end of 2014

• 60 Million Monthly Users

• 15 Million Paid Subscribers

What is Spotify?

• Data Infrastructure

• 1300 Hadoop Nodes

• 42 PB Storage

• 30 TB data ingested via Kafka/day

• 400 TB generated by Hadoop/day

Powered by Data

• Running App

• Matches music to running tempo

• Personalized running playlists in multiple tempos for millions of active users

http://www.theverge.com/2015/6/1/8696659/spotify-running-is-great-for-discovery

Powered by Data

• Now Page

• Shows, podcasts and playlists based on day-parts

• Personalized layout so you always have the right music for the right moment

Building Music Charts10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”

10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”

10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"

• Raw data is complicated

• Often dirty

• Evolving structure

• Duplication all over

• Getting data to a central processing point is HARD

Collecting Data

“It’s simple, we just throw the data into

Hadoop”A naive data engineer

LogArchiver

• Original method to transport logs from APs to HDFS

• Lasted from 2009 - 2013

• Relies on rsynch/scp to move files around

• Regularly scheduled via cron

LogArchiver Fails

• Worked well with small number of APs

• Issues with scale

• Manual Processes of adding new hosts

• Frequent dying of hosts or network issues caused massive congestion

• Manual process of overrides

Apache Kafka to the rescue!

• A publish-subscribe messaging system open sourced by LinkedIn in 2011

• High level overview:

• Topic: Feeds of messages

• Producer: A message publisher

• Consumer : A subscriber of topics

Apache Kafka to the rescue!

• Log -> HDFS latency reduced from hours to seconds!

• Benefits:

• Community supported

• Division of responsibilities

• Allowed for enhanced streaming use-cases

Processing Data

Workflow Management Fail!

0 * * * * spotify-‐core hadoop jar merge_hourly.jar 15 * * * * spotify-‐core hadoop jar aggregate_song_plays.jar 30 * * * * spotify-‐analytics hadoop jar merge_artist_song.jar * 1 * * * spotify-‐core hadoop jar daily_aggregate.jar * 2 * * * spotify-‐core hadoop jar calculate_royalties.jar */2 22 * * * spotify-‐radio hadoop jar generate_radio.jar

Handles the ‘plumbing’ for Hadoop jobs

https://github.com/spotify/luigi

Luigi - Python Workflow Manager

Easy to get started, no xml like Oozie

Hadoop Availability

• In 2013:

• Hadoop expanded to 200 nodes

• It was business critical

• It was not very reliable :-(

• Created a ‘squad’ with two missions:

• Migrate to a new distribution with Yarn

• Make Hadoop reliable

How did we do?

Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015

Hadoop ownerless Dedicated squad launches

Upgrade instability

Continually improving

Going from Python to Crunch

• Most of our jobs were Hadoop (python) streaming

• Lots of failures, slow performance

• Had to find a better way

Moving from Python to Crunch

• Investigated several frameworks*

• Selected Crunch:

• Real types - compile time error detection, better testability

• Higher level API - let the framework optimize for you

• Better performance #JVM_FTW

*thewit.ch/scalding_crunchy_pig

Play Music Collect Data Process

Data driven features that allows for new ways to

play and discover music

Lower latency and enhanced reliability for

passing data from Access Points to HDFS via Kafka

Increased Hadoop reliability, Luigi scheduling

and better performance with Crunch

Improving Charts!

The Future

2012 2013 2014 2015

Hadoop Usage Spotify Users

Growth of Hadoop vs. Spotify Users

Explosive Growth

• Increased Spotify Users

• More users listening to more music -> more data -> longer running jobs

• Increased Use Cases

• Beyond simple analytics into Machine Learning, advanced processing

• Increased Engineers

• In 2014, growth of data and machine learning engineers grew rapidly

Scaling Machines: Easy Scaling People: Hard

User Feedback: Automate it!

Inviso

Developed by Netflix: https://github.com/Netflix/inviso

Hadoop Report Card

• Contains Statistics • Guidelines and Best

Practices • Sent Quarterly

Real Time Use Cases

• Expanding our use of Storm for:

• Targeting Ads based on genres

• Visualizing Data

• Quicker recommendations

• More information:

• https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/

Two takeaways

• Getting data into Hadoop is half the challenge. Think early and often about scale.

• Increasing infrastructure reliability and performance leads to expanded use. This adds challenges but it’s a good problem to have.

Join The Band!

Engineers needed in NYC, Stockholm http://spotify.com/jobs

The Evolution of Big Data at Spotify

Technology

Big Bang - Evolution

Spotify and the Sustainability of Evolution

BIG DATA ANALYTICS + ARTIFICIAL INTELLIGENCE EVOLUTION …

Big Data At Spotify

Topic 5: Ecology and Evolution 5.4: Evolution. Evolution slider The Big Bang The Simpsons

Spotify Running: Lessons learned from building a ‘Lean Startup’ inside a big tech company

Big Data: Evolution? Game Changer? Definitely

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Big Data Evolution - SAS · Big data evolution: ... interviews with leading corporate big data thought ... companies with well-defined data-management

How things don’t quite · 2016. 5. 26. · Yip, Big Apple Scrum 2016. Who here has watched the Spotify Engineering Culture videos? Shout out what you remember from it . Spotify

Big Data Evolution

Autonomy and leadership at Spotify - SlatteryIT · Autonomy and leadership at Spotify Anders Ivarsson ... Evolution and experimentation Chapter ... engineering Support the squad on

Big Idea 1: Evolution

Atmosphere Conference 2015: Service Operations Evolution at Spotify

Big data and machine learning @ Spotify

Spotify lanceert Spotify Kids - Spotify-nl (bericht)€¦ · Spotify lanceert Spotify Kids Nieuwe app voor de volgende generatie luisteraars Gezinnen met een Premium Family-account

Spotify Behind the Scenesgkreitz/spotify/kreitz-spotify_kth11.pdf · Background Streaming P2PProtocol Spotify—BehindtheScenes GunnarKreitz Spotify gkreitz@spotify.com KTH,September292011

The Evolution of Big Data Analytics

Architecture The Evolution of Spotify Home - QCon.ai · The Evolution of Spotify Home Architecture. Emily Staff Engineer Anil Data Engineer @emilymsa @anilmuppallar. Our mission is

Big Data at Spotify - LINClinc.ucy.ac.cy/.../slides/company_presentation_spotify.pdfAnders Arpteg, 2015 Stockholm, Spotify Collaborative filtering Approximate 60M users x 4M songs