46
September 26, 2013 Wouter de Bie Team Lead Data Infrastructure Big Data Infrastructure at Spotify Thursday, September 26, 13

Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Wouter de BieTeam Lead Data Infrastructure

Big Data Infrastructure at Spotify

Thursday, September 26, 13

Page 2: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Who am I?

2

According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system also aligns well with our needs, as we use Hive extensively for ad-hoc queries and for

the analysis of large datasets," de Brie said in a statement.

Thursday, September 26, 13

Page 3: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Who am I?

3

Thursday, September 26, 13

Page 4: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Why data?We play music, right?

4

Thursday, September 26, 13

Page 5: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Why data?We were the first company to do free music streaming. But now everybody can do it, so we need to learn quickly to iterate faster!

5

Thursday, September 26, 13

Page 6: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

AgendaLet’s talk about Data Infrastructure, how we did it, what we learned and how we’ve failed

• Some Context• Use Cases• Our Infrastructure• Lesson learned

6

Thursday, September 26, 13

Page 7: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Some Context • Spotify started in 2006 (in Sweden)• Now 1200+ employees, 450+ engineers• 26 million monthly active users• 20+ million tracks available• 4 data centers across the globe• 17 data engineers building a platform for

easy access to data

7

Thursday, September 26, 13

Page 8: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

8

ReportingBusiness AnalyticsOperational AnalyticsProduct features

Use CasesWe’re a data-driven company, so data is used almost everywhere

Thursday, September 26, 13

Page 9: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Reporting

• Reporting to labels, licensors, partners and advertisers• We support our partners

Thursday, September 26, 13

Page 10: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Business Analytics

• Analyzing growth, user behavior, sign-up funnels, etc

• Company KPIs• A/B testing• NPS analysis• Segmentation analysis• Ad performance

Thursday, September 26, 13

Page 11: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

11

Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$

Thursday, September 26, 13

Page 12: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

12

Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$

Thursday, September 26, 13

Page 13: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

13

Impact of hurricane Sandy29 October 2012

Thursday, September 26, 13

Page 14: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

14

Impact of hurricane Sandy30 October 2012

Thursday, September 26, 13

Page 15: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Operational metrics

•Root cause analysis• Latency analysis• Better capacity planning (servers, people, bandwidth)

Thursday, September 26, 13

Page 16: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Product features

•Radio• Top lists•Recommendations (better than external parties,

because of the amount of data)

Thursday, September 26, 13

Page 17: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Everybody should be able to use data!

Thursday, September 26, 13

Page 18: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

18

So why Data Infrastructure?

View Backend Data pipe Analysis

Thursday, September 26, 13

Page 19: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

19

Our data infrastructure

Thursday, September 26, 13

Page 20: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

20

• 1 TB of compressed data from users per day• 400 GB of data from services per day• 61 TB of data generated in Hadoop each day• 328 node Hadoop cluster• 6500 jobs/day (192.000/month)• Soon 690 nodes (11040 cores)• 10 PB of storage capacity (soon 28 PB)

Some geeky numbers

Thursday, September 26, 13

Page 21: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

21

Spotify’s data infrastructure

Backend services

HDFS

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

Hive

Thursday, September 26, 13

Page 22: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

HadoopProcessing

DatabasesAnalytics/Visualization

22

The thee pillars of our Data Infrastructure

KafkaCollection

Thursday, September 26, 13

Page 23: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Data collection

Thursday, September 26, 13

Page 24: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

2418

Data collection

Kafka: High volume pub-sub system

• Started with a store-and-forward system• Evaluated Apache Flume• Currently from Backend-to-HDFS, but in the future Backend-to-Backend• It was almost a good fit, but...

Thursday, September 26, 13

Page 25: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Guaranteed delivery

25

Kafka doesn’t provide message acknowledgements..

• ... at least, not in 0.7 (stable)• 0.8 has ACKs, but no end-to-end acknowledgements• A track streamed =~ monetary transaction

Thursday, September 26, 13

Page 26: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Hacking Kafka

26

Backend service Ka!a broker Ka!a client

HDFS

acknowledgements

S3

Tape

Thursday, September 26, 13

Page 27: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Dumping databases

27

Not only do we have log files, but also production databases

• Using Sqoop for dumping PostgreSQL• Map/Reduce job (table scans) for dumping Cassandra• hdfs2cass for uploading SSTables into Cassandra• For large DB’s we parse application logs and only dump deltas

Thursday, September 26, 13

Page 28: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Failures (or the hard stuff)

28

• We started with a store-and-forward system that didn’t scale• Kafka has multiple components (client, broker, ZooKeeper)• Internet weather• As with many large Java systems: Garbage collection

Thursday, September 26, 13

Page 29: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Hadoop: our trusted elephant

Thursday, September 26, 13

Page 30: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Scheduling

We wrote and open-sourced our own scheduler: Luigi

•Nothing suitable out there.. (unless you really, really like the XML hell of Oozie)•Written in Python•Generic scheduler and dependency system that supports Python and Java M/R,

Pig, Hive and Sqoop

30

Thursday, September 26, 13

Page 31: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Map/Reduce languages

Python with Hadoop Streaming• Pros: fast development, many Spotify libraries available• Cons: slower then Java, no access to Hadoop API

Java• Pros: fast, access to Hadoop API• Cons: verbose language, not many Spotify libraries available

PIG• Pros: very small scripts, faster then streaming• Cons: yet another language to learn, not many Spotify libs available

Hive• Pros: SQL like syntax (easy for non-programmers) and relational data model• Cons: more moving parts (not well suited for a whole pipe line)

31

Thursday, September 26, 13

Page 32: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Scaling Hadoop at SpotifyOur Journey

• Started with a small (scrap metal) cluster of 37 servers

•Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale

• Built an in-house cluster of 60 nodes because of EMR costs

• Capacity planning every 6 months, grown to 327 nodes today

• 363 more waiting to be provisioned• Put in place data-retention policy and data

archive

32

Thursday, September 26, 13

Page 33: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

33

Hadoop failures

• “We just need good developers” - No, we need Hadoop experts• We underestimated the complexity of Hadoop• You can throw money at the problem of scaling, but at our scale, it pays

off to optimize• Give people easy tools early on

Thursday, September 26, 13

Page 34: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

Lessons learned

4+ years of Hadoop taught us

• Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS

• “Commodity hardware” doesn’t mean cheap hardware• Hadoop isn’t a silver bullet• Hadoop is a complex system that needs love and care• You will have to extend Hadoop (and eco-system components) to tailor it to your needs

34

Thursday, September 26, 13

Page 35: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Databases and visualization

Thursday, September 26, 13

Page 36: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

36

Databases: used for aggregates

• Aggregates from Hadoop are put into PostgreSQL or Cassandra• PostgreSQL powering dashboards and empowering analysts• Cassandra for columnar sets (Spotify Analytics for labels)• Databases are used for low-latency access by systems or analysts

Thursday, September 26, 13

Page 37: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

37

Spotify Data Warehouse

Thursday, September 26, 13

Page 38: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

38

Spotify Data Warehouse

Thursday, September 26, 13

Page 39: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

39

Spotify Analytics Dashboards

Thursday, September 26, 13

Page 40: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

40

Spotify Analytics

Thursday, September 26, 13

Page 41: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

41

Spotify Analytics: Daft Punk

Thursday, September 26, 13

Page 42: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

42

Spotify Analytics: Whitney Houston

Thursday, September 26, 13

Page 43: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

43

Open Source at Spotify

• Luigi (http://github.com/spotify/luigi)• Snakebite (http://github.com/spotify/snakebite)• hdfs2cass (http://github.com/spotify/hdfs2cass)

Thursday, September 26, 13

Page 44: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Questions?

Thursday, September 26, 13

Page 45: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

45

Lessons learned

• Data is crucial for our business• Hadoop and other cutting-edge technology work pretty well for us• Hiring technical analysts was a good idea!• There is no “one-size-fits-all” data product• Spotify has the “build it ourselves” mindset. Sometimes it’s better to

buy then build

Thursday, September 26, 13

Page 46: Big Data Infrastructure at Spotify - Meetupfiles.meetup.com/3097452/Netherlands HUG Big Data... · 2013-09-27 · Big Data Infrastructure at Spotify Thursday, September 26, 13

September 26, 2013

Check out http://www.spotify.com/jobs or @Spotifyjobs for more information.

Or mail: [email protected] twitter: @xinit

Want to join the band?

Thursday, September 26, 13