47
Netflix: Embracing the Cloud Neil Hunt, CPO / Yury Izrailevsky, VP Engineering

2012 re:Invent Netflix: embracing the cloud final

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 2012 re:Invent Netflix: embracing the cloud final

Netflix: Embracing the Cloud

Neil Hunt, CPO / Yury Izrailevsky, VP Engineering

Page 2: 2012 re:Invent Netflix: embracing the cloud final

Embracing the Cloud:Confronting the Challenge

Neil Hunt

Page 3: 2012 re:Invent Netflix: embracing the cloud final

Motivation

Netflix – Service Unavailable – Database Crashed

Rest assured that the right peopleare losing sleep to fix this problem!

We expect to resume service in approximately 72h

12 Aug 2008 03:12am

Page 4: 2012 re:Invent Netflix: embracing the cloud final

A Business in Transition

OLD – DVD delivery

• Value from DVDs at home• Website load small and

predictable

• Traditional DC technology:• Linux, Apache, Oracle, Java

NEW – Streaming

• Value via Internet delivery• Website and APIs high load

and rapidly growing

• Need more robustness• Cloud as opportunity for

fresh start

Page 5: 2012 re:Invent Netflix: embracing the cloud final

Mission: Cloud – High Level Goals

Availability

Scale Performance

4 x nines

Unconstrainedhorizontal scaling

Unlimitedcompute

Page 6: 2012 re:Invent Netflix: embracing the cloud final

Forklift, or Rewrite?

OLD NEW

MonolithicApp

Oracle NoSQL

Service

Assembly

Page 7: 2012 re:Invent Netflix: embracing the cloud final

Old Style – A large 18 wheeler

• Big• Reliable• Efficient (when full)

• Expensive• Inflexible capacity• Many single points of failure

Page 8: 2012 re:Invent Netflix: embracing the cloud final

New Style – A fleet of leased pickups with drivers

• Scalable to small or large loads• Reliability through redundancy• Requires rethinking the whole problem

Page 9: 2012 re:Invent Netflix: embracing the cloud final

SQL or NoSQL?

MySQL/RDB:

• Developer familiarity

• Developers imagine transactional consistency requirements in every scenario

NoSQL

• Availability & Scale

• Avoid overhead and riskof managing SQL

• Experimented with both• Ended up with NoSQL for almost everything important

Page 10: 2012 re:Invent Netflix: embracing the cloud final

Service Oriented Architecture

• Optimizes for small independent teams with well-defined interfaces

• Better independence from subsystem failures

• Scaling applied to each tier separately NoSQL

Page 11: 2012 re:Invent Netflix: embracing the cloud final

How to Manage the Migration?Rebuilding a complex system while in operation

NoSQL

MonolithicApp

Oracle

Page 12: 2012 re:Invent Netflix: embracing the cloud final

Transitional Infrastructure: “Roman Riding”

Page 13: 2012 re:Invent Netflix: embracing the cloud final

Transitional Infrastructure: Create a read-only copy

NoSQL

Source of Truth

Display onlyExample: Membership records

MonolithicApp

Oracle

Page 14: 2012 re:Invent Netflix: embracing the cloud final

Transitional Infrastructure: Move the master copy

NoSQL

Source of Truth

Display only

Example: AB Test Data (account tags controlling test experience)

MonolithicApp

Oracle

Page 15: 2012 re:Invent Netflix: embracing the cloud final

Transitional Infrastructure: Full Multi-Master duplicate

NoSQL

Multi-master

Example: Queue

MonolithicApp

Oracle

Page 16: 2012 re:Invent Netflix: embracing the cloud final

Organizational Challenges

IT Ops• Initial extensive role

managing legacy DC• Raised visibility during

transition• New DC vulnerabilities

and dependencies to manage

DevOps:• Components at a higher

level abstraction• More opportunities for

automation• Automated build-push tools• Autoscaling• Monitoring and automatic

cutouts and failover

A gradually diminishing role A rapidly expanding role

Page 17: 2012 re:Invent Netflix: embracing the cloud final

The Journey

Phase Components Data & PrerequisitesTrial (2009) Streaming Player Content keys (RO)

Membership status (RO)

Development(2010-11)

Member product pages and APIs

Content catalog (RW)Personalization data (RW) & recs algorithmsAB Test data (RW)

Followthrough(2011-12)

Account and membership

Membership data (RW)

Final (2013) Payments PCI and SOX data

Page 18: 2012 re:Invent Netflix: embracing the cloud final

Lessons Learned…

• Embrace the whole concept:Take the opportunity to build a modern architecturerather than forklifting SQL and monolithic apps

• Plan to discard your first experimentsYou’ll learn so much that you’ll be glad to redo it right

• Invest in transitional infrastructure:Migration will take a while,and it’s worth the effort to make it easy

• Expect your team to learn new ways …… but some won’t make the transition

Page 19: 2012 re:Invent Netflix: embracing the cloud final

Embracing the Cloud:Delivering the Cloud Solution

Yury Izrailevsky

Page 20: 2012 re:Invent Netflix: embracing the cloud final

Mission: Cloud – High Level Goals

Availability4 x nines

ScaleUnconstrained

horizontal scaling

PerformanceUnlimitedcompute

Page 21: 2012 re:Invent Netflix: embracing the cloud final

PerformanceScalability Availability

Page 22: 2012 re:Invent Netflix: embracing the cloud final

PerformanceScalability Availability

Page 23: 2012 re:Invent Netflix: embracing the cloud final

23

1/4/

2009

2/5/

2009

3/9/

2009

4/10

/200

9

5/12

/200

9

6/13

/200

9

7/15

/200

9

8/16

/200

9

9/17

/200

9

10/1

9/20

09

11/2

0/20

09

12/2

2/20

09

1/23

/201

0

2/24

/201

0

3/28

/201

0

4/29

/201

0

5/31

/201

0

7/2/

2010

8/3/

2010

9/4/

2010

10/6

/201

0

11/7

/201

0

12/9

/201

0

1/10

/201

1

2/11

/201

1

3/15

/201

1

4/16

/201

1

5/18

/201

1

6/19

/201

1

7/21

/201

1

8/22

/201

1

9/23

/201

1

10/2

5/20

11

11/2

6/20

11

12/2

8/20

11

1/29

/201

2

3/1/

2012

4/2/

2012

5/4/

2012

6/5/

2012

7/7/

2012

8/8/

2012

Scaling Netflix Streaming Service: Weekly Streaming Starts

Page 24: 2012 re:Invent Netflix: embracing the cloud final

Netflix Cross-Regional Cloud Architecture

Page 25: 2012 re:Invent Netflix: embracing the cloud final

Goal: Regional Failover

Page 26: 2012 re:Invent Netflix: embracing the cloud final

Building Global Netflix Streaming Product

Page 27: 2012 re:Invent Netflix: embracing the cloud final

PerformanceScalability Availability

Page 28: 2012 re:Invent Netflix: embracing the cloud final

Weekly Cloud Cost Per Streaming Start (last 12 months)

28

Page 29: 2012 re:Invent Netflix: embracing the cloud final

Simian Army: Cloud Efficiency Automation

Janitor Monkey

Regularly scrape unused capacity

Clean up instances, ASGs, ELBs, SGs, etc.

Efficiency Monkey

AI-based resource under-usage detection (CPU, memory, etc.)

Automated Deletion of Old Data

TTL for S3 (using ObjectExpiration)

29

Page 30: 2012 re:Invent Netflix: embracing the cloud final

Cyclical Streaming Usage Pattern

30

Page 31: 2012 re:Invent Netflix: embracing the cloud final

Load-Based Auto Scaling

3131

50%+ Cost SavingScale up/down

by 70%+

Move to Load-Based Scaling

Page 32: 2012 re:Invent Netflix: embracing the cloud final

PerformanceScalability Availability

Page 33: 2012 re:Invent Netflix: embracing the cloud final

A Truly Great Service…

33

Availability Goal: 99.99%(30 secs/week at peak traffic)

Has To Just Work!

Page 34: 2012 re:Invent Netflix: embracing the cloud final

7/17

/201

1

7/31

/201

1

8/14

/201

1

8/28

/201

1

9/11

/201

1

9/25

/201

1

10/9

/201

1

10/2

3/20

11

11/6

/201

1

11/2

0/20

11

12/4

/201

1

12/1

8/20

11

1/1/

2012

1/15

/201

2

1/29

/201

2

2/12

/201

2

2/26

/201

2

3/11

/201

2

3/25

/201

2

4/8/

2012

4/22

/201

2

5/6/

2012

5/20

/201

2

6/3/

2012

6/17

/201

2

7/1/

2012

7/15

/201

2

7/29

/201

2

8/12

/201

2

8/26

/201

2

9/9/

2012

9/23

/201

2

10/7

/201

2

10/2

1/20

12

11/4

/201

2

June 29th, 2012 AWS / Netflix Outage

Other AWS Outages

Historical Streaming Availability (13wkMA)

Using Redundancy in AWS Infrastructure to Survive Failures

Page 35: 2012 re:Invent Netflix: embracing the cloud final

Cascading Failures

35

API

InstantQueue

SimpleDB

Page 36: 2012 re:Invent Netflix: embracing the cloud final

Netflix Cloud Architecture

36

Page 37: 2012 re:Invent Netflix: embracing the cloud final

Cascading Failures

37

99% Availability

X …

99% 300 = 4.90%

99% Availability 99% Availability

Page 38: 2012 re:Invent Netflix: embracing the cloud final

Strategies to Improve Availability

38

Graceful Degradation Redundancy

Page 39: 2012 re:Invent Netflix: embracing the cloud final

Graceful Degradation

39

Page 40: 2012 re:Invent Netflix: embracing the cloud final

Redundancy

40

Zone A

Zone B

Zone C

Redundancy Across Availability Zones

Storage Redundancy Across Regions,

Vendors

S3 Backup

Secure Cloud Backup

A B C

Cassandra

Page 41: 2012 re:Invent Netflix: embracing the cloud final

Testing Fault Tolerance: Simian Army

41

Chaos Monkey Latency Monkey Chaos Gorilla

Page 42: 2012 re:Invent Netflix: embracing the cloud final

Open Source Portal at http://netflix.github.com

Page 43: 2012 re:Invent Netflix: embracing the cloud final

Superstorm Sandy

AWS Infrastructure Held Up

>2x Netflix Streaming Usage in East Coast Markets

Boston

New York

Philadelphia

Baltimore

D.C.

Page 44: 2012 re:Invent Netflix: embracing the cloud final

Focus on Building a Great Streaming Product

44

Page 45: 2012 re:Invent Netflix: embracing the cloud final

Netflix at 2012 re:Invent

Date/Time Presenter Topic

Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy

Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS

Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation

Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud

Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix

Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds

Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel

Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army

Page 46: 2012 re:Invent Netflix: embracing the cloud final

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Page 47: 2012 re:Invent Netflix: embracing the cloud final

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.