2012 re:Invent Netflix: embracing the cloud final

Preview:

DESCRIPTION

 

Citation preview

Netflix: Embracing the Cloud

Neil Hunt, CPO / Yury Izrailevsky, VP Engineering

Embracing the Cloud:Confronting the Challenge

Neil Hunt

Motivation

Netflix – Service Unavailable – Database Crashed

Rest assured that the right peopleare losing sleep to fix this problem!

We expect to resume service in approximately 72h

12 Aug 2008 03:12am

A Business in Transition

OLD – DVD delivery

• Value from DVDs at home• Website load small and

predictable

• Traditional DC technology:• Linux, Apache, Oracle, Java

NEW – Streaming

• Value via Internet delivery• Website and APIs high load

and rapidly growing

• Need more robustness• Cloud as opportunity for

fresh start

Mission: Cloud – High Level Goals

Availability

Scale Performance

4 x nines

Unconstrainedhorizontal scaling

Unlimitedcompute

Forklift, or Rewrite?

OLD NEW

MonolithicApp

Oracle NoSQL

Service

Assembly

Old Style – A large 18 wheeler

• Big• Reliable• Efficient (when full)

• Expensive• Inflexible capacity• Many single points of failure

New Style – A fleet of leased pickups with drivers

• Scalable to small or large loads• Reliability through redundancy• Requires rethinking the whole problem

SQL or NoSQL?

MySQL/RDB:

• Developer familiarity

• Developers imagine transactional consistency requirements in every scenario

NoSQL

• Availability & Scale

• Avoid overhead and riskof managing SQL

• Experimented with both• Ended up with NoSQL for almost everything important

Service Oriented Architecture

• Optimizes for small independent teams with well-defined interfaces

• Better independence from subsystem failures

• Scaling applied to each tier separately NoSQL

How to Manage the Migration?Rebuilding a complex system while in operation

NoSQL

MonolithicApp

Oracle

Transitional Infrastructure: “Roman Riding”

Transitional Infrastructure: Create a read-only copy

NoSQL

Source of Truth

Display onlyExample: Membership records

MonolithicApp

Oracle

Transitional Infrastructure: Move the master copy

NoSQL

Source of Truth

Display only

Example: AB Test Data (account tags controlling test experience)

MonolithicApp

Oracle

Transitional Infrastructure: Full Multi-Master duplicate

NoSQL

Multi-master

Example: Queue

MonolithicApp

Oracle

Organizational Challenges

IT Ops• Initial extensive role

managing legacy DC• Raised visibility during

transition• New DC vulnerabilities

and dependencies to manage

DevOps:• Components at a higher

level abstraction• More opportunities for

automation• Automated build-push tools• Autoscaling• Monitoring and automatic

cutouts and failover

A gradually diminishing role A rapidly expanding role

The Journey

Phase Components Data & PrerequisitesTrial (2009) Streaming Player Content keys (RO)

Membership status (RO)

Development(2010-11)

Member product pages and APIs

Content catalog (RW)Personalization data (RW) & recs algorithmsAB Test data (RW)

Followthrough(2011-12)

Account and membership

Membership data (RW)

Final (2013) Payments PCI and SOX data

Lessons Learned…

• Embrace the whole concept:Take the opportunity to build a modern architecturerather than forklifting SQL and monolithic apps

• Plan to discard your first experimentsYou’ll learn so much that you’ll be glad to redo it right

• Invest in transitional infrastructure:Migration will take a while,and it’s worth the effort to make it easy

• Expect your team to learn new ways …… but some won’t make the transition

Embracing the Cloud:Delivering the Cloud Solution

Yury Izrailevsky

Mission: Cloud – High Level Goals

Availability4 x nines

ScaleUnconstrained

horizontal scaling

PerformanceUnlimitedcompute

PerformanceScalability Availability

PerformanceScalability Availability

23

1/4/

2009

2/5/

2009

3/9/

2009

4/10

/200

9

5/12

/200

9

6/13

/200

9

7/15

/200

9

8/16

/200

9

9/17

/200

9

10/1

9/20

09

11/2

0/20

09

12/2

2/20

09

1/23

/201

0

2/24

/201

0

3/28

/201

0

4/29

/201

0

5/31

/201

0

7/2/

2010

8/3/

2010

9/4/

2010

10/6

/201

0

11/7

/201

0

12/9

/201

0

1/10

/201

1

2/11

/201

1

3/15

/201

1

4/16

/201

1

5/18

/201

1

6/19

/201

1

7/21

/201

1

8/22

/201

1

9/23

/201

1

10/2

5/20

11

11/2

6/20

11

12/2

8/20

11

1/29

/201

2

3/1/

2012

4/2/

2012

5/4/

2012

6/5/

2012

7/7/

2012

8/8/

2012

Scaling Netflix Streaming Service: Weekly Streaming Starts

Netflix Cross-Regional Cloud Architecture

Goal: Regional Failover

Building Global Netflix Streaming Product

PerformanceScalability Availability

Weekly Cloud Cost Per Streaming Start (last 12 months)

28

Simian Army: Cloud Efficiency Automation

Janitor Monkey

Regularly scrape unused capacity

Clean up instances, ASGs, ELBs, SGs, etc.

Efficiency Monkey

AI-based resource under-usage detection (CPU, memory, etc.)

Automated Deletion of Old Data

TTL for S3 (using ObjectExpiration)

29

Cyclical Streaming Usage Pattern

30

Load-Based Auto Scaling

3131

50%+ Cost SavingScale up/down

by 70%+

Move to Load-Based Scaling

PerformanceScalability Availability

A Truly Great Service…

33

Availability Goal: 99.99%(30 secs/week at peak traffic)

Has To Just Work!

7/17

/201

1

7/31

/201

1

8/14

/201

1

8/28

/201

1

9/11

/201

1

9/25

/201

1

10/9

/201

1

10/2

3/20

11

11/6

/201

1

11/2

0/20

11

12/4

/201

1

12/1

8/20

11

1/1/

2012

1/15

/201

2

1/29

/201

2

2/12

/201

2

2/26

/201

2

3/11

/201

2

3/25

/201

2

4/8/

2012

4/22

/201

2

5/6/

2012

5/20

/201

2

6/3/

2012

6/17

/201

2

7/1/

2012

7/15

/201

2

7/29

/201

2

8/12

/201

2

8/26

/201

2

9/9/

2012

9/23

/201

2

10/7

/201

2

10/2

1/20

12

11/4

/201

2

June 29th, 2012 AWS / Netflix Outage

Other AWS Outages

Historical Streaming Availability (13wkMA)

Using Redundancy in AWS Infrastructure to Survive Failures

Cascading Failures

35

API

InstantQueue

SimpleDB

Netflix Cloud Architecture

36

Cascading Failures

37

99% Availability

X …

99% 300 = 4.90%

99% Availability 99% Availability

Strategies to Improve Availability

38

Graceful Degradation Redundancy

Graceful Degradation

39

Redundancy

40

Zone A

Zone B

Zone C

Redundancy Across Availability Zones

Storage Redundancy Across Regions,

Vendors

S3 Backup

Secure Cloud Backup

A B C

Cassandra

Testing Fault Tolerance: Simian Army

41

Chaos Monkey Latency Monkey Chaos Gorilla

Open Source Portal at http://netflix.github.com

Superstorm Sandy

AWS Infrastructure Held Up

>2x Netflix Streaming Usage in East Coast Markets

Boston

New York

Philadelphia

Baltimore

D.C.

Focus on Building a Great Streaming Product

44

Netflix at 2012 re:Invent

Date/Time Presenter Topic

Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy

Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS

Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation

Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud

Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix

Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds

Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel

Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Recommended