(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent 2014

Preview:

DESCRIPTION

The Netflix service supports more than 50 million subscribers in over 40 countries around the world. These subscribers use more than 1,000 different device types to connect to Netflix, resulting in massive amounts of traffic to the service. In our distributed environment, the gateway service that receives this customer traffic needs to be able to scale in a variety of ways while simultaneously protecting our subscribers from failures elsewhere in the architecture. This talk will detail how the Netflix front door operates, leveraging systems like Hystrix, Zuul, and Scryer to maximize the AWS infrastructure and to create a great streaming experience.

Citation preview

November 12, 2014 | Las Vegas, Nevada

Daniel Jacobson, Netflix

Ben Schmaus, Netflix

Daniel Jacobson

@daniel_jacobson

danieljacobson/linkedin

danieljacobson.com/slideshare

Ben Schmaus

@schmaus

schma.us/in

schma.us/slides

Edge

Engineering

What does Edge Engineering do?

• Broker data between services and devices

• Control playback flow

• Ensure resiliency

• Scale our systems

• Enable high velocity product innovation

• Provide detailed, real-time health insights

“The Edge... the only people who really know

where it is are the ones who have gone over.”

-- Hunter S. Thompson

What does Edge Engineering do?

• Broker data between services and devices

• Control playback flow

• Ensure resiliency

• Scale our systems

• Enable high velocity product innovation

• Provide detailed, real-time health insights

“The Edge... the only people who really know

where it is are the ones who have gone over.”

-- Hunter S. Thompson

What does Edge Engineering do?

• Broker data between services and devices

• Control playback flow

• Ensure resiliency

• Scale our systems

• Enable high velocity product innovation

• Provide detailed, real-time health insights

“The Edge... the only people who really know

where it is are the ones who have gone over.”

-- Hunter S. Thompson

APP-310: Scheduling using

Apache Mesos in the Cloud

9:00 on Friday

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

S2S2S2

Playback Playback Website Website Logging

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

APIRxJava

Hystrix

Scripting

S

E

R

V

I

C

E

S

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

S2S2S2

Playback Playback Website Website Logging

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

APIRxJava

Hystrix

Scripting

S

E

R

V

I

C

E

S

Routing Traffic

“There is no Dana, only Zuul!”

ZuulGatekeeper for the Netflix Streaming Application

Zuul

• Multi-Region

Resiliency

• Dynamic Routing

• Squeeze Testing

• Insights

• Load Shedding

• Security

• Authentication

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

PROD

RxJava

Hystrix

S2S2S2

Scripting

DEBUG

RxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

PROD

RxJava

Hystrix

S2S2S2

Scripting

SQUEEZE

RxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

Systems are healthy.

Traffic

from the

east goes

to

US-EAST

Traffic

from the

west goes

to

US-WEST

Systems failure in US-EAST.

US-EAST Zuul routes traffic

to US-WEST Zuul

(until DNS gets resolved)

DNS gets resolved.

Requests from east

go to US-WEST

Systems recover in US-EAST.

DNS set to return to normal

DNS gets resolved.

Both regions return to normal.

Resiliency in Distributed

SystemsPreventing cascading failures

“A distributed system is one in

which the failure of a computer

you didn’t even know existed

can render your own computer

unusable”-- Leslie Lamport

Dependency Relationships

5,000,000,000Incoming Requests Per

DayNetflix API

30Dependent Services

Netflix API

~600Dependency Jars

Netflix API

40,000,000,000Outbound Calls Per Day

to Dependent Services

Netflix API

1Thing is common across

all dependencies…

0Dependent Services

have a 100% SLA

99.99% = 99.7%30

0.3% of 5B = 15M failures per day

2+ Hours of Downtime

Per Month

99.99% = 99.7%30

0.3% of 5B = 15M failures per day

2+ Hours of Downtime

Per Month

99.9% = 97%30

3% of 5B = 150M failures per day

20+ Hours of Downtime

Per Month

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

APIRxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

S

E

R

V

I

C

E

S

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

RxJava

Hystrix

S2S2S2

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

RxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

API API API API API API

S

E

R

V

I

C

E

S

Call Volume and Health /

Last 10 Seconds

Call Volume / Last

2 Minutes

Successful

Requests

Short-Circuited Requests,

Delivering Fallbacks

Timeouts, Delivering

Fallbacks

Full Queues,

Delivering

Fallbacks

Exceptions, Delivering

Fallbacks

Error

Rate

# + # + # + # / (# + # + # + # + #) = Error Rate

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

D

E

V

I

C

E

S

R

O

U

T

I

N

G

O

R

I

G

I

N

API

S

E

R

V

I

C

E

S

RxJava

Hystrix

S2S2S2

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

APIRxJava

Hystrix

Scripting

S2S2S1S2S2S4

S2S2S3S2S2S6

S2S2S5

S2S2S8S2S2S7

S2S2S10S2S2S9

S2S2S12S2S2S11

S2S2S13

Fallbac

k

Demo

May the demo gods be with us…

Scaling Systems

Preventing failures due to capacity issues

“The possibilities are

numerous once we decide to

act and not react”-- George Bernard Shaw

Reactive Auto Scaling

• Reacts to real-time conditions

• Responds to spikes/dips in metrics– Load average

– Requests per second

• Excellent for many scaling scenarios– Much better than static cluster sizing

Reactive Auto Scaling - Challenges

• Policies can be inefficient w

• Outages can trigger scale down events

• Excess capacity at peak and trough

Scryer : Predictive Auto Scaling

Not yet…

Typical Traffic Patterns Over Five Days

Predicted RPS Compared to Actual RPS

Scaling Plan for Predicted Workload

What is Scryer Doing?

• Evaluates needs based on historical data– Week over week, month over month metrics

• Adjusts instance minimums based on algorithms– Constant feedback loops

– Evaluated routinely through squeeze tests

• Relies on Auto Scaling for unpredicted spikes in

traffic

Results

Results : Load Average

Reactive

Predictive

Results : Load Average

Reactive

Predictive

Results : Response Latencies

Reactive

Predictive

Results : Response Latencies

Reactive

Predictive

Results : Outage Recovery

Results : AWS Costs

Key Takeaways

https://www.github.com/Netflix

Netflix talks at re:InventTalk Time Title

PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and Service Reliability

BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix

PFC-306 Wednesday, 2:15pm Performance Tuning EC2

DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source

Tools can accelerate and scale your services

ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale

PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The

Pros and Cons of Micro Services Architectures

ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems

APP-310 Friday, 9:00am Scheduling using Apache Mesos in the Cloud

http://bit.ly/awsevalshttp://schma.us/in

http://schma.us/slides

Recommended