43
Cloud Native at Netflix What Changed? July 2013 Adrian Cockcroft @adrianco #netflixcloud @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Embed Size (px)

DESCRIPTION

If we start with the need to make the business more agile and responsive to opportunities and competitive threats, a big component of the time taken is in the development and delivery of web services. Cloud Native architecture delivers speed, scalability and security through automation of continuously delivered single function micro-services with a denormalized NoSQL back end. In the case of Netflix, the streaming service is deployed globally using Cassandra to provide cross zone and cross regional replication. NetflixOSS is a set of open source components that anyone can use to help them adopt Cloud Native architectures, and there is even a prize for the best open source contributions to NetflixOSS at http://netflix.github.com

Citation preview

Page 1: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Cloud Native at NetflixWhat Changed?

July 2013Adrian Cockcroft

@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft

Page 2: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Cloud Native

Netflix Architecture

NetflixOSS

Page 3: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Cloud Native

What is it?Why?

Page 4: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Engineers

Solve hard problemsBuild amazing and complex things

Fix things when they break

Page 5: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Strive for perfection

Perfect codePerfect hardware

Perfectly operated

Page 6: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

But perfection takes too long…

Compromises…Time to market vs. Quality

Utopia remains out of reach

Page 7: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Where time to market wins big

Making a land-grabDisrupting competitors (OODA)

Anything delivered as web services

Page 8: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Observe

Orient

Decide

Act

Land grab opportunity Competitive

move

Customer Pain Point

Analysis

Get buy-in

Plan response

Commit resources

Implement

Deliver

Engage customers

Research alternatives

BIG DATA

INNOVATION

CULTURE

CLOUD

Measure customers

Colonel Boyd, USAF

“Get inside your adversaries'

OODA loop to disorient them”

Page 9: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

How Soon?

Code features in days instead of monthsGet hardware in minutes instead of weeks

Incident response in seconds instead of hours

Page 10: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

A new engineering challenge

Construct a highly agile and highly available service from ephemeral and

assumed broken components

Page 11: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Inspiration

Page 12: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

How to get to Cloud Native

Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities

Integrate DevOps into the Business Organization

Re-Org!

Page 13: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Four Transitions

• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps

• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot

• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates

• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers

Page 14: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Netflix BusDevOps OrganizationChief Product

Officer

VP Product Management

Directors Product

VP UI Engineering

Directors Development

Developers + DevOps

UI Data Sources

AWS

VP Discovery Engineering

Directors Development

Developers + DevOps

Discovery Data Sources

AWS

VP Platform

Directors Platform

Developers + DevOps

Platform Data Sources

AWS

Denormalized, independently updated and scaled data

Cloud, self service updated & scaled infrastructure

Code, independently updated continuous delivery

Page 15: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Decentralized Deployment

Page 16: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Asgard Developer Portalhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

Page 17: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Ephemeral Instances

• Largest services are autoscaled• Average lifetime of an instance is 36 hours

Push

Autoscale UpAutoscale Down

Page 18: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Netflix Streaming

A Cloud Native Application based on an open source platform

Page 19: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?

Page 20: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

How Netflix Streaming Works

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

Personalization

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and

Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Page 21: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Nov2012StreamingBandwidth

March2013

MeanBandwidth+39% 6mo

Page 22: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)

Start Here

memcached

Cassandra

Web service

S3 bucket

Personalization movie group choosers (for US, Canada and Latam)

Each icon is three to a few hundred instances across three AWS zones

Page 23: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Three Balanced Availability ZonesTest with Chaos Gorilla

Cassandra and Evcache Replicas

Zone A

Cassandra and Evcache Replicas

Zone B

Cassandra and Evcache Replicas

Zone C

Load Balancers

Chaos Gorilla

Page 24: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Isolated Regions

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

Page 25: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Cross Region Use Cases

• Geographic Isolation– US to Europe replication of subscriber data– Read intensive, low update rate– Production use since late 2011

• Redundancy for regional failover– US East to US West replication of everything– Includes write intensive data, high update rate– Testing now

Page 26: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Benchmarking Global CassandraWrite intensive test of cross region replication capacity

16 x hi1.4xlarge SSD nodes per zone = 96 total192 TB of SSD in six locations up and running Cassandra in 20 min

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-West-2 Region - Oregon

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East-1 Region - Virginia

Test Load

Test Load

Validation Load

Inter-Zone Traffic

1 Million writesCL.ONE (wait for one replica to ack)

1 Million readsAfter 500msCL.ONE with noData loss

Inter-Region TrafficUp to 9Gbits/s, 83ms 18TB

backups from S3

Page 27: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Managing Multi-Region Availability

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNSDynECT

DNS

AWS Route53

Denominator – manage traffic via multiple DNS providers with Java code2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May

Denominator

Page 28: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Incidents – Impact and Mitigation

PRX Incidents

CSXX Incidents

Metrics impact – Feature disableXXX Incidents

No Impact – fast retry or automated failoverXXXX Incidents

Public Relations Media Impact

High Customer Service Calls

Affects AB Test Results

Y incidents mitigated by Active Active, game day practicing

YY incidents mitigated by

better tools and practices

YYY incidents mitigated by better

data tagging

Page 29: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Cloud Security

Automated attack surface monitoringCrypto key store management (CloudHSM)

Scale to resist DDOS attackshttp://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned

Page 30: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

What Changed?

“Impossible” deployments are easyJointly building code with vendors in public

Highly available and secure despite scale and speed

Page 31: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

The DIY Question

Why doesn’t Netflix build and run its own cloud?

Page 32: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Fitting Into Public Scale

Public Grey Area Private

1,000 Instances 100,000 Instances

Netflix FacebookStartups

Page 33: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

How big is Public?

AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default (some VPC don’t)

AWS Maximum Possible Instance Count 4.2 Million – May 2013Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange

Page 34: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

A Cloud Native Open Source PlatformSee netflix.github.com

Page 35: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Establish our solutions as Best

Practices / Standards

Hire, Retain and Engage Top Engineers

Build up Netflix Technology Brand

Benefit from a shared ecosystem

Goals

Page 36: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Example Application – RSS Reader

ZUUL

Zuul TrafficProcessing and Routing

Page 37: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Ice – Detailed AWS “Chargeback”http://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html

Page 38: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Boosting the @NetflixOSS EcosystemSee netflix.github.com

Page 39: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

More Use Cases

More Features

Better portability

Higher availability

Easier to deploy

Contributions from end users

Contributions from vendors

What’s Coming Next?

Page 40: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds

“It’s done when it runs Asgard”Functionally completeDemonstrated MarchReleased June in V3.3

Offering $10K prize for integration workVendor and end user interestOpenstack “Heat” getting therePaypal C3 Console based on Asgard

Page 41: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Functionality and scale now, portability coming

Moving from parts to a platform in 2013

Netflix is fostering a cloud native ecosystem

Rapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)

Page 42: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

Slideshare.net/Netflix Details• Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games

– http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html

• Lightning Talks March S1E2– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-roadmap

• Lightning Talks Feb S1E1– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks

• Asgard In Depth Feb S1E1– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house

• Security Architecture– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/

• Cost Aware Cloud Architectures – with Jinesh Varia of AWS– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-varia-aw

s-and-adrian-cockroft-netflix

Page 43: Cloud Native at Netflix: What Changed? - Gartner Catalyst 2013

What Changed?

Speed wins, Cloud Native helps you get there

NetflixOSS makes it easier for everyone to become Cloud Native

http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft

@adrianco #netflixcloud @NetflixOSS