Transcript
Page 1: Devops at Netflix (re:Invent)

Rainmakers

How Netflix Operates Clouds for Maximum Freedom and Agility

Jeremy EdbergReliability Architect,

Netflix

Page 2: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Do you have...

• A release Engineer?

• A QA department?

• Chef or Puppet to manage your systems?

Page 3: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Do you have...

• Upwards of 100 releases a day?

Page 4: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 5: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

With more than 30 million streaming members in the United States,

Canada, Latin America, the United Kingdom, Ireland and the Nordics,

Netflix is the world's leading internet subscription service for enjoying

movies and TV programs streamed over the internet to PCs, Macs and TV.

Source: http://ir.netflix.com

Page 6: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

The Netflix Way• Everything is “built for three”

• Fully automated build tools to test and make packages

• Fully automated machine image bakery

• Fully automated image deployment

• Independent teams responsible for both Dev and Ops

Page 7: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Philosophy

Page 8: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Automate all the things!

Page 9: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Automate all the things!

• Application startup

• Configuration

• Code deployment

• System deployment

Page 10: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Automation

• Standard base image

• Tools to manage all the systems

• Automated code deployment

Page 11: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Shared state should be stored in a shared

service

Data on an instance should be replicated to other instances

Page 12: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

“Build for Three”We hold a boot camp for new engineers to teach

them how to build for a highly distributed environment.

Page 13: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 14: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Netflix on AWS2012IPv6

2012IPv6

2012IPv6

Open Connect

Page 15: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Highly aligned, loosely coupled

• Services are built by different teams who work together to figure out what each service will provide.

• The service owner publishes an API that anyone can use.

Page 16: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Advantages to a Service Oriented Architecture

• Easier auto-scaling

• Easier capacity planning

• Identify problematic code-paths more easily

• Narrow in the effects of a change

• More efficient local caching

Page 17: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Freedom and Responsibility

• Developers deploy when they want

• They also manage their own capacity and autoscaling

• And fix anything that breaks at 4am!

Page 18: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

All systems choices assume some part will fail

at some point.

Page 19: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

The Monkey Theory

•Simulate things that go wrong

•Find things that are different

Page 20: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Execution

Photo from I, Robot, copyright 20th Century Fox

Page 21: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Netflix built a global PaaS

•Service Oriented Architecture

•HTTP/Rest interfaces between services

Page 22: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Netflix PaaS features• Supports all regions and zones

• Multiple accounts

• Cross region/account replication

• Internationalized, localized and GeoIP routed

• Advanced key management

• Autoscaling with 1000s of instances

• Monitoring and alerting on millions of metrics

Page 23: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

What AWS Provides

• Instances

• Machine Images

• Elastic IPs

• Load Balancers

• Security groups / Autoscaling groups

• Availability zones and regions

Page 24: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Appdynamics App Agent

monitoringApplication war file,

base servlet, platform, interface jars for dependent

services

GC and thread dump

logging

Healthcheck, status servelets, JMX

interface, Servo autoscale

Page 25: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

The Netflix PlatformDiscovery

(Eureka)Entrypoints (Edda)Configuration

(Archaius)Zookeeper (Exhibitor)

logging (Blitz4j & Honu)NIWSGeoBase

Circut Breakers (Hystrix)

Cassandra (Priam & Astyanax &

CassJMeter) Cryptex AKMSEvCache

Proxiesi18nL10nOpen Source

Page 26: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 27: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

NovNov Feb

Feb

DecDec 201

20122 M

arMar AprApr Ju

nJu

n Jul

Jul

AugAug Sep

Sep

OctOct

Curat

o

Curat

o

rrAst

yana

Astya

na

xxSe

rvSe

rv

ooPr

iaPr

ia

mmCas

sJMet

er

CassJM

eter

Exhi

bito

Exhi

bito

rrArc

haiu

Archa

iu

ssAsg

ar

Asgar

ddCha

os

Chaos

Mon

key

Mon

key

Eure

k

Eure

k

aa

MayMay

Open Source at Netflix

GovernatorBlitz4jEdda

Hystrix

Page 28: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Finding things• Discovery (Eureka)

• Application to instance mapping• Heartbeat to keep track of health

• Entrypoints (Edda)• Local database of AWS resources

• NIWS (Netflix Internal Web Service)• On instance software load balancer• Handles retry logic

• Geo (Geolocation library)• Provides IP to Lat/Lon mapping for any service

that needs it.

Page 29: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Entrypoints (Edda)

• REST API

• GET /REST/v2/instance/$id

• Keeps track of all resources

• Autoscaling groups, EIPs, Instances, Applications, Clusters, History

Page 30: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Entrypoints ExplorationFind all active

instancesGET /REST/v2/view/instances

Find all instances in a cluster

GET /REST/v2/group/clusters

Show only ASG name, instance ID

and health

/v2/aws/autoScalingGroups/edda-v123;_pp:(autoScalingGroupName,instances:

(instanceId,lifecycleState))

Which ASG contains a particular instance?

/v2/aws/autoScalingGroups;instances.instanceId=i-96f3ca3a

Page 31: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Keeping it all Straight• Configuration (Archaius)

• Global variables (Fast properties)

• Base

• Base system. Prod vs. Test, etc

• Zookeeper (Curator)

• Locks, other similar coordination

• Logging (Blitz4j and Honu)

• Keep track of what happened and store it for post analysis.

Page 32: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Keeping it Secure• Cryptex

• Service for key management

• High, medium and low value keys

• AKMS (Amazon Key Management System)

• Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance

For more info, see SEC201: Security Panel

Page 33: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Storing it• Cassandra (Priam, astyanax)

• Configure and access Cassandra

• Provide OO abstractions handle connection pooling, discovery of hosts

• EVCache (Eccentric Volatile Cache)

• Wrapper for memcached to handle zone awareness and replication

• Proxies

• Get data out of the datacenter and into the cloud.

Page 34: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

DataWhat do we do with it all?

Page 35: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

We store it!

•Cache (memcached)

•Cassandra

•RDS (MySql)

Page 36: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Cassandra

Page 37: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Why Cassandra?

•Availability over consistency

•Writes over reads

•We know Java

•Open source + support

Page 38: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Using Cassandra at Netflix

• Priam

• Zero touch auto-config

• State management

• Token assignment

• Node replacement

• Backup/restore to/from S3

• Astyanax

• OO abstraction to Cassandra

• Multi-region support

Page 39: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 40: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 41: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Cassandra Architecture

Page 42: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Cassandra Architecture

For more info, see DAT202: Optimizing your Cassandra Database on AWS

Page 43: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Tools

• Asgard

• AWS usage

• Atlas

• Chronos

• Build system

• Explorers (Cassandra and SimpleDB)

Page 44: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 45: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Auto ScalingGroup

LaunchConfiguration

SecurityGroup

Amazon MachineImage

Instances

Elastic LoadBalancer

Page 46: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Page 47: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

api-usprod-v007

api-frontend

api-usprod-v008

Page 48: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 49: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 50: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 51: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Netflix has moved the granularity from the

instance to the cluster

Page 52: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Why Bake?

Generic AMI

Instance

Traditional:•launch OS•install packages•install app

Netflix:•launch OS+app App AMI Instance

Page 53: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Getting Baked

Perforce / GitPerforce / GitPerforce / GitPerforce / Git

libraries

source

Ant targetsAnt targets

IvyIvy

Groovy all overGroovy all over

snapshot / release libraries / apps

app bundles

JenkinsJenkinsJenkinsJenkins

syncsyncsyncsync

resolveresolveresolveresolve

buildbuildbuildbuildcompilecompilecompilecompile reportreportreportreport

publishpublishpublishpublishtesttesttesttest

ArtifactoryArtifactoryArtifactoryArtifactory

Page 54: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Base ImageBaking

Yum / AptYum / AptYum / AptYum / Apt

Linux: CentOS, Fedora, UbuntuLinux: CentOS, Fedora, Ubuntu

AWSRPMs: Apache, Java...

ec2 slave instancesec2 slave instances

S3 / EBS

foundatiofoundation AMIn AMI

foundatiofoundation AMIn AMI

base base AMIAMIbase base AMIAMI

BakeryBakeryBakeryBakery

mount

install

Ready forappbake

Ready forappbake

snapshot

Page 55: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

App ImageBaking

Jenkins / Yum / Jenkins / Yum / ArtifactoryArtifactory

Jenkins / Yum / Jenkins / Yum / ArtifactoryArtifactory

Linux, Apache, Java, TomcatLinux, Apache, Java, Tomcat

AWSapp bundle

ec2 slave instancesec2 slave instances

S3 / EBS

base AMIbase AMIbase AMIbase AMI

app app AMIAMIapp app AMIAMI

BakeryBakeryBakeryBakery

mount

install

Ready to launch!

Ready to launch!

snapshot

Page 56: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

Tomcat

Optional Apache

Appdynamics App Agent

monitoringApplication war file,

base servlet, platform, interface jars for dependent

services

GC and thread dump

logging

Healthcheck, status servelets, JMX

interface, Servo autoscale

Page 57: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Java (JDK 6 or 7)

JBoss

Optional Apache

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

Appdynamics App Agent

monitoringApplication war file,

base servlet, platform, interface jars for dependent

services

GC and thread dump

logging

Healthcheck, status servelets, JMX

interface, Servo autoscale

Page 58: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Linux Base AMI (CentOS or Ubuntu)

Python

Django

Optional Apache

Monitoring

Log Rotation to

S3

Appdynamics Machine

Agent

monitoring

Application file, base server, platform, interface libs for

dependent serviceslogging

Page 59: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

The Monkey Theory

•Simulate things that go wrong

•Find things that are different

Page 60: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

The simian army• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

• Circus -- Kills and launches instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates

For more info, see ARC301: Intro to Chaos Monkey & the Simian Army

Page 61: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

What’s going on?!

Page 62: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Atlas

Page 63: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

{  "clusters": [    "epic_aggregator",    "epic_aggregator-dev"  ],  "alerts": [    // you can use javascript style comments in the config    {      "metricName": "EpicPlugin_NumDropped",      "applyTo": "cluster",      "condition": {        "type": "StaticThreshold",        "max": 0.0      },      "severity": "major",      "description": "plugin is dropping metrics"    },    {      "metricName": "EpicPlugin_NumDropped_Instance",      "applyTo": "instance",      "condition": {        "type": "NumOccurrences",        "num": 4,        "condition": {          "type": "StaticThreshold",          "max": 0.0        }      },      "overrides": {        "service_key_override": "12345",        "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"],        "email_override": "[email protected]"      },      "severity": "minor"    },   

{      "metricName": "EpicPlugin_MetricCount",      "applyTo": "instance",      "description": "${instanceId} is reporting too many metrics",      "condition": {        "type": "NumOccurrences",        "num": 4,        "condition": {          "type": "StaticThreshold",          "max": 0.0        }      },      "additionalDetails": {        "statusUrl": "http://${publicDnsName}:7001/Status",        "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}"      }      "overrides": {        "subject": "${instanceId} is reporting too many metrics",        "incident_key": "${metricName}:${instanceId}",        "service_key_override": "12345",        "email_override": "[email protected]"      },      "severity": "minor"    }  ]}

Example Alert Config

Page 64: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Alert Tuning

Page 65: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Alert Systems

alertingalertingalertingalerting

apiapiapiapi

apiapiapiapi

CORECOREEvent Event

GatewaGatewayy

CORECOREEvent Event

GatewaGatewayy

Paging Paging ServiceServicePaging Paging ServiceService

AmazoAmazonn

SESSES

AmazoAmazonn

SESSES

CORE CORE AgentAgentCORE CORE AgentAgent

Other Other TeamTeam’’s s AgentAgent

Other Other TeamTeam’’s s AgentAgent

CORE CORE AgentAgentCORE CORE AgentAgent

Atlas

Appdynamics

Page 66: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Page 67: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Chronos

Page 68: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Text

Data Collection Pipeline

Data Processing Pipeline

For more info, see BDT303: Data Science with Elastic MapReduce

Page 69: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Chuckwa/Honu messages / min

63 billion

messages a day

Page 70: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Best Practices

Page 71: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Incident Reviews

• What went wrong?

• How could we have detected it sooner?

• How could we have prevented it?

• How can we prevent this class of problem in the future?

• How can we improve our behavior for next time?

Ask the key questions:

Page 72: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Best Practices for Data

• Have multiple copies of all data

• Keep those copies in multiple AZs

• Avoid keeping state on a single instance

• Take frequent snapshots of EBS disks

• No secret keys on the instance

Page 73: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Netflix autoscaling

Traffic Peak

Text1

2

Deployment

Page 74: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

AWS UsageDollar amounts have been carefully removed

Page 75: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Going multi-zone

Page 76: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Benefits of Amazon’s Zones

• Loosely connected

• Low latency between zones

• 99.95% uptime guarantee per region

Page 77: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Going Multi-region

Page 78: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Leveraging Multi-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money

Page 79: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Circuit Breakers (Hystrix)Be liberal in what you accept, strict in what you send

Page 80: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Just a quick reminder...

• (Some of) Netflix is open source:

• https://github.com/netflix

Page 81: Devops at Netflix (re:Invent)

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Page 82: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Questions?

Page 83: Devops at Netflix (re:Invent)

Tweet @jedberg with feedback!

Getting in touchEmail: jedberg@{gmail,netflix}.com

Twitter: @jedberg

Web: www.jedberg.net

Facebook: facebook.com/jedberg

Linkedin: www.linkedin.com/in/jedberg


Recommended