Rainmakers
How Netflix Operates Clouds for Maximum Freedom and Agility
Jeremy EdbergReliability Architect,
Netflix
Tweet @jedberg with feedback!
Do you have...
• A release Engineer?
• A QA department?
• Chef or Puppet to manage your systems?
Tweet @jedberg with feedback!
Do you have...
• Upwards of 100 releases a day?
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
With more than 30 million streaming members in the United States,
Canada, Latin America, the United Kingdom, Ireland and the Nordics,
Netflix is the world's leading internet subscription service for enjoying
movies and TV programs streamed over the internet to PCs, Macs and TV.
Source: http://ir.netflix.com
Tweet @jedberg with feedback!
The Netflix Way• Everything is “built for three”
• Fully automated build tools to test and make packages
• Fully automated machine image bakery
• Fully automated image deployment
• Independent teams responsible for both Dev and Ops
Tweet @jedberg with feedback!
Philosophy
Tweet @jedberg with feedback!
Automate all the things!
Tweet @jedberg with feedback!
Automate all the things!
• Application startup
• Configuration
• Code deployment
• System deployment
Tweet @jedberg with feedback!
Automation
• Standard base image
• Tools to manage all the systems
• Automated code deployment
Tweet @jedberg with feedback!
Shared state should be stored in a shared
service
Data on an instance should be replicated to other instances
Tweet @jedberg with feedback!
“Build for Three”We hold a boot camp for new engineers to teach
them how to build for a highly distributed environment.
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix on AWS2012IPv6
2012IPv6
2012IPv6
Open Connect
Tweet @jedberg with feedback!
Highly aligned, loosely coupled
• Services are built by different teams who work together to figure out what each service will provide.
• The service owner publishes an API that anyone can use.
Tweet @jedberg with feedback!
Advantages to a Service Oriented Architecture
• Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths more easily
• Narrow in the effects of a change
• More efficient local caching
Tweet @jedberg with feedback!
Freedom and Responsibility
• Developers deploy when they want
• They also manage their own capacity and autoscaling
• And fix anything that breaks at 4am!
Tweet @jedberg with feedback!
All systems choices assume some part will fail
at some point.
Tweet @jedberg with feedback!
The Monkey Theory
•Simulate things that go wrong
•Find things that are different
Tweet @jedberg with feedback!
Execution
Photo from I, Robot, copyright 20th Century Fox
Tweet @jedberg with feedback!
Netflix built a global PaaS
•Service Oriented Architecture
•HTTP/Rest interfaces between services
Tweet @jedberg with feedback!
Netflix PaaS features• Supports all regions and zones
• Multiple accounts
• Cross region/account replication
• Internationalized, localized and GeoIP routed
• Advanced key management
• Autoscaling with 1000s of instances
• Monitoring and alerting on millions of metrics
Tweet @jedberg with feedback!
What AWS Provides
• Instances
• Machine Images
• Elastic IPs
• Load Balancers
• Security groups / Autoscaling groups
• Availability zones and regions
Tweet @jedberg with feedback!
Monitoring
Log Rotation to
S3
Appdynamics Machine
Agent
Linux Base AMI (CentOS or Ubuntu)
Java (JDK 6 or 7)
Tomcat
Optional Apache
Appdynamics App Agent
monitoringApplication war file,
base servlet, platform, interface jars for dependent
services
GC and thread dump
logging
Healthcheck, status servelets, JMX
interface, Servo autoscale
Tweet @jedberg with feedback!
The Netflix PlatformDiscovery
(Eureka)Entrypoints (Edda)Configuration
(Archaius)Zookeeper (Exhibitor)
logging (Blitz4j & Honu)NIWSGeoBase
Circut Breakers (Hystrix)
Cassandra (Priam & Astyanax &
CassJMeter) Cryptex AKMSEvCache
Proxiesi18nL10nOpen Source
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
NovNov Feb
Feb
DecDec 201
20122 M
arMar AprApr Ju
nJu
n Jul
Jul
AugAug Sep
Sep
OctOct
Curat
o
Curat
o
rrAst
yana
Astya
na
xxSe
rvSe
rv
ooPr
iaPr
ia
mmCas
sJMet
er
CassJM
eter
Exhi
bito
Exhi
bito
rrArc
haiu
Archa
iu
ssAsg
ar
Asgar
ddCha
os
Chaos
Mon
key
Mon
key
Eure
k
Eure
k
aa
MayMay
Open Source at Netflix
GovernatorBlitz4jEdda
Hystrix
Tweet @jedberg with feedback!
Finding things• Discovery (Eureka)
• Application to instance mapping• Heartbeat to keep track of health
• Entrypoints (Edda)• Local database of AWS resources
• NIWS (Netflix Internal Web Service)• On instance software load balancer• Handles retry logic
• Geo (Geolocation library)• Provides IP to Lat/Lon mapping for any service
that needs it.
Tweet @jedberg with feedback!
Entrypoints (Edda)
• REST API
• GET /REST/v2/instance/$id
• Keeps track of all resources
• Autoscaling groups, EIPs, Instances, Applications, Clusters, History
Tweet @jedberg with feedback!
Entrypoints ExplorationFind all active
instancesGET /REST/v2/view/instances
Find all instances in a cluster
GET /REST/v2/group/clusters
Show only ASG name, instance ID
and health
/v2/aws/autoScalingGroups/edda-v123;_pp:(autoScalingGroupName,instances:
(instanceId,lifecycleState))
Which ASG contains a particular instance?
/v2/aws/autoScalingGroups;instances.instanceId=i-96f3ca3a
Tweet @jedberg with feedback!
Keeping it all Straight• Configuration (Archaius)
• Global variables (Fast properties)
• Base
• Base system. Prod vs. Test, etc
• Zookeeper (Curator)
• Locks, other similar coordination
• Logging (Blitz4j and Honu)
• Keep track of what happened and store it for post analysis.
Tweet @jedberg with feedback!
Keeping it Secure• Cryptex
• Service for key management
• High, medium and low value keys
• AKMS (Amazon Key Management System)
• Hands out keys to instances (and dev boxes) so they don’t have to store the key on the instance
For more info, see SEC201: Security Panel
Tweet @jedberg with feedback!
Storing it• Cassandra (Priam, astyanax)
• Configure and access Cassandra
• Provide OO abstractions handle connection pooling, discovery of hosts
• EVCache (Eccentric Volatile Cache)
• Wrapper for memcached to handle zone awareness and replication
• Proxies
• Get data out of the datacenter and into the cloud.
Tweet @jedberg with feedback!
DataWhat do we do with it all?
Tweet @jedberg with feedback!
We store it!
•Cache (memcached)
•Cassandra
•RDS (MySql)
Tweet @jedberg with feedback!
Cassandra
Tweet @jedberg with feedback!
Why Cassandra?
•Availability over consistency
•Writes over reads
•We know Java
•Open source + support
Tweet @jedberg with feedback!
Using Cassandra at Netflix
• Priam
• Zero touch auto-config
• State management
• Token assignment
• Node replacement
• Backup/restore to/from S3
• Astyanax
• OO abstraction to Cassandra
• Multi-region support
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Cassandra Architecture
Tweet @jedberg with feedback!
Cassandra Architecture
For more info, see DAT202: Optimizing your Cassandra Database on AWS
Tweet @jedberg with feedback!
Tools
• Asgard
• AWS usage
• Atlas
• Chronos
• Build system
• Explorers (Cassandra and SimpleDB)
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Auto ScalingGroup
LaunchConfiguration
SecurityGroup
Amazon MachineImage
Instances
Elastic LoadBalancer
Tweet @jedberg with feedback!
api-usprod-v007
api-frontend
api-usprod-v008
Tweet @jedberg with feedback!
api-usprod-v007
api-frontend
api-usprod-v008
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Netflix has moved the granularity from the
instance to the cluster
Tweet @jedberg with feedback!
Why Bake?
Generic AMI
Instance
Traditional:•launch OS•install packages•install app
Netflix:•launch OS+app App AMI Instance
Tweet @jedberg with feedback!
Getting Baked
Perforce / GitPerforce / GitPerforce / GitPerforce / Git
libraries
source
Ant targetsAnt targets
IvyIvy
Groovy all overGroovy all over
snapshot / release libraries / apps
app bundles
JenkinsJenkinsJenkinsJenkins
syncsyncsyncsync
resolveresolveresolveresolve
buildbuildbuildbuildcompilecompilecompilecompile reportreportreportreport
publishpublishpublishpublishtesttesttesttest
ArtifactoryArtifactoryArtifactoryArtifactory
Tweet @jedberg with feedback!
Base ImageBaking
Yum / AptYum / AptYum / AptYum / Apt
Linux: CentOS, Fedora, UbuntuLinux: CentOS, Fedora, Ubuntu
AWSRPMs: Apache, Java...
ec2 slave instancesec2 slave instances
S3 / EBS
foundatiofoundation AMIn AMI
foundatiofoundation AMIn AMI
base base AMIAMIbase base AMIAMI
BakeryBakeryBakeryBakery
mount
install
Ready forappbake
Ready forappbake
snapshot
Tweet @jedberg with feedback!
App ImageBaking
Jenkins / Yum / Jenkins / Yum / ArtifactoryArtifactory
Jenkins / Yum / Jenkins / Yum / ArtifactoryArtifactory
Linux, Apache, Java, TomcatLinux, Apache, Java, Tomcat
AWSapp bundle
ec2 slave instancesec2 slave instances
S3 / EBS
base AMIbase AMIbase AMIbase AMI
app app AMIAMIapp app AMIAMI
BakeryBakeryBakeryBakery
mount
install
Ready to launch!
Ready to launch!
snapshot
Tweet @jedberg with feedback!
Monitoring
Log Rotation to
S3
Appdynamics Machine
Agent
Linux Base AMI (CentOS or Ubuntu)
Java (JDK 6 or 7)
Tomcat
Optional Apache
Appdynamics App Agent
monitoringApplication war file,
base servlet, platform, interface jars for dependent
services
GC and thread dump
logging
Healthcheck, status servelets, JMX
interface, Servo autoscale
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)
Java (JDK 6 or 7)
JBoss
Optional Apache
Monitoring
Log Rotation to
S3
Appdynamics Machine
Agent
Appdynamics App Agent
monitoringApplication war file,
base servlet, platform, interface jars for dependent
services
GC and thread dump
logging
Healthcheck, status servelets, JMX
interface, Servo autoscale
Tweet @jedberg with feedback!
Linux Base AMI (CentOS or Ubuntu)
Python
Django
Optional Apache
Monitoring
Log Rotation to
S3
Appdynamics Machine
Agent
monitoring
Application file, base server, platform, interface libs for
dependent serviceslogging
Tweet @jedberg with feedback!
The Monkey Theory
•Simulate things that go wrong
•Find things that are different
Tweet @jedberg with feedback!
The simian army• Chaos -- Kills random instances
• Chaos Gorilla -- Kills zones
• Chaos Kong -- Kills regions
• Latency -- Degrades network and injects faults
• Conformity -- Looks for outliers
• Circus -- Kills and launches instances to maintain zone balance
• Doctor -- Fixes unhealthy resources
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things like Amazon limit violations
• Security -- Finds security issues and expiring certificates
For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
Tweet @jedberg with feedback!
What’s going on?!
Tweet @jedberg with feedback!
Atlas
Tweet @jedberg with feedback!
{ "clusters": [ "epic_aggregator", "epic_aggregator-dev" ], "alerts": [ // you can use javascript style comments in the config { "metricName": "EpicPlugin_NumDropped", "applyTo": "cluster", "condition": { "type": "StaticThreshold", "max": 0.0 }, "severity": "major", "description": "plugin is dropping metrics" }, { "metricName": "EpicPlugin_NumDropped_Instance", "applyTo": "instance", "condition": { "type": "NumOccurrences", "num": 4, "condition": { "type": "StaticThreshold", "max": 0.0 } }, "overrides": { "service_key_override": "12345", "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"], "email_override": "[email protected]" }, "severity": "minor" },
{ "metricName": "EpicPlugin_MetricCount", "applyTo": "instance", "description": "${instanceId} is reporting too many metrics", "condition": { "type": "NumOccurrences", "num": 4, "condition": { "type": "StaticThreshold", "max": 0.0 } }, "additionalDetails": { "statusUrl": "http://${publicDnsName}:7001/Status", "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}" } "overrides": { "subject": "${instanceId} is reporting too many metrics", "incident_key": "${metricName}:${instanceId}", "service_key_override": "12345", "email_override": "[email protected]" }, "severity": "minor" } ]}
Example Alert Config
Tweet @jedberg with feedback!
Alert Tuning
Tweet @jedberg with feedback!
Alert Systems
alertingalertingalertingalerting
apiapiapiapi
apiapiapiapi
CORECOREEvent Event
GatewaGatewayy
CORECOREEvent Event
GatewaGatewayy
Paging Paging ServiceServicePaging Paging ServiceService
AmazoAmazonn
SESSES
AmazoAmazonn
SESSES
CORE CORE AgentAgentCORE CORE AgentAgent
Other Other TeamTeam’’s s AgentAgent
Other Other TeamTeam’’s s AgentAgent
CORE CORE AgentAgentCORE CORE AgentAgent
Atlas
Appdynamics
Tweet @jedberg with feedback!
Tweet @jedberg with feedback!
Chronos
Tweet @jedberg with feedback!
Text
Data Collection Pipeline
Data Processing Pipeline
For more info, see BDT303: Data Science with Elastic MapReduce
Tweet @jedberg with feedback!
Chuckwa/Honu messages / min
63 billion
messages a day
Tweet @jedberg with feedback!
Best Practices
Tweet @jedberg with feedback!
Incident Reviews
• What went wrong?
• How could we have detected it sooner?
• How could we have prevented it?
• How can we prevent this class of problem in the future?
• How can we improve our behavior for next time?
Ask the key questions:
Tweet @jedberg with feedback!
Best Practices for Data
• Have multiple copies of all data
• Keep those copies in multiple AZs
• Avoid keeping state on a single instance
• Take frequent snapshots of EBS disks
• No secret keys on the instance
Tweet @jedberg with feedback!
Netflix autoscaling
Traffic Peak
Text1
2
Deployment
Tweet @jedberg with feedback!
AWS UsageDollar amounts have been carefully removed
Tweet @jedberg with feedback!
Going multi-zone
Tweet @jedberg with feedback!
Benefits of Amazon’s Zones
• Loosely connected
• Low latency between zones
• 99.95% uptime guarantee per region
Tweet @jedberg with feedback!
Going Multi-region
Tweet @jedberg with feedback!
Leveraging Multi-region
• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money
Tweet @jedberg with feedback!
Circuit Breakers (Hystrix)Be liberal in what you accept, strict in what you send
Tweet @jedberg with feedback!
Just a quick reminder...
• (Some of) Netflix is open source:
• https://github.com/netflix
We are sincerely eager to hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form when you have a
chance.
Tweet @jedberg with feedback!
Questions?
Tweet @jedberg with feedback!
Getting in touchEmail: jedberg@{gmail,netflix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg