Devops at Netflix (re:Invent)

  • View
    9.707

  • Download
    3

Embed Size (px)

DESCRIPTION

How Netflix operates for maximum freedom and agility. Video here: https://www.youtube.com/watch?v=s0rCGFetdtM

Text of Devops at Netflix (re:Invent)

  • 1. RainmakersHow Netflix Operates Clouds for Maximum Freedom and Agility Jeremy EdbergReliability Architect, Netflix

2. Do you have... A release Engineer? A QA department? Chef or Puppet tomanage your systems?Tweet @jedberg with feedback! 3. Do you have... Upwards of 100 releases a day?Tweet @jedberg with feedback! 4. Tweet @jedberg with feedback! 5. With more than 30 million streaming members inthe United States, Canada, Latin America, theUnited Kingdom, Ireland and the Nordics, Netflix isthe worlds leading internet subscription service for enjoying movies and TV programs streamed over the internet to PCs, Macs and TV. Source: http://ir.netflix.com Tweet @jedberg with feedback! 6. The Netflix Way Everything is built for three Fully automated build tools to test andmake packages Fully automated machine image bakery Fully automated image deployment Independent teams responsible forboth Dev and OpsTweet @jedberg with feedback! 7. PhilosophyTweet @jedberg with feedback! 8. Automate all the things!Tweet @jedberg with feedback! 9. Automate all the things! Application startup Configuration Code deployment System deploymentTweet @jedberg with feedback! 10. Automation Standard base image Tools to manage all the systems Automated code deploymentTweet @jedberg with feedback! 11. Shared state should be stored in a shared service Data on an instance should be replicated to other instancesTweet @jedberg with feedback! 12. Build for ThreeWe hold a boot camp for new engineers to teach them howto build for a highly distributed environment.Tweet @jedberg with feedback! 13. Tweet @jedberg with feedback! 14. Netflix on AWS2012 2012 2012IPv6 IPv6 IPv6 Open ConnectTweet @jedberg with feedback! 15. Highly aligned, loosely coupled Services are built by different teamswho work together to figure out whateach service will provide. The service owner publishes an APIthat anyone can use.Tweet @jedberg with feedback! 16. Advantages to a Service Oriented Architecture Easier auto-scaling Easier capacity planning Identify problematic code-paths more easily Narrow in the effects of a change More efficient local cachingTweet @jedberg with feedback! 17. Freedom and Responsibility Developers deploy when they want They also manage their own capacityand autoscaling And fix anything that breaks at 4am!Tweet @jedberg with feedback! 18. All systems choices assume some part will fail at somepoint.Tweet @jedberg with feedback! 19. The Monkey Theory Simulate thingsthat go wrong Find things thatare differentTweet @jedberg with feedback! 20. Execution Photo from I, Robot, copyright 20th Century FoxTweet @jedberg with feedback! 21. Netflix built a global PaaS Service OrientedArchitecture HTTP/Rest interfacesbetween servicesTweet @jedberg with feedback! 22. Netflix PaaS features Supports all regions and zones Multiple accounts Cross region/account replication Internationalized, localized and GeoIP routed Advanced key management Autoscaling with 1000s of instances Monitoring and alerting on millions of metricsTweet @jedberg with feedback! 23. What AWS Provides Instances Machine Images Elastic IPs Load Balancers Security groups / Autoscaling groups Availability zones and regionsTweet @jedberg with feedback! 24. Linux Base AMI (CentOS or Ubuntu)Optional Java (JDK 6 or 7)Apache AppdynamicsApp Agent Monitoring monitoring TomcatLog Rotation to S3 Application war file, baseHealthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread dump jars for dependent services Servo autoscale Machine AgentloggingTweet @jedberg with feedback! 25. The Netflix Platform Discovery(Eureka)Entrypoints Circut Breakers (Hystrix)(Edda)ConfigurationCassandra (Priam & (Archaius)Astyanax & CassJMeter) Zookeeper (Exhibitor)Cryptexlogging (Blitz4j & Honu) AKMSEvCache NIWS Proxiesi18nGeoL10nBaseOpen SourceTweet @jedberg with feedback! 26. Tweet @jedberg with feedback! 27. N ovC D r u raeto c2012 A x sty Fe anbSTweet @jedberg with feedback! o er a M Pr v arm ia C Ae r as sJ prExM M r hibet a yito JunA s rchJuA ald sg iuarCAOpen Source at NetflixM haEddaBlitz4jug Hystrix on oskeGovernatorSe y pEu a re O k ct 28. Finding thingsDiscovery (Eureka) Application to instance mapping Heartbeat to keep track of healthEntrypoints (Edda) Local database of AWS resourcesNIWS (Netflix Internal Web Service) On instance software load balancer Handles retry logicGeo (Geolocation library) Provides IP to Lat/Lon mapping for any service that needs it.Tweet @jedberg with feedback! 29. Entrypoints (Edda) REST API GET /REST/v2/instance/$id Keeps track of all resources Autoscaling groups, EIPs, Instances,Applications, Clusters, HistoryTweet @jedberg with feedback! 30. Entrypoints Exploration Find all active instancesGET /REST/v2/view/instancesFind all instances in a GET /REST/v2/group/clusterscluster Show only ASG name,/v2/aws/autoScalingGroups/edda-v123;_pp:(autoScalingGroupName,instances: instance ID and health (instanceId,lifecycleState)) Which ASG contains a/v2/aws/autoScalingGroups;instances.instanceId=i- 96f3ca3aparticular instance?Tweet @jedberg with feedback! 31. Keeping it all Straight Configuration (Archaius) Global variables (Fast properties) Base Base system. Prod vs. Test, etc Zookeeper (Curator) Locks, other similar coordination Logging (Blitz4j and Honu) Keep track of what happened and store it forpost analysis.Tweet @jedberg with feedback! 32. Keeping it Secure Cryptex Service for key management High, medium and low value keys AKMS (Amazon Key Management System) Hands out keys to instances (and dev boxes) sothey dont have to store the key on the instanceTweet @jedberg with feedback!For more info, see SEC201: Security Panel 33. Storing itCassandra (Priam, astyanax) Configure and access Cassandra Provide OO abstractions handleconnection pooling, discovery of hostsEVCache (Eccentric Volatile Cache) Wrapper for memcached to handle zoneawareness and replicationProxies Get data out of the datacenter and intothe cloud.Tweet @jedberg with feedback! 34. DataWhat do we do with it all?Tweet @jedberg with feedback! 35. We store it! Cache (memcached) Cassandra RDS (MySql)Tweet @jedberg with feedback! 36. CassandraTweet @jedberg with feedback! 37. Why Cassandra? Availability over consistency Writes over reads We know Java Open source + supportTweet @jedberg with feedback! 38. Using Cassandra at Netflix Priam Zero touch auto-config State management Token assignment Node replacement Backup/restore to/from S3 Astyanax OO abstraction to Cassandra Multi-region supportTweet @jedberg with feedback! 39. Tweet @jedberg with feedback! 40. Tweet @jedberg with feedback! 41. Cassandra ArchitectureTweet @jedberg with feedback! 42. Cassandra ArchitectureTweet @jedberg with feedback! For more info, see DAT202: Optimizing your Cassandra Database on AWS 43. Tools Asgard AWS usage Atlas Chronos Build system Explorers (Cassandra and SimpleDB)Tweet @jedberg with feedback! 44. Tweet @jedberg with feedback! 45. Elastic Load Balancer Auto ScalingGroupSecurity Instances Group Launch ConfigurationAmazon MachineTweet @jedberg with feedback! Image 46. api-frontend api-usprod-v007 api-usprod-v008Tweet @jedberg with feedback! 47. api-frontend api-usprod-v007 api-usprod-v008Tweet @jedberg with feedback! 48. Tweet @jedberg with feedback! 49. Tweet @jedberg with feedback! 50. Tweet @jedberg with feedback! 51. Netflix has moved the granularity from the instance to the clusterTweet @jedberg with feedback! 52. Why Bake? Traditional: launch OS Generic AMI install packagesInstance install appNetflix:launch OS+appApp AMI InstanceTweet @jedberg with feedback! 53. Getting BakedArtifactoryArtifactory app bundlesIvysnapshot / release librarieslibraries / appsJenkins Jenkinsresolve resolve testtest publish publish syncsynccompilecompile buildbuildreportreportsource Perforce / /GitPerforce Git Ant targets Groovy all overTweet @jedberg with feedback! 54. Base ImageBakingS3 / EBS foundationfoundation AMI AMI Linux: CentOS, Fedora, UbuntubasebaseAMIAMI mount snapshot Ready forYum // AptYum Apt app install Bakery Bakery bake AWSRPMs: Apache, Java... ec2 slave instancesTweet @jedberg with feedback! 55. App Image BakingS3 / EBS base AMI base AMI Linux, Apache, Java, Tomcat app app AMI AMI mountsnapshot Jenkins // Yum //Jenkins Yum ReadyArtifactoryArtifactoryto launch! install Bakery BakeryAWS app bundle ec2 slave instancesTweet @jedberg with feedback! 56. Linux Base AMI (CentOS or Ubuntu)Optional Java (JDK 6 or 7)Apache AppdynamicsApp Agent Monitoring monitoring TomcatLog Rotation to S3 Application war file, baseHealthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread dump jars for dependent services Servo autoscale Machine AgentloggingTweet @jedberg with feedback! 57. Linux Base AMI (CentOS or Ubuntu)Optional Java (JDK 6 or 7)Apache AppdynamicsApp Agent Monitoring monitoring JBossLog Rotation to S3 Application war file, baseHealthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread dump jars for dependent services Servo autoscale Machine AgentloggingTweet @jedberg with feedback! 58. Linux Base AMI (CentOS or Ubuntu)Optional PythonApachemonitoring MonitoringDjangoLog Rotation to S3 Application file, baseserver, platform, interface Appdynamicslogginglibs for dependent services Machine AgentTweet @jedberg with feedback! 59. The Monkey Theory Simulate thingsthat go wrong Find things thatare differentTweet @jedberg with feedback! 60. The simian army Chaos -- Kills random instances Chaos Gorilla -- Kills zones Chaos Kong -- Kills regions Latency -- Degrades network and injects faults Conformity -- Looks for outliers Circus -- Kills and launches instances to maintain zone balance Doctor -- Fixes unhealthy resources Janitor -- Cleans up unused resou