AWS Re:Invent - Optimizing Costs with AWS

Optimizing Costs with AWS

Coburn Watson, Manager - Cloud Performance

Netflix Inc.

With more than 30 million streaming members in the United States, Canada, Latin America, the United Kingdom, Ireland and the

Nordics, Netflix, Inc. is the world's leading internet subscription service for enjoying movies and TV programs.

Source: http://ir.netflix.com

Agenda

• Rationale and High-level Methodology • AWS resource-specific optimizations• Performance Testing• Results• Q&A

Rationale andHigh-level Methodology

Rationale• Applications operate at massive scale

• Across three regions and multiple zones per region

• Service oriented architecture • Many moving parts (teams)

• Unconstrained deployment capabilities• “Freedom and Responsibility” culture

Rationale, cont.• Improve availability

• Avoid saturation of key resources• Dynamically adjust capacity to meet workload demands

• Plan for increasing workloads• Less focus on reducing current demand

• Maximize efficiency• Balance OLTP and batch demands

• “That which is measured improves”

Deployment Example• Asgard framework enables turnkey deployment (Netflix open-sourced)

• All engineers have full access

• Real-time reservation capacity

• Unconstrained ASG size limits

Methodology

• Manual• Weekly usage review; leverage Netflix “AWS usage” tool

• Identify unexpected on-demand trends• Review reservation use efficiency• Trend “cost per key event” (e.g. cost per stream event, etc.)

• As-needed• Evaluate utilization and autoscaling efficiency for key services

• Automated• Weekly email to service teams with AWS usage trend (EC2, S3, SimpleDB)• Available reservations exposed real-time to engineers• Janitor Monkey

AWS usage tool

• Pulls cost and usage information from AWS APIs

• Birds-eye view of usage• Near real-time data• Open sourcing plans for tool• Decomposes by application

AWS usage automated email reports

• Weekly email to teams with 4-week cost trend on EC2, S3, and SimpleDB

Janitor Monkey

• Fully Automated• Seeks to reduce “unintentional” resource usage due to

failed cleanup• Cleans up the following resources

• EC2 instances• EBS volumes• EBS snapshots• Launch Configurations*• Autoscaling Groups*• Security Groups*

• Reduces cost and clutter(*)

AWS resource-specific optimizations:

EC2: Primary optimization goals

• Align services to relatively few instance categories• Fewer, larger pools to work with• Common classes (e.g. m2.*)

• Autoscale, autoscale, autoscale• Identify workload components which can utilize excess reservation capacity

• Increase per-instance utilization (CPU, IO, Memory)

• Minimize duration of ASG “overlap” during code pushes

EC2: Autoscaling - Benefits

• Improved efficiency and availability• Avoids setting fixed ASG “max” instance count arbitrarily high

• Optimize resource allocation for mixed workloads• Batch activity can consume unused capacity during OLTP off-peak periods

• Insulate services from unexpected bursts in demand• “Super Bowl” Effect• Chained services that “scale together stay together”

EC2: Autoscaling - Challenges

• Effectively consuming the unused reservation capacity provided through autoscaling• Problem compounded: Large services often scale up or down on the same schedule

7/3/12 15:597/5/12 9:00 7/7/12 2:000

400

800

1,200

1,600

2,000

Unused Reservation Instance Hours *

Need touse this capacity

* - fictitious volumes

EC2: Autoscaling Methodology

• Prioritize service migration to autoscaling• Start with large services• Work downstream to dependent services

• Identify metric to leverage for scaling alarm• Rate-based (requests per second), or Load-based (load average) • More aggressive scale-up versus scale-down• Netflix internal metrics published directly to CloudWatch with Servo *

* Netflix OSS library

EC2: Autoscaling Methodology, cont.

• Validate with load tests • Avoid double-jump or thrashing conditions• Variable instance startup times can result in double-jump

• Autoscaling batch applications• Leverage “scheduled actions” • Maximize consumption of spare reservation capacity

EC2: Simplify Autoscaling Configuration

• Expose Autoscaling capabilities through Asgard• Scaling policies and scheduled actions:

EC2: Autoscaling profile examples

Healthy

Thrashing

Double-Jump

Y-axis = number of instances in ASG

• Once autoscaling in place, focus on improved system utilization• OLTP workloads target 45-60% CPU utilization• Batch workloads target 80%+ to maximize throughput

• Need to be cautious• Some services can have network IO, or other non-CPU as primary limiting factor

EC2: Improve system utilization

SQS: Usage and optimization

• Analytics and log processing infrastructure leverage SQS heavily• Cost is a function of request and data transfer volume• Messages typically small, primary optimization is through request rate

reduction• Adopted AWS SQS API batch capabilities as they evolved

• SQS batching allows up to 10 messages per batch

• 5B messages a day Q1 2012• Implemented batch send and delete capabilities mid 2012

SQS: Request Rate Reduction

Adopted batch delete

Started batch sendadoption

Batch capabilitiesAdoption complete

Time

Re

qu

est

s/d

ay

S3: Buckets…

• S3 usage can take off quickly

• Basic management tactics• Optimize access: Reduce payload size, reduce number of accesses• Age data out with TTL: Deletes are free; scans to find items to delete are not

• Investigate unexpected access patterns and growth trends• Misconfigured archive processes• S3 accesses failing auth at high rates

• Large files decomposed into multi-part upload; each “part” is an access

S3: Logs…

• Can quickly become a primary consumer of S3 capacity

• Reduce volume and access rate• Provide platform libraries with desired behavior• Push logs at infrequent intervals and set appropriate expiry tags

• For “mined” log data find alternate streamlined repositories• Netflix streams data through Chukwa and into Hive for reporting purposes

Performance Testing

• Load tests in test environment• Primarily used to evaluate ASG size requirements and

characterize service resource profile• Up to production scale infrastructure• Leverage homegrown Jenkins + jmeter load test framework

• “Squeeze tests”• Primarily used to identify per-instance capacity• Distribute traffic in production across multiple ASGs

• Reduce size of one ASG to evaluate impact of increased request rate on both performance and utilization characteristics

Results: Efficiency improvements…validated

• 2x the customer traffic, same amount of AWS as 10 months ago

• Optimized EC2: fewer, larger pools of instance types

• Batch activity leverages unused reservation capacity

• Engineering velocity remains unconstrained by capacity management

Netflix Open Source - @NetflixOSS on Github

Open Source Projects - @NetflixOSS on Github

Github / Techblog

Apache Contributions

Techblog Post Only

Coming Soon

Priam

Cassandra as a ServiceAstyanax

Cassandra client for JavaCassJMeter

Cassandra test suite

Cassandra Multi-region EC2 datastore support

Aegisthus

Hadoop ETL for Cassandra

Explorers

Governator - Library lifecycle and dependency injection

Odin

Workflow orchestration

Blitz4j - Async logging

Exhibitor

Zookeeper as a ServiceCurator

Zookeeper PatternsEVCache

Memcached as a ServiceEureka / Discovery

Service DirectoryArchaius

Dynamics Properties ServiceEdda

Queryable config history

Server-side latency/error injection

REST Client + mid-tier LB

Configuration REST endpoints

Servo and Autoscaling Scripts

Honu

Log4j streaming to HadoopCircuit Breaker - Hystrix

Robust service pattern

Asgard - AutoScaleGroup based AWS console

Chaos Monkey

Robustness verification

Latency Monkey

Janitor Monkey

Bakeries and AMI

Build dynaslaves

Legend

Netflix at 2012 re:Invent

Date/Time Presenter Topic

Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy

Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS

Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation

Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud

Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix

Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds

Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel

Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army

We are sincerely eager to hear your FEEDBACK on this presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Contact: [email protected]

Technology

AWS Re:Invent - Optimizing Costs with AWS