30
Optimizing Costs with AWS Coburn Watson, Manager - Cloud Performance

AWS Re:Invent - Optimizing Costs with AWS

Embed Size (px)

DESCRIPTION

AWS Re:Invent 2012 presentation from Netflix which covers how to optimize cost and usage of your AWS resources. Areas of focus are Autoscaling EC2 instances, batch access of SQS, and improved S3 usage.

Citation preview

Page 1: AWS Re:Invent -  Optimizing Costs with AWS

Optimizing Costs with AWS

Coburn Watson, Manager - Cloud Performance

Page 2: AWS Re:Invent -  Optimizing Costs with AWS

Netflix Inc.

With more than 30 million streaming members in the United States, Canada, Latin America, the United Kingdom, Ireland and the

Nordics, Netflix, Inc. is the world's leading internet subscription service for enjoying movies and TV programs.

Source: http://ir.netflix.com

Page 3: AWS Re:Invent -  Optimizing Costs with AWS

Agenda

• Rationale and High-level Methodology • AWS resource-specific optimizations• Performance Testing• Results• Q&A

Page 4: AWS Re:Invent -  Optimizing Costs with AWS

Rationale andHigh-level Methodology

Page 5: AWS Re:Invent -  Optimizing Costs with AWS

Rationale• Applications operate at massive scale

• Across three regions and multiple zones per region

• Service oriented architecture • Many moving parts (teams)

• Unconstrained deployment capabilities• “Freedom and Responsibility” culture

Page 6: AWS Re:Invent -  Optimizing Costs with AWS

Rationale, cont.• Improve availability

• Avoid saturation of key resources• Dynamically adjust capacity to meet workload demands

• Plan for increasing workloads• Less focus on reducing current demand

• Maximize efficiency• Balance OLTP and batch demands

• “That which is measured improves”

Page 7: AWS Re:Invent -  Optimizing Costs with AWS

Deployment Example• Asgard framework enables turnkey deployment (Netflix open-sourced)

• All engineers have full access

• Real-time reservation capacity

• Unconstrained ASG size limits

Page 8: AWS Re:Invent -  Optimizing Costs with AWS

Methodology

• Manual• Weekly usage review; leverage Netflix “AWS usage” tool

• Identify unexpected on-demand trends• Review reservation use efficiency• Trend “cost per key event” (e.g. cost per stream event, etc.)

• As-needed• Evaluate utilization and autoscaling efficiency for key services

• Automated• Weekly email to service teams with AWS usage trend (EC2, S3, SimpleDB)• Available reservations exposed real-time to engineers• Janitor Monkey

Page 9: AWS Re:Invent -  Optimizing Costs with AWS

AWS usage tool

• Pulls cost and usage information from AWS APIs

• Birds-eye view of usage• Near real-time data• Open sourcing plans for tool• Decomposes by application

Page 10: AWS Re:Invent -  Optimizing Costs with AWS

AWS usage automated email reports

• Weekly email to teams with 4-week cost trend on EC2, S3, and SimpleDB

Page 11: AWS Re:Invent -  Optimizing Costs with AWS

Janitor Monkey

• Fully Automated• Seeks to reduce “unintentional” resource usage due to

failed cleanup• Cleans up the following resources

• EC2 instances• EBS volumes• EBS snapshots• Launch Configurations*• Autoscaling Groups*• Security Groups*

• Reduces cost and clutter(*)

Page 12: AWS Re:Invent -  Optimizing Costs with AWS

AWS resource-specific optimizations:

Page 13: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Primary optimization goals

• Align services to relatively few instance categories• Fewer, larger pools to work with• Common classes (e.g. m2.*)

• Autoscale, autoscale, autoscale• Identify workload components which can utilize excess reservation capacity

• Increase per-instance utilization (CPU, IO, Memory)

• Minimize duration of ASG “overlap” during code pushes

Page 14: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Autoscaling - Benefits

• Improved efficiency and availability• Avoids setting fixed ASG “max” instance count arbitrarily high

• Optimize resource allocation for mixed workloads• Batch activity can consume unused capacity during OLTP off-peak periods

• Insulate services from unexpected bursts in demand• “Super Bowl” Effect• Chained services that “scale together stay together”

Page 15: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Autoscaling - Challenges

• Effectively consuming the unused reservation capacity provided through autoscaling• Problem compounded: Large services often scale up or down on the same schedule

7/3/12 15:597/5/12 9:00 7/7/12 2:000

400

800

1,200

1,600

2,000

Unused Reservation Instance Hours *

Need touse this capacity

* - fictitious volumes

Page 16: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Autoscaling Methodology

• Prioritize service migration to autoscaling• Start with large services• Work downstream to dependent services

• Identify metric to leverage for scaling alarm• Rate-based (requests per second), or Load-based (load average) • More aggressive scale-up versus scale-down• Netflix internal metrics published directly to CloudWatch with Servo *

* Netflix OSS library

Page 17: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Autoscaling Methodology, cont.

• Validate with load tests • Avoid double-jump or thrashing conditions• Variable instance startup times can result in double-jump

• Autoscaling batch applications• Leverage “scheduled actions” • Maximize consumption of spare reservation capacity

Page 18: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Simplify Autoscaling Configuration

• Expose Autoscaling capabilities through Asgard• Scaling policies and scheduled actions:

Page 19: AWS Re:Invent -  Optimizing Costs with AWS

EC2: Autoscaling profile examples

Healthy

Thrashing

Double-Jump

Y-axis = number of instances in ASG

Page 20: AWS Re:Invent -  Optimizing Costs with AWS

• Once autoscaling in place, focus on improved system utilization• OLTP workloads target 45-60% CPU utilization• Batch workloads target 80%+ to maximize throughput

• Need to be cautious• Some services can have network IO, or other non-CPU as primary limiting factor

EC2: Improve system utilization

Page 21: AWS Re:Invent -  Optimizing Costs with AWS

SQS: Usage and optimization

• Analytics and log processing infrastructure leverage SQS heavily• Cost is a function of request and data transfer volume• Messages typically small, primary optimization is through request rate

reduction• Adopted AWS SQS API batch capabilities as they evolved

• SQS batching allows up to 10 messages per batch

• 5B messages a day Q1 2012• Implemented batch send and delete capabilities mid 2012

Page 22: AWS Re:Invent -  Optimizing Costs with AWS

SQS: Request Rate Reduction

Adopted batch delete

Started batch sendadoption

Batch capabilitiesAdoption complete

Time

Re

qu

est

s/d

ay

Page 23: AWS Re:Invent -  Optimizing Costs with AWS

S3: Buckets…

• S3 usage can take off quickly

• Basic management tactics• Optimize access: Reduce payload size, reduce number of accesses• Age data out with TTL: Deletes are free; scans to find items to delete are not

• Investigate unexpected access patterns and growth trends• Misconfigured archive processes• S3 accesses failing auth at high rates

• Large files decomposed into multi-part upload; each “part” is an access

Page 24: AWS Re:Invent -  Optimizing Costs with AWS

S3: Logs…

• Can quickly become a primary consumer of S3 capacity

• Reduce volume and access rate• Provide platform libraries with desired behavior• Push logs at infrequent intervals and set appropriate expiry tags

• For “mined” log data find alternate streamlined repositories• Netflix streams data through Chukwa and into Hive for reporting purposes

Page 25: AWS Re:Invent -  Optimizing Costs with AWS

Performance Testing

• Load tests in test environment• Primarily used to evaluate ASG size requirements and

characterize service resource profile• Up to production scale infrastructure• Leverage homegrown Jenkins + jmeter load test framework

• “Squeeze tests”• Primarily used to identify per-instance capacity• Distribute traffic in production across multiple ASGs

• Reduce size of one ASG to evaluate impact of increased request rate on both performance and utilization characteristics

Page 26: AWS Re:Invent -  Optimizing Costs with AWS

Results: Efficiency improvements…validated

• 2x the customer traffic, same amount of AWS as 10 months ago

• Optimized EC2: fewer, larger pools of instance types

• Batch activity leverages unused reservation capacity

• Engineering velocity remains unconstrained by capacity management

Page 27: AWS Re:Invent -  Optimizing Costs with AWS

Netflix Open Source - @NetflixOSS on Github

Page 28: AWS Re:Invent -  Optimizing Costs with AWS

Open Source Projects - @NetflixOSS on Github

Github / Techblog

Apache Contributions

Techblog Post Only

Coming Soon

Priam

Cassandra as a ServiceAstyanax

Cassandra client for JavaCassJMeter

Cassandra test suite

Cassandra Multi-region EC2 datastore support

Aegisthus

Hadoop ETL for Cassandra

Explorers

Governator - Library lifecycle and dependency injection

Odin

Workflow orchestration

Blitz4j - Async logging

Exhibitor

Zookeeper as a ServiceCurator

Zookeeper PatternsEVCache

Memcached as a ServiceEureka / Discovery

Service DirectoryArchaius

Dynamics Properties ServiceEdda

Queryable config history

Server-side latency/error injection

REST Client + mid-tier LB

Configuration REST endpoints

Servo and Autoscaling Scripts

Honu

Log4j streaming to HadoopCircuit Breaker - Hystrix

Robust service pattern

Asgard - AutoScaleGroup based AWS console

Chaos Monkey

Robustness verification

Latency Monkey

Janitor Monkey

Bakeries and AMI

Build dynaslaves

Legend

Page 29: AWS Re:Invent -  Optimizing Costs with AWS

Netflix at 2012 re:Invent

Date/Time Presenter Topic

Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy

Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS

Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation

Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud

Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix

Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds

Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel

Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army

Page 30: AWS Re:Invent -  Optimizing Costs with AWS

We are sincerely eager to hear your FEEDBACK on this presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Contact: [email protected]