74
Lessons learned managing large AWS Environments Ronald Bradford http://ronaldbradford.com @RonaldBradford 2013.06

Lessons Learned Managing Large AWS Environments

Embed Size (px)

DESCRIPTION

How to you optimize management of 500+ AWS servers? In this presentation I share my experiences using Amazon Web Servers covering techniques for webscale. Learn how to optimized your cost, handle security, automate and be prepared for handling failure.

Citation preview

Page 1: Lessons Learned Managing Large AWS Environments

Lessons learned managing large

AWS EnvironmentsRonald Bradford

http://ronaldbradford.com @RonaldBradford

2013.06

Page 2: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SCOPE

Consulting experiences with AWS

Several different clientsLargest - 500+ servers

Some 40-50+ servers

Some 2-5 servers

LAMP/RoR/RDS/Windows

Page 3: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

ABOUT MySELF

Enterprise Data Architecture

24 years with RDBMS - 13 years with MySQL

Using AWS 4+ years

Published author - 4 books

Accomplished presenter - 8 years

Work at Independent MySQL Consultant

Ronald BRADFORD

Page 4: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Covering

1. Products

2. Cost

3. Web Scale

4. Security

5. Instrumentation

6. Failure

Page 5: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS Products & Ecosystem

1

Page 6: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

ABOUT AWS

Many, many products and features

EC2, S3, EBS, ELB, RDS, EMR, VPC, CDN, SWF, SQS, SES, SNS, IAM, ...

Mechanical Turk

Flexible Payments Service (FPS)

AMAZON WEB SERVICES30+

Page 7: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS CONSOLE

May 2013 Aug 2012

Page 8: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Announcements

Product Announcements

Pricing Changes

New instance types

New features (e.g. IOPS)

New Products (e.g. Redshift/ OpsWorks)

http://aws.amazon.com/about-aws/newsletters/

Page 9: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Announcements

Product Announcements

Pricing Changes

New instance types

New features (e.g. IOPS)

New Products (e.g. Redshift/ OpsWorks)

Examples in presentation

http://aws.amazon.com/about-aws/newsletters/

Page 10: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

ECOSYSTEM

AWS Marketplacehttps://aws.amazon.com/marketplace/

Over 800

Page 11: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Product growth

When I started

No RDS, In-memory Cache, DynamoDB, Glacier

No Elastic Beanstalk, OpsWorks

No management console

Page 12: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS Costs2

Page 13: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

operating cost

Are you monitoring your costs?

Daily

Hourly

Page 14: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Operating Cost

https://github.com/ronaldbradford/aws

$ ec2_cost.sh

Page 15: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Operating Cost

https://github.com/ronaldbradford/aws

$ ec2_cost.sh

$29,000 p.m.

Page 16: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Your Money

What is AWS costing you?

Instance types/sizes

Cost options

http://aws.amazon.com/ec2/instance-types

http://aws.amazon.com/ec2/pricing

Page 17: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance Types

General-purpose

Compute-optimized

Memory-optimized

Storage-optimized

GPU

Page 18: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance Prices

$Large Instance (m1.large)

Page 19: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance Prices

$On Demand $0.24 Per hour investment

Reserved $0.136 * + Annual contract ( +$ 0.043)

Spot $0.03+ * Can be terminated (budget)

Large Instance (m1.large)

Page 20: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance Prices

$On Demand $0.24 Per hour investment

Reserved $0.136 * + Annual contract ( +$ 0.043)

Spot $0.03+ * Can be terminated (budget)

Large Instance (m1.large)

Page 21: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance Prices

$On Demand $0.24 Per hour investment

Reserved $0.136 * + Annual contract ( +$ 0.043)

Spot $0.03+ * Can be terminated (budget)

Large Instance (m1.large)

Page 22: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance Prices

$On Demand $0.24 Per hour investment

Reserved $0.136 * + Annual contract ( +$ 0.043)

Spot $0.03+ * Can be terminated (budget)

Large Instance (m1.large)

40% saving

up to 80+% saving

Was $0.32 til 11/19/2012Was $0.26 til 1/16/2013

Light/Medium/Heavy utilization

Page 23: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SPOT EXAMPLE

One hour (24 cents)

1 x Large - Reserved

7.5G, 4 CPUs, 850G

8 x Large - Spot

or

1 x Eight Extra Large - Spot (cc2.8xlarge)

60G, 88 CPUs, 3.4T,10Gb NIC

Page 24: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SPOT EXAMPLE

One hour (24 cents)

1 x Large - Reserved

7.5G, 4 CPUs, 850G

8 x Large - Spot

or

1 x Eight Extra Large - Spot (cc2.8xlarge)

60G, 88 CPUs, 3.4T,10Gb NIC

price has changed 3 times in 8 months

Page 25: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SPOT HISTORY

$ ec2-describe-spot-price-history -t m1.large -d Linux/UNIX SPOTINSTANCEPRICE 0.030000 2013-05-28T17:20:41-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.100000 2013-05-28T17:07:02-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.030000 2013-05-28T16:37:51-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.100000 2013-05-28T16:31:03-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.030000 2013-05-28T16:24:48-0500 m1.large Linux/UNIX us-east-1dSPOTINSTANCEPRICE 0.030000 2013-05-28T16:24:48-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.100000 2013-05-28T16:15:03-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.060000 2013-05-28T16:08:34-0500 m1.large Linux/UNIX us-east-1dSPOTINSTANCEPRICE 0.030000 2013-05-28T16:01:59-0500 m1.large Linux/UNIX us-east-1bSPOTINSTANCEPRICE 0.240000 2013-05-28T15:55:12-0500 m1.large Linux/UNIX us-east-1bSPOTINSTANCEPRICE 0.030000 2013-05-28T15:48:32-0500 m1.large Linux/UNIX us-east-1bSPOTINSTANCEPRICE 0.030000 2013-05-28T15:42:07-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.045000 2013-05-28T15:35:47-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.050000 2013-05-28T15:35:47-0500 m1.large Linux/UNIX us-east-1bSPOTINSTANCEPRICE 0.400000 2013-05-28T15:29:15-0500 m1.large Linux/UNIX us-east-1bSPOTINSTANCEPRICE 0.260000 2013-05-28T15:22:47-0500 m1.large Linux/UNIX us-east-1bSPOTINSTANCEPRICE 0.030000 2013-05-28T15:16:01-0500 m1.large Linux/UNIX us-east-1dSPOTINSTANCEPRICE 0.030000 2013-05-28T15:16:01-0500 m1.large Linux/UNIX us-east-1aSPOTINSTANCEPRICE 0.026000 2013-05-28T15:09:30-0500 m1.large Linux/UNIX us-east-1a

3c to 10c Zone A3c to 40c Zone B2013

Page 26: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SPOT HISTORY

$ ec2-describe-spot-price-history -t m1.large -d Linux/UNIX 0.0260 2012-09-27T09:45:46-0800 m1.large Linux/UNIX us-east-1b0.0260 2012-09-27T09:45:46-0800 m1.large Linux/UNIX us-east-1d0.0290 2012-09-27T09:38:37-0800 m1.large Linux/UNIX us-east-1b0.0370 2012-09-27T09:38:37-0800 m1.large Linux/UNIX us-east-1d0.0600 2012-09-27T09:31:29-0800 m1.large Linux/UNIX us-east-1b0.1700 2012-09-27T09:31:29-0800 m1.large Linux/UNIX us-east-1d0.1600 2012-09-27T09:24:20-0800 m1.large Linux/UNIX us-east-1d0.0600 2012-09-27T09:17:11-0800 m1.large Linux/UNIX us-east-1b0.0900 2012-09-27T09:17:11-0800 m1.large Linux/UNIX us-east-1d0.0260 2012-09-27T09:09:55-0800 m1.large Linux/UNIX us-east-1c0.0260 2012-09-27T09:09:55-0800 m1.large Linux/UNIX us-east-1b

2.6c to 17c (1/2 of 34c)One AZ only2012

Page 27: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Using SPOTS

Is your volume predicable?

Splitting on-demand/spot instances

Can work be done asynchronously?

i.e. can be queued

Is work restartable?

Page 28: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Using SPOTS

Is your volume predicable?

Splitting on-demand/spot instances

Can work be done asynchronously?

i.e. can be queued

Is work restartable? WARNING: Not for general workloads

Page 29: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance sizes

Evaluating the right instance size

What is your bottleneck?

Page 30: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instance sizes

Evaluating the right instance size

What is your bottleneck?

Developing a tool to recommend savings

Page 31: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

TRUSTED ADVISOR

AWS now offers Trusted AdvisorRecommendations to save money

Improve performance

Close security problems

http://aws.amazon.com/premiumsupport/trustedadvisor/

Page 32: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

COST SAVINGS

Other players

http://www.newvem.com/http://www.cloudyn.com/

Page 33: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

OTHER COST SAvings

CDN - Cloudfront

Bandwidth

Reduce response size (e.g. 10%)

Storage

old EBS snapshots

Remove unused instances

http://aws.amazon.com/cloudfront/

NEW: Announced 1/9/2103 CloudWatch Alarm Actions

Page 34: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Web Scale(hint: no humans)

3

Page 35: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

ABOUT WEB SCALE

GUI = #FAIL

CLI is necessary

Manual CLI use is slow

Automation in crucial

Parallel

Page 36: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS CLI’s

Different for EC2, ELB, RDS etc

Updated frequently (i.e. monthly)

$ git clone https://github.com/ronaldbradford/aws.git$ cd aws/scripts$ ./aws_cli_configure.sh

Page 37: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS CLI’s

Different for EC2, ELB, RDS etc

Updated frequently (i.e. monthly)

$ git clone https://github.com/ronaldbradford/aws.git$ cd aws/scripts$ ./aws_cli_configure.sh

Simple helper

Page 38: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

RTFM

http://aws.amazon.com/archives/Amazon-EC2

Page 39: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Identifiers

Access Key ID

Private Access Key

X.509 Certificates (2 of)

Private (*) & Public

AWS Account ID

Canonical User IDhttps://portal.aws.amazon.com/gp/aws/securityCredentials

Page 40: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CLI Examples

Launch Script

Demand/Spot or switch between

Verify SSH

Verify MySQL

Verify replication in sync

Add to ELB

Page 41: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CLI Examples

Audit Script

Consolidates information

Parallel operations

Unused EC2/EBS etc

Feeds reporting

ELB/EC2 usage

Page 42: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CLI EXAMPLES

Others

Cost Measurement

Cloning (optimizes scale-up)

Move servers between load balancers

Spot History graphing

Spot History email alerts

Page 43: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS Security4

Page 44: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SECURITY

Do not give away the front door keys

Do not open all the windows

Page 45: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SECURITY OPTIONS

Keypairs

Security groups

Virtual Private Cloud (VPC)

Identity and Access Management (IAM)

Multi-factor authentication

Learn the different benefits

http://aws.amazon.com/mfa/

Page 46: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

SECURITY TIPS

Restrict open access to port 80/443

Jump box

Restrict IP Access

Additional authentication

Per user SSH authentication

Do not use keypair

Page 47: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

products

Many Others (AWS Summit 2013)

Cloudaware

Enstratius

AlertLogic

Dome9

SafeNet

Page 48: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instrumentation5

Page 49: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instrumentation

Page 50: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instrumentation

What is important to you?

Page 51: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instrumentation

What is important to you?

All server stats

Page 52: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instrumentation

What is important to you?

All server stats

Sampling issues

Page 53: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Instrumentation

What is important to you?

All server stats

Sampling issues

Deceiving averages (frequency)

Page 54: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

REQUESTS PER SEC

5 second averages, not 1 minute samplehttps://github.com/ronaldbradford/reqstat

Page 55: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

REQUESTS PER SEC

5 second averages, not 1 minute samplehttps://github.com/ronaldbradford/reqstat

Page 56: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

REQUESTS PER SEC

5 second averages, not 1 minute samplehttps://github.com/ronaldbradford/reqstat

-1,500 RPS

Page 57: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

outliers

Page 58: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

outliersI care about these

Page 59: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

TESTING

End to end testing critical

Network latency

ELB performance

Page 60: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

products

AWS Cloudwatch

Many Others (AWS Summit 2013)

Datadog

Boundary

CopperEgg

AppDynamics

Page 61: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

products

AWS Cloudwatch

Many Others (AWS Summit 2013)

Datadog

Boundary

CopperEgg

AppDynamics

What features matter?

Page 62: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

Failure6

Page 63: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

FAILURE

Page 64: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

FAILURE

Instances fail

Page 65: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

FAILURE

Instances fail

Outages occur

AWS scheduled reboots

Page 66: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

FAILURE

Instances fail

Outages occur

AWS scheduled reboots

Be prepared

Chaos Monkey

http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html

Page 67: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CONCLUSION

Page 68: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CONCLUSION

Cost Management (saving money)

Page 69: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CONCLUSION

Cost Management (saving money)

CLI automation

Page 70: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CONCLUSION

Cost Management (saving money)

CLI automation

Instrumentation (inc business metrics)

Page 71: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CONCLUSION

Cost Management (saving money)

CLI automation

Instrumentation (inc business metrics)

Distribute your application & data

Page 72: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

CONCLUSION

Cost Management (saving money)

CLI automation

Instrumentation (inc business metrics)

Distribute your application & data

Disaster is inevitable

Page 73: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuity

AWS for FREE

http://aws.amazon.com/free/

Free EC2 t1.micro for a year

Free RDS t1.micro for a year

S3, DynamoDB, SimpleDB, +++

Page 74: Lessons Learned Managing Large AWS Environments

EffectiveMySQL.com - Performance, Scalability & Business Continuityhttp://effectiveMySQL.comRonald Bradford