24
copper egg Austin CUG - August 23rd, 2011 (presented by Eric Anderson) [email protected] Wednesday, August 24, 11

Austin Cloud Users Group - August 23rd, 2011

Embed Size (px)

Citation preview

Page 1: Austin Cloud Users Group - August 23rd, 2011

coppereggAustin CUG - August 23rd, 2011

(presented by Eric Anderson)[email protected]

Wednesday, August 24, 11

Page 2: Austin Cloud Users Group - August 23rd, 2011

About UsCopperEgg

• Founded spring 2010• Super real-time monitoring and analytics

About me (Eric Anderson)• SysAdmin - Centaur - 1999-2007

• 1400 compute nodes, ~50-100 file servers, ~200 misc systems, hundreds of TB’s

• Software Engineer - StorSpeed - 2007-2010• built distributed file system cache for NAS acceleration product

• Co-Founder/COO - CopperEgg - 2010-Present

2Wednesday, August 24, 11

Page 3: Austin Cloud Users Group - August 23rd, 2011

Why Cloud?Important Differences:

• Installs in seconds – copy/paste install• No configuration required - anyone can do it

3

All reliable and business-worthy systems need something like this:

•Physical security•Redundant power•Redundant AC•Redundant & fast network•Peak hardware•Spare equipment•Physical space (storage of spare stuff too)•People to manage physical infrastructure•Hardware repairs

•Redundant infrastructure•Multi-AZ, Regions, storage, etc

•Resilient Applications•Designed for failure

•Performance measurement•Automatic failover/recovery•Security of your infrastructure•Monitoring - up/down/status•Visibility into system as a whole•Don’t rely on cloud vendor!•Delayed, inaccurate

Wednesday, August 24, 11

Page 4: Austin Cloud Users Group - August 23rd, 2011

Why Cloud?Important Differences:

4

All reliable and business-worthy systems need something like this:

Physical

•Physical security•Redundant power•Redundant AC•Redundant & fast network•Peak hardware•Spare equipment•Physical space (storage of spare stuff too)•People to manage physical infrastructure•Hardware repairs

Cloud

•Redundant infrastructure•Multi-AZ, Regions, storage, etc

•Resilient Applications•Designed for failure

•Performance measurement•Automatic failover/recovery•Security of your infrastructure•Monitoring - up/down/status•Visibility into system as a whole•Don’t rely on cloud vendor!•Delayed, inaccurate

Wednesday, August 24, 11

Page 5: Austin Cloud Users Group - August 23rd, 2011

Why Cloud? (for CopperEgg)Why did we go cloud?

• Needed to get building fast• We didn’t know what we needed• Just-in-time scaling• Keep costs low and still provide awesome service levels• Easy deployment for developers

• Test different scenarios, try new setups, etc

• We use it for everything!• code repositories, tickets, email, phone, alerting, etc

5Wednesday, August 24, 11

Page 6: Austin Cloud Users Group - August 23rd, 2011

What we were buildingStorage analytics product

• visualize network attached storage in real-time• massive amounts of data

• analyzing 10 billion ops/day in beta, in real-time

• super real-time (seconds vs minutes)

Requirements:• highly available• super responsive• gobble large amounts of analytics data in real-time• historical data for 2 yrs• great UI

6Wednesday, August 24, 11

Page 7: Austin Cloud Users Group - August 23rd, 2011

Where we started

Bad:• Outgrew it before we outgrew it• Slow!

So then what?

7

+ SimpleDB

Wednesday, August 24, 11

Page 8: Austin Cloud Users Group - August 23rd, 2011

Amazon RDS to save the day!

Good:• Faster than SimpleDB• Could scale the storage

Bad:• Realized it still would not handle our dataset

• Inserts were too slow

So then what?

8

+ SimpleDB

+ RDS

Wednesday, August 24, 11

Page 9: Austin Cloud Users Group - August 23rd, 2011

MySQL on EC2 to save the day!

Good:• Faster than RDS• Increased insert performance

• Using some cheats to get the insert rate up

Bad:• Still not good enough insert performance..

So then what?

9

+ SimpleDB

+ RDS

EC2 + MySQL

Wednesday, August 24, 11

Page 10: Austin Cloud Users Group - August 23rd, 2011

MySQL on Rackspace Cloud

Good:• Faster than Amazon (CPU)• Seemed cheaper

Bad:• No easy way to scale across different zones or regions• No way to expand storage per instance (whole instance only - costly!)• Then we got the bill: they charge for data xfer between instances - OUCH

So then what?

10

+ SimpleDB

+ RDS

EC2 + MySQL+ MySQL

Wednesday, August 24, 11

Page 11: Austin Cloud Users Group - August 23rd, 2011

Back to Amazon!

Why did we move back?• Lots of great services: S3, EC2, EBS, Route 53, ELB (we use all of these)• Even more: SQS, SES, etc• Multiple regions and availability zones• Scale-as-you-need: storage, memory, cpu, redundancy• Documentation

We’re still happy with this.. (9 months and running)

11

+ SimpleDB

+ RDS

EC2 + MySQL+ MySQL

EC2, EBS, MongoDB

Wednesday, August 24, 11

Page 12: Austin Cloud Users Group - August 23rd, 2011

What’s this NoSQL thing?Realized maybe MySQL was not the best choice

• How about a NoSQL database?• So we tested and measured every one we thought was worth looking at:

• Redis• Tokyo Tyrant, Kyoto Cabinet• Cassandra• MongoDB• etc, etc, etc (there are a lot)

12Wednesday, August 24, 11

Page 13: Austin Cloud Users Group - August 23rd, 2011

MongoDB wonMongoDB won the award - why?

• Redundant• Scalable• Persistent data-store• Handles large amounts of data• Awesome user community• Vendor support• Open source• Lots of momentum

13Wednesday, August 24, 11

Page 14: Austin Cloud Users Group - August 23rd, 2011

Where are we now?Needed a way to monitor our site:

• Requirements:• Know right away when problems occur• See into the performance of the system• See historical trends as we grow the business• Super real-time product needs super real-time monitoring

• Not satisfied with existing solutions• slow updates (1m or 5m way to slow - not real-time)• not ‘cloud friendly’• pain to maintain• some are pricey

14Wednesday, August 24, 11

Page 15: Austin Cloud Users Group - August 23rd, 2011

Not real-time?Then what *is* real-time?

• Smallest amount of time you can comfortably have poor service before someone notices and changes their behavior.

• Example:• Web site can only be slow/unavailable for a few seconds before people leave• Email can be slow for tens of seconds before people get grumpy (or less depending on

the people!)• Twitter - well, we’ll leave that one for you to decide

So, if seconds is the yardstick for measuring poor performance, why do we monitor every 1 or 5 minutes?

15Wednesday, August 24, 11

Page 16: Austin Cloud Users Group - August 23rd, 2011

1

25

50

75

100

5:00 PM 5:05 PM

CPU Usage: 5min sampling

Here’s what a 5 minute sample provides• Doesn’t look like much is happening• Users should not be complaining right?

16Wednesday, August 24, 11

Page 17: Austin Cloud Users Group - August 23rd, 2011

CPU Usage: 1min sampling

Same data - 1 minute sample• Looks like there was some kind of cpu activity at 5:01pm - 5:02pm

• Still no issue though - right?

17

0

25

50

75

100

5:00 PM 5:01 PM 5:02 PM 5:03 PM 5:04 PM 5:05 PM

Wednesday, August 24, 11

Page 18: Austin Cloud Users Group - August 23rd, 2011

CPU Usage: 5 second sampling

Same data - 5s sampling• Becomes clear there was something happening:

• between 5:01:10pm - 5:01:25pm

18

0

25

50

75

100

5:00 PM 5:01 PM 5:02 PM 5:03 PM 5:04 PM 5:05 PM

Wednesday, August 24, 11

Page 19: Austin Cloud Users Group - August 23rd, 2011

So we rolled our ownRevealCloud

• Turns out a lot of people agreed with us• Highlights:

• Built on our super real-time analytics engine• Updates in seconds vs minutes• Easy to install, no config required• Great looking and usable interface• Works anywhere - public/private cloud, vm, bare metal)

19Wednesday, August 24, 11

Page 20: Austin Cloud Users Group - August 23rd, 2011

coppereggQuestions

Wednesday, August 24, 11

Page 21: Austin Cloud Users Group - August 23rd, 2011

coppereggDemo

Wednesday, August 24, 11

Page 22: Austin Cloud Users Group - August 23rd, 2011

Demo Screenshots

22Wednesday, August 24, 11

Page 23: Austin Cloud Users Group - August 23rd, 2011

Demo Screenshots

23Wednesday, August 24, 11

Page 24: Austin Cloud Users Group - August 23rd, 2011

Demo Screenshots

24Wednesday, August 24, 11