52
Capacity Planning for Web Operations John Allspaw Operations Engineering

Capacity Planning for Web Operations - Web20 Expo 2008

Embed Size (px)

Citation preview

Page 1: Capacity Planning for Web Operations - Web20 Expo 2008

Capacity Planning for Web Operations

John AllspawOperations Engineering

Page 2: Capacity Planning for Web Operations - Web20 Expo 2008

?Do you know how many servers do

you have?

Are you tracking how your servers are

performing?

Do you know how much traffic can

your servers handle?

(without dying)

Are you tracking howyour application is being

used ?

Page 3: Capacity Planning for Web Operations - Web20 Expo 2008

architecture

procurement

capex

deployment

monitoring

forecasting

testing

metrics

product planning

Page 4: Capacity Planning for Web Operations - Web20 Expo 2008

architecture

procurement

capex

deploymentmonitoring

forecasting

testing

metrics

product planning

go see Adam Jacob’s talk!

Page 5: Capacity Planning for Web Operations - Web20 Expo 2008

traditional capacity planning

Page 6: Capacity Planning for Web Operations - Web20 Expo 2008

capacity planning for web

Page 7: Capacity Planning for Web Operations - Web20 Expo 2008

* and network, datacenter space, power, etc.

Why capacity planning is important

(Cloudware costs $$, too)

Having too little is bad (!@#!!) -> ($$$)

Having too much is bad ($$$$!)

Hardware* costs $$

Page 8: Capacity Planning for Web Operations - Web20 Expo 2008

Growth

“Normal”

“Instantaneous”

projectedplannedexpectedhoped for

spikesunexpectedexternal eventsdigg, etc.

(yay!)

(yay?)

(omg! wtf!)

Page 9: Capacity Planning for Web Operations - Web20 Expo 2008

“Normal” growth at Flickrin a year....

4x increase in photo requests/sec

2.2x increase in uploads/day

3x increase in database queries/sec

Page 10: Capacity Planning for Web Operations - Web20 Expo 2008

“Instantaneous”

Yahoo! FrontPage linkXMas lights

Page 11: Capacity Planning for Web Operations - Web20 Expo 2008

“Instantaneous” coping

- Disabling “heavier” features on the site

- Cache aggressively or serve stale data

- Bake dynamic pages into static ones

Page 12: Capacity Planning for Web Operations - Web20 Expo 2008

capacity != performance

Making something fast doesn’t necessarily make it

lastPerformance tuning = good, just

don’t count on it

Accept (for now) the performance you have, not the performance you wished you had, or you think you might have later

Page 13: Capacity Planning for Web Operations - Web20 Expo 2008

Stewart: “Allspaw!!!! OMG!!!”

How many servers willwe need next year?!

(we need to tell finance by 2pm today)

Page 14: Capacity Planning for Web Operations - Web20 Expo 2008

“Ah, just buy twice as much as we need”

2 x (how much we need) = ?

Page 15: Capacity Planning for Web Operations - Web20 Expo 2008
Page 16: Capacity Planning for Web Operations - Web20 Expo 2008

measurement

Page 17: Capacity Planning for Web Operations - Web20 Expo 2008

Good capacity measurement tools can...

Easily measure and record any number that changes over time

Easily make graphs

Easily compare metrics to any other metrics from anywhere else (import/export)

Page 18: Capacity Planning for Web Operations - Web20 Expo 2008

good tools are out there

cacti.net munin.projects.linpro.no

hyperic.com

ganglia.info

Page 19: Capacity Planning for Web Operations - Web20 Expo 2008

good tools are out there

Flickr uses

cacti.net munin.projects.linpro.no

hyperic.com

ganglia.info

Page 20: Capacity Planning for Web Operations - Web20 Expo 2008

application metrics

photouploads

viaemailper

minute

hour

Page 21: Capacity Planning for Web Operations - Web20 Expo 2008

your stuff, not just system stuff

photos uploaded (and processed) per minuteaverage photo processing time per minute

average photo sizedisk space consumed per day

user registrations per dayetc etc etc

Page 22: Capacity Planning for Web Operations - Web20 Expo 2008

your stuff, not just system stuff

photos uploaded (and processed) per minuteaverage photo processing time per minute

average photo sizedisk space consumed per day

user registrations per dayetc etc etc

Page 23: Capacity Planning for Web Operations - Web20 Expo 2008

Tie application metrics to system metrics

Pretty!! But what doesthis mean?

Page 24: Capacity Planning for Web Operations - Web20 Expo 2008

It means we can process

~120 images per minute

...and we can process them in

~3.5 seconds(on average)

It means that with about60% total CPU...

Page 25: Capacity Planning for Web Operations - Web20 Expo 2008

Benchmarking

Great for comparing hardware platforms and configurations

BUTDoesn’t represent real workloads

Page 26: Capacity Planning for Web Operations - Web20 Expo 2008

not exactly like a bike messenger

Page 27: Capacity Planning for Web Operations - Web20 Expo 2008

Finding your ceilings

• Use real data, from production servers (if at all possible)

• No, really

Page 28: Capacity Planning for Web Operations - Web20 Expo 2008

How many webservers can fail before we’re screwed?

How much traffic can each webserver take before it dies?

When should I add more webservers?

Page 29: Capacity Planning for Web Operations - Web20 Expo 2008

people

webservers

(databases, etc.)

LBs

Page 30: Capacity Planning for Web Operations - Web20 Expo 2008

people

webservers

(databases, etc.)

LBs

network

cpumemory

disk usage

disk i/o

Page 31: Capacity Planning for Web Operations - Web20 Expo 2008

people

webservers

(databases, etc.)

LBs

network

cpumemory

disk usage

disk i/o

- comments/min- photos/min- videos/min- kittens/min- etc etc etc/min

Page 32: Capacity Planning for Web Operations - Web20 Expo 2008

people

webservers

(databases, etc.)

LBs

Page 33: Capacity Planning for Web Operations - Web20 Expo 2008

people

webservers

(databases, etc.)

LBs

Page 34: Capacity Planning for Web Operations - Web20 Expo 2008

people

webservers

(databases, etc.)

LBs

Page 35: Capacity Planning for Web Operations - Web20 Expo 2008

Ceiling = upper limit of “work” (and resources)

what happens here?

Page 36: Capacity Planning for Web Operations - Web20 Expo 2008

Trends of peaks

Time

Page 37: Capacity Planning for Web Operations - Web20 Expo 2008

Benchmarking

Might be your only option if you have a single server.

Siege http://www.joedog.org/JoeDog/Siege

httperf/autobench http://www.hpl.hp.com/research/linux/httperf/

http://www.xenoclast.org/autobench

some good benchmarking tools:

sysbench http://sysbench.sf.net

Page 38: Capacity Planning for Web Operations - Web20 Expo 2008

Economics

Page 39: Capacity Planning for Web Operations - Web20 Expo 2008

Time makes everything cheaper

(the Moore’s Law thing)

you don’t have a lot of time to wait around, do you?

BUT

Page 40: Capacity Planning for Web Operations - Web20 Expo 2008

Vertical scaling

Page 41: Capacity Planning for Web Operations - Web20 Expo 2008

Horizontal architectures

Page 42: Capacity Planning for Web Operations - Web20 Expo 2008

Diagonal scaling

Page 43: Capacity Planning for Web Operations - Web20 Expo 2008

Diagonal scaling

Replacing 67 dual-core webservers with 18 dual quads

Page 44: Capacity Planning for Web Operations - Web20 Expo 2008

Diagonal scaling

more traffic from less machines

Page 45: Capacity Planning for Web Operations - Web20 Expo 2008

Diagonal Scaling

servers CPUsper server

RAMper server

drivesper server

total power (W)@60% peak

67 2 4GB 1x80GB 8763.6

18 8 4GB 1x146GB 2332.8

~70% less power49U less rack space

Page 46: Capacity Planning for Web Operations - Web20 Expo 2008
Page 47: Capacity Planning for Web Operations - Web20 Expo 2008

Disclosure: We don’t use clouds at Flickr.

Utility Computing

(but we know folks who do)

Page 48: Capacity Planning for Web Operations - Web20 Expo 2008

clouds

Still have to pay attention

Many people use the same forecastingmethods

Help with deployment timelinesHelp with procurement timelines

BUT

Page 49: Capacity Planning for Web Operations - Web20 Expo 2008

Use Common Sense(tm)

Measure constantly, adapt constantly

Don’t pretend to know the exact future

Pay attention to the right metrics

Complex simulation and modeling is rarely worth it

Don’t expect tuning and tweaking will ever winyou any excess capacity

Page 50: Capacity Planning for Web Operations - Web20 Expo 2008

Some more statsServing 32,000 photos per second at peak

Consuming 6-8TB per day

Consumed >34TB per day during Y!Photos migration

~3M uploads per day, 60 per second at peak

Page 51: Capacity Planning for Web Operations - Web20 Expo 2008

20% off discount: “vel08js”

June 23-24, 2008

Page 52: Capacity Planning for Web Operations - Web20 Expo 2008

We Are Hiring!(DBA, engineers)

http://flickr.com/photos/85013738@N00/542591121/http://flickr.com/photos/formatc1/2301500208/http://flickr.com/photos/mikefats/11546240/http://flickr.com/photos/kanaka/491064256/http://flickr.com/photos/randysonofrobert/1035003071/http://flickr.com/photos/halcyonsnow/446166047/http://flickr.com/photos/wwworks/2313927146/http://flickr.com/photos/sunxez/1392677065/http://flickr.com/photos/spacesuitcatalyst/536389937/http://flickr.com/photos/theklan/1276710183/