30
Image: xkcd.com Dependable Cloud Architecture @mikewo Mike Wood http:// mvwood.com

Image: xkcd.com

  • Upload
    randi

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Dependable Cloud Architecture. @ mikewo. Mike Wood. http://mvwood.com. Image: xkcd.com. Tack. @ mikewo. Mike Wood. http://mvwood.com. Questions. “Failure is always an option.”. Image: Discovery Channel, Fair Use. What are we looking for?. Protection From:. Loss of Facilities. - PowerPoint PPT Presentation

Citation preview

Page 1: Image: xkcd.com

Image: xkcd.com

Dependable Cloud Architecture

@mikewoMike Wood

http://mvwood.com

Page 2: Image: xkcd.com

Questions

@mikewo

Mike Wood

http://mvwood.com

Tack

Page 3: Image: xkcd.com

“Failure is alwaysan option.”

Image: Discovery Channel, Fair Use

Page 4: Image: xkcd.com

Protection From:

What are we looking for?

Check out: http://bit.ly/wazbizcontImages: Office ClipArt & Godzilla Releasing Corp (Fair Use)

Hardware Failure Data Corruption Network Failure Loss of Facilities

Page 5: Image: xkcd.com

Image: FOX, Fair Use

Human Error

Page 6: Image: xkcd.com

What we’re trying to achieve

1. Monitoring2. Resilient Solutions

Image: Cohdra

Page 7: Image: xkcd.com

Image: Office ClipArt

Cost vs Risk

99.999% $1, … ,000.00To get more 9’s here add more 0’s here.

Page 8: Image: xkcd.com

Image: NASA

Monitoring

Page 9: Image: xkcd.com

Functional Transparency

Image: Office ClipArt

Logging Messages

Hardware Health

Dependent Services Health

Page 10: Image: xkcd.com

Telemetry

Page 11: Image: xkcd.com

Image: NASA

Analyze your Data

Page 12: Image: xkcd.com

ResilienceImage: Office ClipArt

Page 13: Image: xkcd.com

Remember: Failure is always an option.Common Points of Failure

• Machine\application crashes• Throttling (exceeding capacity)• Connectivity\Network• External service dependencies

Focus less on the uptime of hardware and more about how the solution handles it WHEN

something fails!

Page 14: Image: xkcd.com

Try/catch != Resilient private void createFile() {

string fileName = @"c:\workingDirectory\someFileName.txt";

try {

File.Create(fileName);}catch (DirectoryNotFoundException ex)

{Trace.WriteLine(String.Format("Unable to create {0}. {1}",

fileName, ex));

throw; } } }

Page 15: Image: xkcd.com

Image: Michael Wood

Decompose your system…

Page 16: Image: xkcd.com

Capacity BufferingContent Delivery Networks (CDN’s)

Distributed Application Cache

Local Content Cache

Enables recovery during outages or

spikes in load

Image: jepler

Page 17: Image: xkcd.com

Always carry a spare75% Capacity, half of our load 75% Capacity, half of our load

50% more capacity then needed• Can absorb of temporary spikes• Time to react if need to add capacity

100% of load, 150% Capacity0% Capacity, redirect all load

Over allocated, but still functioning• Degrade, but don’t fail

SYSTEM FAILURE!!!

Image: Kevin Rosseel

Page 18: Image: xkcd.com

Request Buffering

Image: Joe Shlabotnik

QueuesRetry PoliciesAsync Workloads

Page 19: Image: xkcd.com

Dept. of Redundancy Dept.

Have a backup, somewhere elseMore than one? Cost to benefit Ratio?

Ready StateHot = full capacityWarm = scaled down, but ready to growCold = mothballed, starts from zero

Image: Mr. White

Page 20: Image: xkcd.com

Redundancy - Its about probability95% uptime 95% uptime 95% uptime 95% uptime

1 box : 5% downtime or 438hrs per year

2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year

4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,0000.000625% downtime or 3.285 MINUTES per year

(that’s 18 ½ days!)

Page 21: Image: xkcd.com

Total Outage duration =

Time to Detect+ Time to Diagnose+ Time to Decide+ Time to ActImage: Office ClipArt

Page 22: Image: xkcd.com

Dynamic Addressing & Configuration

Page 23: Image: xkcd.com

What about your data?

Image: barrymieny

Page 24: Image: xkcd.com

Availability via Degradation

Image: Michael Wood

Page 25: Image: xkcd.com

Images: Gizmodo

Virtualization and Automation

Page 26: Image: xkcd.com

Images: Orion Pictures owns Terminator Franchise

Page 27: Image: xkcd.com

The “HI” Point

Check out: http://bit.ly/wazinternalsImages: Office Clip Art

Page 28: Image: xkcd.com

Image: NASA

Page 29: Image: xkcd.com

“Don't be too proud of this technological terror you've constructed…”

ADMIT:• Your Solution WILL fail at some point• You can learn from others just as

well as yourself

DO:• Root cause analysis• Read other root cause analysis• Plan for failure

DON’T:• Get cocky• Stick your head in the sand

Images: LucasFilm, Fair Use

Page 30: Image: xkcd.com

Questions@mikewo

Mike Wood

http://mvwood.com

http://bit.ly/CloudFailSafe

Tack