Upload
randi
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Dependable Cloud Architecture. @ mikewo. Mike Wood. http://mvwood.com. Image: xkcd.com. Tack. @ mikewo. Mike Wood. http://mvwood.com. Questions. “Failure is always an option.”. Image: Discovery Channel, Fair Use. What are we looking for?. Protection From:. Loss of Facilities. - PowerPoint PPT Presentation
Citation preview
Image: xkcd.com
Dependable Cloud Architecture
@mikewoMike Wood
http://mvwood.com
Questions
@mikewo
Mike Wood
http://mvwood.com
Tack
“Failure is alwaysan option.”
Image: Discovery Channel, Fair Use
Protection From:
What are we looking for?
Check out: http://bit.ly/wazbizcontImages: Office ClipArt & Godzilla Releasing Corp (Fair Use)
Hardware Failure Data Corruption Network Failure Loss of Facilities
Image: FOX, Fair Use
Human Error
What we’re trying to achieve
1. Monitoring2. Resilient Solutions
Image: Cohdra
Image: Office ClipArt
Cost vs Risk
99.999% $1, … ,000.00To get more 9’s here add more 0’s here.
Image: NASA
Monitoring
Functional Transparency
Image: Office ClipArt
Logging Messages
Hardware Health
Dependent Services Health
Telemetry
Image: NASA
Analyze your Data
ResilienceImage: Office ClipArt
Remember: Failure is always an option.Common Points of Failure
• Machine\application crashes• Throttling (exceeding capacity)• Connectivity\Network• External service dependencies
Focus less on the uptime of hardware and more about how the solution handles it WHEN
something fails!
Try/catch != Resilient private void createFile() {
string fileName = @"c:\workingDirectory\someFileName.txt";
try {
File.Create(fileName);}catch (DirectoryNotFoundException ex)
{Trace.WriteLine(String.Format("Unable to create {0}. {1}",
fileName, ex));
throw; } } }
Image: Michael Wood
Decompose your system…
Capacity BufferingContent Delivery Networks (CDN’s)
Distributed Application Cache
Local Content Cache
Enables recovery during outages or
spikes in load
Image: jepler
Always carry a spare75% Capacity, half of our load 75% Capacity, half of our load
50% more capacity then needed• Can absorb of temporary spikes• Time to react if need to add capacity
100% of load, 150% Capacity0% Capacity, redirect all load
Over allocated, but still functioning• Degrade, but don’t fail
SYSTEM FAILURE!!!
Image: Kevin Rosseel
Request Buffering
Image: Joe Shlabotnik
QueuesRetry PoliciesAsync Workloads
Dept. of Redundancy Dept.
Have a backup, somewhere elseMore than one? Cost to benefit Ratio?
Ready StateHot = full capacityWarm = scaled down, but ready to growCold = mothballed, starts from zero
Image: Mr. White
Redundancy - Its about probability95% uptime 95% uptime 95% uptime 95% uptime
1 box : 5% downtime or 438hrs per year
2 boxes : 5/100 * 5/100 = 25/10,000 = 0.25% downtime or 22hrs per year
4 boxes : 5/100 * 5/100 * 5/100 * 5/100 = 625/100,000,0000.000625% downtime or 3.285 MINUTES per year
(that’s 18 ½ days!)
Total Outage duration =
Time to Detect+ Time to Diagnose+ Time to Decide+ Time to ActImage: Office ClipArt
Dynamic Addressing & Configuration
What about your data?
Image: barrymieny
Availability via Degradation
Image: Michael Wood
Images: Gizmodo
Virtualization and Automation
Images: Orion Pictures owns Terminator Franchise
The “HI” Point
Check out: http://bit.ly/wazinternalsImages: Office Clip Art
Image: NASA
“Don't be too proud of this technological terror you've constructed…”
ADMIT:• Your Solution WILL fail at some point• You can learn from others just as
well as yourself
DO:• Root cause analysis• Read other root cause analysis• Plan for failure
DON’T:• Get cocky• Stick your head in the sand
Images: LucasFilm, Fair Use
Questions@mikewo
Mike Wood
http://mvwood.com
http://bit.ly/CloudFailSafe
Tack