CS 603 Failure Models

CS 603Failure Models

April 12, 2002

Fault Tolerance inDistributed Systems

• Perfect world: No Failures– We don’t live in a perfect world

• Non-distributed systemCrash, you’re dead

• Distributed system: Redundancy– Should result in less down time– But does it?– Distributed systems according to Butler LampsonA distributed system is a system in which I can’t get my

work done because a computer that I’ve never heard of has failed.

Analogy:Single vs. Multi-Engine Airplanes

• If the engine quits on a single-engine airplane, you will land soon

• In a multi-engine airplane, you can still fly

• Fatal accidents per 100,000 hours flown (1997, private aircraft):– Single-engine: 1.47– Multi-engine: 1.92(Airlines: .038)

• Why is multi-engine more dangerous?

Problems with Distributed Systems

• Failure more frequent– Many places to fail– More complex

• More pieces• New challenges: Asynchrony, communication

• Potential problems of failure greater– Single system – everything stops– Distributed – some parts continue

Consistency!

First Step: Goals

• Availability– Can I use it now?

• Reliability– Will it be up as long as I need it?

• Safety– If it fails, what are the consequences?

• Maintainability– How easy is it to fix if it breaks?

Next Step: Failure Models

• Failure: System doesn’t give desired behavior– Component-level failure (can compensate)– System-level failure (incorrect result)

• Fault: Cause of failure (component-level)– Transient: Not repeatable– Intermittent: Repeats, but (apparently) independent

of system operations– Permanent: Exists until component repaired

• Failure Model: How the system behaves when it doesn’t behave properly

Failure Model(Flaviu Cristian, 1991)

• Dependency– Proper operation of Database depends on proper

operation of processor, disk

• Failure Classification– Type of response to failure

• Failure semantics– State of system after given class of failure

• Failure masking– High-level operation succeeds even if they depend on

failed services

Failure Classification

• Correct– In response to inputs, behaves in a manner consistent with the

service specification

• Omission Failure– Doesn’t respond to input

• Crash: After first omission failure, subsequent requests result in omission failure

• Timing failure (early, late)– Correct response, but outside required time window

• Response failure– Value: Wrong output for inputs– State Transition: Server ends in wrong state

Crash Failure types(based on recovery behavior)

• Amnesia– Server recovers to predefined state independent of

operations before crash

• Partial amnesia– Some part of state is as before crash, rest to

predefined state

• Pause– Recovers to state before omission failure

• Halting– Never restarts

Failure Semantics

• Max delay on link: d; Max service time: p– Should get response in 2d+p

• Assume omission failure only– If no response in 2d+p, resend request

• What if performance failure possible?– Must distinguish between response to sr and sr’

sr f(sr)sr’ f(sr’)

Failure Semantics

• Specification for service must include– Failure-free (normal) semantics– Failure semantics (likely failure behaviors)

• Multiple semantics– Combine to give (weaker) semantics– Arbitrary failure semantics: Weakest possible

• Choice of failure semantics– Is class of failure likely?

• Probability of type of failure

– What is the cost of failure• Catastrophic?

Hierarchical Failure Masking

• Hierarchical failure masking– Dependency: Higher level gets (at best)

failure semantics of lower level– Can compensate for lower level failure to

improve this

• Example: TCP fixes communication errors, so some failure semantics not propagated to higher level

Group Failure Masking

• Redundant servers– Failed server masked by others in group– Allows failure semantics of group to be higher than

individuals

• k-fault tolerant– Group can mask k concurrent group member failures

from client

• May “upgrade” failure semantics– Example: Group detects non-responsive server,

other member picks up the slackOmission failure becomes performance failure

Additional Issues

• Tradeoff of failure probabilities / semantics– Which is better?

• P[Catastophic] = 0.01 & P[Performance] = 0.5• P[Catastrophic] = 0.1 & P[Performance] = 0.01

• Techniques for managing failure

• Recovery mechanisms

CS 603 Failure Models

Documents

Campus 603

CS 602, 620, 603, 630, 820, 830, 920, 930 - Wap Nilfisk Alto Shop · 2016. 2. 8. · Title: CS 602, 620, 603, 630, 820, 830, 920, 930 Author: zellergu Subject: 15035 - 140292 Created

FINANCIAL STATEMENTS +603 9779 1700 603 9779 1701 / 1702

inSTruCTiOn manuaL CHain Saw CS-620P - ECHO-USA€¦ · inSTruCTiOn manuaL CHain Saw CS-620P WARNING Read the instructions carefully and follow the rules for safe operation. Failure

EFFECTIVELY SUPPORT RISK-BASED PRIORITIZATIONsites.nationalacademies.org/cs/groups/depssite/... · Component Failure Component Risk Rating Component Failure Knowledge-based Assessment

CS 245Notes 081 CS 245: Database System Principles Notes 08: Failure Recovery Hector Garcia-Molina

Security in Networks Single point of failure Resillence or fault tolerance CS model

L'informador 603

Heart Failure Toolkit - Dartmouth-Hitchcock Failure Team (603) 650-5724. Chapter 1 Managing ... HEART FAILURE TOOLKIT. dartmouth-hitchcock.org 11 Change Eating Habits to Lower Sodium

Database System Principlesfolk.uio.no/inf212/handouts/uke18_1.pdf · 1 CS 245 Notes 08 1 Database System Principles Failure Recovery Hector Garcia-Molina CS 245 Notes 08 2 Integrity

Management of pv cs and ventricular tachycardia in advanced heart failure

Manual UVR1611 A4.03-2 CS...UVR 1611 Verze A4.03 CS Manual verze 1 Hotline Sunpower tel.: 603 516 197 ; e-mail: office@sunpower.cz ; fax: 384 388 167 Volně programovatelná univerzální

CS 277 – Spring 2002Notes 081 CS 277: Database System Implementation Notes 08: Failure Recovery Arthur Keller

Batalla 603

CS 5523 Operating Systems: Fault Tolerancetongpingliu/teaching/cs5523/handouts/lecture-12... · CS 5523 Operating Systems: Fault Tolerance ... Arbitrary/Byzantine failure: ! ... Snapshot

CS 603 Mid-Semester Review

CS II Project - Kingfisher Airlines Failure - Strategy Analysis

EM TS CS ppt Template 16x9 English - … Power Grids . Cost efficiency . ... potential failure and allow the customers to act proactively. ... EM TS CS ppt Template 16x9 English

CS 603: Programming Languages

Треугольник №603