Seminarie Informatica Fault-tolerant Systems: The Software Viewpoint A series of seminars coordinated by Vincenzo De Florio

Seminarie Informatica

Fault-tolerant Systems: The Software Viewpoint

A series of seminars coordinated byVincenzo De Florio

http://www.pats.ua.ac.be

25 October 2006

Seminarie Informatica - Lecture 1 2

The matter

• The exam

• The topics

• This lecture Application-level fault tolerance provisions

25 October 2006


Introduction to the exam

• Seminarie informatica 10 seminars on hot topics of computer science Topic of this cycle: software fault-tolerant systems Next 3 seminars: 15, 22 November; 6 December Next year seminars: to be announced on

http://www.win.ua.ac.be/~vincenz/si/0607.html

25 October 2006


Introduction to the exam

• Oral discussion of 2 papers A 5–6 page paper based on one or more of the topics of

the seminars A paper with the analysis of a case study

• See later for examples

• Evaluation criteria: Do the papers contain original ideas? Do they follow «too

strictly» the seminar? Does the author understand the subject? Is (s)he able to

reason independently about the subject?

• Papers must be submitted by May 15, 2007 E-mail to [email protected]

25 October 2006


The Topics

DependabilityDependability=

the property of a system such that reliance

can justifiably be placed on the service it delivers

Fault toleranceFault tolerance=

one of the means of dependability

25 October 2006


The Dependability Tree

25 October 2006


Fault tolerance (FT)

Fault-tolerant system is system that continues to function

in spite of faults

defect IC

bug in program

operation fault

sensor drift

hardware

software

operator

I/O

25 October 2006


Attributes of dependability

• Availability Readiness for usage A(t) = probability that system is conform to

specification at time t

• Reliability Continuity of service R(t) = probability that system is conform to

specifications during [t0,t], provided that so it is at t0

25 October 2006


Attributes of dependability (2)

• Safety Non-occurrence of catastrophic consequences on

environment S(t) = probability that a system is either conform

to specification, or reaches a safe halt, at time t Fail-safe systems

25 October 2006



• Maintainability Aptitude to undergo repairs and evolution M(t) = probability that system is back to

specifications at t if failed at t0

25 October 2006



• Confidentiality Non-occurrence of unauthorised disclosure of

information

• Integrity Non-occurrence of improper alterations of

information

25 October 2006


Related attributes

• Testability Ability to test features of a system Related to maintainability

25 October 2006


Related attributes

• Security Integrity + availability + confidentiality

25 October 2006


References

• Jean-Claude Laprie, “Dependable Computing and Fault Tolerance: Concepts and Terminology”, in Proc. of the 15th Int. Symposium on Fault-Tolerant Computing (FTCS-15), Ann Arbor, Mich., June 1985, pp.2-11

• Jean-Claude Laprie, “Dependability---Its Attributes, Impairments and Means”, in Predictably Dependable Computing Systems, ESPRIT Basic Research Series, B. Randell and J.-C. Laprie and H. Kopetz and B. Littlewood (eds.), Springer Verlag, 1995, pp. 3-18.

25 October 2006


The lecture

• We now focus on application-level fault tolerance

• Why do we need ALFT? Why do we need software FT in the first place?

• We explain why

• We survey the existing methods and assess their pros and cons against a set of properties

• Surprising conclusion: still an open problem

25 October 2006


Software Fault Tolerance

• Human society more and more expects and relies on good quality of complex services supplied by computers

25 October 2006



• Consequences of a failure in the ‘40s:(Computers as fast solvers of numerical problems) Errors in computations, long downtimes• Consequences of a failures nowadays:(Computers controlling nuclear plants, airborne equipment, healthcare…)Incalculable penalty (catastrophes)

Perfo

rmance

&ease

of u

se

25 October 2006



• Traditional answer: Hardware Fault Tolerance

• This is an important ingredient, but not the only one needed today!

HW

OS

MW

APPLICATION

SW

• Complexity is also in the SW layers1. Hierarchies of complex

abstract machines

25 October 2006



• Complexity is also in SW layers (cont.’ed)2. Software is often networked and distributed

3. Relationships among software components are often complex

4. Object model Easier SW reuse Hidden + explicit Complexity

25 October 2006



• In conclusion: “No amount of verification, validation and testing can eliminate all faults in an application and give complete confidence in the availability and data consistency of applications”

Fault tolerance in SW is key

! SW failures can have the same extent in consequences of failures in HW Ariane 5 !

25 October 2006


Problems of SW FT

HW

OS

HL RUN-TIME

APPLICATION The lighter the color, the more general purpose

the (virtual) machine

The lighter the color, the more complex

the problem ofexpressing fault tolerance

25 October 2006


Problems of Application-levelFault Tolerance

• “The only alternative and effective means for increasing software reliability is that of incorporating in the application software provisions for SFT”

• The Application software has to manage Functional aspects Fault tolerance (FT) aspects

at the same time / in the same space

25 October 2006


Problems and properties of Application-level Fault Tolerance

• Hazard : code intrusion FT provisions are specified side by side with the service Conflicting design concerns Overall design complexity gets increased Larger development and maintenance costs & times Larger probability of introducing software bugs

25 October 2006



• Separation of design concerns ( SDCSDC ) In what follows we call an “ALFT” a means to

express fault tolerance in the application software A criterion to compare ALFT’s is by their degree

of SDCSDC

25 October 2006



• Hazard : porting code porting service FT code assumes fault model = f(e)

1. If e changes, or

2. If the code is moved to another environment e’

the QoS may degrade

25 October 2006



• Hazard: porting code porting service

IRS IRS

FC C

• An interesting case: Ariane 5 501 Ariane 4 missions software re-used in

Ariane 5 The early part of the trajectory of Ariane 5

differed from that of Ariane 4 and resulted in quite higher horizontal velocity values

…370 Million Eurosin the sink

This could be a case study for the exam

25 October 2006



2. Problem: service portability Porting FT comes not for free “Hardwired ” fault model = static environment More difficult to adapt / test / maintain More prone to Ariane 5 - effects

“ What is the most often overlooked risk in sw engineering? That the environment will do something the designer never anticipated ” [J. Horning ]

25 October 2006


Problems and properties of Application-level Fault Tolerance• Adaptability ( ADAD )

Does the ALFT provide means to adapt, dynamically, to new environmental conditions?

A criterion to compare 2 ALFT’s is by their degree of ADAD

25 October 2006



3. Problem: adding complexity can decrease the dependability The ALFT (the means to express FT) must be

based on a simple strategy It must be syntactically adequate to host several

mechanisms

25 October 2006


Problems and properties of Application-level Fault Tolerance• Hazard:

“Languages shape the way we think …” [Warf] “If all you have is a hammer, everything looks like a

nail” [/usr/share/fortune]

‼ …but – is it really a nail?

• Syntactical Adequacy ( SASA ) Does the ALFT provide simple means to host many FT

solutions? A criterion to compare 2 ALFT’s is by their degree of SASA

25 October 2006


Summary• Separation of design concerns ( SDCSDC )• Adaptability ( ADAD )• Syntactical Adequacy ( SASA )

A “base” of attributes we can use to compare ALFT’s with one another

12

34

56

SDC AD SA

0

2

4

6

8

10

12

25 October 2006


System structures for SFT

• Single-version FT

• Multiple-version FT

• Object model

• Linda Model

• FT Languages

• Recovery metaprogram

Each of these could be a case study for the exam

25 October 2006


Single-version Fault Tolerance

• Single-version SFT = embedding in the user application of a simplex system a set of error detection / recovery features Explicit code intrusion (bad SDCSDC ) Increases size and complexity (bad SASA ) Bad for transparency, maintainability, portability Increases development times and costs No support for dynamic adaptability (bad ADAD )

• Libraries SwIFT, HATS, EFTOS …

25 October 2006


Multiple-version Fault Tolerance

• Multiple-version SFT: NVP and RB• Idea: redundancy of software: independently designed

versions of software Randell (1975) : “All fault tolerance must be based on the

provision of useful redundancy, both for error detection and error recovery. In software the redundancy required is not simple replication of programs but redundancy of design”

• Assumption: random component failures. Correlated failures sudden exhaustion of available redundancy Again, Ariane 5 flight 501: two crucial components were

operating in parallel with identical hardware and software…

25 October 2006



#include <ftmacros.h> ... ENSURE(acceptance-test) { Alternate 1; } ELSEBY { Alternate 2; } ... ENSURE;

25 October 2006



#include <ftmacros.h> ... NVP VERSION{ block 1; SENDVOTE(v-pointer, v-size); } VERSION{ block 2; SENDVOTE(v-pointer, v-size); } … ENDVERSION(timeout, v-size); if (!agreeon(v-pointer)) error_handler(); ENDNVP;

25 October 2006



• Multiple-version SFT Implies N-fold design costs, N-fold maintenance

costs The risk of correlated failures is not negligible Code intrusion is limited (Acceptable SDCSDC ) System structure is fixed (Bad SASA ) No support for dynamic adaptability (bad ADAD )

Can be combined with other means

25 October 2006


Object-centred Strategies

• Strategies based on the object model Metaobject protocols and reflection

• Open implementation of the run-time executive of an OO-language

• Reflection, reification

Composition filters• Each object has a set of “filters”. Messages sent to any

object are trapped by its filters. These filters possibly manipulate the message before passing it to the object.

25 October 2006



Active objects• Objects that have control over the synchronisation of

incoming requests from other objects. Objects can autonomously decide, e.g., to delay a request until it is acceptable, i.e., until a guard is met

• FRIENDS, SINA, Correlate Full separation of design concerns (Good SDCSDC ) No code intrusion Syntactically adequate at least for a subset of FT

strategies (Acceptable SASA )

25 October 2006



Assumption: application written in extended OO-language

Adaptability? (Questionable ADAD )

25 October 2006


FT Linda Systems

Generative communication - messages are not “sent”, they are stored in a public, distributed shared memory

A shared relational database for storing and withdrawing “tuples”

Tuples: lists of objects identified by their contents, cardinality and type

A Linda process inserts, reads, and withdraws tuples via blocking or non-blocking primitives

Synchronisation: presence / absence of a matching tuple

25 October 2006


Linda

In master-worker applications Dynamic load balancing, also in heterogeneous

clusters Inherently tolerates crash failures of workers

Single-op atomicity

• Solutions: Atomic transactions with multiple TS ops Stable tuple space Tuple space checkpointing, etc.

Possible case study for the exam

25 October 2006


Linda

• FT-Linda, Persistent Linda... Full separation of design concerns (Good SDCSDC ) No code intrusion Syntactically adequate at least for a subset of

FT strategies (Acceptable SASA ) Assumption: application written in Linda Adaptability? (Questionable ADAD )

25 October 2006


FT Languages

• FT Languages1. Enhanced, pre-existing

• Examples: FT-SR

• Fail-stop modules - “abstract unit of encapsulation”• Atomic execution• Composability

x-Linda (x C, Fortran, C++, …)

25 October 2006


FT Languages

• FT Languages2. Novel languages

• Examples: Argus: distributed OO programming language

and operating system• “Guardians”: objects performing user-definable actions

in response to remote requests• Atomic transactions

FTAG: functional language based on attribute grammars

25 October 2006


FT Languages

• FTAG Computation = collection of pure mathematical

functions, the modules. Each module has a set of input values, called

inherited attributes, and of output variables, called synthesized attributes.

25 October 2006


FTAG (cont.’d)

Primitive modules can be executed Non-primitive modules require other modules to

be performed first FTAG program = decomposing a “root” module

into its basic sub-modules and then applying recursively this decomposition process to each of the sub-modules (computation tree)

25 October 2006


FTAG (cont.’d)

Natural support for redoing (replacing a portion of the computation tree with a new computation)

Natural support for replication (replicated decomposition: a module is decomposed into N identical sub-modules implementing the function to replicate)

25 October 2006


FT Languages

• Conclusions for FT languages

adequate separation of design concerns, transparency (good SDCSDC )

special purpose syntax (potentially good SASA ) application must be written with non standard

language bad portability Adaptability ( ADAD ): unknown

25 October 2006


RMP

• Recovery Metaprogram

Two cooperating processing contexts User-placed breakpoints in the user context bring

to the execution of a meta-program When the meta-program ends, control is returned

to the user program

Meta-program is to be written in CSP

25 October 2006


RMP

• Adequate, e.g., for recovery blocks: Breakpoint can trigger the execution of

• CHECKPOINT• ALTERNATES• ACCEPTANCE TESTS...

25 October 2006


RMP

• RMP summary: Full separation of design concerns No code intrusion (Good SDCSDC ) Syntactically adequate at least for a subset of

FT strategies (Average SASA ) The meta-program is written in a fixed, pre-

existing language (CSP) Inefficient implementation (huge performance

overhead for switching execution modes) No adaptability (Bad ADAD )

25 October 2006


Summary

-1

0

1

2

3

4

5

6

Single-

vers

ion

Mult

iple-

vers

ion

Objec

t mod

el

Linda

Lang

uage

sRM

P

SDC

AD

SA

• No optimal solution exists yet• Challenging research problem!

25 October 2006


Conclusions – in search of optimum

• A dependable service is one that persists even when, for instance, its corresponding program experiences faults – to some agreed upon extent

• An F-dependable service (resp. F-dependable program, system…) is one that persists despite the occurrence of faults as described in F

• F is the fault model

25 October 2006



• F is the model of an environment (E)

• An F-dependable service may tolerate faults in E and may not for those in E’

• What if F matches an environment E’?

• What if E changes into E’?

• What if an F-service is moved?

→ A failure may occur!

25 October 2006



• Adapting services

• X-dependable services, where X = f(E)

• X changes when The service is moved The environment mutates

• Changes should occur automa[tg]ically (High ADAD)

• The expression of adaptability and dependability concerns should not increase complexity “too much” (High SASA )

25 October 2006


Conclusions

• Ideally, the code should be made of two components:

(service, FT)(Optimal SDCSDC )

and FT should adapt dynamically w.r.t. e’

25 October 2006


Conclusions

• Risks: this may call for complexity! But generic architectures can be thought so as to

go for a limited complexity Optimizations are possible

• In a future seminar: a compliant architecture that is being designed within PATS

25 October 2006


Questions?

All citations by B. Randell if no author is specified

Documents

Seminarie Informatica Fault-tolerant Systems: The Software Viewpoint A series of seminars coordinated by Vincenzo De Florio