View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Seminarie Informatica
Fault-tolerant Systems: The Software Viewpoint
A series of seminars coordinated byVincenzo De Florio
http://www.pats.ua.ac.be
25 October 2006
Seminarie Informatica - Lecture 1 2
The matter
• The exam
• The topics
• This lecture Application-level fault tolerance provisions
25 October 2006
Seminarie Informatica - Lecture 1 3
Introduction to the exam
• Seminarie informatica 10 seminars on hot topics of computer science Topic of this cycle: software fault-tolerant systems Next 3 seminars: 15, 22 November; 6 December Next year seminars: to be announced on
http://www.win.ua.ac.be/~vincenz/si/0607.html
25 October 2006
Seminarie Informatica - Lecture 1 4
Introduction to the exam
• Oral discussion of 2 papers A 5–6 page paper based on one or more of the topics of
the seminars A paper with the analysis of a case study
• See later for examples
• Evaluation criteria: Do the papers contain original ideas? Do they follow «too
strictly» the seminar? Does the author understand the subject? Is (s)he able to
reason independently about the subject?
• Papers must be submitted by May 15, 2007 E-mail to [email protected]
25 October 2006
Seminarie Informatica - Lecture 1 5
The Topics
DependabilityDependability=
the property of a system such that reliance
can justifiably be placed on the service it delivers
Fault toleranceFault tolerance=
one of the means of dependability
25 October 2006
Seminarie Informatica - Lecture 1 6
The Dependability Tree
25 October 2006
Seminarie Informatica - Lecture 1 7
Fault tolerance (FT)
Fault-tolerant system is system that continues to function
in spite of faults
defect IC
bug in program
operation fault
sensor drift
hardware
software
operator
I/O
25 October 2006
Seminarie Informatica - Lecture 1 8
Attributes of dependability
• Availability Readiness for usage A(t) = probability that system is conform to
specification at time t
• Reliability Continuity of service R(t) = probability that system is conform to
specifications during [t0,t], provided that so it is at t0
25 October 2006
Seminarie Informatica - Lecture 1 9
Attributes of dependability (2)
• Safety Non-occurrence of catastrophic consequences on
environment S(t) = probability that a system is either conform
to specification, or reaches a safe halt, at time t Fail-safe systems
25 October 2006
Seminarie Informatica - Lecture 1 10
Attributes of dependability (3)
• Maintainability Aptitude to undergo repairs and evolution M(t) = probability that system is back to
specifications at t if failed at t0
25 October 2006
Seminarie Informatica - Lecture 1 11
Attributes of dependability (4)
• Confidentiality Non-occurrence of unauthorised disclosure of
information
• Integrity Non-occurrence of improper alterations of
information
25 October 2006
Seminarie Informatica - Lecture 1 12
Related attributes
• Testability Ability to test features of a system Related to maintainability
25 October 2006
Seminarie Informatica - Lecture 1 13
Related attributes
• Security Integrity + availability + confidentiality
25 October 2006
Seminarie Informatica - Lecture 1 14
References
• Jean-Claude Laprie, “Dependable Computing and Fault Tolerance: Concepts and Terminology”, in Proc. of the 15th Int. Symposium on Fault-Tolerant Computing (FTCS-15), Ann Arbor, Mich., June 1985, pp.2-11
• Jean-Claude Laprie, “Dependability---Its Attributes, Impairments and Means”, in Predictably Dependable Computing Systems, ESPRIT Basic Research Series, B. Randell and J.-C. Laprie and H. Kopetz and B. Littlewood (eds.), Springer Verlag, 1995, pp. 3-18.
25 October 2006
Seminarie Informatica - Lecture 1 15
The lecture
• We now focus on application-level fault tolerance
• Why do we need ALFT? Why do we need software FT in the first place?
• We explain why
• We survey the existing methods and assess their pros and cons against a set of properties
• Surprising conclusion: still an open problem
25 October 2006
Seminarie Informatica - Lecture 1 17
Software Fault Tolerance
• Human society more and more expects and relies on good quality of complex services supplied by computers
25 October 2006
Seminarie Informatica - Lecture 1 18
Software Fault Tolerance
• Consequences of a failure in the ‘40s:(Computers as fast solvers of numerical problems) Errors in computations, long downtimes• Consequences of a failures nowadays:(Computers controlling nuclear plants, airborne equipment, healthcare…)Incalculable penalty (catastrophes)
Perfo
rmance
&ease
of u
se
25 October 2006
Seminarie Informatica - Lecture 1 19
Software Fault Tolerance
• Traditional answer: Hardware Fault Tolerance
• This is an important ingredient, but not the only one needed today!
HW
OS
MW
APPLICATION
SW
• Complexity is also in the SW layers1. Hierarchies of complex
abstract machines
25 October 2006
Seminarie Informatica - Lecture 1 20
Software Fault Tolerance
• Complexity is also in SW layers (cont.’ed)2. Software is often networked and distributed
3. Relationships among software components are often complex
4. Object model Easier SW reuse Hidden + explicit Complexity
25 October 2006
Seminarie Informatica - Lecture 1 21
Software Fault Tolerance
• In conclusion: “No amount of verification, validation and testing can eliminate all faults in an application and give complete confidence in the availability and data consistency of applications”
Fault tolerance in SW is key
! SW failures can have the same extent in consequences of failures in HW Ariane 5 !
25 October 2006
Seminarie Informatica - Lecture 1 22
Problems of SW FT
HW
OS
HL RUN-TIME
APPLICATION The lighter the color, the more general purpose
the (virtual) machine
The lighter the color, the more complex
the problem ofexpressing fault tolerance
25 October 2006
Seminarie Informatica - Lecture 1 23
Problems of Application-levelFault Tolerance
• “The only alternative and effective means for increasing software reliability is that of incorporating in the application software provisions for SFT”
• The Application software has to manage Functional aspects Fault tolerance (FT) aspects
at the same time / in the same space
25 October 2006
Seminarie Informatica - Lecture 1 24
Problems and properties of Application-level Fault Tolerance
• Hazard : code intrusion FT provisions are specified side by side with the service Conflicting design concerns Overall design complexity gets increased Larger development and maintenance costs & times Larger probability of introducing software bugs
25 October 2006
Seminarie Informatica - Lecture 1 25
Problems and properties of Application-level Fault Tolerance
• Separation of design concerns ( SDCSDC ) In what follows we call an “ALFT” a means to
express fault tolerance in the application software A criterion to compare ALFT’s is by their degree
of SDCSDC
25 October 2006
Seminarie Informatica - Lecture 1 26
Problems and properties of Application-level Fault Tolerance
• Hazard : porting code porting service FT code assumes fault model = f(e)
1. If e changes, or
2. If the code is moved to another environment e’
the QoS may degrade
25 October 2006
Seminarie Informatica - Lecture 1 27
Problems and properties of Application-level Fault Tolerance
• Hazard: porting code porting service
IRS IRS
FC C
• An interesting case: Ariane 5 501 Ariane 4 missions software re-used in
Ariane 5 The early part of the trajectory of Ariane 5
differed from that of Ariane 4 and resulted in quite higher horizontal velocity values
…370 Million Eurosin the sink
This could be a case study for the exam
25 October 2006
Seminarie Informatica - Lecture 1 28
Problems and properties of Application-level Fault Tolerance
2. Problem: service portability Porting FT comes not for free “Hardwired ” fault model = static environment More difficult to adapt / test / maintain More prone to Ariane 5 - effects
“ What is the most often overlooked risk in sw engineering? That the environment will do something the designer never anticipated ” [J. Horning ]
25 October 2006
Seminarie Informatica - Lecture 1 29
Problems and properties of Application-level Fault Tolerance• Adaptability ( ADAD )
Does the ALFT provide means to adapt, dynamically, to new environmental conditions?
A criterion to compare 2 ALFT’s is by their degree of ADAD
25 October 2006
Seminarie Informatica - Lecture 1 30
Problems and properties of Application-level Fault Tolerance
3. Problem: adding complexity can decrease the dependability The ALFT (the means to express FT) must be
based on a simple strategy It must be syntactically adequate to host several
mechanisms
25 October 2006
Seminarie Informatica - Lecture 1 31
Problems and properties of Application-level Fault Tolerance• Hazard:
“Languages shape the way we think …” [Warf] “If all you have is a hammer, everything looks like a
nail” [/usr/share/fortune]
‼ …but – is it really a nail?
• Syntactical Adequacy ( SASA ) Does the ALFT provide simple means to host many FT
solutions? A criterion to compare 2 ALFT’s is by their degree of SASA
25 October 2006
Seminarie Informatica - Lecture 1 32
Summary• Separation of design concerns ( SDCSDC )• Adaptability ( ADAD )• Syntactical Adequacy ( SASA )
A “base” of attributes we can use to compare ALFT’s with one another
12
34
56
SDC AD SA
0
2
4
6
8
10
12
25 October 2006
Seminarie Informatica - Lecture 1 33
System structures for SFT
• Single-version FT
• Multiple-version FT
• Object model
• Linda Model
• FT Languages
• Recovery metaprogram
Each of these could be a case study for the exam
25 October 2006
Seminarie Informatica - Lecture 1 34
Single-version Fault Tolerance
• Single-version SFT = embedding in the user application of a simplex system a set of error detection / recovery features Explicit code intrusion (bad SDCSDC ) Increases size and complexity (bad SASA ) Bad for transparency, maintainability, portability Increases development times and costs No support for dynamic adaptability (bad ADAD )
• Libraries SwIFT, HATS, EFTOS …
25 October 2006
Seminarie Informatica - Lecture 1 35
Multiple-version Fault Tolerance
• Multiple-version SFT: NVP and RB• Idea: redundancy of software: independently designed
versions of software Randell (1975) : “All fault tolerance must be based on the
provision of useful redundancy, both for error detection and error recovery. In software the redundancy required is not simple replication of programs but redundancy of design”
• Assumption: random component failures. Correlated failures sudden exhaustion of available redundancy Again, Ariane 5 flight 501: two crucial components were
operating in parallel with identical hardware and software…
25 October 2006
Seminarie Informatica - Lecture 1 36
Multiple-version Fault Tolerance
#include <ftmacros.h> ... ENSURE(acceptance-test) { Alternate 1; } ELSEBY { Alternate 2; } ... ENSURE;
25 October 2006
Seminarie Informatica - Lecture 1 37
Multiple-version Fault Tolerance
#include <ftmacros.h> ... NVP VERSION{ block 1; SENDVOTE(v-pointer, v-size); } VERSION{ block 2; SENDVOTE(v-pointer, v-size); } … ENDVERSION(timeout, v-size); if (!agreeon(v-pointer)) error_handler(); ENDNVP;
25 October 2006
Seminarie Informatica - Lecture 1 38
Multiple-version Fault Tolerance
• Multiple-version SFT Implies N-fold design costs, N-fold maintenance
costs The risk of correlated failures is not negligible Code intrusion is limited (Acceptable SDCSDC ) System structure is fixed (Bad SASA ) No support for dynamic adaptability (bad ADAD )
Can be combined with other means
25 October 2006
Seminarie Informatica - Lecture 1 39
Object-centred Strategies
• Strategies based on the object model Metaobject protocols and reflection
• Open implementation of the run-time executive of an OO-language
• Reflection, reification
Composition filters• Each object has a set of “filters”. Messages sent to any
object are trapped by its filters. These filters possibly manipulate the message before passing it to the object.
25 October 2006
Seminarie Informatica - Lecture 1 40
Object-centred Strategies
Active objects• Objects that have control over the synchronisation of
incoming requests from other objects. Objects can autonomously decide, e.g., to delay a request until it is acceptable, i.e., until a guard is met
• FRIENDS, SINA, Correlate Full separation of design concerns (Good SDCSDC ) No code intrusion Syntactically adequate at least for a subset of FT
strategies (Acceptable SASA )
25 October 2006
Seminarie Informatica - Lecture 1 41
Object-centred Strategies
Assumption: application written in extended OO-language
Adaptability? (Questionable ADAD )
25 October 2006
Seminarie Informatica - Lecture 1 42
FT Linda Systems
Generative communication - messages are not “sent”, they are stored in a public, distributed shared memory
A shared relational database for storing and withdrawing “tuples”
Tuples: lists of objects identified by their contents, cardinality and type
A Linda process inserts, reads, and withdraws tuples via blocking or non-blocking primitives
Synchronisation: presence / absence of a matching tuple
25 October 2006
Seminarie Informatica - Lecture 1 43
Linda
In master-worker applications Dynamic load balancing, also in heterogeneous
clusters Inherently tolerates crash failures of workers
Single-op atomicity
• Solutions: Atomic transactions with multiple TS ops Stable tuple space Tuple space checkpointing, etc.
Possible case study for the exam
25 October 2006
Seminarie Informatica - Lecture 1 44
Linda
• FT-Linda, Persistent Linda... Full separation of design concerns (Good SDCSDC ) No code intrusion Syntactically adequate at least for a subset of
FT strategies (Acceptable SASA ) Assumption: application written in Linda Adaptability? (Questionable ADAD )
25 October 2006
Seminarie Informatica - Lecture 1 45
FT Languages
• FT Languages1. Enhanced, pre-existing
• Examples: FT-SR
• Fail-stop modules - “abstract unit of encapsulation”• Atomic execution• Composability
x-Linda (x C, Fortran, C++, …)
25 October 2006
Seminarie Informatica - Lecture 1 46
FT Languages
• FT Languages2. Novel languages
• Examples: Argus: distributed OO programming language
and operating system• “Guardians”: objects performing user-definable actions
in response to remote requests• Atomic transactions
FTAG: functional language based on attribute grammars
25 October 2006
Seminarie Informatica - Lecture 1 47
FT Languages
• FTAG Computation = collection of pure mathematical
functions, the modules. Each module has a set of input values, called
inherited attributes, and of output variables, called synthesized attributes.
25 October 2006
Seminarie Informatica - Lecture 1 48
FTAG (cont.’d)
Primitive modules can be executed Non-primitive modules require other modules to
be performed first FTAG program = decomposing a “root” module
into its basic sub-modules and then applying recursively this decomposition process to each of the sub-modules (computation tree)
25 October 2006
Seminarie Informatica - Lecture 1 49
FTAG (cont.’d)
Natural support for redoing (replacing a portion of the computation tree with a new computation)
Natural support for replication (replicated decomposition: a module is decomposed into N identical sub-modules implementing the function to replicate)
25 October 2006
Seminarie Informatica - Lecture 1 50
FT Languages
• Conclusions for FT languages
adequate separation of design concerns, transparency (good SDCSDC )
special purpose syntax (potentially good SASA ) application must be written with non standard
language bad portability Adaptability ( ADAD ): unknown
25 October 2006
Seminarie Informatica - Lecture 1 51
RMP
• Recovery Metaprogram
Two cooperating processing contexts User-placed breakpoints in the user context bring
to the execution of a meta-program When the meta-program ends, control is returned
to the user program
Meta-program is to be written in CSP
25 October 2006
Seminarie Informatica - Lecture 1 52
RMP
• Adequate, e.g., for recovery blocks: Breakpoint can trigger the execution of
• CHECKPOINT• ALTERNATES• ACCEPTANCE TESTS...
25 October 2006
Seminarie Informatica - Lecture 1 53
RMP
• RMP summary: Full separation of design concerns No code intrusion (Good SDCSDC ) Syntactically adequate at least for a subset of
FT strategies (Average SASA ) The meta-program is written in a fixed, pre-
existing language (CSP) Inefficient implementation (huge performance
overhead for switching execution modes) No adaptability (Bad ADAD )
25 October 2006
Seminarie Informatica - Lecture 1 54
Summary
-1
0
1
2
3
4
5
6
Single-
vers
ion
Mult
iple-
vers
ion
Objec
t mod
el
Linda
Lang
uage
sRM
P
SDC
AD
SA
• No optimal solution exists yet• Challenging research problem!
25 October 2006
Seminarie Informatica - Lecture 1 55
Conclusions – in search of optimum
• A dependable service is one that persists even when, for instance, its corresponding program experiences faults – to some agreed upon extent
• An F-dependable service (resp. F-dependable program, system…) is one that persists despite the occurrence of faults as described in F
• F is the fault model
25 October 2006
Seminarie Informatica - Lecture 1 56
Conclusions – in search of optimum
• F is the model of an environment (E)
• An F-dependable service may tolerate faults in E and may not for those in E’
• What if F matches an environment E’?
• What if E changes into E’?
• What if an F-service is moved?
→ A failure may occur!
25 October 2006
Seminarie Informatica - Lecture 1 57
Conclusions – in search of optimum
• Adapting services
• X-dependable services, where X = f(E)
• X changes when The service is moved The environment mutates
• Changes should occur automa[tg]ically (High ADAD)
• The expression of adaptability and dependability concerns should not increase complexity “too much” (High SASA )
25 October 2006
Seminarie Informatica - Lecture 1 58
Conclusions
• Ideally, the code should be made of two components:
(service, FT)(Optimal SDCSDC )
and FT should adapt dynamically w.r.t. e’
25 October 2006
Seminarie Informatica - Lecture 1 59
Conclusions
• Risks: this may call for complexity! But generic architectures can be thought so as to
go for a limited complexity Optimizations are possible
• In a future seminar: a compliant architecture that is being designed within PATS
25 October 2006
Seminarie Informatica - Lecture 1 60
Questions?
All citations by B. Randell if no author is specified