1 Dagstuhl, 11-15 Sept 2006 Failures: a basic issue in dependability and security - Their definition, modelling & analysis Brian Randell

1Dagstuhl, 11-15 Sept 2006

Failures: a basic issue in Failures: a basic issue in dependability and security - dependability and security -

Their definition, modelling & Their definition, modelling & analysisanalysis

Brian RandellBrian Randell

Failures: a basic issue in Failures: a basic issue in dependability and security - dependability and security -

Their definition, modelling & Their definition, modelling & analysisanalysis

Brian RandellBrian Randell


Dependability and Dependability and SecuritySecurity

Dependability and Dependability and SecuritySecurity

• There are severe terminological and conceptual confusions in the dependability & security field(s). These come into prominence when one takes an adequately general view of dependability & security problems – by avoiding the (naive) assumptions that systems always have well-established boundaries and fully-adequate specifications.

• But the confusion also concerns, for example, the relationship of dependability and security – so let’s dispose of that first:• Until recently the IEEE/WG10.4 dependability community’s definitions

regarded dependability as subsuming security.• The latest version of the dependability concepts and terminology saga

has (pragmatically) suggested somewhat of a differentiation, which in effect associates dependability with an emphasis on accidental faults and security on malicious faults. (In fact what is almost always needed is their combination!)


Dependability “versus” Dependability “versus” SecuritySecurity

Dependability “versus” Dependability “versus” SecuritySecurity

Availability

Reliability

Safety

Confidentiality

Integrity

Maintainability

Dependability Security

Basic Concepts and Taxonomy of Dependable and Secure Computing,

Avizienis, A., Laprie, J.-C., Randell, B. and Landwehr, C.

IEEE Transactions on Dependable and Secure Computing,

Vol. 1, No. 1, pp 11-33, 2004


On FailuresOn FailuresOn FailuresOn Failures• To me, failures are the central issue, the most basic concept -

regardless of the exact relationship that is deemed to exist between security and dependability – a topic I will ignore.

• Particular types of failures (e.g. producing wrong results, ceasing to operate, revealing secret information, causing loss of life, etc.) relate to what can be regarded as different attributes of dependability/security: reliability, availability, confidentiality, safety,etc.

• Complex real systems, made up of and by other systems (e.g. of hardware, software and people) do actually fail from time to time (!), and reducing the frequency and severity of their failures is the major challenge - common to both the dependability and the security communities.

• My preferred definition: a dependable/secure system is one whose (dependability/security) failures are not unacceptably frequent or severe (from some given viewpoint).

• So what is a failure?


• From Avizienis et al:• A system failure occurs when the delivered service deviates from fulfilling the

system function, the latter being what the system is aimed at. (I’ll return to this definition.)

• An error is that part of the system state which is liable to lead to subsequent failure: an error affecting the service is an indication that a failure occurs or has occurred. The adjudged or hypothesised cause of an error is a fault.

• (Note: errors do not necessarily lead to failures – this may be avoided by chance or design; component failures do not necessarily constitute faults to the surrounding system – this depends on how the surrounding system is relying on the component).

• These three concepts (an event, a state, and a cause) must be distinguished, whatever names you choose to use for them.

• Identifying failures and errors, as well as faults, involves judgement.

Three Basic ConceptsThree Basic ConceptsThree Basic ConceptsThree Basic Concepts


• A failure occurs when an error “passes through” the system-user interface and affects the service delivered by the system – a system of course being composed of components which are themselves systems. This failure may be significant, and thus constitute a fault, to the enclosing system. Thus the manifestation of failures, faults and errors follows a “fundamental chain”:

. . . failure fault error failure fault . . .i.e.

. . . event cause state event cause . . .• This chain can flow from one system to:

• another system that it is interacting with.• the system which it is part of.• a system which it creates or sustains.

• Typically, a failure will be judged to be due to multiple co-incident faults, e.g. the activity of a hacker exploiting a bug left by a programmer.

The Failure/Fault/Error “Chain”The Failure/Fault/Error “Chain”The Failure/Fault/Error “Chain”The Failure/Fault/Error “Chain”


• Identifying failures (and hence errors and faults), even understanding the concepts, is difficult when:

• there can be uncertainties about system boundaries.

• the very complexity of the systems (and of any specifications) is often a major difficulty.

• the determination of possible causes or consequences of failure can be a very subtle, and iterative, process.

• any provisions for preventing faults from causing failures may themselves be fallible.

• Attempting to enumerate a system’s possible failures beforehand is normally impracticable.

• Instead, one can appeal to the notion of a “judgemental system”.

System FailuresSystem FailuresSystem FailuresSystem Failures


Systems Come in Threes!Systems Come in Threes!Systems Come in Threes!Systems Come in Threes!

• The “environment” of a system is the wider system that it affects (by its correct functioning, and by its failures), and is affected by.

• What constitutes correct (failure-free) functioning might be implied by a system specification – assuming that this exists, and is complete, accurate and agreed. (Often the specification is part of the problem!)

• However, in principle a third system, a judgemental system, is involved in determining whether any particular activity (or inactivity) of a system in a given environment constitutes or would constitute – from its viewpoint – a failure.

• The judgemental system and the environmental system might be one and the same, and the judgement might be instant or delayed.

• The judgemental system might itself fail – as judged by some yet higher system – and different judges, or the same judge at different times, might come to different judgements.


Judgemental SystemsJudgemental SystemsJudgemental SystemsJudgemental Systems

• This term is deliberately broad – it covers from on-line failure detector circuits, via someone equipped with a system specification, to the retrospective activities of a court of enquiry (just as the term “system” is meant to range from simple hardware devices to complex computer-based systems, composed of h/w, s/w & people).

• Thus the judging activity may be clear-cut and automatic, or essentially subjective – though even in the latter case a degree of predictability is essential, otherwise the system designers’ task would be impossible.

• The judgement is an action by a system, and so can in principle fail – either positively or negatively.

• This possibility is allowed for in the legal system, hence the concept of a hierarchy of crown courts, appeal courts, supreme courts, etc.

• As appropriate, judgemental systems should use evidence concerning the alleged failure, any prior contractual agreements and system specifications, certification records, government guidelines, advice from regulators, prior practice, common sense, etc., etc.


A Role for (Formal) Modelling?A Role for (Formal) Modelling?A Role for (Formal) Modelling?A Role for (Formal) Modelling?• If one could express even some of these ideas in a formal notation, this might

facilitate:• the analysis of system failures.• the analysis and design of (fault-tolerant) systems themselves.

• The notation I’ve been experimenting with is that of Occurrence Nets (aka Causal Nets, Occurrence Graphs, etc.).

• ONs represent what (allegedly) happened, or might happen, and why, in a system – they model system behaviour, not actual systems.

• Simple nets can be shown pictorially.• They can be expressed algebraically, and have a formal semantics.• Tools exist for their analysis and manipulation – and even for synthesizing systems

from them (in simple cases).

• My thought experiments have concerned what might be called “Structured Occurrence Nets”.

• Their structure results from notions like the fundamental F-E-F chain.• This structure provides significant complexity reduction, and so could facilitate

(automated) failure analyses, or possibly even system synthesis.


(Structured) Occurrence Net (Structured) Occurrence Net NotationNotation

(Structured) Occurrence Net (Structured) Occurrence Net NotationNotation

The (simple and perhaps new) idea is to introduce various types of (formal) relations between Occurrence Nets (which are an old idea), and treat a set of such related ONs as a “Structured ON”.


An Aside –An Aside –Concept MinimizationConcept Minimization

An Aside –An Aside –Concept MinimizationConcept Minimization

• The Avizienis et al dependability and security definitions aim to identify and define a minimum set of basic concepts (nouns), such as “fault”, “error” and “failure” – and then elaborate on these using adjectives, such as “transient”, “internal”, “catastrophic”, etc.

• One major problem in such definitional efforts is to avoid circular definitions, and to minimise the base of pre-existing definitions used.

• (We used to use the word “reliance” in a sole definition of “dependability” – a regrettable near-circularity.)

• This we achieved using as a basic starting point just three conventional dictionary definitions – for “system”, “judgement” and “state”.

• My recent work on Occurrence Nets was sparked by the belated realisation that the concepts of “system” and “state” were not separate, but just a question of abstraction.

• In fact Occurrence Nets can represent both systems and states using the same symbol – a “place”.


Systems & their Systems & their BehaviourBehaviour

Systems & their Systems & their BehaviourBehaviour

The markings in the places in the lower ON are in effect “colourings” which identify the system concerned – here they thus relate states to systems.


System UpdatingSystem UpdatingSystem UpdatingSystem Updating

The relations between the upper and lower nets show which state sequences are associated with the systems before they were modified, and which with the modified systems. (This is off-line system modification.)


On-Line System On-Line System EvolutionEvolution

On-Line System On-Line System EvolutionEvolution

This shows the history of an online modification of some systems, in which the modified systems continue on from the states that had been reached by the original systems.


System Creation & System Creation & EvolutionEvolution

System Creation & System Creation & EvolutionEvolution

This shows some of the earlier history of the two systems, i.e. that system 1 created system 2, and that both then went through some independent further evolution.


System CompositionSystem CompositionSystem CompositionSystem Composition

This shows the behaviour of a system and of its three component systems, and how this behaviour is related to that of its components. It portrays what is in effect a spatial abstraction – one can also have a temporal abstraction, via the “abbreviation” relation.


AbbreviationAbbreviationAbbreviationAbbreviation

“Abbreviating” parts of an ON in effect defines atomic actions, i.e. actions that appear to be instantaneous to their environment. The rules that enable one to make such abbreviations are non-trivial when there are concurrent activities.


The AtomicityThe AtomicityProblemProblem

The AtomicityThe AtomicityProblemProblem

Occurrence Nets represent causality, so must be acyclic.

(a) shows a valid collapsing – i.e. a part of the ON that can be regarded as atomic and hence replaceable by a single event box. (b) shows that if a similar collapsing is also applied to the other part of the ON a cycle is introduced.

Thus it is not possible to treat both parts of this ON as being simultaneously atomic.


Hardware and SoftwareHardware and SoftwareHardware and SoftwareHardware and Software

The relationship between hardware and the software processes running on it, and indeed between the electricity source that powers the hardware, and these software processes, seem to be the same – each could be characterized as “provides means for”, or “allows to exist/occur”.


System InfrastructureSystem InfrastructureSystem InfrastructureSystem Infrastructure

This shows the relation between the computer and electrical systems and the behaviour of the software. Such “infrastructural” relations tend to be ignored, at least until something goes wrong. For example, the computer designer cannot design any mechanisms that will function in the absence of electricity, or the software designer any for a computer that cannot obey instruction sequences.


State Retention, e.g. for Fault State Retention, e.g. for Fault ToleranceTolerance

State Retention, e.g. for Fault State Retention, e.g. for Fault ToleranceTolerance

To allow for the possibility of failure a system might for example make use of “recovery points”. Such recovery points can be recorded in places that take no further (direct) part in the system’s ongoing behavior, as portrayed above.


(Post Hoc) Judgement(Post Hoc) Judgement(Post Hoc) Judgement(Post Hoc) Judgement

Here a place in the judgemental system’s ON is shown as holding a representation of an ON of the system being judged.


Failure AnalysisFailure AnalysisFailure AnalysisFailure Analysis

• Structured ONs could be used to represent actual or assumed past behaviour, or possible future behaviour, and to record F-E-F chains between systems.

• They could be generated and recorded (semi?)automatically – alternatively they might need to be generated retrospectively, from whatever evidence and testimony is available.

• Analysis of a Structured ON typically involves following (possibly in both directions) causal arrows within ONs, and relations between ONs.

• Such analysis is of course limited by the accuracy and the completeness of the Structured ON – and might be interspersed with efforts at validating and enhancing the Structured ON.

• The envisaged forms of structuring have various potential benefits:• They allow fairly direct representation of what happens in various complex

situations, such as dynamic system evolution, infrastructure failures, etc.• They provide “divide-and-conquer”-style complexity reduction, compared to

the use of (one-level) occurrence nets.


Concluding RemarksConcluding RemarksConcluding RemarksConcluding Remarks• Possibilities for taking advantage of the complexity-reduction provided by

a SON’s structuring would seem to include:• A judgemental system, having identified some system event as a failure, could

analyze records forming a Structured ON in an attempt to identify (i) the fault(s) that should be blamed for the failure, and/or (ii) the erroneous states that could and should be corrected or compensated for.

(This could be viewed as a way of describing (semi)formally what is often currently done by expert investigators in the aftermath of a major system failure.)

• Structured ONs might be usable for modelling complex system behaviour prior to system deployment, so as to facilitate the use of some form of automated model-checking in order to verify at least some aspects of the design of the system(s).

• In principle (again largely using existing tools) one could even synthesize a system from such a checked Structured ON.

• Next steps (with/by Maciej Koutny) – completion of formal definitions of these SON concepts, and (I hope) exploitation of his work on model checking of occurrence net based specifications


Some ReferencesSome ReferencesSome ReferencesSome References• Best, E. and Randell, B. (1981). A Formal Model of Atomicity in

Asynchronous Systems, Acta Informatica, Vol. 16 (1981), pp 93-124. Springer-Verlag Germany.http://www.cs.ncl.ac.uk/research/pubs/articles/papers/397.pdf

• Grahlmann, P and Best, E: PEP - More than a Petri net tool. Proc. of TACAS'96, LNCS 1055, 1996, pp.397-401 [PEP]

• Holt, A.W., Shapiro, R.M., Saint, H., and Marshall, S., “Information System Theory Project”, Appl. Data Research ADR 6606 (US Air Force, Rome Air Development Center RADC-TR-68-305), 1968.

• Khomenko, V. and Koutny, M.: Branching Processes of High-Level Petri Nets, Proc. of TACAS'03, LNCS 2619, 2003 pp.458-472, [PUNF] http://www.cs.ncl.ac.uk/research/pubs/articles/papers/425.pdf

• Merlin, P.M. and Randell, B. State Restoration in Distributed Systems, In Proc FTCS-8, Toulouse, France, 21-23 June 1978 pp. 129-134. IEEE Computer Society Press 1978 http://www.cs.ncl.ac.uk/research/pubs/articles/papers/347.pdf

Documents

1 Dagstuhl, 11-15 Sept 2006 Failures: a basic issue in dependability and security - Their definition, modelling & analysis Brian Randell