Monitoring and alarm interpretation in industrial environments

1

Monitoring and alarm interpretation inindustrial environments

S. Cauvin, IFP, 1-4 Avenue du Bois Preau, F-92852 RueilMalmaison cedex

M-O. Cordier, IRISA, Campus de Beaulieu, F-35000Rennes

C. Dousson, France Telecom – CNET, 2 Avenue PierreMarzin, F-22307 Lannion cedex

P. Laborie, EDF/DER/ER, 1 Avenue du General deGaulle, F-92141 Clamart cedex

F. Levy, LIPN, Avenue J.B. Clement, F-93430Villetaneuse

J. Montmain, CEA Marcoule, BP 171, F-30205Bagnols s/Ceze

M. Porcheron, EDF/DER/SDM, 6 Quai Watier,F-78400 Chatou

I. Servet, LAAS-CNRS, 7 Avenue du Colonel-Roche,F-31077 Toulouse cedex

L. Trave-Massuyes, LAAS-CNRS, 7 Avenue duColonel-Roche, F-31077 Toulouse cedex

The “ALARM” research group has been runningtwo years as a multi-labs group within the French Na-tional Program on Artificial Intelligence PRC-IA. Thegroup’s objective was to bring together representativesof the academic and industrial worlds in order to ana-lyze, on real applications, the various problems raisedby alarm interpretation and to define the potential ben-efits of AI techniques in this field. This paper presentsthe conclusions stemming from the analysis and discus-sions which took place during this period. It describesthe industrial applications which the members of thegroup dealt with and compares them in a table with re-spect to several criteria identified as the most signifi-cant.

1. Introduction

The “ALARM” research group focused on the studyof tools and methods for building systems aimed at in-terpreting alarms. This theme is undoubtly significant

from a practical point of view as interpreting alarms iscritical to the operation of many industrial sites.

In this paper, the conclusions stemming from theanalysis and discussions which took place during thisperiod are presented. This paper does not compare AItechniques with other possible techniques such as thoseproposed by the process control, statistics or signal pro-cessing communities. On the other hand, it is primarilyconcerned with monitoring and diagnosis. Problemsrelated to prediction and decision-making are not ad-dressed.

In section 2, we state the role of alarm interpretationin complex process monitoring. In section 3, we de-fine what “alarm interpretation” means. In section 4,we present the techniques used in the industrial appli-cations which the members of the group dealt with. Insection 5, we describe each application in detail, in acommon framework in order to highlight common fea-tures. The last section compares these applications ina table with respect to several criteria identified as themost significant.

2. Alarms and monitoring

Any physical system evolves with time, either due toits own dynamics or under the impact of external ac-tions or events. Informally, monitoring a dynamic sys-tem can be viewed as performed by a high level modulewhich keeps track of the system continuously, analysesall situations encountered, communicates with humanoperators and suggests decisions to be taken in caseof dysfunctions. Monitoring heavily relies on alarms,which rely in turn on sensor values. Such systems areencountered in telecommunication and power distribu-tion networks, as well as in major industrial plants suchas nuclear plants, refining or petrochemical plants, etc.They have certain common features. Numerous alarmswhich indicate either dysfunctions or other significantevents are continuously produced. The monitoring sys-tem can receive up to several hundred messages persecond, making interpretation difficult. The task is fur-ther complicated by other features of these systems:

AI Communications 0 () 0ISSN 0921-7126 / $8.00 , IOS Press

2 The ALARM research group / Monitoring and alarm interpretation in industrial environments

– they are dynamic: the behavior of individual com-ponents evolves over time and, in some cases,changes are due to the activity of other compo-nents;

– alarms are not necessarily received in the same or-der as the one in which they were emitted. Prop-agation times and, sometimes, transmission pathsmust be taken into account, both to reconstruct theorder of events and to decide when all relevantmessages have been received;

– alarms are not independent: some are merely theconsequence of others;

– some alarms may be lost or masked: they are emit-ted, but due to the dysfunctions of an intermediateelement, they never reach the monitoring system.

Therefore, even the absence of an alarm can providehelpful information about the state of the process.

There is very little consensus as to the architecture ofthe ideal monitoring system, partly because of the di-versity of the processes that must be monitored (contin-uous or discrete for instance). In general, it consists inseveral cooperating modules (see figure 1):

– a detection module which gathers elementary in-formation provided by the sensors and decides whe-ther the evolution of the process is normal;

– a diagnosis module, responsible for one or more ofthe following tasks:

∗ identifying characteristic situations (especiallyabnormal situations);∗ localizing the faulty components responsible for

the situation;∗ determining the primary causes of the abnor-

malities detected;

– a decision module which determines the actionswhich can be undertaken to reach the objective orbring back the process in normal conditions ; theseactions are generally presented to the operators.

These modules rely on a knowledge base which con-tains:

– a library of more or less detailed behavioural mod-els of the physical system, under normal operatingconditions and, when possible, in degraded modesor in the presence of failures,

– the objectives to be met by the physical system,possibly together with plans of actions,in normalor abnormal conditions,

– a description (or several descriptions on variouslevels) of the present state of the system, summa-rizing recent observations or expressing these in alanguage which can be understood by the operator.

DIA

GN

OS

TIC

Decision

DecisionsAlarms Faults

LOC

ALI

ZA

TIO

N

DE

TE

CT

ION

FIL

TE

RIN

G

INTERPRETATION

GENERATION

Signals

ALARMS FAULTS

ACTIONS

INTERPRET DECIDESEE

AC

TU

AT

OR

S

SYSTEM

SE

NS

OR

S

PROPOSED

DECISION-SUPPORTOF ALARMSSignals

Fig. 1. Typical architecture of a monitoring system

3. Alarm interpretation

3.1. What is an alarm?

The very concept of “alarm” varies from one ap-plication to another, which can complicate the under-standing and expression of the results. This sectiongives the definition that has been gradually adoptedby the members of the “ALARM” group after ana-lyzing together several industrial applications in var-ious sectors (nuclear power systems, telecommunica-tion networks, petroleum industry, etc.). This defini-tion is based on the notion of “event” whose definitionis given first, followed by the one of an alarm:

“An event is a piece of information extracted fromcontinuous or discrete signals emitted by a component(significant variations in a variable message emittedby the system) or data about the context (repair, ac-tions, observations related to the environment, etc.). Itis dated and instantaneous.”

“An alarm is a discrete indicator emitted by the mon-itoring system on the basis of events; it is intended totrigger a human or automated reaction.”

In a certain number of cases, there are several levelsof alarms: safety alarms which trigger automated re-actions and process alarms intended to draw attentionso that preventive or corrective actions are taken. Thetriggering of an alarm can be determined:

– either directly by a physical mechanism follow-ing simple event detection (such as an alarm indi-cating a pump failure when no output flow is de-tected);

– or by calculation; for example when a signal iscompared to a reference.

The ALARM research group / Monitoring and alarm interpretation in industrial environments 3

This rises the following problems:

– what should the reference be?– what information in the signal should be compared

to the reference ; should the signal be processed;should one or several values be compared?

– what means of comparison should be adopted?– such triggering of alarms can be determined at var-

ious levels in the overall data-processing proce-dure; It may be at the level of the sensors, that ofthe central system, or at an intermediate level.

Under real operating conditions, the main problemsencountered in interpreting alarms are:

– the absence of emission of an alarm, the loss of analarm during transmission, or the masking of oneor several alarms by others;

– cascade phenomena related to the strong interde-pendence among system variables;

– the pertinence and degree of discrimination of thesets of alarms obtained (do they really enable de-tection of the real problems?);

– the importance of the context.

These different problems, which are closely relatedto the use one makes of alarms, are discussed in the fol-lowing sections.

3.2. Generation of alarms

Alarms may be used for two different purposes, whichdo not imply the same time constraints.

When they are intended for the operator, they areprocessed on-line. The purpose of monitoring is thussupervisory control and therefore the interpretation mustbe done in real-time. The operator has a short-term op-timization objective: in general, the goal is to remain asclose as possible to ideal operating conditions, allow-ing for the variability of inputs and the natural evolu-tion of the processes. For instance, the structural shiftsin the system (wear on parts, slow modifications in theproperties of the components, etc.) are not taken intoaccount as such, and they are corrected by adjusting thecontrol parameters.

When they are intended for the maintenance expert,processing may be differed. Off-line analysis is moredetailed, since the objective of maintenance is to fore-see incidents and to schedule maintenance operationsso as to limit failures and interruptions in service to aminimum. Background on the system, slow evolutionof its characteristics and the recurrence of phenomenacorrected by the process must be interpreted in order

to recognize those structural changes which need to bedealt with.

Diverse objectives imply diverse processing meth-ods as well. When the operator controls the final deci-sions to be undertaken an important issue is to avoid the“cognitive overload” problem characterized by a floodof data, most of which is redundant. The aim is hence to“intelligently” organize the information provided to theoperator. When it is possible that decisions are takenwith no human intervention and when it is important toreact quickly, interpretation of alarms may as well aimat automatically triggering the reactions appropriate tothe evolution observed on the system.

3.3. Prioritizing alarms

In the cases when the operator retains the initiativefor undertaking actions, the objective is merely to facil-itate his understanding of the data. Alarms are often re-dundant, either because several sensors serve to detect asingle phenomenon, or because the cause phenomenontriggers a cascade of secondary anomalies which arenot significant on their own, and whose display merelyburdens the operator. The aim is therefore to assist theoperator in filtering the alarms so as to display only themost significant ones.

Several methods may be used for filtering:

– one can combine measurements provided by ele-mentary sensors to obtain a more synthesized dis-play, for example a graph can be analyzed and pre-sented in a symbolic form;

– one can use a causal network expressing failuremodes and their interdependency, or more directlythe dependency among alarms, and display onlythose which appear the farthest upstream (or cor-responding to the failure modes farthest upstream)in the network;

– one can also use simulation to verify whether ornot the consequences of a hypothetical fault coin-cide with the real measurements, and in this case,filter out the redundant alarms.

Another means of helping the operator is by classi-fying alarms. They are pre-analyzed and presented dif-ferently, according to their importance and degree ofemergency. Monitoring systems often distinguish sev-eral levels of alarms:

– the highest level indicates a direct threat to safetyand triggers shutdown of all or part of the super-vised system;


– warnings require later repair measures, but gener-ally not immediate shutdown;

– on the lowest level, alarms are merely process in-dicators used to maintain the system as close aspossible to the optimum behavior.

3.4. Processing alarms

Automated interpretation of alarms may have one ofseveral objectives: either to determine the causes of adysfunction and provide explanations to the operator,or to predict future behaviour of the system so as to as-sess the degree of emergency of a situation, or to reactautomatically by triggering physical mechanisms.

Determination of causes is similar to the overall pro-cess of diagnosis, and therefore calls for the varioustechniques in this field. Among the possible causes,we can distinguish internal faults related to equipmentfailures and external faults corresponding to inputs thatare outside the acceptable range. It may be necessaryto provide explanations, and therefore to describe ex-plicitly some underlying implicit causal relations. Forthis purpose, causal networks are again a useful wayto work back to the explanations. It must nonethelessbe noted that their use for this purpose is different thantheir use for filtering alarms: for explanation purposes,neglecting one causal relation can render recognition ofthe primary cause impossible, whereas for alarm filter-ing, it merely leads to useless redundancies.

Prediction can use more or less complete, quanti-tative or qualitative models. Such models are rarelyavailable and tractable for the whole system.

Some alarms or alarm configurations are associatedwith an immediate reaction. Whenever the most crucialfactor is the speed of reaction, this association is pre-calculated and managed by automated control mecha-nisms. In other situations, there is less emergency (andthe reaction can be within minutes or even hours) orless potential seriousness. It is therefore beneficial tohave access to cheaper techniques for recognizing sig-nificant situations, and to pursue research on the vari-ous means for automated patterns recognition.

4. AI techniques and tools for alarm management

4.1. Causal graphs relating failures andmanifestations

The idea of exploiting knowledge of causes for thepurpose of diagnosis is quite natural. A “dysfunction”

can be described in a simple way by relations associat-ing its original causes (component failures, illness, etc.)to observable manifestations, or symptoms. Having atheory which models this kind of relations, the diagnos-tic problem consists in using the theory to seek satisfac-tory explanations for the observed symptoms.

The basic inference mechanism of this type of rea-soning which “moves from effects to causes” is calledabductive. It can be summarized as follows:

Given fact “B” and the association (causal relation):

“A→ B” (“A” causes “B”), infer “A is possible”

This is a relatively old approach to diagnosis whichwas extensively used in medical diagnosis in the earlydays. Abductive reasoning and its application to diag-nosis problems was formalized in the mid-eighties [1].More recently, these methods received even more at-tention [2].

Let us look more closely at the formalization of ab-ductive diagnosis as described in [3, 4]. A failuremodel is considered as a theory constituted by a set ofrelations between the observations expected when thesystem is behaving abnormally and the “explanations”one can provide for them. The logical structure of theserelations is as follows:

C→ E : C causes EC∧

C’→ E : C and C’ cause EC∧α→ E : C can cause E under

certain unspecified conditions,represented by the abstractcondition α.

Certain terms of the failure model are distinguishedfor the purpose of diagnosis:

– the observable terms, called symptoms, whose tru-th value can be established by observing the sys-tem;

– the primary causes, i.e. the terms which have nocause in the theory; these terms will be the con-stituents of the elementary explanations produced.

Given a failure model M constituted of a number ofrelations in the above form and ψ, a set of observa-tions, the positive observations confirm the existenceof symptoms, while the negative observations enablenegating the existence of others. The explanations ofψ are the most specific formulas in the language of theprimary causes which imply the positive observationsand are consistent with the negative observations. Suchan explanation is a conjunction of primary causes andof negations of primary causes in M. For the purposeof illustration, let us look at this extract from a failure


model exploited by the DIAPO system for diagnosingthe coolant pump sets in EDF nuclear power plants (seesection 5.3). The following example is simplified bynot considering the time labels attached to the causalrelations.Shaft vibration

∧α

→ Blocking of the pump bearing knee joint

Primary water rise∧β

→ Breakdown of the pump bearing knee joint

Blocking of the pump bearing knee joint→ High D1/D2 ratio

Breakdown of the pump bearing knee joint→ High D1/D2 ratio

Blocking of seal-1→ Decrease of QFJ1 flow

High D1/D2 ratio and Decrease of QFJ1 flow aresymptoms. Blocking of the pump bearing knee joint,Breakdown of the pump bearing knee joint and Block-ing of seal-1 are diagnostic hypotheses;α and β are ab-stract conditions.

Given ψ = {Decrease in QFJ1 flow, High D1/D2 ra-tio} the full set of observations, the explanation for-mula is:

F = (Blocking of seal-1∧

Primary water rise∧β)∨

(Blocking of seal-1∧

Shaft vibration∧α)

There exists many extensions of this principle, par-ticularly to combine abductive and time-based reason-ing. This is especially important in applying thesemethods to dynamic systems subject to continuous mon-itoring. The observations are dated, and time spanscan be added to the causal relations which model thenew temporal relations between the occurrence of thecauses and that of their effects. The major difficultyconsists in exploiting the two types of knowledge, cau-sal and temporal, together. For example, [5] proposeto separate knowledge describing the manifestationscaused by the different states of the system indexed bytime (behavior models) from that describing the pos-sible transitions from one state to another (transitiongraphs). The diagnosis is first produced by an abduc-tive resolution in the behavior models, before beingvalidated by verifying its consistency with the possiblesequences of change state.

4.2. Qualitative models, influence graphs

Qualitative physics aims at representing physical sys-tems, predicting and explaining their behavior on thebasis of both the causal, common-sense reasoning used

by humans to analyze the environment qualitatively,and the scientific knowledge implicitly used by engi-neers [6, 7]. The representation of continuous variablesmust be guided by the following principle: the distinc-tions made by a quantification must be pertinent to thetype of reasoning adopted. Let us add that the modelsare declarative, in the sense that the representation ofthe system (the model) and the reasoning supported bythe model are independent. Also note that qualitativeformalisms are particularly useful in representing im-precise, uncertain or incomplete knowledge.

All these facts explain why qualitative physics haveshown very appropriate for supporting the monitor-ing of continuous processes: with the objective to ex-plain the functioning of the process to the operators,pertinence is indeed more important than precision atthe moment of making decisions. Unlike quantitativemodeling, the qualitative approach enables one to dealwith systems in their operational environment and tounderstand industrial installations as a whole.

In comparison with pure symbolic models used inlogical frameworks, qualitative models have the advan-tage of accepting arithmetic reasoning. Nevertheless,generally based on intuitive concepts (orders of mag-nitude, causality, etc.), they enable to derive mecha-nisms of inference which have analogies with the en-gineer reasoning about physical systems. In this way,they are well suited to the needs of high-level tasks likemonitoring.

Qualitative models are generally used for detection:they provide the reference behavior to which to com-pare the measurements. The difference between pre-dicted and measured behavior provides the basis forfault detection. As the explicit nature of these modelsis well adapted to mechanisms of explanation, they cangenerally be used directly for isolation as well.

Qualitative models can be divided in two classes, ac-cording to whether they represent causality explicitlyor not.

4.2.1. Models with no explicit causalityIn these models, the constraints linking the different

variables of the system are of equation type. The ba-sic techniques are constraint resolution and qualitativecalculus. The most commonly used qualitative simu-lator which allows for implementing this type of mod-els is QSIM [8, 9]. The QSIM algorithm recursivelyexamines active states to generate all possible succes-sive states, eliminating those which violate the modelconstraints in a second step. Because a given state canhave several successors, QSIM builds a state tree in


which each branch represents one possible behavior ofthe physical system.

QSIM has been used as the core of a monitoring sys-tem called MIMIC [10]. Two phases can be distin-guished in MIMIC: monitoring and diagnosis, both ofwhich call on QSIM. The concepts of monitoring, di-agnosis and simulation are associated in a cycle of hy-pothesis/model building/simulation /comparison. MI-MIC uses a library of qualitative models including onemodel of normal behavior and failure models.

The monitoring phase consists in updating the “mon-itoring set” composed of those models which are con-sistent so far with the observations received from thesensors. If M is such a model, the diagnosis phasefirst identifies all components responsible for devia-tions between the observations and the present state ofM. Then, in a second step, making the hypothesis of asingle failure, it finds the variants of M which corre-spond to the observations by modifying the operatingmode of one component. The new models obtained arefinally added to the monitoring set.

4.2.2. Models with explicit causality or influencegraphs

Davis [11] wrote that a large part of the knowledgeneeded for analyzing degraded operating conditionsbuilds on understanding the mechanisms in terms ofcausality. A system can often be described by its struc-tural equations, which are generally algebraic inter-pretations of the physical laws governing the system;causal ordering provides a guideline for identifying ex-isting links between dependent and independent vari-ables. Indeed a causal structure is a description of theinfluences that the variables have one on each other.The behavior of any system can be at least partially de-scribed by an influence graph, i.e. an oriented graphwhose nodes represent the variables and arcs, the one-direction relationships among them. This later pro-vides a conceptual tool for examining the way changesin an installation are propagated. Bandekar [12] showedthat an explicit representation of relations of this kindis directly useful for diagnosis: in model-based reason-ing, this means that knowledge about causal dependen-cies can be used in searching for the primary deviationin the graph. An influence graph is above all a struc-ture, which can be enriched depending on the knowl-edge available and the precision required for the diag-nosis: the variables can be associated to alarms, butalso to more complex features (model-process errorsfor example); the arcs may indicate only the signs of theinfluences, but also complex dynamic functions.

Present alarm processing systems provide operatorswith an unsequenced list of simultaneous alarms whichaffect the process; later, the operators must interpretand extract the source alarm, which is to say the one thatenables explaining all other alarms set off. Influencegraph methods aim at helping operators in their isola-tion task. Given a set of alarms, the influence graphcan serve to build a tree whose root is the source alarmand whose branches illustrate the propagation path ofalarms. The principle is more or less always the same.The goal is to explain abnormal deviations in the evo-lution of the variables of an installation with a mini-mum number of faults at the source. A primary faultis a change in the evolution of a variable which is di-rectly due to a failure or a non-measured degradation,while secondary faults result from the propagation ofthis deviation in time, causing new deviations. The di-agnosis consists in seeking the source variable, whosevariation is sufficient to explain all the deviations de-tected in other variables. The result may be either a de-graded arc arriving at the source variable, which cor-responds to at least partial lack of one function of theinstallation (failure of a component), or a degradationwhich directly affects the source variable. To analyzethe propagation path, local tests between one variableand its antecedents are performed, evaluating the con-sistency between the arcs and the state of the variables;the test is more or less complex, depending on the in-formation carried by the arcs and the definition of thestate of a variable.

4.3. Expert systems

Expert systems, like systems based on pattern recog-nition (see section 4.4), are based on the direct expres-sion of the links between a set of observable events (orsymptoms) and a characteristic situation one wishes toidentify (in particular, a system dysfunction). Expertsystems were the first tools used for diagnosing staticsystems. They were later extended to diagnosing dy-namic systems. Today, they are still the most widelyused in industrial monitoring systems. In the eighties,PICON was designed to reason in real time about pro-cess control data. It led to the development of the G2expert system generator (developed by GENSYM [13])which is used in many industrial applications, and inparticular, by IFP (Institut Francais du Petrole) in theAlexip software [14, 15]. The expert system approachwas also adopted by Sollac in the SACHEM projectand by France Telecom for monitoring the Transpacnetwork [16], before the GASPAR project had started.


This choice is justified by long industrial experiencewith this type of tool and its relative simplicity of im-plementation.

The advantages and drawbacks of expert systems foralarm processing and monitoring tasks are summarizedbelow.

So-called real-time expert system generators are char-acterized by the fact that they take temporal aspectsinto account. Consequently there are temporal primi-tives in the language. This enables one to express sym-bolic and/or numeric temporal constraints on observedevents. Another consequence is the objective of guar-anteeing limited response time without making the ef-ficiency too dependent on the way the rules are writ-ten, as this would make the rules difficult to programand read. Moreover, real-time expert system genera-tors propose mechanisms for interruptions which allowfor reaction, as well as possibilities for focusing mech-anisms. Most often, there is also the possibility of com-piling the rule base.

A further feature of these systems is their manage-ment of a time-stamped fact base, with mechanismsfor “forgetting” or storing the facts. This type of factbase enables more natural memorization of informa-tion, considerably facilitating contextual recognition ofmessages. On the other hand, the very existence of thismemory is a source of problems due to the continu-ous stream of information; management of the obsoles-cence of facts in the base depends heavily on the appli-cation. Very often, the mechanisms to forget or get ridof some facts are included in the expert system kernelbut they are not as powerful as needed. Hence the pro-grammer must regularly clean out the fact base to avoidthat it becomes overloaded.

Looking at the formalisms adopted in expert sys-tems, it can be noted that they are well suited to expressthe actions to be performed when an alarm is triggered,to tune the rules so that they can take losses of informa-tion into account, or to take all sorts of heuristics intoaccount. Interaction between the actions performed onthe system or the fact base and the condition portion ofthe rules, however, is often non-deterministic in theoryand difficult to control in practice. Furthermore, manyexpert systems require complete knowledge, which theexpert is unable to provide. Inference engines oftenrely on the (strong) hypothesis of completeness of theknowledge base. One key issue then remains the acqui-sition of all the expertise rules.

Finally, most expert system generators allow for easyintegration of tools for processing sensor data and forsimulation. They also provide user-friendly, flexible

interfaces. G2, for example, offers a simulator whichincludes equation resolution algorithms, a graphic lan-guage enabling description of diagrams, and a moduleallowing for integration of neural networks.

Below, we describe a few examples of French ex-pert monitoring systems. Sollac’s Sachem project for ahigh-temperature furnace process control system relieson an expert-system approach using the Kool object-oriented language. Expertise was acquired from high-temperature furnace process experts in Fos sur Mer.Alexip [17] was developed at IFP for monitoring refin-ing and petrochemical processes. It was implementedwith the G2 software and tested on the Alphabutol pro-cess, and for some parts on pilot plants. In this system,explicative models (causal graphs, influence graphs)are used in cooperation with knowledge base systems.

Another example is the software that was developedat France Telecom upon the Chronos software for mon-itoring the Transpac network. This system is now op-erational. A new approach is currently studied for thisapplication in the context of the GASPAR project, es-sentially because of the difficulty to update the exper-tise when the monitored system changes.

In summary, the delicate aspects of this type of ap-proach are acquisition, validation and the degree ofcoverage of the expertise obtained (is it consistent?does it completely cover all foreseeable situations?).Knowledge is acquired from experts, implying a riskof inaccuracy, inconsistency and incompleteness. Asregards validation, the techniques proposed for knowl-edge bases appear well suited, but they generally focuslittle on verification of temporal constraints. A furtherpoint is the inherent non-generic nature of this type ofmodel. Evolution in the process can render the full setof expertise obsolete.

On the other hand, the following benefits are widelyrecognized: their efficiency, due to the direct asso-ciation between alarms and situations; the fact thatalarm patterns generally contain only those discrimi-nants necessary and sufficient for identifying a situa-tion; their possible use for detection, filtering, interpre-tation and diagnosis; and finally, the fact that all knowl-edge thus used can be understood by the operators andtherefore can serve directly for providing explanationsor justifications.

4.4. Recognition of chronicles

While expert systems base their reasoning on rules,relegating time information to the background, recog-


A 1

A 2 A 3

A 4 A 5

[1’,3’]

nition of chronicles is based on diagrams of evolutionin which time is fundamental.

Chronicles or patterns represent a possible evolu-tion in what is observed. A chronicle is a set of on-offevents, interlinked by time constraints, and whose oc-currence may be dependent on the context.

In the monitoring framework, these events could bealarms referring to the supervised system as time infor-mation would enable sequencing them and even speci-fying time spans between two occurrences.

For example, in the following chronicle, the alarmsare partially sequenced and the temporal constraint be-tween A2 and A5 means that it can be at least one sec-ond, but no more than three seconds, between their oc-currence.

There are several methods of recognition. To nameonly those which take a similar approach, [18] presentsthe problem from the point of view of compatibilityof two time-constraint networks, one of which corre-sponds to the model to be recognize and the other, thesession, relates to the constraints between the observedevents.

[19] proposes a method based on complete predic-tion of the possible dates for each event which has notyet occurred; all these values (called temporal win-dows) are reduced by propagation of the dates of ob-served events through the graph of time constraints ofthe chronicle. Recognition is incremental -each eventis integrated as soon as it occurs - and it is performedover a single reading of the input stream (the systemdoes not manage a record of observed events). Thismethod has a high-performance algorithm, partly dueto a phase of compilation of the chronicles.

The approach described in [20] is based on a finite-state automaton whose transitions are characterized bythe occurrence of events. This approach is extremelyefficient for recognizing sequences, but its performanceis affected by the introduction of quantitative time con-straints or by the use of other than sequential structures.As above, the system requires only one reading of theinput stream and can later “forget” the events.

Chronicle recognition has been for example used by[21] in the AUSTRAL project in order to analyze se-quence of alarms emitted by substations in a Frenchmedium voltage distribution network. It is also used inthe GASPAR project in order to analyze alarms issuedby the network equipment in a telecommunication net-work.

4.5. Discrete-event models

These are models which describe system behavior indifferent modes (normal behavior, degraded behaviors,behavior in the case of failure). In general, they enablesimulating the system step by step, and thus predict-ing the values of observable variables. They can there-fore be used directly to detect abnormal situations bymeans of comparison between predictions and obser-vations. Because of the dynamic nature of the super-vised systems, a behavior mode is described by a set ofstates (stable or transient) and transitions between thesestates.

In discrete-event models, time is rendered discrete,as are the variables. The underlying formalism is thefinite-state automaton; the only dates considered arethose corresponding to a change of state. Petri net-works fall into this category, and have been enriched(time-lag, temporal, stochastic and fuzzy Petri nets)for improved representation of evolutions in a dynamicsystem. Petri networks are essentially used for simu-lation, and, in particular, for better handling event syn-chronization problems. They are not yet widely in use;we might cite the use of fuzzy Petri networks for anEsso refinery in Canada and the proposal in [22] whichdraws on them to build a situation graph.

Another possibility is to model system operation di-rectly using automata. This is the approach proposedby (Sampath 94) and used by [23] for monitoring theTranspac network. In both cases, the suggestion is tobuild the global automaton of the system by combiningelementary automata associated to system componentsand available in a library. This representation is appro-priate to simulation and detection for interpretation anddiagnosis, two methods are possible. The first consistsin off-line transformation of the control into a “diag-noser”, as proposed by [24]; [23] consists in simulat-ing this control on the basis of the most common fail-ures so as to build pairs (set of observable events, fail-ures) which can be used in on-line recognition of pat-terns ([25]). It is often necessary to represent explicitlythe time constraints that are verified by the changes ofstate in the system. One possible formalism is that oftemporal automata. The approach is used, for example,to model the Transpac network in [23, 26].

4.6. Other formalisms

A number of other possible formalisms are not de-scribed here, as we have voluntarily limited ourselvesto those used in the applications with which the group


participants deal directly. We could cite the multi-agenttype techniques used, for example, for alarm filtering atthe Saclay French Atomic Energy Commission instal-lation ([27], as well as Bayesian networks, probabilistictemporal networks and neural networks.

Others are based on a modeling by means of tradi-tional or linear logic, but their applications are rare.

5. Our applications

In this section, we describe each application the mem-bers of the “ALARM” group have been involved into ina common framework in order to highlight the commonfeatures. Tables comparing the applications on thesecommon features are given in the section 6.

5.1. IFP: ALEXIP project

Alexip is an architecture for supervising refining andpetrochemical processes studied by IFP (Institut Fran-cais du Petrole).

5.1.1. Supervised systemThe IFP has some fifty pilot plants at the CEDI (In-

dustrial Research and Development Centre) at Solaize.These pilot plants are small scale refining units in whichvarious tests are carried out. The aim is to obtain veryaccurate results, for all the useful parameters of a pro-cess, under given operating conditions. Each pilot plantis equipped with a PC with local control software. Dataare then fed back into a real-time object database to bedisplayed to the supervisor.

Furthermore, the IFP sells industrial units. The ob-jective is thus to obtain products in quantity to the re-quired specifications.

5.1.2. Aims of the supervisionOne person permanently works as the supervisor over

the pilot units. They detect any anomaly in any of theunits and coordinate the work of the operators form-ing part of the crew stationed there. For this, they mustmonitor the changes in the different variables and mustdeal with any alarm reaching them. Threshold valuesare set on each magnitude according to the desired op-erating conditions. When any of these values are ex-ceeded, an alarm is triggered which is fed back to thecentral station. The operator must analyze the situationand react according to the degree of seriousness of theunderlying problem.

The object of the supervision system is to help op-erators control the processes in the optimum manner.Since control systems are installed on the units, and thesupervisor has access to all the digital data availableon the processes, the primary objective of an additionalcomputer system is to arrange and filter the alarms pre-sented to the operators so that they are not drowned ina mass of information. The aim of this alarm process-ing is to organize matters, not to carry out a precise di-agnosis of a critical situation. This precise diagnosisrequires the use of physico-chemical knowledge of theprocess and forms the second level of a sophisticatedsystem of supervision. Finally, the next objective is toadvise the operator on what he must do, i.e. it must helpwith the on-line application of the expert-defined oper-ating procedures.

5.1.3. InputsOn average a hundred or so variables are monitored

for each pilot unit. These consist of pressures, temper-atures, flow rates and possibly analyses, i.e. productcomposition measurements. These data are processedat the rate of one second locally and one minute cen-trally. At the supervisor level, an alarm is set off whena threshold is exceeded by a given variable for a certaintime. Two threshold levels are defined for generating:

– safety alarms which trigger automatic safety sys-tems;

– control alarms which are used to control the unitwithin the desired operating ranges.

5.1.4. Models5.1.4.1. Causal graph for alarm filtering. For filter-ing the alarms occurring over time and displaying inreal time the main problems to be resolved in an or-derly and synthetic manner, a causal network combinedwith a system of priorities is used. The causal networkis generated automatically off-line, from the status dis-play of the unit. It is then examined in real time, to de-termine the problems at the alarm sequence source [17,15]. The method consists in:

– choosing an entry point on the status display (afeed);

– examining each element and determining the prob-lems which may disrupt its operation;

– examining the consequences these problems haveon the elements situated upstream and downstream;


– connecting the problems with each other by carry-ing out pattern matching on the consequences ofone problem and the conditions of occurrence ofanother. Delays are introduced into the links totake into account the fairly slow response times ofthe refining processes.

5.1.4.2. Causal graph for the diagnosis of critical sub-sections. Secondly, it is necessary to go deeper intothe diagnostics of the sensitive parts of the process. Theproblem is to determine precisely which events gov-ern the progress of the procedure. This detection is dif-ficult due to the lengthy response times of the unit toelementary disruptions. The performance of the pro-cess is the continuous result of the combination of ac-tions spread over time. The computer system con-tains the description of each event and all its conse-quences when it alone disrupts the operation of the unit.The consequences concern the position of the variableswith respect to a given reference state and changes inthem. In real time, the Dominant/Masked algorithm[28, 29] detects the dominant events for which all thelong or medium term characteristic consequences areobserved, and the masked events for which certain con-sequences are not observed due to the presence of dom-inant events.

5.1.4.3. Graphs of situations and operating procedures.When the diagnosis is finished, the system has the in-formation necessary for characterizing the overall sit-uation of the process. Each situation has an associatedoperating procedure [22]. A mechanism derived fromPetri networks can be used to manage the chaining ofsituations over time. Several transitions are grouped to-gether thanks to the concept of situation classes. Theexpert-defined operating procedures are described inthe form of decision trees using condition or premiseboxes, and action boxes. The premise boxes test thechanges in certain variables, diagnostic conclusions orspecific rules. Actions can be actions on operating pa-rameters (methods of calculating their amplitude arethen attached to them) or maintenance actions. Thesetrees are used in real time to select possible action plansand arrange them in order of preference.

5.1.4.4. Causal graphs of short term interactions be-tween variables. An explanatory module of the be-haviour of the process [28, 15] uses causal graphs ofshort term interactions between variables. The vari-ables are interconnected by links bearing the + signwhen they are evolving in the same direction and bear-

ing the 6= when they are evolving in the opposite di-rection. Thresholds are set at which certain phenomenaappear. According to the actual progress of the process,the valid causal pathways in the graph are identified toprovide an explanation of the behaviour of the process.

5.1.5. OutputsThe alarm management system displays the source

problems of an alarm sequence. However, certain con-sequent problems may be very important and demandimmediate action. A system of priorities is used toidentify them and display them even so. The com-bined use of priorities and the causal graph enablesthe seriousness of possible future alarms to be foreseenand, accordingly, processing priorities to be defined be-tween the separate alarm groups. In the same way, theconclusions of the dominant/masked algorithm and theplans of actions that are generated, are displayed withtheir explanations according to their priorities

5.1.6. ImplementationAlexip was developed with the aid of Gensym’s G2

software [13]. The Alexip architecture fully uses theobject and graphical language offered by G2. All thealgorithms were implemented in the form of G2 pro-cedures. These manipulate the graphics containing theknowledge relating to the procedures studied. Theknowledge is therefore well isolated and may be easilyupdated or modified by the experts. The inference en-gine is used for triggering the algorithms in real timeon suitable data, and for focusing on critical parts of theapplication. Finally, ”bridges”, i.e. external communi-cation modules, provided by Gensym, are used to con-nect up with the supervisors and make the applicationindependent of the latter.

5.1.7. ConclusionsAlarm processing involves all the data and calls upon

the instrumentation knowledge of the process, not itschemistry. This module has been successfully testedon several pilot plants using different procedures. Theproposed approach truly allows the operator’s load tobe contained. It takes into account the whole of the in-dustrial mechanism; which is useful since one finds thatmany malfunctions occur in the various components ofan industrial site and not always in the sensitive subsec-tions forming the subject of mathematical modeling.


5.2. Electricite de France (Network Study Branch):AUSTRAL project

5.2.1. Supervised systemThe French medium voltage (MV) power distribu-

tion system is a three-phase network mainly operat-ing at 20 kilo-volts. It is fed by the power transportand repartition networks (high voltage HV, more than60 kilo-volts) and supplies MV/LV transformer substa-tions and some industrial customers (both MV/LV sub-stations and industrial customers are called loads).

At every moment, the system functions with a ra-dial structure from HV/MV (primary) substations to theloads. Nevertheless this structure is meshable thanks toa set of remote-controlledor manually-operatedcircuit-breakers. This allows a reconfiguration of the systemto recover a maximum of loads after a permanent faultoccurs.

In a primary substation, the MV produced by HV/MVtransformers feeds a busbar via an incoming feeder.The busbar supplies with MV a set of (outgoing) feed-ers, each of them is protected by a circuit-breaker. Fig.2 describes a primary substation and a part of the dis-tribution network downstream one feeder.

HVMV

incoming feeders

transformers

feeders

shunt

load

busbar

HV/MVSUBSTATION

DISTRIBUTIONNETWORK

1

2

3

4

5

REPARTITION NETWORK

Fig. 2. Substation and distribution network

The objective of the system operation is to eliminateor reduce the impact that the different faults occurringon the system could have on the quality of service.

Except for some mechanical faults in the HV/MVtransformers, most of the faults are electrical short cir-cuits (more or less resistive) that may occur in the dif-ferent components of a substation (transformers, in-coming feeders, busbars) and on the lines of the net-

work, for instance the appearance of an electrical arcbetween one phase and its support after a thunderbolt.

Fault detectors (protections) are positioned in thesubstations as well as on the network. When a fault isdetected by a protection relay, it can trigger some auto-matic device in order to isolate or eliminate the fault.

The main automatic devices are described below (thenumber of items refers to Fig. 2). As the same fault isdetected by all the protections upstream, the function-ing of the automatic devices is coordinated, throughtheir specified times, so that they react in the followingorder:

1. some outgoing feeders are protected by a shuntthat react by short-cutting the faulty phase duringsome hundred milliseconds in order to eliminatetransient faults;

2. on some outgoing feeders (in general aerial feed-ers), an automatic recloser applies one or severalcircuit opening cycles. If the fault is not elimi-nated after these cycles the feeder circuit breakerdefinitely opens;

3. on the busbar an automatic device opens the sur-rounding circuit breakers (incoming and outgo-ing feeders, switched busbar circuit breaker) incase of an internal fault;

4. incoming feeder circuit breaker opens when afault is detected by its associated protection;

5. when a fault occurs in an HV/MV transformer,the surroundingcircuit breakers (incoming feeder,HV transformer feeder) are automatically opened.

Each of these automatic devices are fired by one orseveral local protections. The fault detectors and auto-matic devices send remote signal to the telecontrol sys-tem in the centre with respect to their behavior (faultdetection, opening or closing of a circuit breaker, etc.).Automatic devices can be affected by outages, for in-stance, a circuit breaker may not open when asked todo so.

The French distribution system is controlled by abouta hundred control centres. Each centre is responsiblefor several tens of substations.

The configuration is variable from one centre to an-other. Centres can differ because of the network struc-ture or the equipment configuration. As an example,the centre of Lyon controls 2300km of lines and feeds6350 loads (among them, 4000 are MV/LV substationsthat feed 550000 LV customers).

Each day, network operators must handle severaltens of faults. Each fault generates several tens of re-mote signals. Around 5% of the faults are permanentand need network reconfiguration.


5.2.2. Supervision problemThe occurrence of a fault results in the functioning of

several automatic devices that generate a flow of timestamped remote signals. The operator should then re-act in a minimum delay. He has to interpret the flow ofincoming events and alarms in order to make an accu-rate diagnosis: what has happened, where is the faultlocated and which customers are de-energized? On thebasis of this diagnosis, action is taken: remote switch-ing orders are sent to isolate the fault and restore powerfor the maximum number of customers. A team is sentinto the field for fault repair. During the early steps ofthis procedure, time is critical. Unfortunately, the largeamount of events and alarms coming from the networkduring an outage may make the diagnosis task ratherdifficult. Moreover, reconstruction of a coherent expla-nation from remote events may require a fairly goodknowledge of automata to link it to the actual state ofnetwork topology. AUSTRAL attempts to assist theoperator throughout this procedure.

The alarm processing function of AUSTRAL (ESF)is the first function triggered after a fault occurs. Thefirst objective of the ESF is therefore to reduce the to-tal amount of data presented to the operator. To achievethis, sets of coherent remote signals are combined toform a single synthetic data entity. The second objec-tive of ESF is to provide a fuller analysis of incomingevents, in terms of outage diagnosis. Around 20 typesof diagnosis have been identified. They correspond toa synthesis of correct behavior or misbehavior of au-tomata and protection devices.

According to the diagnosis produced by ESF, otherAUSTRAL functions may be launched. FLF (Fault Lo-cation Function) will be triggered when a fault on anoutgoing feeder occurs; its objective is to locate thefault on the network by consulting fault detectors.

NRF (Network Restoration Function) will be acti-vated for every permanent fault; this function proposesto the operator a reconfigurationplan (a list of open/closecircuit-breaker commands) in order to restore powerfor the maximum number of customers.

AUSTRAL is connected in real-time to the telecon-trol system. A full description of the AUSTRAL pack-age can be found in [31].

5.2.3. InputsWhen a fault occurs on the network, captors trigger

some automatic devices and inform the operator (viaremote signals). The principle of captors is to measurecurrents or instantaneous powers and to compare thesemeasures between themselves or to thresholds.

The detection of a fault results in the functioning ofautomatic devices (shunt or recloser cycles, opening orclosing of circuit-breakers) that generates remote sig-nals according to their behavior.

The supervision system inputs are the set of remotesignals send by the different network equipments. Fora network like the one of Lyon, around 8000 differentremote signals are possible; these signals are gatheredinto around 20 types.

The informations coming from protections and auto-matic devices in the substations are considered to be re-liable and are used by ESF to perform fault diagnosis.

The informations coming from fault detectors on thelines are uncertain (unreliability of equipments, possi-ble communication problems). They are used by theFLF to locate faults.

5.2.4. ModelsTwo kinds of models are used :

– a set of generic chronicles describes the classesof different distribution system behaviors after afault has occured. Chronicles are used by theESF for fault diagnosis. Each chronicle is associ-ated a synthetic message that will be displayed tothe operator when the chronicle is recognized (seesection 4.4). The complete knowledge base con-tains around 100 chronicles. Each chronicle cor-responds in average to some tens of events.

– a dynamic database (RDTS for Real Time DataStructure) reflects the state of the whole network atevery moment. The RDTS is updated by the tele-control system and is used by all AUSTRAL func-tions.

The three main functions of AUSTRAL use thesemodels in different ways.

– The diagnosis function (ESF) performs chroniclerecognition and generates a synthetic message whena chronicle is recognized.

– When a permanent fault occurs, the fault loca-tion function (FLF) consults fault detectors on thefaulty feeder and locate the fault downstream thelast detector that has seen the fault and upstreamthe detectors that have not seen it. AUSTRALmakes hypothesis in order to minimize the numberof detector outages when the informations comingfrom fault detectors are incoherent.

– The network reconfiguration function (NRF) usesthe dynamic database to generate a set of reconfig-uration plans. AUSTRAL weights the generated


plans thanks to an aggregated quality criterion thattakes into account the customers de-energized, theeasiness to execute the plan and its robustness.

The network dynamic database used by AUSTRALis derived from the telecontrol system’s one. The con-nection of AUSTRAL to the telecontrol system via API(Application User Interface) was one of the main diffi-culties of the programming part of the project.

The chronicle knowledge base for the ESF was col-lected and tuned by a network expert during around oneyear, in collaboration with the final users.

Because of the number and complexity of chroni-cles, the problem of validation and evolution of theknowledge base is a bottleneck. We are developping atool, called GEMO, to automatically generate and val-idate the chronicle knowledge base from a deep modelof the network equipments using communicating au-tomata [32, 33].

5.2.5. OutputsThere are two kinds of outputs for AUSTRAL.

– Firstly, a display window (Synthetic Summary) in-forming the operator about what currently happenson the network. The messages displayed in thiswindow are the ones generated during chroniclerecognition by the ESF.

– Secondly, when a permanent fault is diagnosed, alist of reconfiguration plans - sorted according tothe quality multi-criterion - are proposed to the op-erator.

5.2.6. ImplementationThe different AUSTRAL functions are implemented

as independent process and are executed in parallel asit would not be realistic to stop the event analysis anddiagnosis function because a reconfiguration plan mustbe computed by the NRF.

AUSTRAL functions can query the RTDS (for ex-ample about the network topology at a given moment),send queries to the telecontrol system (for example into interrogate the fault detectors) or receive events fromthe telecontrol system (remote signals).

AUSTRAL works on Sun and IBM architectures.AUSTRAL functions are written in ILOG Talk (Object-

oriented Lisp compilable in C). The RTDS as well asthe interface between AUSTRAL and the telecontrolsystem have been implemented in C++.

5.2.7. Validation, maintenance, evolutionAt present, the tool is being tested in the centre of

Lyon. Three French centres will be equipped withAUSTRAL in 1998. The potential customers are all thedistribution centres in France, as well as some centresabroad.

5.3. Electricite de France (Monitoring, Diagnosisand Maintenance Branch): DIAPO project

DIAPO [34] stands for Diagnosis Support of Reac-tor Coolant Pump Sets of Pressurized Water NuclearPower Plants.

5.3.1. Monitored System.The Reactor Coolant Pump Set (RCPS) is a major

component of a pressurised water reactor: it ensurescirculation of the coolant fluid between the reactor coreand the steam generator (figure 3). Its good function-ing is decisive for plant availability (stopping the RCPSresults in plant shutdown) and security (primary fluidalso serves to cool the reactor core).

An RCPS is about 10 meters high and 80 tons heavywith a flow rate of 7 m3/s. Rotation speed is of 1500rpm. Three functioning phases are considered: start-up, nominal operating conditions and rundown.

Experts generally distinguish the following main func-tional assemblies (figure 3): the pump subsection, wherean impeller drives primary circuit water; the thermalbarrier and the seal system, preventing primary coolantwater from rising; the shaft line, consisting of threecoupled shafts whose stability is ensured by bearings;the motor, which ensures the shaft line rotation and aflywheel which moderates slow-down time in case ofelectrical power loss.

About 150 RCPS operate in the French nuclear powerplants. Their expected lifetime is over 30 years. Threedifferent RCPS technologies are distinguished, corre-sponding to the three steps of conception of the nuclearpower plants. Some differences can also exist betweenRCPS of the same type, due to local technological mod-ifications. However, these differences are worthless,with respect to the diagnostic problem.

The RCPS is a dynamic system; still, its characteris-tic variables (flows, vibrations, temperature.. .) are al-most stationary during nominal operating conditions.

RCPS failures have various dynamics: from briefphenomena (e.g. a part loss) to very slow ones (e.g.progressive loosening of a bearing fixation). RCPS be-haviour is altered during its life-time, because of itscomponents wear and aging.


Fig. 3. Pressurised Water Reactor Operating Principle and RCPS Exploded View.

5.3.2. Surveillance and diagnostic aimsFinal users of DIAPO are engineers of the RCPS

maintenance staff, who have to decide whether theRCPS must be stopped or not in case where an abnor-mal behaviour is detected. This responsibility is quiteimportant as the RCPS run-down entails the plan un-availability. In addition diagnosis and surveillance ofthe RCPS are expected to support optimization of itsmaintenance.

The expected frequency of use of DIAPO is low,RCPS being reliable machines (between 1 and 10 timesa year, depending of the plant). It is worth noting thatthe RCPS is not monitored in order to optimize its func-tioning.

5.3.3. Diagnostic InputsThe RCPS is continuously monitored by an auto-

mated system using some 30 sensors to characterise thecondition of the machine and generate alarms.

Monitoring data are stored in a data-base and can beretrieved for real-time or delayed consultation. High-level parameters are computed from “basic” sensor data(e.g. for measured vibrations, amplitude and phase ofeach harmonic).

Altogether, about 200 observable parameters may beused to describe RCPS states, including: monitoringsystem data, such as thermo-hydraulic (bearing temper-atures, seal flow...) and vibratory measurements (shaftand housing vibrations.. .); plant state parameters, (pri-

mary circuit temperature.. .); locations where signifi-cant observations may be performed (under the engine,in bearing lubricant. . .) and maintenance and controloperation logs.

It is worth noting that the diagnostic system is not au-tomatically triggered by the monitoring system.

5.3.4. Models5.3.4.1. Types DIAPO diagnostic method strongly re-lies on advanced research works on abductive and tem-poral reasoning, especially [3, 4, 35]. DIAPO uses fourfault models described figure 4.

– A Prototypical Fault Model which produces hy-potheses on the abnormal situation encountered.

– A Localization Fault Model which produces hy-potheses on the location of faulty components.

– A Causal Fault Model which produces hypotheseson the primary causes of the RCPS misbehavior.

– An Associative Causal Model which is a simplifi-cation of the deep causal model, consisting of di-rect cause-to-effect relations between a hundredRCPS faults and their manifestations.

In these models, symptoms play a particular part,as their truth value can be established through RCPSobservation. They consist in predicate calculus for-mulae with existentially quantified variables. For in-stance, the symptom There exist a variation of typestep with amplitude greater than 25 microns of the


Concepts Relations InterpretationInternal states of theRCPs, events,

C[∆] −→ E C causes E after ∆

symptoms, main-tenance and controloperations,

C1 ∧ . . . ∧Cn[∆1 . . .∆n; ∆]→ E C1 . . .and Cn causes E with delays∆1 . . .∆n; ∆

logical and temporaloperators

C ∧A[∆] −→ E C causes E after ∆, under contextual con-dition A

C ∧ α[∆] −→ E C causes E after ∆, under non-specifiedcondition represented by α

Fig. 4. Relations for Diapo Fault Models

monitoring parameter DA is represented by the for-mula: “∃(x)Type(x, Step) ∧ Parameter(x,DA) ∧amplitude(x) > 25µ”. In each diagnostic problem in-stance, variables will be assigned by events describingthe RCPS behaviour, in order to compute the symptomssatisfied by the actual RCPS observation.

These events are the individuals which constitute theRCPS observation for a diagnostic problem instance.They are defined by :

– a parameter P representing the monitoring mea-surement involved (e.g. vibration measurement;seal flow...)

– a type T representing the property verified by theparameter (e.g. step, abnormally high.. .)

– a fuzzy temporal extent defined by a beginning in-terval [b1 b2] and an end interval [e1 e2] whereb1, b2, e1, e2 are time points verifying b2 ≥ b1, e2 ≥e1, b1 ≤ e1, e2 ≤ b2 and on which the propertyrepresented by T holds.

For instance, the observation A 2 second long stepon the vibration measurement was seen on April 2nd at12:00:00 will be represented by the event:

– parameter : vibration measurement– type: step– beginning interval:

[April 2nd 12:00:00, April 2nd12:00:00]– end interval: [April 2nd 12:00:02, April 2nd 12:00:02]

This representation allows to deal with observationswhose beginning and end are not known with precision.For instance, the observation The seal flow is abnor-mally high (noticed on April 1st at 12:00) can be rep-resented by the event:

– parameter : seal flow– type: high level– beginning interval:]−∞ , April 1st 12:00[

– end interval: ]April 1st 12:00, +∞[

Delays attached to relations in the fault models rep-resent general knowledge on RCPS fault dynamics In-formally, the delay ∆ in the cause-to-effect relationC[∆] → E describes the relations between C and Etemporal extents. When E is a symptom, such a delaymay be specified for each existentially quantified vari-able of E. Practically, it consists in constraints, pos-sibly approximate, between respective beginnings andends of C and E. For example, let C = ([b1b2][e1e2])be the temporal extent of C; let ∆ be “E begins be-tween 1 and 6 months after the beginning of C; E andC ends are simultaneous”. The temporal extent for Eis then: ([b1 + (1month), b2 + (6months)][e1e2]).

In addition to fault models, DIAPO also uses an Ob-servation Model which consists of relations describ-ing possible dependencies between events observed.These relations are based upon physics, mathematics,and signal processing theory laws, applying to the func-tioning an the surveillance of RCPS. Here is an exam-ple a such a law:Any high level state observed on some continuous mon-itoring parameter is necessarily preceded by an in-crease.

These relations can be used to automatically com-plete the observation set during problem solving. Someof them can also serve to complete the fault models inorder to prevent from explaining independently corre-lated events.

5.3.4.2. Size. Fault models currently used by Diapocontain about 1500 relations.

5.3.4.3. Use. An abduction procedure integrating tem-poral information propagation is applied to each faultmodels. It computes formulae that explains the ob-


served events and are consistent with the informationthat event occurrences are negated.

Such an explanation takes the following form:

EXP (Si, {e1 . . . en})≡

((Ci1, Ci1) ∨ . . .[(Cik, Cik) ∧ (Cip, Cip) ∧ . . .

Sync(∆k,n(Cik),∆p,l(Cip) . . .)] ∨ . . .[(Cil, Cil) ∧ (αij , αij) ∧ . . .] ∨ . . .)

∧¬(C′1, C1′) ∧ . . . ∧ ¬(C′k, Ck

′)

Where:

– Si are symptoms, {e1 . . . en} their satisfying as-signments

– Cij are initial-causes; Cij are their temporal ex-tent.

– The Sync predicate represents the synchroniza-tion condition attached to conjunctive terms.∆k,n(Cik) represents the application to the tem-poral extent of Cik of the series of delays held bythe relations between Cik and the conjunction.

– αij are abstract conditions, αij their temporal ex-tent.

– C′i are the refuted terms, and Ci′

the temporal ex-tents on which they have been refuted.

Diagnosis is each model is obtained by computingthe best conjunctions of such formulae according toheuristic criteria (maximum of events explained, bestcircumscription of explained events. . .)

Final diagnosis is the conjunction of the diagnosesobtained in each fault model. Its worth noting that anhypothesis assumed in one fault model can be refuted inanother fault model, if the latter is more complete thatthe former.

5.3.4.4. Acquisition. Knowledge acquisition requiredfive experts and two knowledge engineers during aboutone year. It was performed without any specific knowl-edge acquisition tool.

5.3.4.5. Limits. The main limit of this diagnostic me-thod is its incompleteness: DIAPO can produce diag-noses only for cases embodied in its model. This limitis inherent to any fault model based diagnostic method.

This limitation can be severe, when facing multiplefaults. Indeed, to limit complexity and combinatorial

character of the models, failures are considered inde-pendent. That is, two independent causes are assumedto produce the conjunction of their respective effects,except an explicit description of their combination ispresent in the model. But this hypothesis is not alwayssound. Two independent failures occurring simultane-ously can interact, resulting in some unexpected effectnot described in the model. In this case, diagnosis willfail.

5.3.5. Diagnostic OutputsOutputs of DIAPO consists of :

– the diagnosis obtained.– the causal paths between initial-causes and the ex-

plained events– the justifications of the refutations

Complementary information attached to term spec-ified as faults can be displayed: gravity level, main-tenance advice. This outputs can be browsed througha graphical interface, and saved in a printable formatin order to produce diagnostic reports. It is worth not-ing that the diagnostic system has no feed-back on theRCPCS process.

5.3.6. ImplementationThe diagnostic process is decomposed into several

tasks (problem solving in one particular model, com-plementary observation acquisition.. .) and implementedin a blackboard architecture.

The whole system is implemented in Lisp, in theIlog’s development environment SmeciTM . A full-scale prototype a been developed in co-operation be-tween EDF and the RCPS manufacturer, Jeumont In-dustrie, for a global cost of 10 Million francs, includ-ing the acquisition and the validation of the knowledgebases, the development and the test of the prototype.

5.3.7. PerspectivesFrom a technical point of view, one important per-

spective is to provide DIAPO with consistency-baseddiagnostic method, using a model of the normal be-haviour of the RCPS. The use of heuristic knowledgein order to focus abductive reasoning is also studied;as well as the introduction of probabilistic informationwithin the fault models.

The industrialization of a simplified product derivedfrom the prototype is currently under development, incooperation with Jeumont-Industrie, the RCPS man-ufacturer. This system will be installed in 20 EDFsites and proposed by Jeumont-Industrie to its foreign


Technical Center

Supervisor

Technical Center

Switches Switches Switches Switches

...

... ...

Station USG Station USG Station USGStation UE Station UE

Link Link Link Link

LineLineLine Line

... ...

...

...

...

...

related to

connected to

composed with

manage

Role UE Role UERole USG Role USG Role USG

Fig. 5. Example of supervision network architecture.

clients. The direct expected benefits for EDF are eval-uated to 2 million francs per year, through avoiding ofpower plant shutdowns enabled by DIAPO diagnosesof RCPS faults at incipient stages.

5.3.8. Validation, maintenance, evolutionValidation issues concern the Fault Models and the

Observation Model. Since formal methods lack to val-idate this knowledge, a hundred tests have been per-formed in order to evaluate the prototype. About 10%of these tests were real incident reconstructions, theother part consisting in theoretical situations built bythe experts and the developers. 80% of diagnoses wereconsidered satisfactory by the experts. Failures weredue to incompleteness of the fault model.

The tests have shown a good evolutivity and main-tenability of the system, thanks to the unified scheme ofresolution employed: the structure of the fault modelsallow to modify them easily.

5.4. France Telecom : GASPAR project

5.4.1. Telecommunications NetworksFrench packet switching data transmission network

includes two independent entities, with respect to op-eration, which are the transport network, ensuring con-nection setup and data transfer (voice, text, images),and the management network, which transmits ordersto network equipment, and receives alarms which arethen routed to the supervisor.

The architecture described in figure 5 is organizedin tree-structure between switches (networks nodes),the technical centres and the supervisor. About fiftytechniques centres, which group over three hundredswitches, are controlled by the supervisor.

Each equipment in the network can happen to fall outof order and each equipment in the network can be putback to nominal functioning conditions through reini-tialization or switch over to standby equipment, whichmeans the network is “protected”. This kind of networkoperates non-stop, hence supervision must also be en-sured round the clock.

As faults as well as coming back occur, alarms aresent by equipments to the supervisor through the net-work. Masking phenomena may occur with this typeof architecture. If a component of the tree-structure hasfailed, all the trees of the sub-tree will be unreachableand in a unknown state, hence they will not be ableto notice the supervisor (even everything work fine attheir level). Such loss of alarms must be taken into ac-count when making a diagnosis.

5.4.2. Objectives of SupervisionContinuous supervision of the network makes it pos-

sible to be informed at any time about the state of thenetwork, e.g. which device is backed up or out of or-der. It also allows to know the origin of the failures,without having to check the various equipment or to goover all the alarms. The important thing is to screen thealarms issued by the network (about 150 000 per day)and interprete them. In this way, only events of majorimportance are notified to the operator.

Supervision is particularly useful for short-term net-work maintenance. As a matter of fact, operators haveto evaluate the extent of the problem and initiate correc-tive actions or failed components. The equipment in-cludes automatic control facilities which take the func-tional aspect of the network into account.

5.4.3. System inputsSystem inputs mainly consist of alarms issued by the

network equipment and routed over the tree-structuredsupervision network. Data may also include responsesto queries about the state of a particular equipment, putby an operator or by the supervision system.

These alarms are symbolic since no quantitative in-formation is given regarding the traffic density or switchload. They include information about :

– the component state change (e.g from its normalstate to out of order and vice-versa)

– connection breakdown between two pieces of equip-ment (e.g. between a technical centre and a switch),

– reinitialization of a component (e.g a switch).

The alarms are reliable but two phenomena may in-duce losses of alarms:


– saturation of buffers: alarms are buffered in eachswitch, but switching takes priority over their pro-cessing; when buffers are full, further alarms willbe lost.

– masking of alarms: in the event of component fail-ure, the alarms sent bu the subtree to this compo-nent are lost, hence they will not be received by thesupervisor (figure 5).

At least, one hundred different alarms can be ob-served in the various parts of the network. The num-ber of alarms has significantly increased in the last fewyears: switches of the last generation can produce asmany as 150 000 alarms per day. Another difficultyis that of propagation of alarms. When a technicalcentre breaks, it causes the booting of all its switches(hierarchical dependency). Thus, a technical centrebreakdown will be responsible of the emitting of manyalarms (as many as switches).

It should also be noted that buffers placed along thealarm feedback paths induce variable delays betweentransmission and reception of messages. Hence the cor-relation between alarms from different paths may bemore difficult.

5.4.4. ImplementationThe present knowledge was devised by an expert, it

is organized in about two hundred deduction rules. Itwas so costly that it was decided that network evolu-tions would not be taken into account.

System outputs are alarms, failures contexts (reini-tialization, connection breakdown...) and informationof the state of the network (the diagram is updated us-ing change-of-state alarms).

This system has been in operation for several years.However, expert knowledge procedures have been stop-ped and the operators did not propose or wait for fur-ther evolutions because of unsufficient real-time per-formance and also because, nowadays, network expan-sion is the highest priority task for the operators, henceknowledge acquisition is delayed. Attention is still fo-cused on network supervision but aids will have to bedeveloped for expert knowledge acquisition and evolu-tion.

Considering these problems, it appeared necessary togo deeper into model-based reasoning techniques andalso training and data mining techniques. These latterare still at a prospective stage, however.

RECOGNITION

DISCRIMINATION

SIMULATION

MODELLING

ON LINE

OFF LINE

Temporal Data

Model of the network

Temporal alarm sequences

Scenarios

Stream of temporal events Diagnosis

Library of elementarycomponent models

Physical description

of the network

Faults

Fig. 6. GASPAR architecture.

5.4.5. The Gaspar project : knowledge acquisition bysimulating models

The GASPAR1 project2 aims to study the applicationof artificial intelligence to telecommunicationsnetworksupervision. Particular emphasis is put on model-baseddiagnosis and to a greater extent on qualitative reason-ing.

One of the advantages of model-based diagnosis isgenerics, but this method may be costly (in executiontime). For this reason, the architecture of the GASPARsystem is organized in two parts (see fig. 6): networkmodeling and failure simulation are performed off-line[36, 23]; it enables characteristic sequences (called sce-narios or chronicles) to be obtained. Detection of thesequences is then performed on-line, using tools suchas IxTeT [19, 37].

5.4.5.1. Modeling. Modeling takes advantage of theformalism of communicating temporal automata adap-ted to discrete event systems; this method is suited tothe construction of a model combining simple elements[38]. Two levels of abstraction can be made out in thisconstruction:

– The connectivity network level, which describesthe transmission of messages, giving greater im-portance to the transmitting or non-transmittingstate in each node. The model is not much differ-ent from that presented in figure 5.

– The description of each component behaviour inautomaton form. The transitions between differ-ent states correspond to reception or transmissionof messages. Some of these messages correspondto alarms sent to the supervisor.

1The French acronym GASPAR stands for “Gestion d’Alarmespar Simulation de PAnnes sur Reseau de telecommunication”

2collaboration between France Telecom/CNET and LIPN/CNRSand IRISA/CNRS


5.4.5.2. Simulation. Simulation is made taking fail-ures as a basis, using the behavioral model of a networkwith “failure events”. The time division aspect of com-ponent functioning is also taken into account as wellas the uncertainty regarding the time of occurrence ofsome events. Simulation thus reproduces the propaga-tion of alarms to the supervisor. Masking phenomenaare also simulated (for instance when a technical cen-tre is in non-transmitting state during the routing of amessage issued by a switch). The sequences of alarmsreceived by the supervisor for a given series of failurescan then be worked out.

5.4.5.3. Generics. For a lot of network configurations,the method applied is generic, since the global modelis obtained from a number of elementary models. If amodification or addition of a component is necessary,the component library is updated and the global modelis easily reconstructed. In this way, maintenance pro-cedures are facilitated.

In addition, the time aspect is taken into account forcomponent functioning which enables transmission du-rations to be represented.

5.5. CEA Marcoule : DIAPASON project

DIAPASON is a system developed by the “Commis-sariat a l’Energie Atomique” 3.

5.5.1. Supervised SystemSpent fuel is reprocessed to recover the uranium and

plutonium still present in the fuel rods irradiated in nu-clear reactors, and to isolate the remaining fission prod-ucts. The pulsed column facility considered here in-cludes extraction columns and fission product scrub-bing columns, and is designed to separate the uraniumand plutonium from the fission products (FP) by selec-tive extraction. A pulsed column is a liquid-liquid ex-traction device. The spent fuel (comprising uranium,plutonium and FP) is dissolved in nitric acid, and theextraction column selectively transfers the uranium andplutonium to an organic phase consisting essentially oftributylphosphate (TBP); most of the fission productsthus remain in the aqueous phase. The extraction steprequires that the aqueous and organic phases be thor-oughly mixed to maximize the contact surface area be-tween the two solvents and thereby optimize the chem-

3CEA Marcoule – DRRV/SSP/LIA – BP 171,30207 Bagnols-sur-Ceze cedex.

ical exchange phenomena. The two phases tend to sep-arate by gravity, as nitric acid has a higher density thanTBP.

In order to ensure countercurrentflow, the light (TBP)phase is injected at the base of the column and theheavy (acid) phase at the top. The organic phase isthe continuous phase; it initially fills the entire vol-ume of the extraction column, and remains the predom-inant phase after injection of the aqueous phase: aciddroplets are dispersed in the TBP. The resulting mix-ture is subjected to periodic pressure pulses to form anemulsion in order to retard the descent of the heavyphase and to mix it with the light phase. The interphasethe surface physically separating the two phases is lo-cated in the settler at the bottom of the column, and isregulated by drawing off a the aqueous phase at a suit-able rate. In the application considered here, all liquidtransfers are ensured by airlifts.

DIAPASON was developed and tested by means ofa process simulator based on numeric codes used by theDCC/DRDD/SEMP: numeric model simulations con-stitute the “actual” or “measured” process behavior. Atotal of about seventy failures were modeled. A controlinterface is provided for process control and for imple-menting failure modes.

5.5.2. Surveillance ObjectivesNuclear processes are prime candidates for supervi-

sory aids:

– The process media are highly radioactive and it isthus difficult to install sensors that are both sophis-ticated and durable. The measurement equipmentis thus implemented using rudimentary techniqueswith minimum maintenance requirements. Chem-ical measurement sensors are particularly difficultto develop; as a result, much of the data concern-ing chemical phenomena are obtained off-line bysampling.

– The physical and chemical phenomena involvedare relatively complex, and process control re-quires highly experienced operators; moreover, ma-ny tests are performed on prototypes to assess theeffects of various parameters on process behavior.

– Process control is also highly sensitive since theprocess units are interconnected to allow continu-ous product recycling, and the equipment items ineach unit are closely coupled.

DIAPASON provides the operator with a syntheticrepresentation with modeling based on qualitative phy-sics and automatic control [39]. It is designed to predict


process evolution to allow anticipative control. In in-cident situations it detection, localization and diagnos-tics of malfunctions for preventive maintenance or forprocess control in degraded mode.

5.5.3. InputsAll inputs are digital values, with no descriptors of

any other type. The digital simulator is seen as a blackbox whose outputs constitute the sensors used by DIA-PASON for its surveillance functions. Noise may beadded to each digital output line from the simulator tosimulate measurement noise.

The objective is to provide the operator with infor-mation strictly related to diagnostics; only pertinentchanges in variables (i.e. changes liable to affect over-all process operation) are therefore taken into account.Segmentation is used to convert the sampled processinput signal (a computerized representation) into a se-ries of significant variations (a more explicit represen-tation for the observer). Changes in measured processvariables are described as a series of significant eventsrepresented by a segmented affine time function. Achange in slope corresponds to a significant change inthe behavior of a variable. In order to allow for in-stantaneous phenomena (at the supervision time scale)these functions may be discontinuous. An event is thuscharacterized by a triplet occurrence date t0, slope vari-ation at t0, amplitude variation at t0). The evolution isthus perceived as a succession of chronologically or-dered events.

5.5.4. ModelsThe model is an influence graph in which the vertices

are the process variables and the arcs are the causalrelations among the variables (PROTEE module)[40].The evolution of a variable is described numericallyand an arc (qualitative transfer function) characterizesthe influence of one variable on another using basic au-tomatic control concepts (dynamic allowance for influ-ences). The overall process evolution is characterizedby the state of the pertinent variables: a significant vari-ation of an input variable corresponds to the detectionof a graph source event; the propagation of the event inthe graph indicates the process response to the input.

Each arc on the graph defines the temporal causal-ity relation between two variables using conventionalnotions to provide a dynamic description of the influ-ences (gain, pure delay, response time, etc.). By anal-ogy with classic automatic control systems, the func-tion supported by the arc was designated the Qualita-tive Transfer Function (QTF).

The QTF response to an input signal is approximatedby a segmented affine function from the response of aconventional transfer function to the same input, i.e. itis an “evolution”. The simulation consists in propagat-ing significant changes affecting the graph sources (i.e.the process inputs) from one variable to another usingQTFs. The event-driven nature of the simulator is dueto the fact that these changes do not occur at regular in-tervals.

The graph for the pulsed column facility comprisesabout 55 variables and 70 arcs. The event simula-tor providing the process “reference” behavior is obvi-ously less precise than the digital simulator used to ob-tain the “measured” behavior, but the development costis of another magnitude even if the construction of thecausal graph is not a simple task and requires the co-operation of an expert. The models declarative naturesimplifies revisions.

Detection (MINOS module [41]) involves local pro-cessing in which each deviation between process mea-surements and the data simulated by PROTEE is ana-lyzed over a specified time interval. The simulationsbetter represent process behavior than exact numericvariable values; a simple comparison with a single-threshold error is thus inadequate (inappropriate thresh-olds, noise sensitivity, etc.). We therefore introducedthe notion of qualitative equivalence between two evo-lutions to provide the module with greater insensitiv-ity to noise (modeling approximations, measurementnoise, etc.). Various comparison criteria are consid-ered: curve shapes, deviation, distance between curves.

In view of the multiple sources of inaccuracy, wehave attempted to model vague and imprecise phenom-ena by implementing the theory of fuzzy sets [42].Each criterion requires a symbolic interpretation lead-ing to a decision.

The detection phase is followed by a localizationphase (MINOS module), an overall analysis of all thesimulation errors in the graph. It constitutes a solu-tion to the alarm cascades by proposing causal chainslinking the deviations affecting all the variables in thegraph. Dynamic management of the fault coherenceyields the propagation subgraph of a trend and identi-fies the possible source variable(s).

A source variable is defined as the first variable (intime) for which the influence of a perturbation is ob-servable. A detection variable is a variable for whichthe deviation between the actual evolution and the ex-pected evolution exceeds the permissible threshold. Thepossibility of tracing back from a detection variable to asource variable compensates for the often arbitrary na-ture of the threshold technique.


When faults are detected, the graph is used for dy-namic investigation of the errors in analysis time win-dows. Tracing the qualitative graph from effect tocause produces a propagation subgraph for the errorand its possible sources.

The coherence test consists in substituting real eventsfor predicted events in the analysis time window foreach variable preceding the detection variable. Propa-gating these new events yields a new simulated evolu-tion for the detection variable, and subsequently a newmodel/process error on this variable, which is com-pared with the initial deviation: a qualitative interpreta-tion (i.e. deviation eliminated, unchanged or reversed)of the error variation between the two simulations isbased on orders of magnitude, and provides informa-tion on the causal links between the upstream devia-tions and the deviation of the detection variable.

This coherence test can be interpreted as an alarm fil-ter: before reporting the occurrence of a new defect,the causality analysis attempts to relate the new alarmto the previous suspect subgraph. We have also shownthat as in the case of detection this defect filtering capa-bility could be modeled as a decision-making processapplied to the aggregate of criteria described by fuzzysets.

The failure diagnostics (SPHYNX module) imple-ment structure knowledge of the process to identifythe defective component. Fault diagnostics performedby the behavioral model are not generally sufficientto identify the physical component failures associatedwith these faults, as the processing would require struc-tural and functional knowledge not included in the mo-del. However, a knowledge-basesystem is poorly suitedto time management; adding temporal logic would com-plicate the rules and raise problems in validating theknowledge base. We therefore decided to have the fail-ure diagnostic expert system manipulate the expertise,and assign time management considerations to the MI-NOS module, which activates the expert system onlyafter a defect is observed, by supplying an image ofthe process formed by actual and predicted source vari-able values at the instant t when the default is detected,as well as the values of the other variables at coher-ent prior times (allowing for the delays inherent in thecausality relations). This approach simplified the diag-nostic system task, thereby enhancing performance.

5.5.5. OutputsThe event-driven simulator yields the behavior of all

the causal graph variables in the form of segmentedevolutions. A predictive horizon may be specified by

the user. The actual and predicted evolutions of vari-ables that can be selected on the graph can be monitoredon child windows. Three types of alarms are indicatedfor each graph variable:

– a predicted alarm: the variable at the operator-selected prediction time horizon) exceeds an alarmthreshold;

– a pre-alarm condition: the variable is currentlywithin the permissible range, but its current valuedoes not correspond to the simulated value (i.e. thedetection module reports an alarm condition);

– an alarm: the variable already exceeds an alarmthreshold in the conventional sense of the term.

The MINOS module supplies the following outputs:

– the detection system provides a set of defectivevariables;

– the localization system organizes this set into oneor more defect subgraphs whose evolution can bemonitored graphically as the propagation of dis-turbances in the facility: the propagation can befollowed by the color of the affected arcs and vari-ables over time: green (normal) or red (defect).

SPHYNX assigns failures to the equipment items inthe monitored facility. The output is thus a set of fail-ure hypotheses whose degree of plausibility can be fol-lowed.

Finally, an interface child window is reserved forthe explanation. In order to know the explanation forthe behavior of a variable, the contributions of the up-stream variables are presented (using the superimpo-sition theorem) opposite the history of the specifiedvariable. The following information is thus availablefor any operator-selected variable: the designated up-stream variables acting on the result; the direction ofthe action of each upstream variable and the compara-tive amplitude of each contribution. This explanationalso provides a solid basis for developping new plansof action to modify the state of the variable.

5.5.6. ImplementationThe three system modules were developed in Ada on

a Sun workstation, with the exception of the rule com-piler written in STARLET, a predicative language im-plementing affixive grammar. The three modules havebeen integrated into a demonstrator running online witha conventional numeric simulator of a nuclear fuel sol-vent extraction and scrubbing facility.

The prototype version is the outgrowth of three Ph.D.thesis at the LIA with the collaboration of the INSA


at Lyon (Professor L. FRECON) and the Laboratoired’Automatique de Grenoble (Professor S. GENTIL).The interface (MIMIX) and interprocess communica-tions required 1.5 man-year of work. The followingknowledge is necessary to develop a new applicationwith the PROTEE, MINOS and SPHYNX modules:

– constitution of the causal graph of the facility forthe PROTEE and MINOS modules;

– interpretation of the AFME for SPHYNX.

Nevertheless, the causal graph representation allowsthe use of a single, simple model for simulation, expla-nation and localization purposes; the declarative natureof the model clearly discriminates between knowledgeand reasoning.

The separation between defect diagnostics (localiza-tion) and failure diagnostics makes it possible to initiateonly local studies with the expert system, while reserv-ing time management for the more appropriate tools ofthe MINOS module.

5.6. Sollac : SACHEM project

5.6.1. The supervision systemThe Sachem project (Computer aided blast furnace

operation) is being set up at Sollac to help operatorsto drive the Blast Furnace (BF) at Fos sur mer andDunkirk (France). Sollac is a subsidiary of USINOR,the first steel maker company in Europe and the thirdin the world. First of all, coal and coke are introducedin the BF in order to make cast iron which gives af-terwards steel. The materials are introduced at the topand the middle of the furnace and the cast iron is pickedup at its bottom at each casting. We can find the threephases, the solid, gaseous and liquid ones together inthe furnace. Therefore the phenomena are complexwith regard to thermic, chemical and mechanical ener-gies. Their identification is rather complex too. TheIRSID (the research center of USINOR Group) has al-ready developed a few partial models, for instance onedealing with the balance of energy. But there is noglobal model especially no dynamic model.

There are many different BFs but we can considerthat they all have the same process.Man has been able to produce cast iron for 3,000 years.A modern BF is therefore the fruit of much experience,patience and technology. A BF works continuouslywithout stopping except for programmed stops everyfour months and emergencies which are unusual. TheBF process is slow: when you add some coal or coke ithas an effect on the cast iron temperature between fourand eight hours later.

��INTERLOCUTEURS��FONCTIONS

3!#(%-

"&�%80%243

3/&47!2%�!.$�+./7,%$'%%.').%%23

/0%2!4/23

#OMPUTER�ASSISTEDBLAST�FURNACEPROCESS�CONTROLSYSTEM

#OMPUTER�ASSISTEDBLAST�FURNACEPROCESS�ANALYSISSYSTEM��%.15%4

-AINTENANCE%NVIRONNEMENT

$EVELOPMENT�AND

Fig. 7. the functions of Sachem

5.6.2. Purpose of the supervisionThe Sachem project has been set up to help supervi-

sion of the BF particularly to improve its regularity andthe homogeneity of the cast iron. This allows a longerduration of life of the BF, less operation in the steelmaking plant and therefore a lower cost price of the fi-nal product (the coil). Operators cannot easily reachthis aim of regularity because of the delays between anaction and its effect, the effect of their decision and ac-tion is usually seen by the following team. The earlydetection of malfunctions allows the operators to takecorrective measures with anticipation. This is a keyfactor for a good control of the process. Another aim ofSachem consists in diminishing of the consequences ofthe turn-over in the team of operators for instance whenthey go into retirement.

Sachem is to survey the real-time process, point outthe phenomena, look for their causes and recommendappropriate actions. Sachem is in operation in the con-trol room. Of course, the operators already work thanksto data given by one thousand captors, but data areso numerous that they cannot be used at their best.There are about ten actuators to drive the BF. One ofthe most important problem is the thermic regulationwhich deals with the cast iron temperature. More-over, a specialized function of Sachem helps the pro-cess analysis team.

5.6.3. Input5.6.3.1. Gross data. Sachem analyses data providedby captors situated all around the BF. There are roughly1,000 captors which give continuous signals with a fre-quency of a data every minute. The data are acquiredby a specialized module - the acquisition module - andare tested for validity. These data gather: temperaturesfrom the wall of the furnace (staves, heat resistant wallsetc.), pressures on the wall in order to calculate the per-


meability of the BF, gas analyses: temperature, chem-ical analyses.. .flows: flows of gas at the nozzle, flowsof gas out of the BF, cast iron flows .. .

These data are afterwards sent to an elaboration mod-ule which produces 3,300 variables. The mathematicalrules used within this module are for example the aver-age, the smoothing. Sachem uses also the partial math-ematical models of the BFs.

5.6.3.2. The Signal Phenomena. They characterizethe previous measures and describe: the level of a sig-nal,its stability, the variations (e.g. of the slope .. .), thelong term tendency, the undulation of the signal . . .

These signal phenomena correspond to the signalcharacteristics that have been selected as significant bythe experts. The ENQUET environment - based on theX-ANALYST software and commercialized by the AIScompany, a young company coming from Matra Mar-coni Space - is used to tune and parameterize the signalanalysis algorithms. The techniques consist in apply-ing a treatment on a data range and in visualizing theintermediary and last results.

5.6.3.3. The BF Phenomena (BFP). They are obtain-ed from the signal phenomena thanks to the use ofcoded expertise linked to a knowledge base. We havecounted about 150 phenomena which induce about 450possible different messages (alarms or warnings) forthe operator.

Let us give examples of these alarms ; “the increaseof the gas distribution on the wall”, “a low cast irontemperature”, “a high level of slag index”.. .Of course,the detection of some phenomena depends on the func-tioning of the BF. The thresholds used by the signalanalysis module are calculated from the informationdescribing the context of functioning.

Here is an example of an expertise rule: If there isa signal phenomena of decrease of wind in a nozzleand within two hours a signal phenomena of increaseof wind in the nozzle superior to 80% of the previousdecrease then a BFP of punctual passage of bloc is de-tected for this nozzle. Consistency and validity con-trol are effected on the data and the phenomena. Ifsome data miss or are not valid, some other substitu-tion data can be used. These data are therefore used bythe knowledge base as long as the invalid data are notspecified as operational again.

5.6.4. ModelsThe identification of the BFP from signal phenom-

ena is due to a knowledge base embedding the domainsexpertise. It is implemented in Kool 96 - a hybrid pow-erful generator - and structured within 20 rubrics of ex-pertise.

The following ones have been identified: local con-ditions of nozzles, quality of the cast iron, global per-meability, thermic balance etc. In each rubric can befound mainly the following points:

– the issue of the rubric,– the typical phenomena of the rubric and the detec-

tion rules,– the invalidation rules and the substitute detection

rules in case of invalidation.

Some detection rules of BFP are chronicles, composedof characteristic events and temporal constraints thatevents have to satisfy.

The knowledge base embeds about 25,000 objectscorresponding to 33 goals, 27 tasks, 75 inference struc-ture, 3,200 concepts and 2,000 relations (in KADSsense). The knowledge acquisition was a hard work: ithas needed 6 knowledge engineers and up to 13 experts.Together they produced a common glossary about theprocess control, a General Expert Analysis (it requiredone year) and a Detailed Expert Analysis (2 years). Theexpertise exists under tree ways: reformulations (nat-ural and structured language) able to be validated bythe experts, a model on OpenKADS for the softwareengineers and, at last, Kool codes. The total cost ofthe knowledge acquisition is about 30 man-years (14for the preliminary knowledge acquisition, the rest forthe tuning and validation phase). It represents about100,000 lines of Kool code.

5.6.5. OutputThe output of Sachem mainly consists in the presen-

tation of the BFP to the Blast Furnace operators. Therecommendation module is being developed. On thescreen we mainly have:

– a synthesis of the process control with the presen-tation of the alarms and warnings (the BFP consid-ered as important),

– the possibility for the operator to ask for the justi-fications of the BFP,

– the possibility to consult the process data,– the summary for the next shift,– the prediction of the cast iron temperature.

In the next version we will find the recommendationsof actions.


,EVEL��

COMPUTER

!LARMS

+NOWLEDGE��"ASES

7ARNINGS

*USTIFICATIONS

$ATA

VISUALISATION

��

DATA�MINUTE

!CTION

RECOM

MENDATION

DATA

PROCESSING

!CTIONS

��MSG��DAY

3$"

.$"

DATA

PROCESSING

��

MODELLING

!CQUIRIN

G

�DATA

$ETECTING

THE

0HENOMENA�

MANAGING

THE�CONTEXT

SIGNAL

ANALYSIS

3YSTEMATIC

SYNCHRONOUS

ASSISTANCE

!SYNCHRONOUS�

ASSISTANCE

ON�REQUEST

-ONITORING�AND

OPERATING

��

DATA�MINUTE

NUMERIC SYMBOLIC��DAYS�ON

LINE��'"

Fig. 8. the architecture of Sachem

5.6.6. ImplementationThe architecture of Sachem is presented in figure 8.

Data are computed in batch mode every minute. Thetotal cycle duration is of 8 s on a SP2 (a computer ofthe range of Deep Blue). The total cost of the project isabout 200 man-years. The project began in 1991.

5.6.7. ResultsThe expected gains are of 6 francs per ton of iron.

The production in Sollac is of 10 millions of tons peryear. The Sachem system is in operation on 3 BFs outof 7. The first one was provided in October 1996, thesecond in May 1997 - both at Fos sur Mer - and the thirdin September 1997 at Dunkirk. The first observationsshow that the number of process anomalies is dividedby 3. There are even some problems that have com-pletely disappeared since Sachem is in service.

5.7. Exxon: TIGER project

The TIGER condition monitoring system was de-vised and implemented within the framework of theTIGER Esprit European project entitled: “Real TimeAssessment of Dynamic, Hard to Measure Systems”. Itwas applied to the gas turbine application domain [3, 4,6]. The monitoring functionalities of TIGER are pro-vided by three independent tools which can all workin parallel, each examining different aspects of the tur-bine:

1. Kheops [43] is an expert systems shell allowingone to compile the rule base into a decision tree,which guarantees an upper response time limit.Kheops is used as a hight speed limit checkingsystem.

2. IxTeT [25] is a temporal reasoning system whichis able to perform on line recognition of chroni-cles, that is sequences of events related with tem-poral constraints. The chronicles are specified bythe user for normal operating conditions and/orknown faulty situations.

3. Ca∼En is a model based detection and diagno-sis system. It uses deep models –physical lawsof the domain– of the physical system normal be-havior and implements a consistency based rea-soning schema. Fault knowledge is not necessarywhich makes Ca∼En usable from the very begin-ing of the physical system life.

This section focuses on the Ca∼En tool applied tothe Exxon turbine which is one of the two applicationsdealt with during the TIGER project (the reader can re-fer to the section 5.8 for a description of the second ap-plication).

5.7.1. Monitored physical systemThe initial installation of TIGER was at the Fife

Ethylene Plant which is a 650,000 tones a year gascracking facility located in South East Scotland, andjointly owned by Exxon Chemical company and ShellChemical company. The major product of the facility ishigh grade ethylene for use in the plastics and butyl rub-ber industries in both the UK and on the continent. Thefeed stock is ethane gas obtained from the Shell/Essooffshore facilities in the North Sea. The process is con-tinuous with the ethylene product being transported byboth ship and pipeline to end users in the UK and on thecontinent. The gas turbine is a 28 mega-watt GeneralElectric Frame Five with two shafts supplied by JohnBrown Engineering (see figure 9), this is used to drivethe primary compressors for the fife ethylene plant.

The turbine is controlled by a Speedtronic Mark IVcontroller which is in charge to apply all the regulationswhich are necessary to satisfy the production demandwhile keeping the efficiency optimal. The Exxon tur-bine is a two shaft turbine with a set of nozzles, knownas “second stage nozzles” that balance the energy be-tween the two shafts. This allows the compressor to runat its optimum speed (5100± 10rpm) while providingfor variable load on the turbine. The position of the sec-ond stage nozzle is controlled by a servo. The decom-position into subsystems is visible on figure 9.

Two subsystems were deeply analyzed: the secondstage nozzles and the fuel admission system.


Fig. 9. Exxon turbine displayed by the TIGER interface

5.7.1.1. Operating modes. As the Exxon applicationturbine runs continuously 24 hours a day, the startingand stopping modes are very rare and the most interest-ing operating mode is the nominal speed mode. How-ever, within the nominal mode, still two possible modesexist:

– The speed control mode takes over when the ex-haust gas temperature is below the acceptable limit.The admission control law is hence a function ofthe load which increases or decreases the primaryshaft rotation speed;

– The temperature control mode takes over as soonas the exhaust gas temperature reaches the unac-ceptable limit. Increasing fuel admission is not al-lowed anymore. If more power is needed though,the steam from an auxiliary steam turbine can beincreased.It is also possible to add steam right intothe combustion chamber manually.

Besides, the turbine can be fuel feeded from twosources: starting fuel tank and running fuel tank.

This also defines two distinct operation modes. Eachsource is equiped by two valves: the first valve keepsthe input pressure of the second constant whereas thesecond valve controls the injected fuel flow.

5.7.2. Monitoring aimsThe Exxon application turbine needs to be monitored

mainly once it is established around the nominal speed.The requirements are hence on-line monitoring withreal-time constraints. These three following aspectsmust be present:

1. Assistance for controlling the turbine which hasthe following requirements:

– to detect abnormalities and to produce a simpleto interpret and precise report about the state ofthe turbine to be used by the operators.

– to interpret the alarms coming from the controlsystem ladder logic; these may be ambiguousas the same alarm may have several possiblecauses.

2. Maintenance which has the following require-ments:

– to go back to the primary causes of detected ab-normalities and to produce a diagnosis (list ofpossible causes in terms of responsible faultycomponents).

– to perform trend analysis on the basis of onemonth or one year scale recorded data; this isintended to detect deteriorations resulting fromwearing.

3. Anticipation since the faults must be detected assoon as possible in order to allow operators to fixthe problems without stopping the turbine.

5.7.3. InputsThe sensors existing on the turbine provide 74 ana-

log signals and 80 digital signals. Most of the continu-ous signals directly provide the value of a given phys-ical quantity of the turbine through time. Still some ofthem result from a simple calculus (percentage for ex-ample).They are generally noisy. These signals werefirst filtered (low band filter) before being used as in-puts for the TIGER system. Digital signals are repre-sentative of the state of the turbine. They constitute aset of alarms which must be interpreted by the opera-tor when deciding upon the control actions to be per-formed. This is a difficult task for which the manualscontaining the logical circuit drawings are often neces-sary to consult; indeed, the alarms are not one-to-oneassociated with the faults. All these sensors primarilyarise from control requirements.

5.7.4. ModelsCa∼En’s representation formalism allows one to com-

bine empirical causal knowledge and first principles ofthe domain. Time is dealt with explicitly with a logi-cal clock which delivers a constant frequency sampledversion of continuous time. The Ca∼En formalism isbased on a multi-model representation scheme includ-ing:


– a causal model in the form of an influence graph(see section 4.2) in which the links represent in-fluence relations between pairs of variables, alsocalled the local constraint level;

– an analytical equation model which allows one torepresent algebro-differentialequations, also calledthe global constraint level.

Both models can manage imprecise knowledge. Theytake part to the prediction algorithm in a cooperativecycle. The prediction process being driven by the causalmodel, this requires that all the knowledge about thephysical system is implemented at this level. Con-sistency of the pieces of knowledge which are imple-mented at both levels is guaranteed by the fact thatthe causal model is generated from the correspond-ing equational model automatically. This is performedby the Causalito algorithm [44] which implements acausal ordering approach for multi-model systems.

Ca∼En has two processing modules:

1. A simulation module which produces the explicitbehaviour of the physical system in terms of thevalues of the internal variables across time ac-cording to the behaviour of the exogenous vari-ables. Imprecision is managed with interval val-ues, which implies that predicted graphs are curveenvelopes [45, 46].

2. A diagnosis module which accounts for fault de-tection and isolation of faulty components. Faultdetection is based on models of normal behaviour.The on-line simulation of these models providesa way of implementing a discrepancy detectionprocedure. This enables monitoring of the be-haviour of the system and detection of early devi-ations from the nominal behaviour. The diagno-sis algorithm falls into Reiter’s model-based di-agnosis framework and uses the Ca∼En causalgraph as the System Description (SD). It relies onthe collection of conflict sets, i.e. sets of compo-nents such that the observations indicate that atleast one of the components in a set must be be-having abnormally, and the use of an incrementalhitting set algorithm. The diagnoses are given assets of faulty components labelled by their corre-sponding time of failure [47].

The Ca∼En diagnosis system conclusions rely ona reasoning based on the physics underlying the be-haviour of the system, i.e. physical laws and empiri-cally known causal interactions. When a fault is de-tected, it is viewed as the violation of some of thesephysical principles which then guide the isolation of

Fig. 10. Causal model of the nozzles system and Ca∼En predictedenvelopes

the faulty component(s). As a consequence, there is noneed to anticipate the faults, which is highly valuablein most complex engineering domains. On the otherhand, as variables and parameters take interval values,one can easily adapt the models granularity to the re-quirements of the faults. Hence Ca∼En has a wide cov-erage of faults, from those radically changing the be-haviour of the physical system to those causing smoothdeviations.

5.7.4.1. Behavioral models. There are two sources ofknowledge:

– the Speedtronic Mark IV controller manual whichprovided us with the relations between the con-trol and controlled variables in the form of a set ofequations. Some of the parameter values were alsoavailable;

– data recorded on the turbine which allowed usto identify the structure and parameter values ofother relations.

5.7.5. OutputsThe outputs are, for every non exogenous variable,

the curve envelop providing a bounding of their possi-ble values across time. Moreover, an anomalie detec-tion message is displayed every time a variable is con-sidered to have an abnormal behavior. This message isfollowed by a diagnosis report listing the diagnosis setsfor which the components are labelled by their failuretime. A screen display is given in figure 10.


5.7.6. ImplementationThe Ca∼En system used in TIGER was implemented

on a Sun 4 station, written in LeLisp. A more re-cent version runs on Solaris and includes a component-connection knowledge acquisition interface.

5.7.7. Validation, maintenance, evolutionThe behavioral models were fitted to more than ten

recorded data scenarios chosen for the various faultysituations they covered. Validation was performed byrunning Ca∼En on one hundred scenarios chosen ran-domly from one year recorded data. The diagnosis re-sults were found correct for 99 scenarios out of 100.

5.8. Dassault Aviation : TIGER Project

The TIGER condition monitoring system, as des-cribed in the anterior section, was used to develop twoapplications during the TIGER Esprit project. Theanterior section presents the application made to theExxon turbine; this section presents the applicationmade to the Dassault Auxiliary Power Unit turbine.

5.8.1. Monitored physical systemThe APU (Auxiliary Power Unit) is a little turbine

used as an auxiliary power supply in aircrafts. The onethat was considered was a 0.4 MW turbine designed bythe company Micro Turbo for Dassault Aviation andused in Rafale fighters. Like all turbine systems, itis made of an air supply, a compressor, a combustionchamber, a turbine and an exhaust pipe. It is used onthe ground or during flight time to produce electric orpneumatic power:

– electric power: the APU is used to drive one or twoelectrical power generators.

– pneumatic power: this is generally performed bytaking air right after the low pressure compressorto realize the following functions:

∗ starting the jet engines in some faulty situations;∗ cabine air conditioning, particularly at take off

time when the engines need full power;∗ defrosting.

The TIGER application focused on the APU fuelsystem which feeds and regulates the APU, provid-ing the fuel from the aeroplane tanks to the injectorswith the right pressure and flow, depending on the shaftspeed and the aeroplane operating mode. This subsys-tem is given in figure 11.

The APU fuel system includes the following func-tions (see figure 11):

(i) A pressure and flow rising function including afuel shut-off valve which opens and closes the fuel sys-tem ; a check-valve which enables to fill the circuit withfuel at starting time, when the pump has started the de-rived flow closes immediately the check-valve ; a pumpwhich provides desired flow and pressure, the pump isa volumetric pump with constant capacity, its speed isproportional to the engine speed.

(ii) A filtering function constituted by three filters lo-cated just before the fragile components, e.g. the fuelcontrol valve. Their role is to eliminate the impurities(dust, ice-crystals, etc.) that can be present in the fuel.

(iii) A fuel regulation function constituted by a fuelcontrol valve associated with a current servo which reg-ulates the fuel flow as a function of the APU operat-ing mode and of the running speed set point; a differ-ential pressure control valve which maintains constantpressure between the fuel control valve input and out-put so that there is a proportionality relation betweenthe servo-current and the subsection of the valve, hencethe fuel flow. The exceeding fuel arriving at the input ofthe fuel control valve is recycled just before the pump.

(iv) A fuel injection function which includes two in-jection rings which spray the fuel in the combustionchamber ; the first injection ring of four injectors actson its own for low power supplies, it is complementedby the second ring of five injectors when high poweris needed. The second ring is activated as soon as thedividing valve opens, which is obtained for some pres-sure conditions.

(v) A drainage function which avoids, when the APUis stopped, to have accumulated fuel in the circuit, thisbeing dangerous in case of too high temperature situa-tions. It is composed of a drainage valve which emp-ties the fuel out of the secondary injection ring, a sec-ond valve which has the same function for the first in-jection ring and a shut-off valve which opens when theAPU is stopped.

The APU dynamics is very quick. The sensors sam-pling time is 0.02 seconds.

5.8.1.1. Operating modes. The functional descriptionof the APU fuel system shows that some componentsare of continuous type and others, which control theopening and closing of some parts of the circuit, areof “binary” type. The binary components define 8 dif-ferent operating modes associated with pressure condi-tions.


Fig. 11. APU fuel system block diagram

5.8.2. Monitoring aimsThe function of the APU is to assist the main en-

gines in situations which require full power. If a prob-lem occurs, the starting phase is at least as importantas the running phase to detect and diagnose the fault.It is often the case indeed that a problem results in theAPU not starting at all. Therefore, the APU monitor-ing must be particularly efficient for transients, i.e. dur-ing starting time. Even thought the problems need tobe detected as soon as possible, on-line monitoring isnot required; the data can as well be analyzed a poste-riori. Therefore, monitoring exclusively relates to themaintenance aspect, as an assistance provided to theoperator when the turbine is on its test bench after ananomaly, or for a series of programmed maintenancetests.

5.8.3. InputsGiven the type of monitoring which is required, the

available sensors are those existing on the test bench,whose number is higher than the ones set on the turbineduring normal operation. There are 12 test bench sen-sors. They all provide continuous signals, each report-ing the value of a physical quantity of the fuel systemacross time.

5.8.4. Behavioral modelsThe knowledge was provided on the one hand by

the Dassault Aviation experts which specified the APU,and on the other hand by the Micro Turbo experts whichdesigned the APU. The knowledge was mostly in theform of equations coming from the physical domain;there were also numeric knowledge about the param-

eter values and their tolerances as well as graphs ob-tained on the test bench for characterizing the APU ef-ficiency. The fuel system was decomposed down to 10hydraulic components of 5 different classes. A genericmodel was built for every class. The global fuel systemmodel includes 22 equations for 4 input variables, 18internal variables and 22 constant parameters. 8 vari-ables were measured and 14 were not.

5.8.5. Validation, maintenance, evolutionThe behavioral models as well as the Ca∼En diag-

nosis system were tested on 25 scenarios recorded onthe Micro Turbo hydraulic test bench. The scenarios in-cluded several faulty situations obtained by physicallyintroducing a fault into the system. A demonstrationof Ca∼En running on-line on the test bench was per-formed and resulted fully successful.

5.9. France Telecom : IMOGENE project

IMOGENE (Inversion of a Model thanks to Geneticalgorithms)4 is a supervision system whose main func-tion is to determine, in a given network, which streamsare responsible for call losses on communication or-gans (switches and circuit groups).

IMOGENE performs this task by comparing streamtraffic values to their nominal values. However, streamtraffic values are not measured by the on-line data ac-quisition system and, hence, have to be computed. Weperform this computation by inverting a stream propa-gation model thanks to evolutionary computation tech-niques [48].

5.9.1. The French long distance networkIn the first hand, the French long distance network

can easily be seen as a graph with:

– nodes, representing commutation centres (or swit-ches) whose capacity represents the maximum num-ber of simultaneous carried calls,

– arcs, that represent the circuit groups betweentwo switches; their capacity (i.e. their number ofcircuits) is equal to the maximum number of callsthat can be carried simultaneously.

Then, the telephone network carries some streams,each stream corresponding to a set of calls going form aswitch (called the origin node) to another one, the des-tination node; a stream is also characterized by its of-

4collaboration between France Telecom/CNET and LAAS/CNRS


fered traffic value (the average number of call arrivalsduring a period corresponding to the average call hold-ing time) and its routing table that provides the possibleways to reach the destination node.

The basic node of this network is the subscriberswitch, called the routing autonomy switch (RAS),which is connected to phone subscribers, PS, either di-rectly, thanks to point-to-point links, or thanks to a lo-cal switch (which might be a satellite), marked LS infigure 12.

The network has a specified structure in which somenodes are only dedicated to calls routing and are not di-rectly connected to the phone subscribers: the transitswitches. The French long distance network is com-posed by two levels: the main transit level (formed ofMTS) and the secondary transit one (where the switchesare marked STS), as shown in figure 12. In this way,

STS STS

RAS

LS LS

MTS MTS

RAS RAS RAS

LS LSPS PS

PSPSPSPS

Fig. 12. Telephone network hierarchy

when a subscriber A tries to call a subscriber B whois not connected to the same routing autonomy switch,the communication request of A tries to go through thehigh usage switches and circuit groups, as defined inthe request stream routing table. If one of these circuitgroups is overloaded, the call request of A overflows, aswritten in the routing table, toward a final circuit groupand it goes through the route corresponding to this over-flow.

The routing policy of the French long distance net-work only allows one overflow: every call that tries tooverflow towards an overloaded circuit group is lost.

5.9.2. Supervision objectivesIndependently of the network switches and circuit

groups characteristics, the traffic flow quality can be

affected by some streams overloads and, because ofthe high connectivity of the network, this disturbancemay propagate within the network in a very short timeframe, creating new disturbances.

This traffic modification can be linked to:

– daily or seasonal changes: some switches offeredtraffic increase whereas the others decrease. Thisis, for instance, the case of switches connected totourist sites during holidays.

– a whole overload; this may happen on special oc-casions like Christmas or New Year’s day.

– an unexpected event such as a televised or ra-diotelephone game, a natural, rail or air disaster orelse a political event.

The real time network control objective is to avoidnetwork degradations by supervision means and by theimplementation of traffic control actions; these actionspermit, in the case of an overload, that most of thecall requests reach their destination, using the maximalnumber of available resources in the network.

The real time network control can be decomposed inthe following way:

1. network state and working conditions supervi-sion,

2. supervision data collection and analysis,3. network abnormal circumstances detection,4. disturbances causes diagnosis,5. corrective actions over the network or the traffic.

5.9.3. InputsIMOGENE inputs come from a data collection that

is done every 5 minutes on the various network ele-ments (switches and circuit groups). These data prin-cipally concern the call attempts number, the effectivecalls number and the offered and carried traffic.

IMOGENE knows the network structure, i.e. theswitches number and capacity, the circuit groups num-ber and capacity and the streams routing table.

5.9.4. ModelThe model used by IMOGENE is a one-moment

model (i.e. offered traffic is only characterized by itsmean) and has been developed under classical assump-tions of this kind of model :

– A1 Call arrival is a Poisson process.– A2 Call holding time has a negative exponential

distribution.– A3 Blocking probabilities are statistically inde-

pendent.


– A4 The network is in statistical equilibrium.– A5 Call arrival (node-originated plus overflow) on

any element is a Poisson process.

The inputs of the stream propagation model pre-sented here are the same as the ones of the CNET5 sim-ulator SuperMac , i.e. the network structure and thestreams offered traffic values. On the other hand, ourmodel is more informative than SuperMac concerningits outputs since as it provides not only the whole traf-fic losses and offered traffic for each organ and for eachstream but also the traffic losses and offered traffic foreach stream at each network element.

Our stream propagation model is qualitative in thatit is only based on [49, 50]:

– the intuitive notion of blocking organ which isbased on the use of the Erlang’s formula; our model,at the opposite of the models described in [51–53],uses this formula as an indicator and not as a equa-tion that has to be solved.

– a qualitative knowledge about the network, basedon its structure.

5.9.5. OutputsIMOGENE outputs are the approached values of the

network streams offered traffic; these values are com-puted by inverting our stream propagationmodel thanksto evolutionary computation techniques.

Then, these values are compared to the correspond-ing nominal traffic values in order to determine whichstreams are responsible for the overflow.

5.9.6. ImplementationIMOGENE has been elaborated as part of a CNET-

CNRS collaboration contract and has been implementedin C++ language, on a Solaris UNIX station.

5.9.7. Validation, maintenance and evolutionIMOGENE has been tested on a set of networks that

present all the particularities of the real one; it givesparticularly good results when using a real variant of agenetic algorithm.

On going work deals with the complexity and real-time constraint requirements in the following aspects:

– add to IMOGENE a qualitative pretreatment stageallowing us to decoupe the network in several in-dependant subnetworks,

5Centre National d’Etudes en Telecommunications

– consider parallel evolutionary computation tech-nique.

Moreover, we are working on the following issues:

1. control actions that bring back the network to nor-mal operating conditions,

2. mobile phones and the disturbances they cause onthe French long distance network.

6. Synthesis and Conclusion

6.1. Synthetic tables

In the next three tables, we compare the applicationsthe members of the “ALARM” group are involved in,according to criteria which were appreciated as signi-ficative ones. The first table (see Figure 13) comparesthe applications with regards to their context: whichkind of observables can be obtained? what is the dy-namics of the monitored system? what is the gen-eral goal of the application: preventive maintenanceor control of the system. The second table (see Fig-ure 14) compare the applications in a more detailedway. Which tasks do they complete: filtering alarms,detecting malfunctions, locating abnormal components,looking for primary causes of malfunctions, proposing(control or repair) actions? On which kind of mod-els do they rely: numerical, qualitative? static or dy-namic? expertise-based? Which artificial intelligencetechniques do they use: influence graphs, causal graphs,rule-based systems, pattern recognition or anything else?Which kinds of reasoning schemes are implemented:deductive, abductive or another kind?

6.2. Conclusion

This paper provides a summary of the different tech-niques that are nowadays proposed for alarm monitor-ing of dynamic systems in the field of Artificial Intel-ligence together with the description of various appli-cations implementing these techniques. The variety oftheoretical approaches used, as well as the good dealof different real-world problems tackled with success,demonstrates that AI techniques offer operational solu-tions to industrial problems involving these issues. Theapplications and corresponding monitoring systems areanalysed with respect to different criteria which exibita possible taxonomy of the domain. This work is to berelated to other works in the same direction by Chantler[54] and Console [55].


Alexip Austral Diapason Diapo Imogène

IFP EDF CEA EDF FT/CNET

Continuous Signals • - • • •

Observables Events - • - • -

Sensors Awareness • • - (*) -

Time Effects - - - wear -

Temporal System Dynamics low high low variable high

Aspects ≠ Operating Phasis • - - • (3) •

≠ Configurations • • - - •

Supervision for Control/Maintenance C C/M C M C

Gaspar Retrait. Sachem Caen-Tiger

FT/CNET CEA SOLLAC EXXON

Continuous Signals - • • •

Observables Events • - - -

Sensors Awareness - - - -

Time Effects - abrasion - -

Temporal System Dynamics low low high

Aspects ≠ Operating Phasis - • • •

≠ Configurations • (?) • - -

Supervision for Control/Maintenance C C C M

(*) Nevertheless, sensors reliability is an actual problem for this application

Fig. 13. Applications Characteristics

The results of our study show that the very first fea-ture orienting the choice of the technique is related tothe continuous or discrete nature of the system. TheGASPAR application dealing with the supervision oftelecommunicationnetworks is definetively discrete whe-reas the TIGER or DIAPASON applications deal withcontinuous processes.

However, one should note that the discrete/continu-ous feature is not intimely linked to the process at hand.It also depends on the level of abstraction at which theprocess is observed. Indeed, the IMOGENE applica-tion deals with the supervision of the telephone net-work by aggregating the calls, that are discrete quanti-ties, in terms of traffics and flows. On the other hand,the DIAPO application reasons from an interpretationof the continuous signals in terms of discrete events.The choice of the level of abstraction is very much re-lated to the type of knowledge available and its source.The actual tendency is towards the increasing use ofmodels from the design stage as this knowledeg is gen-erally accessible in a well formalized form and can betransformed, often in an automatised manner, as de-sired.

As a matter of fact, the so-called model based ap-proach is nowadays taking importance over the moreclassical shallow knowledge approach in the dynamicsystems monitoring area.

7. Acknowledgments

The following persons also contributed to the group’swork that gave rise to this paper: M. Allouche (SIMADE),S. Bibas (LIPN), P. Dague (LIPN), G. Deflandre (FranceTelecom), M. Dumas (CEA), D. Fontaine (UTC), S.Gentil (ENSIEG-LAG), G. Ramaux (UTC), L. Roze(IRISA), C. Sayettat (SIMADE).

References

[1] Y. Peng and J. A. Reggia, Abductive inference models for di-agnostic problem solving. Springer-Verlag, 1990.

[2] W. Hamscher, L. Console, and J. De Kleer, eds., Readings inModel-Based Diagnosis, Morgan Kaufman, 1991.

32T

heA

LA

RM

rese

arch

grou

p/

Mon

itori

ngan

dal

arm

inte

rpre

tatio

nin

indu

stri

alen

viro

nmen

ts

Alexip Austral Diapason Diapo Imogène Gaspar Retrait. Sachem Caen-Tiger

IFP EDF CEA EDF FT/CNET FT/CNET CEA SOLLAC EXXON

Abnormalities Detection • • • • • • • • •

Alarm Filtering • • - - - • - - -

Situations Identification - • - - - • • • -

Objectives Fault Localisation - • • • • • • - •

Primary Cause Determ. • • • • - - • - -

Action Advice • • - - in progress - - future -

Prevent/Correct Action C C C P C C C C

Static Numerical Model - - - - - - • - -

Physical Dynamic Numer. Model - - • - - - - partial •

System Discrete Events Model - • - - - • - - -

Knowledge Qualitative Model - - - - • - - - •

Expertise • • - • - • - • •

Influence Graphs • - • - - - - - •

Causal Graphs • - - • - - - - -

Tools Rules-based Systems - - • - - (old version) - • -

and Scenarios recognition - • - - - • AMDE • -

Technics decision action plans, situations genetic numerical mathematical numerical

trees automata prototypes algos automata equations models constraints

Models Prediction - - • • • - - • -

Use Explanation • • • • - • • • •

Others…

Fig.

14.

Supe

rvis

ion

Syst

emC

hara

cter

istic

s


[3] L. Console, D. T. Dupre, and P. Torasso, “Abductive reason-ning through direct deduction from completed domain mod-els.,” Methodologies for intelligent systems, vol. 4, pp. 175–182, 1989.

[4] L. Console, D. T. Dupre, and P. Torasso, “A theory of diagno-sis for incomplete causal models,” Proceedings of the IJCAI,1989. Detroit.

[5] L. Console, L. Portinale, D. Theseider Dupre, and P. Torasso,“Diagnostic reasoning across different time points,” in 10thEuroppean Conference on Artificial Intelligence, (Vienne),pp. 369–373, 1992.

[6] MQ&D, “Qualitative reasoning: a survey of techniques andapplications,” AICOMS, vol. 8, no. 3-4, pp. 119–192, 1995.coordinated by Philippe Dague.

[7] L. Trave-Massuyes, P. Dague, and F. Guerrin, Le Raison-nement Qualitatif pour les Sciences de l’Ingenieur. Paris: Edi-tion Hermes, 1997.

[8] B. Kuipers, “Qualitative simulation,” Artificial Intelligence,vol. 29, pp. 289–338, 1986. Also in Readings in QualitativeReasoning About Physical Systems, Morgan Kaufmann.

[9] B. Kuipers, Qualitative reasoning - Modeling and simulationwith incomplete knowledge. Cambridge, MA: MIT Press,1994.

[10] D. Dvorak and B. Kuipers, “Model-based monitoring of dy-namic systems,” in 11th Int. Joint Conf. on Artificial Intelli-gence, (Detroit, MI,), pp. 1238–1243, 1989. Also in Readingsin Model-Based Diagnosis, Morgan Kaufmann.

[11] R. Davis, “Diagnosis via causal reasoning: paths of interactionand the l ocality principle,” in AAAI, pp. 88–94, 1983.

[12] V. Bandekar, “Causal models for diagnostic reasoning,” Arti-ficial Intelligence in Engineering, vol. 4, no. 2, 1989.

[13] P. Berthelot, “G2: a real-time computing tool. real-time,knowledge-based systems,” tech. rep., Engineering Softwaren28., Sept. 1992.

[14] S. Cauvin, B. Braunschweig, P. Galtier, and Y. Glaize,“Alexip, expert system coupled with a dynamic simulator forthe supervision of the alphabutol process,” Revue of InstitutFrancais du Petrole., vol. 47, pp. 375–382, May-June 1992.

[15] S. Cauvin, Un environnement generique a base de connais-sances pour la supervision de procedes de raffinage et deptrochimie. Informatique, These de Doctorat CNAM, France,Nov. 1995.

[16] C. Ungauer, “Problematique d’utilisation de techniques desupervision a base de connaissances profondes : L’exemplede la supervision du reseau TRANSPAC,” rapport technique,CNET, Octobre 1993.

[17] S. Cauvin and L. Bes, “Model based alarm filtering on pilotplants.,” Preprints IFAC On-Line Fault Detection and Super-vision in the Chemical Process Industries, pp. 224–236, June1995. Newcastle.

[18] D. Fontaine, “Reconnaissance de scenarios temporels,” in Re-vue d’Intelligence Artificielle (Hermes, ed.), vol. 8:1, AFCET,1994.

[19] C. Dousson, Suivi d’evolutions et reconnaissancede chroniques. Intelligence artificielle, These de Doctorat del’Universite Paul Sabatier, Toulouse, France, Sept. 1994.

[20] F. Levy, “Recognising scenarios : a study,” Fifth internationalworkshop of diagnosis, pp. 174–178, 1994.

[21] P. Laborie and J.-P. Krivine, “Automatic generation of chron-icles and its application to alarm processing in power distri-bution systems,” in International Workshop on Principles ofDiagnosis (DX’97), pp. 61–68, September 1997. Mont St-Michel, France.

[22] S. Cauvin, “Action plans dynamic application in the alexipknowledge-based system.,” Preprints of 2nd IFAC Workshopon Computer Software Structures Integrating AI/KBS Systemsin Process Control, Aug. 1994. Lund, Suede.

[23] S. Bibas, M.-O. Cordier, P. Dague, C. Dousson, F. Levy, andL. Roze, “Alarm driven supervision for telecommunicationnetworks : I- Off-line scenario generation.,” Annales des Tele-communications, vol. 9/10, pp. 493–500, Oct. 1996.

[24] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamohideen,and D. Teneketzis, “A discrete event system approach to fail-ure diagnosis,” in International Workshop on Principles of Di-agnosis (DX’94), pp. 269–277, 1994. New Paltz, USA.

[25] C. Dousson, P. Gaborit, and M. Ghallab, “Situation recog-nition: representation and algorithms,” Proc. 13th IJCAI,pp. 166–172, Aug. 1993. Chambery, France.

[26] L. Roze, “Supervision of telecommunications network : Adiagnoser approach,” in Proceedings of the InternationalWorkshop on Principles of Diagnosis(DX’97), (Mont-Saint-Michel), pp. pages 103–111, 1997.

[27] K. Khoualdi, “Filtrage d’alarmes pour un systeme automa-tise par une approche multi-agent,” Master’s thesis, UniversiteParis 6, 1994.

[28] S. Cauvin, B. Braunschweig, P. Galtier, and Y. Glaize,“Model-based diagnosis for continuous process supervision :the alexip experience,” Engineering Applications of ArtificialIntelligence, vol. 6, no. 4, pp. 333–343, 1993.

[29] S. Cauvin and B. Braunschweig, “Graphical knowledge rep-resentation in the alexip system for petrochemical process su-pervision,” Applications of AI in Engineering, applicationsand techniques, vol. 8, no. 2, pp. 219–234, 1993. Toulouse,France.

[30] C. Ferraz-Simha, Sacre : Systeme d’Aide au Controle de Re-sultats Experimentaux. Informatique, These de Doctorat del’Universite Paris XIII, France, Nov. 1993.

[31] J.-P. Krivine and O. Jehl, “The AUSTRAL system for diagno-sis and power restoration: an overview,” in Proc. of interna-tional conference on Intelligent System Application to PowerSystems (ISAP’96), August 1996. Orlando, USA.

[32] P. Laborie and J.-P. Krivine, “GEMO: A model-based ap-proach for an alarm processing function in power distributionnetworks,” in Proc. of international conference on IntelligentSystem Application to Power Systems (ISAP’97), July 1997.Seoul, South Corea.

[33] P. Laborie and J.-P. Krivine, “Automatic generation of chroni-cles and its application to alarm processing in power distribu-tion systems,” in Proc. International Workshop on Principlesof Diagnosis (DX’97), Sept 1997. Mont St Michel.

[34] M. Porcheron and B.Ricard, “An application of abductive di-agnostic methods to a real world problem,” in InternationalWorkshop on Principles of Diagnosis (DX’97), (Le Mont-St-Michel, France.), pp. 87–94, 1997.

[35] L. Console, Dupre, and P. Torasso, “On the co-operation be-tween abductive and temporal reasoning in medical diagnos-tic.,” Artificial Intelligence in Medecine, vol. 3, no. 6, pp. 291–311, 1991.

[36] S. Bibas, P. Dague, F. Levy, M.-O. Cordier, and L. Roze, “Sce-nario generation for telecommunication network supervision,”Workshop on AI in Distributed Information Networks, Aug.1995. Montreal, Quebec, Canada.

[37] C. Dousson, “Alarm driven supervision for telecommunica-tion networks : II- On-line chronicle recognition.,” Annalesdes Telecommunications, vol. 9/10, pp. 501–508, Oct. 1996.


[38] M. Sampath, R. Sengupta, and S. L. et al., “A discrete eventsystems approach to failure diagnosis,” Fifth InternationalWorkshop on Principles of Diagnosis, Sept. 1994. New Paltz.

[39] J. Penalva, L. Coudouneau, L. Leyval, and J. Montmain, “Dia-pason: a supervision support system,” IEEE Expert IntelligentSystems and their Applications, vol. 8, no. 5, 1993.

[40] L. Leyval, Raisonnement causal pour la simulation deprocedes industriels continus. PhD thesis, 1991.

[41] J. Montmain, Analyse qualitative de simulations pour le diag-nostic en ligne de procedes continus. PhD thesis, 1992.

[42] S. Gentil and J. Montmain, “Operation support for alarm fil-tering,” IEEE, CES ’96, 1996. Lille.

[43] H. Philippe, Algorithmes pour la compilation de bases de con-naissances en logique propositionnelle et du premier ordre:les systemes CLOPS et KHEOPS. Intelligence artificielle,These de Doctorat de l’Universite Paul Sabatier, Toulouse,France, May 1989.

[44] L. Trave-Massuyes and R. Pons, “Causal ordering for multiplemode systems,” 11th Int. Workshop on Qualitative Reasoningabout Physical Systems, 1997.

[45] K. Bousson and L. Trave-Massuyes, “Fuzzy causal simulationin process engineering,” Proc. 13th IJCAI, Aug. 1993. Cham-bery, France.

[46] K. Bousson and L. Trave-Massuyes, “Putting more numbersin the qualitative simulator ca-en,” Intelligent Systems Engi-neering, pp. 62–69, 1994.

[47] K. Bousson, L. Trave-Massuyes, and L. Zimmer, “Causalmodel-based diagnosis of dynamic systems,” 5th Int. Work-shop on Principles of Diagnosis (DX ’94), 1994.

[48] I. Servet, L. Trave-Massuyes, and D. Stern, “Evolutionarycomputation techniques for telephone networks traffic super-vision based on a qualitative stream propagation model,” inICANNGA’97, (Norwich, UK), Apr. 1997.

[49] I. Servet, L. Trave-Massuyes, and D. Stern, “Traffic super-vision in telephone networks and qualitative modelling,” An-nales des telecommunications, vol. 51, Octobre 1996.

[50] I. Servet, L. Trave-Massuyes, and D. Stern, “Traffic super-vision based on a one-moment model of telephone networksbuilt from qualitative knowledge,” in CESA’96 IEEE-IMACSMulticonference, (Lille), July 1996.

[51] F. LeGall, J. Bernussou, and J. Garcia, “A one-moment modelfor telephone networks with dependance on link blockingprobabilities,” in Performance (E. Gelenbe, ed.), pp. 449–458,1984.

[52] D. R. Manfield and T. Downs, “On the one-moment analysisof telephone traffic networks,” IEEE Transactions on commu-nications, vol. 27, pp. 1169–, Aug. 1979.

[53] J. Guerineau and J. Labetoulle, “End to end blocking in tele-phone networks: a new algorithm,” in International TeletrafficCongress (M. Akiyama, ed.), (Kyoto), pp. 2.4A–2–1–2.4A–2–7, Elsevier, 1985.

[54] M. Chantler, S. Cermignani, K. W. Mathisen, and O. Saarela,“Selecting model-based diagnostic solutions,” in Interna-tional Workshop on Principles of Diagnosis (DX’98), (CapeCod, Massachussetts.), pp. 6–15, 1998.

[55] L. Console and D. Theseider-Dupre, “On the dimensions oftemporal model-based diagnosis,” in International Workshopon Principles of Diagnosis (DX’98), (Cape Cod, Massachus-setts.), pp. 16–23, 1998.

Documents

Monitoring and alarm interpretation in industrial environments