Modeling and inference for troubleshooting with interventions applied to a heavy truck auxiliary braking system

Engineering Applications of Artificial Intelligence 25 (2012) 705–719

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

0952-19

doi:10.1

� Corr

Linkopi

Tel.: þ4

E-m

journal homepage: www.elsevier.com/locate/engappai

Modeling and inference for troubleshooting with interventions appliedto a heavy truck auxiliary braking system

Anna Pernestal a,b, Mattias Nyberg a,b, Hakan Warnquist a,c,�

a Scania CV AB, Sodertalje, Swedenb Department of Electrical Engineering, Linkoping University, Swedenc Department of Computer and Information Sciences, Linkoping University, Sweden

a r t i c l e i n f o

Available online 1 April 2011

Keywords:

Automobile industry

Decision support systems

Fault diagnosis

Probabilistic models

Bayesian network

76/$ - see front matter & 2011 Elsevier Ltd. A

016/j.engappai.2011.02.018

esponding author at: Department of Comput

ng University, 581 83 Linkoping, Sweden.

6 8 55383497; fax: þ46 13 139282.

ail address: [email protected] (H.

a b s t r a c t

Computer assisted troubleshooting with external interventions is considered. The work is motivated by

the task of repairing an automotive vehicle at lowest possible expected cost. The main contribution is a

decision theoretic troubleshooting system that is developed to handle external interventions. In

particular, practical issues in modeling for troubleshooting are discussed, the troubleshooting system

is described, and a method for the efficient probability computations is developed. The troubleshooting

systems consists of two parts; a planner that relies on AOn search and a diagnoser that utilizes Bayesian

networks (BN). The work is based on a case study of an auxiliary braking system of a modern truck. Two

main challenges in troubleshooting automotive vehicles are the need for disassembling the vehicle

during troubleshooting to access parts to repair, and the difficulty to verify that the vehicle is fault free.

These facts lead to that probabilities for faults and for future observations must be computed for a

system that has been subject to external interventions that cause changes in the dependency structure.

The probability computations are further complicated due to the mixture of instantaneous and non-

instantaneous dependencies. To compute the probabilities, we develop a method based on an algorithm,

updateBN, that updates a static BN to account for the external interventions.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

To meet increasing requirements on functionality, safety, andenvironmental performance, modern automotive vehicles becomemore and more complex products integrating electronics, mechanics,and software. Due to their intricate architecture and functionalitythey are often difficult for a workshop mechanic to troubleshoot. Inaddition, shortened repair times and increased uptime are required.To shorten repair times for the increasingly complex automotivesystems, one approach is to provide a computer aided troubleshoot-ing system to the workshop mechanic. The troubleshooting systemshould suggest a sequence of actions, including for example repairsand observations, that leads to a fault free vehicle at lowest expectedcost.

In this work, we design a troubleshooting system that is applic-able to automotive vehicles in real world environments. We con-sider practical issues in modeling and troubleshooting, and developmethods for efficient troubleshooting and inference. The work isinspired by an application study of an auxiliary heavy truck breaking

ll rights reserved.

er and Information Sciences,

Warnquist).

system, called the retarder. The retarder is a mechatronic systemconsisting of electrical, mechanical, and hydraulic parts. In particu-lar, we apply and verify our troubleshooting approach on theretarder. The methods developed are designed for, but not restrictedto, the troubleshooting of automotive systems.

In the literature, decision theoretic approaches have proven tobe efficient for troubleshooting, see for example Sun and Weld(1993), Heckerman et al. (1995), Langseth and Jensen (2002), andOlive et al. (2003). However, application studies in these previousworks mainly consist of electronic systems, such as printers andelectronic control units. In comparison with these electronic applica-tions, the solution to the problem of troubleshooting automotivemechatronic systems needs to take two additional important issuesinto account.

First, in automotive mechatronic systems it is often not asstraightforward to determine whether a certain repair has madethe system fault free as in the electrical systems in the previousworks. In the previous works, it is assumed that after each repairit is verified whether the system is fault free or not. Such verifi-cation is typically expensive in automotive mechatronic systems, andtherefore we do not presume it. The consequence is that we need tocompute probabilities in a system subject to external interventions,i.e. after affecting the system with the repairs. Second, not all parts ofthe system can be reached without first disassembling other parts of

www.elsevier.com/locate/engappai

dx.doi.org/10.1016/j.engappai.2011.02.018

mailto:[email protected]

dx.doi.org/10.1016/j.engappai.2011.02.018

1 We will use the terms ‘‘nodes’’ and ‘‘variables’’ interchangeable.

A. Pernestal et al. / Engineering Applications of Artificial Intelligence 25 (2012) 705–719706

the system. This means that the level of disassembly, and the extratime required for disassembly and assembly activities, needs to beconsidered in the solution.

During troubleshooting the aim is to guide the mechanic bysuggesting the next repair or observation such that the expectedrepair cost is minimized. For small systems, the problem could besolved using influence diagrams (Jensen and Nielsen, 2007; Russelland Norvig, 2003). For larger systems, such as troubleshooting of theretarder, influence diagrams become unfeasibly large and complex.Instead, we formulate the troubleshooting problem as a probabilisticconditional planning problem.

The troubleshooter designed in this paper consists of a diagnoserand a planner. The planner finds a conditional plan of actions bysolving a general state space search problem, where a state describesthe current knowledge, i.e. the current belief state, of the system. Toachieve this plan, the planner has to consider the costs of actions andthe effects they may have on the system. The costs of actions aredependent on the level of disassembly, and each action may changethis level. We use the informed search algorithm AOn (Nilsson, 1980)to find an optimal plan, i.e. a plan with minimal expected cost thatmakes the vehicle fault free. The output from the planner to themechanic is the first action of this plan. If the mechanic is busywaiting for a response, the search time contributes to the total repaircost. Therefore, the planner can be halted anytime returning apossibly suboptimal choice of action.

The diagnoser supports the planner with computation of prob-abilities of faults and of future observations. The main challenge inthe probability computations is handling the external interventionscaused by the troubleshooting activities. These interventions changethe structure of dependencies during the troubleshooting. In pre-vious works on troubleshooting, computing probabilities afterexternal interventions with the system is often avoided, for exampleby assuming a function-verifying observation after each repair(Langseth and Jensen, 2002). In Breese and Heckerman (1996)interventions are handled using so-called persistence nodes, wheremapping nodes are used to track dependency changes. However, thedependency changes studied in the current paper are of a differentsource, and the persistence nodes are not applicable. Anotherapproach is to utilize event-driven non-stationary dynamic Bayesiannetworks (event-driven nsDBN), see for example Pernestal (2009). Inthe event-driven nsDBN, new time slices are added to a dynamicBayesian network (DBN) by events, caused by external interven-tions. By allowing different structures in different time slicesthe nsDBNs provide a general description of the troubleshootingprocess, but this generality complicates inference. In the currentwork, one step further is taken, and a new method of inferencefor troubleshooting is developed. We note that the probabilitycomputations in troubleshooting are of a special kind, and showhow these probabilities can be computed by replacing the nsDBNwith a static Bayesian network (BN) that is updated as trouble-shooting progress.

To summarize, the main contribution in the current work isthe complete troubleshooting system that handles external inter-ventions. We also provide a detailed investigation of practicalissues when modeling and building troubleshooting systems forautomotive vehicles, as well as a new algorithm for efficientcomputation of the probabilities needed for troubleshooting. Theapproach is not specific for troubleshooting of automotive vehi-cles and can be used for other types of troubleshooting.

First, notation and preliminaries are introduced in Section 2,before presenting the retarder and troubleshooting scenarioin Section 3. The planner is described in Section 4, and in Section5 modeling for troubleshooting is discussed. In particular, practicalissues when modeling real systems are highlighted. The diagnoser isdiscussed in Section 6, and the BN updating algorithm is derivedin Section 7. Sections 7.1 and 7.2 are technical, and give the details

for the interested reader, but are not necessary for the overallunderstanding of the method presented. Finally, the troubleshootingsystem is applied to the retarder in Section 8, before concluding andproviding an outlook in Section 9.

2. Preliminaries

Before going into the troubleshooting details, we present thenotation used, and give a brief introduction to Bayesian networks(BN) and dynamic Bayesian networks (DBN).

2.1. Notation

All variables considered in this work are discrete. We usecapital letters for variables and lower case letters for their values,X¼x. We use p(X¼x) or p(x) to denote the probability that X¼x,while p(X) denotes the probability distribution of X. Bold faceletters denote vectors. Subscripts are used to denote variableindices, and superscripts to denote time. For example, xi

t is thevalue of the variable Xi

t, with number i at time t.

2.2. Bayesian networks

A Bayesian network (BN) is a directed acyclic graph representinga factorization of the joint probability distribution over a set{X1,y,Xn} of variables. In the BN, denoted B, nodes representvariables1 and edges between them represent dependency relationsand are directed from parents to children. We let paB(X), chB(X), anddeB(X) denote the sets of parents, children, and descendants ofvariable X in the BN B. Moreover, we use paB(x) to denote anassignment of values to paB(X), and similarly for chB(x) and deB(x).Whenever the BN B is clear from the context, we omit superscript B.

To each variable Xi in B, there is a conditional probabilitydistribution (CPD) associated, defining the probability distribu-tion pðXijpaBðXiÞÞ, and B represents the factorization

pðX1, . . . ,XnÞ ¼Yn

i ¼ 1

pðXijpaðXiÞÞ: ð1Þ

We use the term evidence to denote assignments of values tovariables in the BN. The BNs considered in this work are causalBNs, meaning that the direction of the edges represent causaleffects. More detailed descriptions on BNs are for example givenby Jensen and Nielsen (2007) and Russell and Norvig (2003).

To model dynamic systems and process, a dynamic Bayesiannetwork (DBN) can be used. The DBN consists of time slices,where each time slice models the system during a certain timeinterval. Dependencies over time are represented by edgesbetween the time slices, sometimes called temporal edges. Forreferences on DBN, see for example the works by Jensen andNielsen (2007), Russell and Norvig (2003), and Murphy (2002).

3. The troubleshooting scenario and system

In this section we present the troubleshooting scenario, andgive an overview of the troubleshooting system, but we firstpresent our motivating application: the retarder.

3.1. Motivating application—the retarder

The retarder is an auxiliary hydraulic braking system thatallows braking of the truck without applying the conventional

Fig. 1. A heavy truck gearbox with an integrated retarder. The retarder is visible

on the bottom right of the gearbox.

Fig. 2. Overview of the troubleshooting system.

A. Pernestal et al. / Engineering Applications of Artificial Intelligence 25 (2012) 705–719 707

brakes. It consists of a mechanical system and a hydraulic system,and is controlled by an electronic control unit (ECU), see Fig. 1.The retarder generates breaking torque by letting oil flow througha rotor driven by the propeller axle causing friction. The kineticenergy is thereby converted into thermal energy in the oil that iscooled off by the cooling system of the truck. At full effect andhigh rpm, the retarder can generate as much torque as the engine.We have chosen to study the retarder since it is a representativesystem of heavy duty trucks, and since it is difficult to trouble-shoot due to its complexity.

3.2. The troubleshooting scenario

Imagine a heavy truck, driving along the highway to deliverproducts to a company. Suddenly, the driver experiences pro-blems with the braking performance, and decides to take thevehicle to a workshop. At arrival to the workshop, the driverexplains the problem to the mechanic, who plugs in his computerand reads out further information from the truck. From thisinformation it is decided that the truck needs to be repairedimmediately to avoid serious trouble. The driver must fulfill histransportation assignment, so the repair should be performed asfast and time efficient as possible. Therefore the mechanic usesthe computer aided troubleshooting system.

The troubleshooting system is connected to the truck, andsuggests actions for the mechanic to perform. The mechanicreports the results to the troubleshooting system, and waits fornew actions to be computed. This goes on until the troubleshoot-ing system has declared that the truck can leave the workshop.

3.3. The troubleshooting system

A troubleshooting action is defined by its cost, its precondition,and its effect. The cost of an action is typically related to the timeit takes to perform it and the resources consumed such as spareparts. Also, the cost depends on whether certain parts of thevehicle are assembled or not. The level of assembly is describedby the assembly state. The precondition defines in which assemblystate the action can be performed. For example to replace the oilpressure sensor the retarder oil needs to be drained and the oilcooler needs to be removed. The effect of an action can be toobserve a value, perform a repair, test the operation of the truck,or to change the assembly state. When an action is performed weget an action result. An action result is a confirmed effect, i.e. theaction and the outcome.

In Fig. 2 an overview of the troubleshooting system used inthis work is shown. The troubleshooting system communicates

with the mechanic through action requests and the mechanicreturns action results. There is no requirement that action resultscome from a requested action; the mechanic is free to performactivities on his own choice and report to the troubleshootingsystem. However, we presume that the mechanic is honest andonly reports action results which actually have occurred.

As depicted in Fig. 2, the troubleshooting system consists oftwo modules, a planner and a diagnoser, that communicate throughthe probabilities. This architecture divides the troubleshootingsystem into two parts with different tasks, and that can be deve-loped independently.

To determine the next action the planner creates a conditionalplan of actions called a troubleshooting strategy. The action requestis the first action of such a plan that has a small expected cost. Thetroubleshooting strategy is found by searching the belief statespace, i.e. the probability distribution

bt¼ pðCt

ja1:tÞ,

over component states, given the action results, a1:t ¼/a1, . . . ,atS.Even though the mechanic may perform actions freely, whencomputing the troubleshooting strategies, it is assumed that reques-ted action will be performed. As shown in Fig. 2, the planner utilizesthe diagnoser in two ways: to compute the belief state bt from theprevious belief state bt�1 and the sequence a1:t of action results, andto determine the probability pðatþ1jct ,a1:tÞ of future actions. Theaction effects that change the assembly state are deterministic andtherefore, the treatment of the assembly state is contained withinthe planner. In the diagnoser, the probability computations aredivided into two subproblems:

�
Model updating: for maintaining a model of the current system,taking external interventions into account. � Probability computation: for belief state updating and predic-
tion of the outcomes of future actions.

Troubleshooting is terminated when the probability that thevehicle is fault-free is above a predefined threshold. Such a stateis called a goal state for the planner.

3.4. Variables

We use a BN to model the system under troubleshooting in thediagnoser. The BN for the retarder is shown in Fig. 3. As seen inthe figure, there are three types of nodes: components, observablesymptoms, and internal states. In this section we describe theircharacteristics. All variables are discrete.

3.4.1. Components

We use the term component both for the physical componentsand for the variables describing the fault state of the component.Components are denoted Ci, i¼1,y,N. An assignment Ct

¼ct to all

Fig. 3. A Bayesian network modeling the retarder.


component variables is called a diagnosis. Each component Ci hasthe possible fault state ‘‘no fault’’ (NF). In addition, there are atleast one state indicating that there is a fault present in thecomponent. To simplify the presentation we only consider onepossible fault, ‘‘faulty’’ (F), for each component in the retarder.

There are two probability distributions related to the compo-nents. The first is the probability of a component being faultygiven the operation history H, pðCijHÞ. The history consists ofinformation about how the vehicle has been used. For example, ifthe vehicle has been operated at extremely high load, its compo-nents are more likely to break. At a certain troubleshootingoccasion the history is constant. In the current work we aim atdescribing a troubleshooting occasion, and we therefore avoidclutter in notation by writing pðCijHÞ ¼ pðCiÞ. The second prob-ability distribution for the components is the probability distribu-tion of successful repair,

pðCijrepairðCiÞÞ: ð2Þ

We assume that, during troubleshooting, components cannotchange state spontaneously, i.e. if a component is faulty, it mustbe repaired in order to become fault free. The operation timeduring test drives is assumed to be short enough for no new faultsto appear.

3.4.2. Observable symptoms

Observable symptoms are represented by variables Oj, j¼1,y,M, and represent observations that can be made, for exampleair leakage at proportional valve and engine warning lamp. Obser-vable symptoms are typically driver’s observations, observationsmade in the workshop, Diagnostic Trouble Codes (DTC) generatedin the ECU during driving, or direct observations of components.A direct observation is obtained by inspection of a componentwhether it is faulty or not.

When an observation action is confirmed, evidence is added tothe corresponding observable symptom variable.

3.4.3. Internal states

In addition to the components and the observable symptoms,we use a set of hidden variables to represent internal states of theretarder. The internal states are represented by variables Xk,k¼1,y,L. For example, in the retarder, there is an internal staterepresenting the uncontrollable braking torque. This internal state

can be observed by both the mechanic and the driver. In this waywe can model the fact that the result of observing the brakingtorque level may give different results for example due to the skillof the observer.

3.4.4. Troubleshooting BN

The three different types of variables presented above can becombined to a BN. In this work, we consider troubleshooting BNsdefined as follows.

Definition 1 (Troubleshooting BN). A troubleshooting BN consistsof component variables, observable symptoms, and internal states,connected by directed edges such that the following rules hold:

�
Components can be parents to all kind of variables, but canonly be children of other component variables. � Observable symptoms can be parents only to other observable
symptoms, but can be children of all types of variables.
� Internal states can be parents to observable symptoms only,
and children to components only.

4. Planner

As described in the previous section, the task of the planner is togenerate the next action request Atþ1. This is done by evaluatingdifferent troubleshooting strategies and choosing the first action ofthe strategy with smallest expected cost. A troubleshooting strategyis a conditional plan which means that, depending on the resultsof previous actions, the subsequent actions to take may bedifferent. Fig. 4 shows an example of a troubleshooting strategyfor the retarder. A troubleshooting strategy p is defined as a treewhere each node is represented by an action A and each outgoingedge from a node represents an action result a of the correspondingaction. Branching occurs when an action has multiple possibleresults, e.g. the action check leakage near proportional valve mayhave the action results leakage and no leakage. The troubleshootingstrategy is associated to a state st describing the vehicle at the time t

when the strategy begins. This state consists of the assembly statedt, the belief state bt, and the history of action results a1:t.A troubleshooting strategy pðstÞ is said to be complete if theexecution of every action on the path from the root node to anyleaf node leads to a goal state, i.e. a fault free vehicle.

Fig. 4. An example of a troubleshooting strategy for the retarder where in the initial state the accumulator and the control valve are suspected to be faulty. The nodes are

actions and the edges are action outcomes.


4.1. Optimal expected cost of repair

To evaluate complete troubleshooting strategies, the expectedcost of repair (ECR) is computed. The expected cost of repair is theexpected cost of reaching any leaf node of the troubleshootingstrategy. If the first action Atþ1 of a troubleshooting strategy pðstÞ

is performed, there will be a certain action result atþ1 with theprobability

pðatþ1ja1:tÞ ¼X

ct

pðatþ1jct ,a1:tÞpðctja1:tÞ|fflfflfflfflffl{zfflfflfflfflffl}bt

, ð3Þ

where we have marginalized over the component states ct at time t.The first probability in the sum above is computed by the diagnoser,and the second is recognized as the previous belief state.

Let the cost of performing an action At be q(dt,At) and let theresulting state after action result atþ1 be stþ1

a . Before performing At, asequence of actions with effects that affect the assembly state mustfirst be performed so that the precondition of At becomes fulfilled.The cost q(dt,At) includes the costs of all these actions, and in theresulting state the new assembly state is updated from the old one inaccordance with the effects of these actions. More details on how theassembly state is updated and how the action cost is computed canbe found in Warnquist et al. (2009). The belief state in the resultingstate is computed by the diagnoser.

Let p0ðstþ1a Þ � pðstÞ be the troubleshooting strategy rooted in

the node that is connected to the edge corresponding to the actionresult a. Then, the expected cost of repair ECRðpðstÞÞ is

ECRðpðstÞÞ ¼

qðdt ,AÞ if st is a goal state,

qðdt ,AÞþP

atþ 1

pðatþ1ja1:tÞECRðp0ðstþ1a ÞÞ otherwise: :

8<:

For a given initial state st, the optimal troubleshooting strategyp�ðstÞ is

p�ðstÞ ¼ argminpAPðst Þ

ECRðpðstÞÞ

where PðstÞ is the set of all possible complete troubleshootingstrategies starting in st. The optimal expected cost of repairECRn(st) is the expected cost of repair for p�ðstÞ. Let PAtþ 1 ðstÞ be

the subset of PðstÞ where Atþ1 is the first action. Then

ECR�ðstÞ ¼ minpAPðst Þ

ECRðpðstÞÞ ¼minAtþ 1

minpAPAðst Þ

ECRðpðstÞÞ

¼minAtþ 1

qðdt ,AÞ if st is goal state,

qðdt ,AÞþP

atþ 1

pðatþ1ja1:tÞECR�ðstþ1a Þ otherwise:

8<:

ð4Þ

Actions that affect the assembly state do not need to be con-sidered in the minimization step of (4) because the cost of theseactions that are necessary for the precondition of A will already beincluded in q(dt,A).

4.2. Search graph

To obtain the optimal troubleshooting strategy and the nextaction request, the minimization (4) must be solved. Fig. 5 illus-trates how the problem is decomposed in the form of a tree.Solving the minimization in (4) corresponds to choosing to followa single outgoing branch from the boxes in Fig. 5. To compute thesummation, every outgoing branch from the circles must beevaluated. This kind of decomposition corresponds to an AND/ORgraph. In accordance with Nilsson (1980), the AND/OR graph can bedefined as a hypergraph with nodes that are states interconnectedby hyperedges. A hyperedge connects one state with one or manyother successor states. In the AND/OR graph for (4), each non-goalstate has one outgoing hyperedge for each action that connects toone other state for each action result of that action. A solution to anAND/OR graph is a subgraph of that graph that contains the startstate and, for every non-goal state in the solution graph, exactly onehyperedge and all of its successor states. Every solution correspondsto a complete troubleshooting strategy and the optimal solution isthe one that solves (4).

4.2.1. Search algorithm

There are many efficient algorithms to find optimal solutionsin AND/OR graphs and the one used in this work is AOn (Martelliand Montanari, 1978; Nilsson, 1980). AOn is an informed searchalgorithm that finds the optimal solution to an implicit AND/ORgraph G specified by a start state and a successor function. Thesuccessor function generates the successors stþ1

a of a state st for

Fig. 5. The problem decomposition of the calculation of (4).


every action result atþ1 as well as the probability of reaching eachsuccessor. The algorithm is initialized with an explicit AND/ORgraph G0 that consists of only the start state. It uses the successorfunction to expand G0 with the successors of one of the leaf statesin the optimal solution of G0. After each expansion of G0, theoptimal solution is updated, i.e. the newly expanded state and allof its ancestors are evaluated using a cost function f. For thetroubleshooting problem this is

f ðstÞ ¼minAtþ 1

qðdt ,AÞ if st is goal state in G,

hðstÞ if st is leaf state in G0,

qðdt ,AÞþP

atþ 1

pðatþ1ja1:tÞf ðstþ1a Þ otherwise,

8>>><>>>:

where h(st) is a heuristic cost function that estimates the optimalexpected cost of repair such that

hðstÞrECR�ðstÞ for any state st : ð5Þ

The algorithm keeps expanding G0 until all leaf states in theoptimal solution to G0 are goal states. If (5) holds, then the optimalsolution to G0 is also the optimal solution to G.

The heuristic that is used in the implementation is derivedfrom a relaxation of the problem, where we assume that we canobserve all components for free. Then for all possible diagnoses ct,we have to compute the cost of repairing the faulty componentsin ct, i.e.

hðstÞ ¼X

ct

pðct ja1:tÞX

i:Cti¼ F

qðdt ,repairðCti ÞÞ, ð6Þ

where the probability pðctja1:tÞ can be taken directly from thebelief state bt.

Finding the optimal solution to conditional planning problemsis highly exponential (Rintanen, 2004). This means that the timeAOn requires to complete can be very long. If the mechanic iswaiting for a response, the computation time contributes to thecost. Therefore, the search can be aborted prematurely and thefirst action of the optimal solution to the current explicit graph G0

is returned. This solution does not correspond to a completetroubleshooting strategy and the decision is therefore not neces-sarily optimal. However, for every additional computational timeallowed, the quality of the solution converges monotonically towardthe optimal.

5. Modeling for troubleshooting

In this section we discuss modeling for troubleshooting.In particular, we discuss practical issues in modeling for

troubleshooting, and we give an introduction to how event-drivennon-stationary nsDBNs, developed in Pernestal, 2009, can be usedto handle external interventions during the troubleshootingprocess.

5.1. Practical issues when building BN for troubleshooting

Building BNs for troubleshooting, as modeling in general, is anartwork that requires knowledge about the system to model and/ora lot of training data to learn the model from. Since troubleshootingsupport is most important when products are new, before experi-ence is collected at the workshops, it is typically the case that thetroubleshooting system, including the BN, should be available at themarket at the same time as the vehicle is released. At this time, datais not yet collected, and the model must be learned mainly fromexpert knowledge.

The BN for the retarder shown in Fig. 3 is based on engineers’expert knowledge, and consists of 20 component variables,denoted C1–C20, five internal state variables, denoted X1–X5, and25 observable symptoms, denoted O1–O25.

When building the BN we aim at a model that is simpleenough to enable fast computations, but descriptive enough tosolve the troubleshooting problem with sufficiently high preci-sion. There are several design choices, and in this section wediscuss some of the most important ones.

Components: The parts of the troubleshooted systems can bedivided into components in the BN in different ways. The max-imum size of components are sets of parts of the retarder thatalways are repaired together, also called minimal repairable unit.Choosing larger components may lead that to more parts thannecessary are replaced during troubleshooting. Choosing smallersets of parts of the retarder as components in the BN is possible,but may give worse performance in the troubleshooting algo-rithm and gives more parameters that need to be determinedin CPDs.

In this work we choose components to be minimal repairableunits. Furthermore, we allow several components to be faulty atthe same time.

Driver or mechanic: Observations concerning the performance ofthe vehicle, for example the braking torque, can be obtained byasking the driver or by letting the mechanic perform a test drive. Ingeneral, the answer from the mechanic is less uncertain but is oftenobtained at a higher cost since it is more expensive to let themechanic perform a test drive than interviewing the driver. Thedriver’s answers can only be obtained at the beginning of trouble-shooting. It may be the case that the driver’s answers bias themechanic. For example, if the driver complains about uncontrollable

Fig. 6. Dependencies in a subsystem of the retarder at workshop arrival, at rest at

the workshop, and after repairing the oil.


braking torque it is reasonable that the mechanic will be influencedand observe the same symptom with higher probability. This case ismodeled as a dependency between the observation nodes, see O4

and O5 in Fig. 3 for an example.Perception: In some observations there may be uncertainties. For

example the observation leakage magnet valve (O15) can be mistakenfor leakage prop. valve (O16). We model this by using internal statevariables that represent the true situation, in this case X2 and X3, andfrom each such internal variable to both observations.

Effect of external systems: In the troubleshooting of a certainsystem, there are typically adjacent systems that also may affectthe observations. One previously used approach is to assume thatsurrounding systems are fault free, see e.g. Heckerman et al.(1995). In the current work another approach is taken. There aretwo cases that are important to consider: when the trouble-shooted system causes faults in an adjacent systems, and whenfaults in an adjacent system affects observations in the trouble-shooted system.

In the first case, when the troubleshooted system causes afault in an adjacent system, the adjacent system is modeled as anobservation. In Fig. 3 we have for example identified that thestates retarder oil level low (O19) and oil level low (O20), which canalso be observed and thus are modeled as observable symptoms,may cause the gearbox to break. This is modeled through theobservable symptom gearbox broken (O21).

An example of the second case, that faults in adjacent systemsalso can explain observations in the troubleshooted system, isthat leakages outside the retarder may cause the observation DTC:unplausible oil pressure (O11). This external fault is handled byincreasing the probability of false alarm for this DTC. Note thatthis also induces that the requirement on the goal state must bechanged, i.e. at some point we consider the system as fault freealthough there may be observations that have alarmed.

Time: There are two aspects of time in troubleshooting. First,‘‘time is money’’, in the sense that there are costs associated withhaving the truck at the workshop. To model this, each action has acost for performing the action. This cost is taken into account inthe planner.

Second, time goes on while troubleshooting, and the system maychange over time. In particular, the system changes with repairs andtest operation. In the current work we consider troubleshooting as adiscrete process, where time steps are taken when repair andoperation actions are performed. The time interval between twosuch actions may be of different lengths, and we assume that thesystem is static during each interval. This assumption is reasonable,since the vehicle is at rest at the workshop, and there are basicallyno dynamics present.

5.2. Repairs, operations, and interventions

Assume that there is a BN modeling the system under trouble-shooting. Performing observations simply means adding evidence tothe observed variables in the BN. Performing a repair of componentCi means that the repaired component is fault free with probabilitygiven by (2). However, when performing a repair there is also anintervention with the system. To illustrate the effect of a repair,assume for a moment that repairs are always successful. Then,repairing a component Ci means that the component is forced to befault free by intervention, rather than being observed as fault free.Therefore, it is not sufficient to only add the evidence Ci¼NF to theBN (Pearl, 2000). The consequences of an external interventiondepend on the characteristics of the causal dependencies in thesystem. As discussed by Pernestal (2009), there are two differentkinds of causal relations in troubleshooting: instant and non-instant.For example, if the oil is replaced in the retarder, this will have aninstant effect on the oil color. Non-instant relations, on the other

hand, need operation of the system to be present. One example isthat if a gasket is replaced in the retarder, the retarder must beoperated in order to verify if there is a leakage or not. In this smallexample it is shown that operation actions also are external inter-ventions with the systems since an operation changes the relationsbetween variables.

The nature of interventions and their causal effects is carefullydiscussed by Pearl (2000). However, the interventions consideredby Pearl (2000) are based on the fact that all causal dependenciesare instant, i.e. changing the value of a variable gives instanta-neous effects on its children. In the troubleshooting applicationthere are both instant and non-instant causal dependencies, andthus the rules of causality developed by Pearl (2000) are notdirectly applicable.

5.3. Event-driven non-stationary DBN

To compute probabilities of faults after external interventions,i.e. after repairs and operations, a model describing both thesystem under troubleshooting and the troubleshooting processitself is needed. One framework for modeling troubleshootingprocesses is the one based on event-driven non-stationary DBN(event-driven nsDBN) developed by Pernestal (2009). An nsDBN isa DBN, where dependencies are allowed to be different indifferent time slices, see for example the works by Robinsonand Hartemink (2008) and Pernestal (2009). In an event-drivennsDBN, new time slices are generated by external interventionsthat change the structure of dependencies. Following the nomen-clature by Pernestal (2009), such external interventions are calledevents. An example of an event-driven nsDBN is shown in Fig. 6. Atime interval between two events is called an epoch. As discussedin Section 5.1, we assume that the system is static betweenevents, meaning that in the nsDBN, each epoch is modeled by atime slice. In an epoch several observations can be performed.However, we assume that the same observable symptom can onlybe observed once in each epoch. An nsDBN together with asequence of action results is called a troubleshooting session.

To get familiar with nsDBNs, study the example in Fig. 6. Thefigure shows a three-time-slice nsDBN modeling a subsystem ofthe retarder. In the figure, subscripts correspond to numbersin Fig. 3, and superscripts denote the corresponding time slice(or, equally, epoch). The nsDBN in Fig. 6 has three time slices. Thefirst models the system at arrival to the workshop. The secondtime slice is started by the ‘‘empty event’’, i.e. the event wherethere is no external intervention with the system. The system is at


rest at the workshop, and no actions have been performed. Asdescribed by Pernestal (2009), the empty event is merely fortheoretical purposes where it is used as a reference; in practice,there is no need for starting a new epoch after the empty event,since the system has not changed. The third and final time slicein Fig. 6 is initialized by the event that the oil has been repaired.Using the nsDBN in Fig. 6, reasoning during troubleshooting canproceed in the following way. In the figure, ignoring the directionsof the edges, there is a path between O25

1 and C191 (via O25

0 , C200 , and

C190 ). This means that by observing whether there is oil on the

noise shield, conclusions can be drawn about the status of theoil. In the third time slice, after repairing the oil, the path fromO25

2 and C192 is broken, and the observation whether there is oil on

the noise shield does not contribute in the reasoning about thestate of the oil.

In each time slice in an nsDBN, there are two types of edges:instant edges and non-instant edges. We use the followingdefinition from Pernestal (2009), slightly rewritten to fit intothe current framework.

Definition 2 (Instant edge). An edge in a BN that models a systemis instant if it does not require operation of the system to bepresent. An edge that does require operation to be present is non-

instant.

In Fig. 6, the edge between the oil and oil color is instant, whilethe edge between radial gasket and oil on noise shield is non-instant. Also, the nodes in an nsDBN can be classified as one of thetwo types: persistent or non-persistent. Again, we use the defini-tion from Pernestal (2009).

Definition 3 (Persistent variable). A variable in a BN is persistent ifits value in one time slice, generated by the empty event, isdependent on its value in the previous epoch.

In Fig. 6, the nodes oil, radial gasket, and oil on noise shield arepersistent, while the nodes oil color and obs. oil color are non-persistent. In particular, for a persistent variable, if there are noexternal interventions affecting it, there is an edge between thetwo copies of the variable in two consecutive time slices.

In Pernestal (2009) it is shown that an nsDBN modeling atroubleshooting process can be characterized by three pieces ofinformation: (i) an initial BN Bns

0 ; (ii) the effects of the emptyevent; and (iii) for each action, information about the edges addedand removed, and the CPDs changed in relation to the effects ofthe empty event.

The following assumptions related to the nsDBN are used inthe current work.

Assumption 1 (Initial BN). The initial BN Bns0 is a troubleshooting

BN as defined by Definition 1.

Assumption 2 (Persistence). If not affected by external interven-tions, a persistent variable has the same value in two consecutiveepochs.

Assumption 3 (Persistent components). Components are persistent.

Assumption 4 (Empty event). The empty event in epoch t gen-erates a new time slice Bns

tþ1 where all nodes and all instant edgesare copied from the previous time slice Bns

t . Time slice Bnstþ1 is

connected to Bnst by edges from all persistent variables in Bns

t to itscopies in Bns

tþ1.

Assumption 5 (Locality of repair). The event repair(Ci) in epoch t

generates a new time slice Bnstþ1 that is equal to the time slice

generated by the empty event, except that the edge between Cit

in Bnst and Ci

tþ1 in Bnstþ1 is removed. In addition, all edges between

Ci in Bnstþ1 and all other components in Bns

tþ1 are removed.

Assumption 6 (Operation). The event operate in epoch t gener-ates a new time slice Bns

tþ1 that is equal to the initial time slice Bns0 .

Time slice Bnstþ1 is connected to Bns

t by edges from each componentvariable in Bns

t to its copy in Bnstþ1.

One consequence of the assumptions above is that, with onlyone exception, no faults are introduced during troubleshooting.The exception is that the repair of a component Ci may beunsuccessful, and introduces faults in Ci. Moreover, Assumption 5means that repair of a component Ci does not affect any othercomponents than Ci. Assumption 6 means that operation duringtroubleshooting is long enough for all non-instant dependencies toestablish. Furthermore, it means that test operation makes allpersistent variables, except components, independent of their pre-vious values given the current component states.

6. Diagnoser: belief state updating

In this section and in the following, we present the computa-tions performed in the diagnoser. As described in Section 3.3 thecomputations are divided into two subproblems. The first sub-problem, to maintain a model of the troubleshooted system, isconsidered in Section 7, and in this section we concentrate on thesecond subproblem: probability computations for belief stateupdating and for prediction of future observations.

As described in Section 3.3, there are two cases where theplanner requests probabilities from the diagnoser. The first case iswhen an action result at is reported to the planner, and theplanner requests the diagnoser to compute the belief state, i.e. theprobability distribution

bt¼ bðctÞ ¼ pðctja1:tÞ, ð7Þ

for ct¼(c1

t ,y,cNt ), given a sequence a1:t ¼/a1, . . . ,atS of action

results. Recall also that the previous belief state is known, althoughnot explicitly written in the probabilities. The second case is duringplanning, and concerns the probability distributions of possiblefuture actions, pðatþ1jct ,a1:tÞ, i.e. the first probability in the sumin (3). Repair and observation actions are requests to the mechanicto perform an activity, and have only one possible result each,namely ‘‘repair performed’’ or ‘‘operation performed’’. These actionresults are always obtained with probability one. For observations,on the other hand, there are several different values on the observedvariable. Therefore, the diagnoser needs to compute the probability

pðotþ1j jct ,a1:tÞ: ð8Þ

This probability will be computed in Section 7. The remainder of thissection is devoted to the computation of the belief state (7) forobservation, repair, and operation actions. In the diagnoser, there isno need to consider assemble/disassemble actions since they do notintroduce any new faults, and thus do not change the belief state.

6.1. Observation actions

Let at¼observe(Oj¼oj). By Assumption 2 we have that

pðCt¼ cjCt�1

¼ c,a1:tÞ ¼ 1, ð9Þ

i.e. when at is an observation Ct¼Ct�1. By using Bayes’ rule, (7)

can be written as

pðCt¼ cja1:tÞ ¼ pðCt

¼ cjotj ,a

1:tÞ ¼ gpðotj jC

t�1¼ c,a1:tÞpðCt�1

¼ cja1:tÞ

ð10Þ

¼ gpðotj jC

t�1¼ c,a1:t�1ÞpðCt�1

¼ cja1:t�1Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}bt�1

, ð11Þ

where g is a normalization constant. Furthermore, Eq. (9) is usedto replace ct with ct�1 in the second equality above, and the


knowledge that the observation at does not change the systembetween times t�1 and t is used in the last equality. In (10), theprevious belief state bt�1 is known, so the resulting probabilitycomputation to perform is

pðotj jc

t�1,a1:t�1Þ, ð12Þ

which is of the same form as (8), and will be computed in Section 7.

6.2. Repair actions

Now, let at¼repair(Ci). Let sAfNF,Fg and C

i¼ ðC1, . . . ,Ci�1,

Ciþ1, . . . ,CNÞ. The belief state after repairing Ci at time t is then

bðct1, . . . ,ct

i�1,s,ctiþ1, . . . ,ct

NÞ ¼ pðcti,Ct

i ¼ sjrepairðCiÞ,a1:t�1Þ

¼ pðctijrepairðCiÞ,a

1:t�1ÞpðCti ¼ sjrepairðCiÞ,a

1:t�1Þ

¼ pðctija1:t�1ÞpðCt

i ¼ sjrepairðCiÞÞ, ð13Þ

where we, in the second equality, have used that the repair makesCt

iand Ci

t independent. In the last equality of (13) we have usedthat Ct

iis independent of the repair of Ci, and that, given that it is

repaired, Cit is independent of previous events. Marginalizing over

Ct�1i , (13) becomes

pðct�1ija1:t�1ÞpðCt

i ¼ sjrepairðCiÞÞ ¼ ðpðct�1i

,Ct�1i ¼NFja1:t�1Þ

þpðct�1i

,Ct�1i ¼ Fja1:t�1ÞÞpðCi ¼ sjrepairðCiÞÞ

¼ ðbð. . . ,ct�1i�1 ,NF,ct�1

iþ1, . . .Þþbð. . . ,ct�1i�1 ,F,ct�1

iþ1, . . .ÞÞpðCi ¼ sjrepairðCiÞÞ,

ð14Þ

and belief state updating after repair(Ci) is given by (14). Giventhe previous belief state bt�1 belief state updating after repair issimply an addition and a multiplication. In particular, under theassumption that repairs are always successful, the updated beliefstate after a repair action becomes

btð. . . ,ci�1,s,ciþ1, . . .Þ ¼

0 if s¼ F,

bt�1ð. . . ,ct�1

i�1 ,NF,ct�1iþ1, . . .Þ

þbt�1ð. . . ,ct�1

i�1 ,F,ct�1iþ1, . . .Þ if s¼NF:

8>><>>:

ð15Þ

6.3. Operation actions

Finally, let at¼operate. According to Assumption 6 no new faults

appear during operation, and the belief state updating becomes

bðctÞ ¼ pðctja1:t�1,operateÞ ¼ pðct�1ja1:t�1Þ ¼ bðct�1Þ: ð16Þ

7. Diagnoser: BN updating

In the previous section, it is shown that for repair and operationactions the belief state is simply updated from the previous beliefstate according to (14) and (16). For observation actions, probabil-ities of the type (12) are needed to update the belief state. Thisprobability is the same as (8). It cannot be obtained by simplemanipulations of the previous belief state only, and needs to becomputed in the diagnoser. One straight-forward approach tocompute the probability (12) is to use an event-driven nsDBN asdescribed in Section 5.3. The event-driven nsDBN is a general modelof the troubleshooting process, but, due to its generality, probabilitycomputations in an event-driven nsDBN may become time consum-ing and inefficient. In this section we will take off from the frame-work of event-driven nsDBN and provide an algorithm thatefficiently updates a static BN instead of unrolling an nsDBN. Inthe static BN any standard inference method is possible to use suchas variable elimination (Dechter, 1996) or importance sampling(Shachter and Peot, 1989).

Begin with dividing the sequence a1:t of actions into twosequences, one comprising the events, e1:t, and one comprisingthe evidence, v1:t. For example,

a1:3 ¼/repairðC1Þ,observeðO2 ¼ o2Þ,operateS

gives

e1:3 ¼/repairðC1Þ,0,operateS,

v1:3 ¼/0,observeðO2 ¼ o2Þ,0S:

Above, the figure ‘‘0’’ is used to denote that there is no event orevidence respectively. Let Bns(e

1:t) be the nsDBN generated by thesequence e1:t of events. The probability (12) can then be written as

pðotj jc

t�1,a1:t�1Þ ¼ pðotj jc

t�1,v1:t�1,Bnsðe1:t�1ÞÞ: ð17Þ

When it is clear from the context which sequence of events thathave generated the nsDBN we will sometimes write Bns instead ofBns(e

1:t�1).During search in the planner, there are many sequences of

actions under consideration at the same time, and the plannerswitches back and forth between these sequences. Each sequenceof actions generates an nsDBN. There are two main alternativesfor using nsDBNs for the probability computations. In the firstalternative, no nsDBN is stored. Each time the planner switches toa new sequence of actions, all time slices of the nsDBN represent-ing this sequence are unrolled and the probability computationsare performed from start to the current time. In stationary DBNsthe probability computations can be made efficiently for exampleby using algorithms presented by Murphy (2002). For nsDBNs, onthe other hand, the structure changes lead to that these efficientmethods cannot be applied. Instead, basic inference methods suchas variable elimination are applied (Jensen and Nielsen, 2007;Pernestal, 2009). This may lead to time consuming computationsin the nsDBN. The second alternative is, instead of generating anew nsDBN for each action sequence, to store one nsDBN for eachaction sequence. The nsDBN can be stored as the distribution ofthe variables in time slice t�1 together with the last two timeslices. When (if) the planner returns to this particular sequence, anew epoch is added and for example variable elimination can beused to compute the new probabilities. Since the number ofconsidered action sequences may be large, this approach mayrequire an unfeasible memory capacity. Furthermore, if K is thenumber of nodes, inference is made in a BN with 2K nodes.

Taking another look at (17), we note that instead of using annsDBN that can be used to compute arbitrary probabilities, it issufficient to use a model that gives the conditional probabilitiesfor the observations only. This opens the possibility to use asimpler model that is optimized for computation of the prob-abilities (17). The strategy here is to use a sequence of static BNsB0, B1,ysuch that

pðoðtÞj jcðtÞ,v1:t ,Bnsðe

1:tÞÞ ¼ pðOj ¼ oðtÞj jC¼ cðtÞ,v1:t ,BtÞ: ð18Þ

The probability in the right hand side of (18) is computed in thestatic BN Bt, and we have introduced the convention that variables inthe static BN have no superscript, but are assumed to belong to theBN that the probability is conditioned on. Moreover, recall thatsuperscript on variables in an nsDBN denotes the time slice theybelong to. In (18) we have introduced superscript (t) to denote thetime slice after event et but before next non-empty event. Forexample, let at

¼repair(Ci), atþ1¼observe(Oj¼oj), and atþ2

¼observe

(Ol¼ol). Then, since the observations are not events, we have

pðatþ1jcðtÞ,v1:t ,Bnsðe1:tÞÞ ¼ pðoðtÞj jc

ðtÞ,v1:t ,Bnsðe1:tÞÞ,

pðatþ2jcðtÞ,v1:t ,Bnsðe1:tÞÞ ¼ pðoðtÞl jc

ðtÞ,v1:t ,Bnsðe1:tÞÞ:


For each sequence of action results under consideration in theplanner, the belief state is stored, but no BNs are stored. Instead,when the planner switches to an action sequence a1:t, the BN Bt isgenerated from this sequence, and inference about future obser-vations is performed in this BN. It will be shown that it is sufficientto perform inference in a subpart of Bt, typically consisting of anumber of nodes that is significantly smaller than K.

In the following two subsections an algorithm Bt¼updateBN

(Bt�1,at) is presented that recursively generates the sequence ofBNs B0,B1,y so that (18) is satisfied. These sections are technicaland provide many details, and can be skipped without losing theoverall understanding of the method.

7.1. BN updating example

To illustrate the idea of the algorithm updateBN, consider againthe example system with the two components oil and radial

gasket introduced in Section 5.3. In Section 5.3 the nodes areclassified as persistent or not, and the edges within time slices areclassified as instant or not. Fig. 7(a) shows an nsDBN modeling atroubleshooting process with two events (external interventions):repair(oil) and operate. In the figure, non-instant edges are markedwith dotted arrows while instant edges are solid. Persistent nodesare gray and non-persistent nodes are white.

The leftmost part of Fig. 7(a), the time slice for epoch 0, or simply‘‘time slice (0)’’, models the system when troubleshooting is initi-alized. In this time slice, nodes are marked with superscript (0) and

Oil C (0)

19

Obs. Oil Color O (0)

24

Oil noise shield O (0)

25

R. Gasket C (0)

20

Oil C (1)

19

R

Oil Color X (1)

5

OOil Color

X (0) 5


24

B 0 B 1

e 1 =repair(Oil)

Oil C 19

Obs. Oil Color O 24

Oil noise shield

O 25

R. Gasket C 20

Oil Color X 5

Oil C 19

Obs. Oil Color O 24

Oil Color X 5

time slice for epoch 0 time slice fo

Fig. 7. An nsDBN modeling the example system subject to a trouble

is the initial BN, denoted Bns0¼Bns

(0) of the nsDBN. Below time slice (0),in Fig. 7(b), the corresponding B0 is shown. Since there has been noexternal interventions with the system, B0 is identical to Bns

0 .

7.1.1. Updating example: repair

Let a1¼e1¼repair(C19), i.e. that the oil is repaired. In the nsDBN

in Fig. 7(a) the event initializes epoch 1 and produces a new timeslice. The new time slice is constructed by copying all nodes andinstant edges from the previous time slice. According to Assumption 5temporal edges are added between all persistent nodes, exceptbetween C19

(0) and C19(1), which represents the oil before and after the

repair. Since all probability queries will be of the type (12), we studyhow to compute the probabilities for the observations.

Consider first the probability of o24(1),

pðoð1Þ24 jcð1Þ19 ,cð1Þ20 ,repairðC19ÞÞ ¼ pðoð1Þ24 jc

ð1Þ19 ,Bnsðe

1:1ÞÞ

¼Xxð1Þ

5

pðoð1Þ24 jxð1Þ5 ,Bnsðe

1:1ÞÞpðxð1Þ5 jcð1Þ19 ,Bnsðe

1:1ÞÞ: ð19Þ

In the first equality we have used (17) and that o24(1) is independent

of c20(1) in Bns(e1:1) and in the last equality we have marginalized

over the internal variable X5(1). The sum in (19) contains variables

in time slice 1 only. Thus, the computations are independent ofthe variables in time slice 0.

Consider now the probability of o25(1). Since the variables O25

(1)

and C20(1) are persistent and connected to their copies in the

previous time slice, they have the same values in the two timeslices and can be used interchangeably. By noting that O25

(1) is

. Gasket C (1)

20

il noise shield O (1)

25

B 2

Oil noise shield O* 25

R. Gasket C 20

Oil C 19

Obs. Oil Color O 24

Oil noise shield

O 25

R. Gasket C 20

Oil Color X 5

Oil C (2)

19

R. Gasket C (2)

20

Oil Color X (2)

5


25


24

e 2 =operate

r epoch 1 time slice for epoch 2

shooting sequence (top), and the corresponding BNs (bottom).


independent of C19(1), and by marginalizing over C19

(0) we obtain

pðoð1Þ25 jcð1Þ19 ,cð1Þ20 ,repairðC19ÞÞ ¼ pðoð1Þ25 jc

ð1Þ20 ,Bnsðe

1:1ÞÞ ¼ pðoð0Þ25 jcð0Þ20 ,Bnsðe

1:1ÞÞ

¼Xcð0Þ

19

pðoð0Þ25 jcð0Þ19 ,cð0Þ20 ,Bnsðe

1:1ÞÞpðcð0Þ19 jcð0Þ20 ,Bnsðe

1:1ÞÞ: ð20Þ

The last probability in the sum in (20) can be written as

pðcð0Þ19 jcð0Þ20 ,Bnsðe

1:1ÞÞ ¼pðcð0Þ19 ,cð0Þ20 jBnsÞ

pðcð0Þ20 jBnsðe1:1ÞÞ¼

pðcð0Þ19 ,cð0Þ20 jBnsðe1:1ÞÞPcð0Þ

19pðcð0Þ19 ,cð0Þ20 jBnsðe1:1ÞÞ

:

ð21Þ

Here, pðcð0Þ20 ,cð0Þ20 jBnsðe1:1ÞÞ ¼ b0 is known, and will not change.Therefore, we can update the CPD pðoð1Þ25 jc

ð1Þ20 ,Bnsðe1:1ÞÞ by using (20)

and (21), and then forget the previous time slice.The computations above show that, if the CPD for O25

(1) is

updated, it is possible to compute the probabilities pðoð1Þ24 jcð1Þ,

repairðC19ÞÞ and pðoð1Þ25 jcð1Þ,repairðC19ÞÞ using variables in time slice

1 only. This indicates that, beginning with a BN B0 correspondingto epoch 0, we can apply a sequence of manipulations on nodesand edges and obtain a new BN B1 that corresponds to epoch 1.The two BNs B0 and B1 are shown in Fig. 7(b) and (c) respectively.These manipulations are illustrated in Fig. 8. They begin with annsDBN consisting of the two epochs 0 and 1. First, we merge thenodes with the same values and remove superscript, i.e. O25

(0) andO25

(1) are merged to O25 and C20(0) and C20

(1) are merged to C20

in Fig. 8(b). In (21) it is shown that the probability for C19(0) can

Oil C (0)

19


21


22

R. Gasket C (0)

20

Oil C (1)

19

R. GasC (1)

2

Oil Color X (1)

5

Oil noshielO (1)

2

Oil Color X (0)

5


21

e 1 =repair(Oil)

time slice 0 time slice 1

Oil noise shield

O 22

R. Gasket C 20

CPD

Fig. 8. Merging two epochs in an nsDBN to one BN. The nsDBN is shown in (a). In (b) n

since they will not contribute to the probabilities of o25(1) and o24

(1) conditioned on c(0). The

C19(0) and removing node c19

(0).

be computed from b0. If the variables X5(0) and O24

(0) have evidencethis is taken into account in b0, and if they do not have evidencethey are barren nodes, see for example the work by Jensen andNielsen (2007), and will not contribute in the probability compu-tations. Thus, X5

(0) and O24(0) can be removed. This is illustrated by

crossing over in the nodes in Fig. 8(b). Finally, by updating theCPD for O25 according to (21) we can remove C19

(0) and obtain theBN B1 in Fig. 8(c).

Finally, we summarize the set of manipulations made on B0

in Fig. 7(b) to obtain B1 in Fig. 7(c).

�

ket

0

ise d

2

ode

BN

Set B1¼B0.

�
Remove all non-instant edges to and from the repairedcomponent C19. � Update the CPD for O25¼O25
(2) according to (21). In Fig. 7(c) theupdated CPD is marked with a ‘‘n’’.

7.1.2. Updating example: operation

After repairing the oil, the system is operated, i.e. a2¼e2¼

operate. In the nsDBN in Fig. 7(a) the operation causes an eventthat initiates epoch 2. According to Assumption 6, all non-instantedges are reinserted and temporal links between persistent vari-ables, except components, are removed. In Fig. 7(a), the only connec-tion between time slices (1) and (2) are through nodes c(2)

¼

(c19(2),c20

(2)), i.e. c(2) d-separates all the other nodes of time slice 2 fromthe previous time slices. The probabilities (17) of the observations

Oil C (0)

19

Oil noise shield

O 22

R. Gasket C 20

Oil C (1)

19

Oil Color X (1)

12


21


21

Oil Color X (0)

5

Oil C 19

Oil Color X 5

Obs. Oil Color O 21

s with the same value are merged, and the child nodes of C19(0) are crossed over

in (c) is obtained by updating the CPD for O25¼O25(2) with the contribution from

C t-1 i C t-1

l C t-1 i

C t-1 i


are conditioned on c(2), and are thus independent of the previoustime slices. Translating this to one single BN, we obtain B2 in Fig. 7(d).

Summarizing the manipulations on B1 to obtain B2 we have

X t-1 k

t-1 t-1 t-1

X t-1 k
�
O O O

200

Set B2¼B1.

1 4 1
� t-1 t-1 t-1 t-1
Insert a non-instant edge between C19 and O25.
O 2 O 3 O 2 O 3 � Reset the CPD of O24 to pðO24jpaB0 ðO24ÞÞ.
Fig. 9. A troubleshooting BN in (a), the repair-influenced BN B(Ci,O1) in (b), and the

repair-influenced BN B(Ci,O2) in (c).

Table 1Structure classes in family F � , and their updates after repair(Ci).

No Property Bt�1 Bt Comment

1 Component without

childrenCi Ci

No edges to add or

remove

2 Non-persistent

observation with

instant edge from

parent component

Ci

Oj

Ci

Oj

Ct d-separates Ot

from the previous

epoch

3 Non-persistent

observation with non-

instant edges from

parent component

Ci

Oj

Ci

Oj

Ct d-separates Ot

from the previous

epoch

4 Persistent observation

with non-instant edges

to parent component

Ci

Oj

Ci

Oj

CPD

The CPD for O is

updated to take the

affect of Ct�1i into

account

5 Dependent non-

persistent components

with instant edges to

parent component.

Only two observations

are allowed to be

directly connected

Ci

OmOj

Ci

OmOj

Cit d-separates Oj

t

and Omt from the

previous epoch

6 Dependent non-

persistent components

with non-instant edges

to parent components.


are allowed to be

directly connected

Ci

OmOj

Ci

OmOj

Cit d-separates Oj

t

and Omt from the

previous epoch

7 Dependent

observations, non-

persistent observation

with instant edge and

persistent observation

with non-instant edges.


are allowed to be

directly connected

Ci

OmOj

Ci

OmOj

CPDThe CPD for Oj is

updated to take the

effects of Cit�1 and

Omt�1 into account

8 Non-persistent internal

state, instant edge from

the component and to

the observations. More

than two child

observations are

allowed

Ci

Xk

Ot-1 j Ot-1m

Ci

Xk

Ot-1 j Ot-1m

Cit d-separates Ot

from the previous

epoch

9 Non-persistent internal

state, instant edge from

the component and to

the observations. More

than two child

observations are

possible, but each

Ci

Oj

Xk

On

Ci

Oj

Xk

On

Cit d-separates Ot

from the previous

epoch

7.2. BN updating algorithm

In the example in the previous section we started with a BN B0,and manipulated this by adding and removing edges and updat-ing CPDs as events occurred. We obtained the two BNs B1 and B2

that, by construction, satisfy (18). In this section we generalizethe updating rules derived above, and present an algorithmupdateBN (see Algorithm 1) that provides the manipulationsneeded for all kinds of sequences of action results. The algorithmBt¼updateBN(Bt�1,at) takes a BN Bt�1, for which (18) holds, and

an action at as input, and delivers a BN Bt that satisfies (18). Thealgorithm consists of three cases depending on whether at is anobservation, operation, or repair. In this section, recall that the BNswe consider are troubleshooting BNs as described in Section 3.4.

7.2.1. Updating for observation and operation

An observation action is not an event. Thus there are nostructure changes, and the algorithm updateBN generates a BNBt such that Bt

¼Bt�1.By Assumption 6, an operation action basically resets the BN to

the initial BN, so for an operation updateBN gives Bt¼B0.

7.2.2. Updating for repairs

For repair actions, the situation is more involved. To describethe effects, we will study the effects of repair actions on subpartsof the BN, called repair-influenced BNs, and defined as follows.

Definition 4 (Repair-influenced BN). A repair-influenced BN in atroubleshooting BN B for component Ci and observationOjAdeBðCiÞ is denoted B(Ci,Oj) and is the subpart BN consistingof the variables {Oj, RB(C,Oj), Ci} and the edges between thesevariables. The set RB(C,Oj) consists of the variables in B that arenot independent of Oj given C.2

To exemplify a repair-influenced BN, Fig. 9(a) shows a BN,and Fig. 9(b) and (c) show the repair-influenced BNs for B(Ci,O1)and B(Ci,O2) respectively.

The repair-influenced BNs B(Ci,Oj) can be classified into structure

classes, depending on their structural properties. The elements in astructure class share structural properties, but the number of nodesmay be different. For example ‘‘persistent observation with aninstant edge from its parent component’’ is one structure class. Inthis work we define nine structure classes, each of them correspond-ing to one row in Table 1. A set of structure classes is called a family

of structure classes and denoted F , and in particular we let F � be thefamily consisting of the structure classes in Table 1.

We say that a troubleshooting BN B belongs to a family F ofstructure classes if every repair-influenced BN in B belongs to astructure class in family F . We will from now on only considertroubleshooting BNs such that the BN B0 modeling the systemwhen troubleshooting begins belongs to the family F �. This mayseem technical and limiting, but since BNs belonging to F �capture several kinds of component-observation relations, theyare useful in many troubleshooting applications. In particular, itcan be realized that the BN for the retarder, shown with all

observation is directly

connected to at most

one other observation

Om Om

2 The variables RB(C,Oj) are not d-separated from Oj by C (Jensen and Nielsen,

7).


instant/non-instant edges and persistent/non-persistent variablesin Fig. 10, belongs to F �.

An important property of family F � is that its structure classesare constructed so that removing edges in a repair-influenced BNthat belongs to a class in F � will give a new repair-influenced BNthat belongs to one of the nine classes in F �.

In Table 1 the effects of repairing Ci for the nine structureclasses in F � are shown. In column three of Table 1, for eachstructure class, a typical repair-influenced BN Bt�1(Ci,Oj) is shown.Assume that (18) holds for this BN and let at

¼repair(Ci). Then, aswill be illustrated in the remainder of this section, equality (18)holds also for the corresponding BN Bt in column four of Table 1.In particular, if Bns

(t�1:t) is a two-time-slice nsDBN with initial timeslice Bns

(t�1)¼Bt�1 and a second time slice Bns

(t) generated accordingto the assumptions in Section 5.3, the BNs Bt are such thatequality

pðoðtÞj jcðtÞ,Bðt�1:tÞ

ns Þ ¼ pðOj ¼ oðtÞj jC¼ cðtÞ,BtÞ ð22Þ

holds.Structure class1: The manipulations on Bt�1 to obtain Bt are

trivial for structure class 1, since there is one single componentwithout children. In this case there are no edges to add or remove.

Structure classes2, 3, 5, 6, 8, and 9: For structure classes 2, 3, 5,6, 8, and 9, the common factor is that Ci has non-persistentdescendants only. This means that, as in the computation of theprobability of O24

(1) in (19), the observations made after the repairare independent of the previous actions since ci

t after the repairaction is given. To obtain Bt from Bt�1, we set Bt :¼ Bt�1. We thenremove all non-instant edges in Bt.

Structure classes4 and 7: Structure classes 4 and 7 share theproperty that the children of Ci are observable symptoms, andthat at least one of them is persistent. Similarly to the computa-tions for O25

(1) in (20) and (21), the BN Bt is obtained from Bt�1 byremoving all non-instant edges, and updating the CPD for thepersistent observable symptom variables Oj to take informationfrom the previous time slice into account. To determine theupdated CPD, note that (18) holds for Bns(e

1:t�1) and Bt�1. Notealso that paBt ðOjÞ ¼ |. We search an updating of the CPD for Oj

after the repair such that pðoðtÞj jpaBt ðojÞ,BtÞ ¼ pðoðtÞj jB

tÞ ¼ pðOðtÞj ¼

ojja1:t ,Bnsðe1:tÞÞ. Consider the last probability in the equality.

Fig. 10. A Bayesian network

Marginalizing over Ci(t�1) gives

pðoðtÞj ja1:t ,Bnsðe

1:tÞÞ ¼Xcðt�1Þ

i

pðoðtÞj jcðt�1Þi ,a1:t ,Bnsðe

1:tÞÞpðcðt�1Þi ja1:t ,Bnsðe

1:tÞÞ

¼Xcðt�1Þ

i

pðoðtÞj jcðt�1Þi ,a1:t�1,Bnsðe

1:t�1ÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ðaÞ

pðcðt�1Þi ja1:t�1,Bnsðe

1:t�1ÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}ðbÞ

: ð23Þ

To obtain probability (a) in the sum (23), we have used that O(t) isindependent of the future repair of Ci given Ci

(t�1). For probability(b) in the sum (23) we have used that at is an external interven-tion performed after Ci

(t�1), which means that Ci(t�1) is indepen-

dent of at. The probability (b) can be recognized as the previousbelief state bt�1, and is known. For the first probability in the sumwe have, by using (17) and then (18), that

pðoðtÞj jcðt�1Þi ,a1:t�1Þ ¼ pðoðt�1Þ

j jcðt�1Þi ,a1:t�1Þ

¼ pðoðt�1Þj jcðt�1Þ

i ,v1:t�1,Bnsðe1:1�tÞÞ

¼ pðOj ¼ oðt�1Þj jCi ¼ cðt�1Þ

i ,v1:t�1,Bt�1Þ, ð24Þ

where we have used (18) in the last equality. To summarize,from (23) and (24) we have that the CPD for Oj is computed usingits CPD in Bt�1 and the previous belief state bt�1.

In this section, updating rules for single repair-influenced BNsare considered. However, real systems typically consist of largerBNs, consisting of several repair-influenced BNs. These arehandled by the theorem and algorithm in the following section.

7.2.3. Updating algorithm

Pseudo-code for the algorithm updateBN is given in Algorithm 1.For a BN B0 that belongs to F �, and given a sequence a1:t of actionresults, updateBN generates a sequence B1,y,Bt of BNs that eachsatisfies (18). The algorithm consists of three cases (if-statements),one for observation actions, one for operation actions, and one forrepair actions. Within the if-statement for repair actions, the setConsideredObs is constructed to avoid that the same repair-influ-enced BN is considered several times. The following theoremguarantees the properties of the updating algorithm updateBN

defined by Algorithm 1.

Theorem 1 (Algorithm updateBN). Consider a troubleshooting ses-

sion described by an nsDBN Bns with initial BN Bns0 belonging to F �

modeling the retarder.


and a sequence a1:t of action results such that there is at least one

operation action between two repair actions. Let B0¼Bns

0 and let

B1,y,Bt be a sequence of BNs such that Bk¼updateBN(Bk�1,ak),

k¼1,yt, where updateBN is defined by Algorithm 1. Then, (18) holds

for each Bk, k¼0,y,t.

The theorem is proved by Pernestal (2009).

Algorithm 1. B¼updateBN(B� ,a).

B :¼ B�

if a¼observe(O¼o) then// Nothing to do

else if at¼operate then

B :¼ B0

else if a¼repair(C) thenConsideredObs :¼ |for all OAdeðCÞ do

if O=2ConsideredObs thenO :¼ fO0 : O0ABðC,OÞg

ConsideredObs :¼ ConsideredObs [OUpdate B(C,O) according to Table 1

end ifend for

end if

Fig. 11. Average increase of the expected cost of repair when the planning is

stopped prematurely.

Fig. 12. Average increase of the cost of repair when the parameters in the BN are

distorted.

8. Modeling application

The troubleshooting system consisting of a planner and adiagnoser as described in Sections 4–7 is implemented andapplied to the problem of troubleshooting a heavy truck with afaulty retarder. A BN B0 modeling the retarder at arrival to theworkshop is shown in Fig. 10. This model is built from expertknowledge, and by applying the modeling principles developedin Section 5. The retarder BN belongs to F �, so the algorithmupdateBN is applicable. In Fig. 10, instant edges are solid, non-instant edges are dotted, persistent nodes are gray, and non-persistent nodes are white.

In the implementation, the size of the belief state with whichthe planner is initialized, is limited such that only the 21 mostprobable diagnoses c of component statuses are kept. Also, thediagnoser is set to disregard diagnoses where four or morecomponents are faulty. This is done to keep the size of the beliefstate manageable, and it is reasonable because the probability forseveral simultaneous faults in the retarder is typically very smallcompared to having fewer faults. This method of keeping downthe size of the belief state works for our model of the retarder, butit is not feasible for larger systems. In those cases methods as theone presented in Lerner et al. (2000) can be used, where thediagnoser collapses similar diagnoses into one.

When the algorithm updateBN is used inference needs only tobe made in a static BN with 50 nodes instead of the entire nsDBNwith 50 � t nodes where t is the number of events. In the imple-mentation, inference is made using the exact method variableelimination from Dechter (1996) which is feasible because the staticpart of the BN for the retarder is of moderate size. For largernetworks, it may be possible to perform the inference by usingapproximate methods such as importance sampling (Shachter andPeot, 1989).

When many different components are suspected in theinferred belief state, the planner may require much time tooptimize (4). However, if planning is aborted prematurely theexpected cost of repair of the suboptimal troubleshooting strategymay be almost as good. To illustrate this, we drew 100 randomfaults from the prior distribution and measured the ECR of theoutputted troubleshooting strategies when the planner is aborted

early. A majority of the samples were ‘‘easy’’ cases where the truediagnosis could almost be isolated only be reading out theDTCs. Fig. 11 shows the average increase in the ECR whenplanning is ended prematurely for the 5% most difficult samples(solid line) and the average over all samples (dashed line). Thecurve shows an exponential decrease of the increased ECR andafter 30 s the hardest cases are less than 0.5% from optimum.

To investigate the relevance of accurate probability computa-tions by the diagnoser, we introduced noise in the parameters inthe BN. Noise is added using the log-odds normal distribution asdescribed in Kipersztok and Wang (2001). Every parameter y inthe CPDs in B0 receives a new value y0 which is

y0 ¼1

1þðy�1�1Þ � 10�os

, ð25Þ

where os is a random number drawn from a normal distributionwith standard deviation s. The troubleshooter has only access tothis distorted model, while an undistorted model of the retarderis used to represent the physical system. In each test case there isa predefined fault. When actions are performed, the results aredrawn randomly in accordance with the undistorted model andthe predefined fault. The troubleshooting process is simulateduntil the fault is repaired and the total cost is measured. To avoidlong waiting times, after 60 s, the quality of the decision isconsidered sufficiently optimal and the planner is aborted. Asseen in the previous experiment this timeout is reasonable. Thestandard deviation s is varied from 0 to 1, and for each level of s,25 test cases are run. Fig. 12 shows the average discrepancy inthe cost for troubleshooting using a BN with parameter errors


(represented by a noisy BN) compared to using the nominal BN.Small errors in the parameters do not affect the result signifi-cantly, but for larger errors represented with noise with astandard deviation above 0.25 the discrepancy increases fast. Inpractice, the result in Fig. 12 means that, since small parameterserrors have an (almost) insignificant impact on the ECR com-puted, the parameters could be chosen roughly.

9. Conclusion and future work

Inspired by a case study of the retarder, an auxiliary heavytruck breaking system, a decision theoretic troubleshooting sys-tem has been developed. Focus has been put on issues that areimportant in real world applications: the need for disassemblingthe system during troubleshooting, the problem of verifying thatthe system is fault free during the troubleshooting, and the factthat computations for suggestion of new actions should beperformed while the mechanic is waiting. These issues have twomain consequences: probabilities must be computed in a systemthat is subject to external interventions, and the computationsshould be fast.

The troubleshooting system developed is based on a decision-theoretic approach. It consists of a planner that suggests the nexttroubleshooting action to the mechanic, and a diagnoser thatsupports the planner with probabilities for faults. In the planner,an any-time AOn algorithm with heuristics has been used. In thediagnoser, probabilities are computed by an algorithm based on astatic BN.

Driven by the application study of the retarder, we have alsostudied practical issues of modeling for troubleshooting in detail,and provided guidelines for building the BN to be used in thediagnoser. There are two different types of dependencies that areused in the troubleshooting: instant and non-instant dependen-cies. To handle this fact, in combination with the need forhandling the external interventions caused by repairs and opera-tions, and the need for time efficient computations, a newalgorithm updateBN has been developed. The algorithm updateBN

reduces the external interventions to simple manipulations on astatic BN.

Finally, we have applied the troubleshooting system to theretarder. The results confirm the suggested modeling approachand that the decision theoretic troubleshooting approach used issuitable in real-world applications.

The application of decision theoretic troubleshooting to auto-motive systems is a large research field, and there are still

interesting open questions for future work regarding efficientprobability computations. One is a comparison of the efficiency ofcomputation methods, including updateBN, event-driven nsDBN,and other methods. Another is the extension of the family F � ofstructure classes.

However, as a first step toward troubleshooting with inter-ventions and both instant and non-instant dependencies, theresults presented in this work show that computer aided trou-bleshooting can be applied to complex mechatronic systems suchas the retarder.

References

Breese, J.S., Heckerman, D., 1996. Decision-theoretic troubleshooting: a frameworkfor repair and experiment. In: Proceedings of the 12th Conference onUncertainty in Artificial Intelligence.

Dechter, R., 1996. Bucket elimination: a unifying framework for several probabil-istic inference. In: Proceedings of UAI’96. Morgan Kaufmann.

Heckerman, D., Breese, J.S., Rommelse, K., 1995. Decision-theoretic troubleshoot-ing. Communications of the ACM 38 (3), 49–57.

Jensen, F.V., Nielsen, T.D., 2007. Bayesian Networks and Decision Graphs. Springer.Kipersztok, O., Wang, H., 2001. Another look at sensitivity of Bayesian networks to

imprecise probabilities. In: Proceedings of the Fifth International Workshop onArtificial Intelligence and Statistics.

Langseth, H., Jensen, F.V., 2002. Decision theoretic troubleshooting of coherentsystems. Reliability Engineering & System Safety 80 (1), 49–62.

Lerner, U., Parr, R., Koller, D., Biswas, G., 2000. Bayesian Fault Detection andDiagnosis in Dynamic Systems. In: AAAI/IAAI, pp. 531–537.

Martelli, A., Montanari, U., 1978. Optimizing decision trees through heuristicallyguided search. Communications of the ACM 21 (12), 1025–1039.

Murphy, K., July 2002. Dynamic Bayesian networks: representation, inference andlearning. Ph.D. Thesis, UC Berkeley, UC Berkeley, USA.

Nilsson, N.J., 1980. Principles of Artificial Intelligence. Morgan Kaufmann, SanFrancisco, CA.

Olive, X., Trave-Massuyes, L., Poulard, H., 2003. AOn variant methods for automaticgeneration of near-optimal diagnosis trees. In: 14th International Workshopon Principles of Diagnosis (DX’03), pp. 169–174.

Pearl, J., 2000. Causality. Cambridge University Press, Cambridge.Pernestal, A., December 2009. Probabilistic fault diagnosis—with automotive

applications. Ph.D. Thesis, Linkoping University, Linkoping, Sweden.Rintanen, J., 2004. Complexity of planning with partial observability. In: ICAPS

2004. Proceedings of the 14th International Conference on Automated Plan-ning and Scheduling. AAAI Press, pp. 345–354.

Robinson, J.W., Hartemink, A.J., 2008. Non-stationary dynamic Bayesian networks.In: Proceedings of NIPS.

Russell, S., Norvig, P., 2003. Artificial Intelligence. A Modern Approach. Prentice-Hall.

Shachter, R.D., Peot, M.A., 1989. Simulation approaches to general probabilisticinference on belief networks. In: Proceedings of UAI’89, pp. 221–234.

Sun, Y., Weld, D.S., 1993. A framework for model-based repair. In: Proceedings ofthe AAAI-93, pp. 182–187.

Warnquist, H., Pernestal, A., Nyberg, M., 2009. Anytime near-optimal trouble-shooting applied to an auxiliary truck braking system. In: IFAC Safeprocess2009.

Documents

Modeling and inference for troubleshooting with interventions applied to a heavy truck auxiliary braking system