Learning temporal probabilistic causal models from longitudinal data

Learning Temporal Probabilistic Causal Models fromLongitudinal Data�Alberto Riva, Riccardo BellazziDipartimento di Informatica e SistemisticaUniversit�a di Pavia, Pavia, ItalyAbstractMedical problems often require the analysis and interpretation of large collectionsof longitudinal data in terms of a structural model of the underlying physiologicalbehavior. A suitable way to deal with this problem is to identify a temporal causalmodel that may e�ectively explain the patterns observed in the data. Here we willconcentrate on probabilistic models, that provide a convenient framework to representand manage underspeci�ed information; in particular, we will consider the class ofCausal Probabilistic Networks (cpn).We propose a method to perform structural learning of cpns representing time-series through model selection. Starting from a set of plausible causal structures anda collection of possibly incomplete longitudinal data, we apply a learning algorithmto extract from the data the conditional probabilities describing each model. Themodels are then ranked according to their performance in reconstructing the originaltime-series, using several scoring functions, based on one-step ahead predictions.In this paper we describe the proposed methodology through an example takenfrom the diabetes monitoring domain. The selection process is applied to a set ofinput-output models that generalize the class of arx models, where the inputs are�Reprinted from: Arti�cial Intelligence in Medicine Journal, 8:217{234, Elsevier, 19961

the insulin and meal intakes and the outputs are the blood glucose levels. Althoughthe physiological process underlying this particular application is characterized bystrong non-linearities and low data reliability, we show that it is possible to obtainmeaningful results, in terms of conditional probability learning and model rankingpower.Keywords: Temporal probabilistic causal models, time series analysis, learning,patient monitoring.1 IntroductionMedical problems often require the analysis and interpretation of large collections of lon-gitudinal data in terms of a structural model of the underlying physiological behavior.A suitable way to deal with this problem is to identify a temporal causal model thatmay e�ectively explain the patterns observed in the data. Since data in the medical do-main are a�ected by di�erent sources of uncertainty (e.g. measurement errors, inter- andintra-individual variability), we will concentrate on probabilistic models, that provide aconvenient framework to represent and manage underspeci�ed information; in particular,we will consider the class of Causal Probabilistic Networks (cpn).Studying the representation of dynamic systems by means of graphical models is a cur-rent research topic. Several applications aimed at the calculation of predictive distributionshave been developed in the �eld of forecasting. A cpn representation of the Kalman �lterwas described by Normand [24], Smith [33] and Dempster [16]. Recent studies have beendeveloped by Dagum in the �eld of time series analysis, with the de�nition of the DynamicNetwork model framework, which combines dynamic linear models, additive models andcpns [14]. The computational problems connected with the solution of Dynamic CausalProbabilistic Networks models were presented by Kj�rul� [18].Markov processes and Markov decision processes were encoded in cpns and in theirdecision-theoretic counterpart, the In uence Diagrams (ids), by Berzuini [8] and Tatman2

[34], respectively.In the medical �eld, Andreassen [2] proposed the utilization of cpns, in the de�nitionof a Diabetes Advisory System, to simulate the probability distributions for the bloodglucose concentration on a 24 hour period with 1 hour steps. Another related work waspresented by Oppel, using sequences of cpns to represent compartmental models throughstochasti�cation [22].cpns were used also as the kernel of a system for model-based patient monitoring [9, 3];in this system the patient's behavior is modeled through a set of parametric equations, anda cpn is used to estimate the parameters on the basis of the patient's data. Moreover,such a system is able to analyze a population of patients, providing at the same time anestimate of intra-individual and inter-individual variability, an issue widely discussed inthe �eld of pharmacokinetics [30, 20]. A similar approach was applied also for forecastingfailure in biomedical time series [10]. Finally, cpns were proposed as a convenient tool fordealing with signal reconstruction and deconvolution problems [4].The goal of this work is to conjugate the cpn representational expressiveness andtheir computational power, in order to impose and evaluate di�erent causal and temporalstructures for the data. This task can be e�ectively accomplished by means of a two-stepprocess, involving learning the conditional probability distribution of a set of plausiblemodels, and choosing the most suitable model on the basis of a number of scores based onprediction error.Several techniques have been proposed in the �elds of statistics and arti�cial intelligenceto learn the conditional probabilities and the structure of a given cpn from the availabledata [31, 32, 11, 17]. Here, we propose a method to accomplish both the above tasksin a context where multiple time series coming from the simultaneous monitoring of apopulation of patients are analyzed in the light of di�erent plausible causal models. Theobservations are �ltered by the cpns that represent the di�erent models, and are usedto update the knowledge on both the individual patients and the whole population. Themodels are then ranked according to their ability in performing one-step ahead predictionsof the original time-series, using several di�erent scoring functions.3

In this paper we describe the proposed methodology through an example taken fromthe diabetes monitoring domain. The selection process is applied to a set of input-outputmodels that generalize the class of arx models, where the inputs are the insulin injectionstimes and characteristics and the outputs are the blood glucose levels.2 CPNs for representing time seriesCausal Probabilistic Networks (cpns) are a exible and powerful framework to representand solve Bayesian inference problems, for domains in which structural dependencies amongvariables are known. They have been used to perform probabilistic reasoning under uncer-tainty [25], both in diagnostic systems [1] and in decision support systems based on utilitytheory [26]. cpns are useful in the analysis of complex systems, due to their capability todecompose problems into manageable subsystems, and this has promoted their use also inthe �eld of dynamic modeling [15, 9, 2].The use of cpns in time series analysis provides a means to represent the probabilisticdependencies among the problem variables, as well as the non-linearities involved in therelationships.Moreover, cpns that contain an explicit representation of time overcome the limitationsof the typical probabilistic knowledge-based systems that deal only with static knowledge,and can therefore be e�ectively used to represent non-linear stochastic input-output models,as well as to perform Bayesian inferences over them.In this section we will show how a discrete-time time invariant dynamic system can beeasily encoded within the cpns framework. Consider a discrete-time dynamic system ininput-output form:yt+1 = ft(yt; yt�1; : : : ; ut; ut�1; : : : ; wt); t = 0; 1; : : : ; N � 1 (1)where t is the discrete time index, yt is the output of the system at time t, ut is the inputat time t, wt is a random parameter, N is the time horizon and f(�) expresses a generic4

time-variant function.If we suppose that the state of the system is discrete, we obtain an intrinsically non-linear �nite state system. Such a system may be conveniently represented in terms ofconditional probabilities1, PjjYp;Up;t = P (yt+1 = j j Yp; Up; t). This expresses the probabilitythat at time t the next output will be j, given the past output observations Yp = yt; yt�1; : : :,and the past inputs Up = ut; ut�1; : : : [7].If the distribution PjjYp;Up;t is not dependent on the time index t, the system is time-invariant; under this assumption, the above described system formulation can be naturallyencoded as a discrete-variable cpn, as shown in Figure 1.Insert Figure 1 around hereThis representation re ects a Non Linear Auto-Regressive eXogenous inputs (narx)model structure, and is su�ciently general to cope with interesting real problems, par-ticularly in medicine. These models may be interpreted as the cpn equivalents of theinput-output arx models, that are usually used as black-box models of a collection ofinput-output data. They are therefore phenomenological models, in the sense that theyrepresent only the phenomena without any structural knowledge on the problem. On theother hand, a cpn-based time series expresses a causal pattern that may be interpreted inthe light of the process knowledge.cpns are also suitable to represent more complex systems, like systems with someperiodicity properties. Consider, for example, a process that presents a quasi-periodicbehavior over a 24-hours period, and suppose we are able to obtain four measurements ofthe output variable of the process each day. The process can be described by the followingconditional probabilities: P (y1t+1 j y1t ; y4t ; Up) (2)1Formally, this formulation can be obtained rewriting the discrete time system as: yt+1 = wt andspecifying the probability distribution of the random parameter wt as P (wt = j j Yp; Up; t) = PjjYp;Up;t.5

P (y2t+1 j y2t ; y1t ; Up) (3)P (y3t+1 j y3t ; y2t ; Up) (4)P (y4t+1 j y4t ; y3t ; Up) (5)y1; : : : ; y4 represent the same variable measured at di�erent day times, the time step t; t+1is equal to 24 hours, and the overall structure will show a daily periodic behavior. Thecpn model is depicted in �gure 2.Insert Figure 2 around hereThe main issues related to this kind of representation are the learning of the conditionalprobability distributions and, consequently, the selection of the best model structure fromthe data. The next two sections will deal with these problems.3 Learning on longitudinal dataA cpn may be represented as a set of conditional links, each of which describes the prob-abilistic relationship between the states of one or more parent variables and the states ofa child variable. For any possible set of states of the parent variables, we are thereforeinterested in determining the probability distribution on the states of the child variable.One way to do this is to apply a learning algorithm to a collection of observations takenfrom past experience, looking for patterns that may �t into the conditional model underconsideration. For example, consider a very simple cpn in which a binary variable A (withstates true and false) directly in uences a three-valued variable B (with states 1, 2 and3), and suppose we have collected a number of observations reporting the simultaneousvalues of A and B. We restrict ourselves to the set of observations in which A = true,and we note that, among these, the number of observations with B = 1 is the majority.We can therefore conclude that the conditional probability P (B = 1 j A = true) is higherthan the conditional probabilities P (B = 2 j A = true) and P (B = 3 j A = true). Theexact value of the conditional probabilities will depend on the probability distribution and6

a priori estimates that we assume for the states of B, on the relative frequencies of thedi�erent observations, and on the exact nature of the learning algorithm.Various algorithms to perform learning of conditional probabilities have been proposedin the literature, [31, 32, 11, 17] and many of them have been successfully used in di�erentapplication �elds, particularly in medicine [27]. However none of them has yet coped withthe problem of learning from a population of patients with longitudinal data collected overtime.A desirable feature of the learning method we need is therefore the ability to learnindividualized conditional probability tables, combining observations from several patientsthat have some degree of similarity (thus taking into account the accumulating knowledgeon the whole population), rather than pooling together all the available data.Here we propose a Bayesian method derived from previous work by Leonard [19] andConsonni [12], that is applied for the �rst time to the problem of learning cpn conditionalsfrom longitudinal data collections gathered from several individuals. In this section we willbrie y describe the related theoretical and implementative issues.3.1 The methodologyTo describe the method, let us now resort to the simple cpn previously introduced. Weassume that we want to learn the conditional probability distribution of the node B givenA = true (PBjA=T ).Following [31, 32, 11, 17] we can assume that PBjA=T has a multinomial distribution.Moreover, we can parameterize PBjA=T with the parameter vector �B = [�1; �2; �3], where�1 = PB=1jA=T and so on. We will try to learn this parameter vector from the availabledata2.Let us suppose to have a database D in which we can distinguish M di�erent patients,each one having Nj; j = 1; : : : ;M data observations over time. This means that for the2Herein we will assume that both the hypotheses of global and local independence hold, see [31, 12] fora more detailed description of these assumptions. 7

j-th patient we have three counters, Nj(1); Nj(2); Nj(3), each one expressing the numberof times in which A = True and B = 1; 2 or 3, respectively, so that Nj = P3k=1Nj(k).For the moment we assume that the database is complete, so that no missing or in-complete data is present. For each patient a conditional probability distribution PBjA=T (j)can be speci�ed, and parameterized with the parameter vector �B(j).It is now possible to specify the following exchangeable3 prior probability model (seeLeonard [19]):1. Each �B(j) possesses the same prior Dirichlet distribution�B(j) � D[��1; ��2; ��3] (6)Given � and � = �1; �2; �3, the parameters �B(j); j = 1; : : : ;M are independent ofeach other. Moreover 0 < � <1, �k < 1; k = 1; : : : ; 3 and Pk �k = 1.2. � is independent of � and possesses density P (�), while � is distributed as a Dirichletdistribution with known parameters � 1; � 2; � 3. Finally 0 < � < 1, k < 1; k =1; : : : ; 3 and Pk k = 1Given this prior model, the following (approximate) posterior model can be derived fromthe information collected in the database D.The posterior mean vector of �B(j) = k, with k = 1; 2; 3 is given byE(�B(j) = k j D; �) = Nj(k) + ��(k)Nj + � (7)��(k) = E(�(k) j D; �) (8)where E(� j �) stands for the posterior expected value.3With the term exchangeable we will denote the property of the database that the �nal results will beindependent of the order in which data are collected. In our context this means to have an assumption ofstationarity of the process generating the data. 8

Since ��, that can be viewed as the posterior expected value for a population parameter,is usually very di�cult to calculate, we approximate it following the approach proposed byLeonard (for details see [19])��(k) = PMj=1 �jNj(k) + � (k)PMj=1 �jNj + � (9)�j = 1 + �Nj + � (10)We can therefore distinguish a two-level learning architecture. At the �rst level, that isrelated to the patients, we have a set of parameters that express the individual conditionalprobability table for each patient, i.e. the �B(j)s; at the second level, that is related tothe population, we have a set of parameters that express the conditional probability dis-tribution of the population, i.e the vector �. The most important implication is that theposterior distributions of each patient's conditionals are a compromise between patientspeci�c information and the posterior distribution of the population, so that each condi-tional borrows strength from all the experience contained in the database. The degreeof compromise can be modulated by the expert in dependence of the database at hand,by selecting the parameter �, that measures our belief on the degree of similarity of eachpatient. If � is small, the parameter updating is more sensitive to each patient's data,while if � is high, the �nal estimate will be close to the population value. Referring toEquation (6), � is the implicit sample size of the prior distribution of the parameter vector�B of each plant, and hence has a double signi�cance: it expresses our belief in the priorprobability values �1; �2; �3 and it summarizes the number of (implicit) cases in which allplants are assumed to have the same probability distribution. If we are not able to suggesta value for � based on a subjective judgment of the plants' similarity, we can resort todi�erent techniques to estimate it. For details, see [6].It is worth noting that the above presented method is able to perform a sequentialupdating of the conditional probability tables, a particularly desirable property in a patientmonitoring context. 9

A further requirement of the algorithm is its ability to deal with incomplete data (i.e.observations in which the state of one or more parent variables is unknown), a problem forwhich several techniques have been proposed [32]. In such cases, we apply the fractionalupdating method in which the unit evidence of an observation is subdivided among theconditionals associated with the possible values of the unknown variables. This techniquedoes not require changes to the algorithm, the only extension being the fact that thecounters Nj are no longer integer numbers.3.2 The implementationThe algorithm requires the presence of a complex data structure to perform simultaneouslearning on a set of di�erent patients, described by several di�erent models, each of whichis represented as a set of conditional relations. The heart of the data structure is theconditional structure, that is characterized by a child variable and a set of parent variables,as well as by the parameters �, �, and . Each conditional structure also contains a tree,whose branches are associated with the di�erent values of one of the parent variables. Eachleaf of the tree is therefore in correspondence with a distinct con�guration of the parentvariables, and holds a matrix of counters. The indexes into this matrix are the state of thechild variable and the patient number. Each counter, in conclusion, counts the number ofsimultaneous occurrences of a certain set of parent states (that identi�es the leaf of thetree) and a child state (�rst index into the array) with regard to a certain patient (secondindex into the array).We will now brie y outline the various steps of the learning algorithm. The input to thealgorithm is an observation of the k-th state of the output variable, coming from the j-thpatient. We �rst need to identify the set of those conditional relations, among the ones thatmake up the model, that are a�ected by the observation: they will be all the conditionalswhose consequent is the output variable of the observation. For each conditional, we thenproceed to update its counters following the branches of the tree indicated by the values ofthe parent variables. Once the correct leaf has been reached, the state of the child variable10

k and the patient number j are used as indexes into the matrix of counters to update theright element. Referring to equations 13-16, this re ects in the increasing of Nj(k) andhence of Nj. This in turns updates ��(k) and, as a consequence, the posterior mean vector� for all the patients.In case one of the parent variables is unknown, instead of propagating a unit observationalong a single path in the tree we propagate an evidence of weight 1=k along all the branchesassociated with the k di�erent states of the unknown variable.4 Choosing modelsIn order to compare the predictive performance of a set of models, it is necessary to rankthem on the basis of a suitable measure of their capability to explain the data. In otherwords, we measure the degree of matching between the structure of the model and thetemporal patterns identi�able in the data. Several scoring metrics have been proposed inthe literature [23, 17, 13, 11]. Following [23], we have considered the following metrics:� Predictive accuracy metricsBy using this metric we choose the model M� that, among the available model setM , minimizes one of the following scores:Log-Likelihood score: Given a cpn model M1 and an output variable y, the log-likelihood score is de�ned as �logP (y jM1) (11)The total Log-Likelihood Score (lls) over the N observations is de�ned asLLS = NXi=1�logPi(y jM1) (12)Brier score: The total Brier Score (bs) over the N observations is de�ned asBS = NXi=1(Pi(y jM1)� 1)2) (13)11

� Parametric Penalty MetricGiven the posterior probability P (yk j M1) distribution for each observation yk, thepoint estimate yk for yk can be calculated asthe expected value: yk = E(P (yk jM1));the maximum a posteriori value (map estimate): yk = argfmaxy[P (yk jM1)]g.A possible measure of predictive accuracy is the Sum of Squared Error (sse), derivableas: SSE = NXi j yi � yi j2 (14)In order to penalize too complex models, i.e. those with a high number of elementsin the conditional probability tables, we have implemented a modi�cation of theAkaike Information Criterion (aic), in which the best model M� is chosen amongthe available model set M in such a way that:M� = arg minm2M [log(SSE(m)� (1 + 2n=N))] (15)where n is the number of conditional probability cells to be estimated.The predictive accuracy metrics are useful for a sequential comparison of the models duringthe learning phase, and, if needed, to derive empirical rules in order to reduce the modelspace M . As it is clear from the equations, the lls and Brier scores penalize the modelsin which events that actually occur are assigned low probabilities. The parametric penaltymetrics, widely used in time series model structure selection [21], are useful to penalize theover�tting due to the large number of parameters used for predictions.12

5 An application exampleWe have applied the proposed methodology to a challenging medical problem, the BloodGlucose time series analysis in Insulin Dependent Diabetes Mellitus (iddm).5.1 The medical problemThe study of Diabetes Mellitus (dm) has a great scienti�c and social interest, since dmis one of the major chronic diseases in developed countries. Particularly, type I dm, alsoknown as Insulin Dependent Diabetes Mellitus (iddm), de�nes a group of patients thatneed exogenous insulin in order to prevent ketoacidosis and death. In such patients apancreatic beta-cells disorder provokes an insu�cient secretion of insulin, the most im-portant hormone regulating glucose metabolism, thus inducing hyperglycemia, polyuria,glycosuria and ketonuria. iddm also causes long-term diseases, like large vessel diseases,microvascular diseases, neuropathies and nephropathies.Conventional therapy of iddm out-patients involves subcutaneous administration ofexogenous insulin several times a day (two to four), self-monitoring of Blood Glucose Lev-els (bgl), and insulin dose adjustment on the basis of the actual measurement, followingindividualized control tables de�ned by the physician. Diet and physical exercise are care-fully evaluated, in order to balance insulin regimen, meal intakes and muscular glucosemetabolism. The results of self-monitoring are usually reported on a patient's diary.Patients are periodically checked by physicians, in order to assess the insulin scheduleand the total amount of daily insulin, on the basis of the metabolic control achieved.Metabolic control is usually evaluated by observing blood glucose measurements and thehypoglycemic events reported on the diaries as well as other important measurements, likeHb1A or glycosuria.Exogenous insulin is available in di�erent forms, in dependence of the rate of onset ofe�ectiveness and duration of action. These insulins may be classi�ed as regular, intermedi-ate (nph), and long-acting. By combining them it is possible to emulate the physiologicalinsulin pro�le: rapid insulin administration in correspondence with the meals, in order13

to increase glucose metabolism and prevent meal-related hyperglycemia, and intermediate(or long-acting) insulin to ensure a su�cient basal insulin level, particularly in the earlymorning.A very interesting and challenging problem is the interpretation of the time series ofBlood Glucose measurements coming from home monitoring. There are many sources ofdi�culty in dealing with such task, mainly because of the sparse sampling of Blood Glucosevalues: the available two/three samples a day are usually insu�cient to reconstruct thedaily Blood Glucose dynamics. Di�erent techniques can be exploited to this aim, and apartially successful time series analysis technique was recently exploited by the authors [5],by means of arx models. In that experience, the bgls were taken to be the outputs of adiscrete-time input/output process, with an estimate of the insulin activity and the mealintakes as inputs.Here we exploit the cpn framework to relax the restrictive assumptions underlying arxmodels, like model linearity and the necessity of interpolating the data in order to obtainequally spaced samples [35]. In addition, a cpn model allows �ltering the available data onthe basis of qualitative structural knowledge about the process, thus imposing a temporalorganization upon the database.5.2 CPNs representationLet us consider a typical time-series coming from Blood Glucose monitoring, as shown inFigure 3.Insert Figure 3 around hereSeveral measurements are available every day, approximatively at the same time. The�gure shows that high and low peaks are likely to occur roughly at the same time duringthe day. In the the course of the following analysis we will implicitly take into accountsuch regularity, by assuming that the underlying dynamical process possesses the cyclo-stationarity property, i.e. the process is periodically time-invariant, with a period of 2414

hours.With this assumption it is possible to represent this process within the cpn frameworkas stated in section 2 for systems that show periodic properties. In particular, we subdividethe 24 hours period into a number of time slices, that de�ne the temporal granularity ofour interpretation of the data. The time slices may have di�erent lengths and eventsthat occur in the same time slice are taken to be contemporaneous; all the delays in thetemporal models are expressed in terms of the number of time-slices. Formally, the bloodglucose values of all the time slices in day k are the elements of the output vector yk, andthe causal relationships among the variables in di�erent time slices are embedded in theoverall model through the speci�cation of the conditional probabilities.For example, the model depicted in Figure 2 is periodic over a day, but is composed of4 structurally identical sub-models (one for each time slice) that may be characterized bydi�erent conditional probability distributions (see equations 8{11).Data coming from diabetic patients' bgl monitoring can also be structured accordingto the particular insulin protocol they are following. The insulin administration protocolscan be roughly classi�ed according to three main characteristics [29]:a) Number of injections. Usual values for this parameter range from 2 to 4. b) Insulintypes. Several types of insulin are available, characterized by di�erent time of onset ofaction, time of peak action, and e�ective duration. For each injection, the protocol musttherefore specify the relative amounts of the various kinds of insulin. c) Injection times.Insulin injections take place at well-de�ned times of the day, that usually coincide with(but are not limited to) the main meals (e.g. breakfast, lunch, dinner). A protocol usuallyspeci�es the injection times as a set of \qualitative" time-points, while the correspondencebetween these and actual times depends on the patient's habits.Since these three protocol properties in uence the dynamic response of the patient'sglucose metabolism, it is reasonable to use them to partition the database into clusters ofcomparable behaviors. For each protocol it is then possible to de�ne a collection of suitablecpn time series models, to learn their conditional probability distributions from the patientresponses characteristics, and to rank them on the basis of the metrics proposed in section15

4. The Leonard learning algorithm proposed in section 3 preserves and estimates inter-and intra-individual variability, while model ranking provides information about the insulintherapeutic activity. Let us consider, for example, a protocol that suggests to deliver annph insulin injection in time slice 1. Since the action of nph insulin is delayed in time, thisinjection may a�ect the subsequent two or three time slices. If a cpn that shows a linkbetween time slices 1 and 3 ranks higher than one that shows a link between time slices1 and 2, we can conclude that the peak action of nph insulin is delayed by an amount oftime equal to two time slices.From a computational point of view, the database partitioning performed according tothe protocol type makes it possible to limit the number of eligible cpn models for eachpartition, thus speeding up the learning and ranking process.6 ResultsWe have applied the above described techniques to a subset of the data provided by theAAAI 1994 Spring Symposium organizing committee (namely, patients #32, #33, #34,#35, #36, #37, #48, #59), containing insulin doses and blood glucose measurements foreight diabetic patients, who followed the same treatment protocol over a period of ap-proximately one month. The protocol prescribed four daily injections of regular insulin(at breakfast, lunch, dinner and night) and one or two daily injections of nph insulin (atbreakfast and/or bed time). We used the available data to learn the conditional probabili-ties of eight di�erent cpn models and we ranked them in a sequential learning experimentaccording to the three di�erent scoring functions described in section 4. In particular,the data set was scanned sequentially according to the observation times reported in thedatabase, and for each observation y the predictive probability P (y) was evaluated beforeusing it to update the learned probabilities. A lower score indicates a better predictiveperformance. Since no strong evidence of patient similarity was found in the database, weused small values for the � and � parameters, assumed as being equal to 1.16

Insert Table I around hereThe models we tested share a common overall structure: they consist of a single con-ditional probability that describes the e�ect of previous bgl measurements and insulinintakes on the current blood glucose value. Table I shows the discrete values of the vari-ables used in the models. The binning and bounding of the insulin levels was based on arough descriptive statistical analysis of the available data, while the bgl threshold valueswere chosen on the basis of their physiological signi�cance. Note the explicit presenceof a slice variable that makes the models time-variant: the causal relations between thevariables can change from one time slice to the other. The symbolic values of the slicevariable correspond to the following day time intervals: 6am to 10am (breakfast), 11am to3pm (lunch), 4pm to 8pm (dinner), 9pm to midnight (bedtime), midnight to 5am (night).Insert Figure 4 around hereMore in detail, we expect the current bgl to be in uenced by the bgl of the sametime slice on the previous day (due to the daily periodicity of the process), a possibleregular insulin intake on the previous slice, a possible nph insulin intake two slices before,and the slice number. We want to investigate the variants of this basic model that areobtained by adding dependencies from a) the blood glucose on the previous slice, b) aregular insulin intake two slices before, and c) an nph insulin intake on the previous slice.Figure 4 shows the graphical representation of the cpns; the continuous links representthe basic model, while the dashed ones are the variants we will test. The resulting eightmodels (whose variables are in the table) were used to �lter the available data in order tolearn their conditional probabilities, and their ability to perform a step-by-step predictionof the original time series of each patient was evaluated using the three ranking functionsdescribed in section 4.Insert Table II around hereTable II shows the results of the process. The scores assigned by the three scoring17

functions to the models on the eight patients are reported, and the models are ordered onthe basis of the mean scores achieved on all patients. The three ranking methods showan agreement in selecting model 0 as the best one. This is not surprising, since it is thesimplest one in terms of number of conditionals to be estimated, but also the most suitablefrom a physiological point of view.Insert Figure 5 around hereThe ordering established by the scoring functions is almost always the same across thepatients, and this supports our initial assumption that their behaviors are approximatelysimilar. On the other hand, the di�erent scoring functions lead to di�erent rankings, dueto the fact that while the �rst two (lls and Brier scores) measure the predictive power ofthe models, the third one (aic score) takes into account also the complexity of the modelsand thus penalizes those with too many variables. This is con�rmed by the graph in �gure5, that shows a monotonic increase in the mean aic scores when plotted as a function ofthe model complexity. Since the aic score is usually applied to less parameterized modelsand does not take into account the Bayesian framework exploited here, we believe thatthe penalty assigned to the model complexity turns out to be too high and therefore therankings provided by the lls and Brier scoring functions are more signi�cant in our case.7 Discussion and future workThe work presented here describes a method for time series interpretation using causalprobabilistic models, based on a two-step learning process. The �rst step consists of learn-ing the conditional probability distribution of a set of eligible cpns from data; the secondstep allows the cpns set to be ranked through the calculation of predictive accuracy met-rics. The algorithm applied here performs a Bayesian learning of individualized conditionalprobability tables, taking into account the experience incrementally gathered from each pa-tient and from the overall database, without pooling the data together.The main problem with the proposed methodology is the computational burden in-18

volved in managing models characterized by large conditional probability tables. Unfor-tunately, like many predictive models, the number of conditionals exploited in the cpnrepresentation of time-series is naturally very high, so that the cost of performing learningand ranking is intrinsically elevated. However, in the literature it is possible to �nd sev-eral attempts to reduce cpns representational and computational complexity, by resortingfor example to the cpns additive models [14]. In these models, complex conditionals arerepresented on the basis of linear combination of smaller ones. Hence, a possible futureenhancement of our proposed methodology could be to apply the learning and rankingmethods in an additive models context.One of the most interesting implications of the present work is the capability of ourmethodology to interpret the temporal structure of an available database of longitudinaldata. This feature seems particularly useful in the medical domain, especially in monitoringpatients a�ected by chronic diseases. The �ltering and structuring activity proposed herecould be exploited also to design a temporal object oriented database, able to collect thedata and perform inferences at the same time.Our future attention will be also devoted to improving the management of missing orincomplete information. In order to accomplish such task, we plan to represent the cpnmodels, to propagate probabilities and to evaluate the models, using a novel methodologycalled Logic-based Belief Maintenance [28]. This framework provides an explicit represen-tation of the lack of knowledge about a model component based on probability bounds, andmakes it possible to perform inferences on incomplete probabilistic models. Using this toolwe should therefore be able to propagate probabilities in the cpns even if their conditionaldistributions derived in the learning phase are not precisely known. Moreover, the scoringfunctions used to rank the competing models will need to take into account the amount ofignorance in the models that a�ects the outcomes of the propagation, in addition to theprediction error itself.19

References[1] S. Andreassen, M. Woldbye, B. Falck, S.K. Andersen, MUNIN: a Causal ProbabilisticNetwork for Interpretation of Electromyographic Findings. Proc. of the 10th Intern.Joint Conference on Arti�cial Intelligence, 1987, 366{372.[2] S. Andreassen, Model-Based Biosignal Interpretation. Methods of Information inMedicine, 33 (1994) 103{110.[3] R. Bellazzi, Drug Delivery Optimization through Bayesian Networks: An applicationto Erythropoietin Therapy in Uremic Anemia. Computers and Biomedical Research,26 (1993) 274{293.[4] R. Bellazzi, G. De Nicolao, Smoothing noisy signals with Bayesian Networks, in:Bayesian Belief Networks and Probabilistic Reasoning, 1994, A. Gammerman Ed.,Unicom publisher, Uxbridge, in press.[5] R. Bellazzi, C. Siviero, M. Stefanelli, G. De Nicolao Adaptive controllers for intelligentmonitoring To appear in: Arti�cial Intelligence in Medicine Journal.[6] R. Bellazzi, A. Riva Learning conditional probabilities with longitudinal data, Work-shop on Building Probabilistic Networks, IJCAI, Montreal, 1995[7] D. Bertsekas, Dynamic Programming, Prentice Hall, Engelwood Cli�s, 1990.[8] C. Berzuini, R. Bellazzi, S. Quaglini, Temporal Reasoning with Probabilities. Proc.of V Workshop on Uncertainty in Arti�cial Intelligence, 1989, 14{21.[9] C. Berzuini, R. Bellazzi, S. Quaglini, D.J. Spiegelhalter, Bayesian Networks for PatientMonitoring. Arti�cial Intelligence in Medicine, 4 (1992) 243{260.[10] C. Berzuini, C. Larizza, P. Grossi, Bayesian Prediction from Complex BiomedicalData. Proc. IPMU 1994, Paris, France, 4-8 July, 1994.20

[11] W.L. Buntine, Operations for Learning with Graphical Models, Journal of Arti�cialIntelligence Research, 2 (1994) 159-225.[12] G. Consonni, P. Giudici, Learning in probabilistic expert systems. in S.I.S. Workshopon Probabilistic Expert Systems, R. Scozzafava ed., 1993, 57-78.[13] R.G. Cowell, A.P. Dawid and D.J. Spiegelhalter, Sequential Model Criticism in Prob-abilistic Expert Systems, IEEE Trans. on Pattern. Anal. Mach. Intell., 15 (1993)209-219.[14] P. Dagum, A. Galper, Additive Belief-Network Models. Knowledge System LaboratoryReport KSL-93-01, Stanford, CA, July 1993.[15] P. Dagum, A. Galper, E. Horvitz, A. Seiver, Uncertain Reasoning and Forecasting.Knowledge System Laboratory Report KSL-93-47, Stanford, CA, June 1993.[16] A.P. Dempster, Construction and Local Computation Aspects of Network Belief Func-tions, in: In uence Diagrams, Belief Nets and Decision Analysis, 1990, R.M. Oliverand J.Q. Smith eds., Wiley.[17] D. Heckerman, D. Geiger, D.M. Chickering, Learning Bayesian Networks: the combi-nation of Knowledge and Statistical Data, Microsoft Research Advanced TechnologyDivision Report, Microsoft Corporation, MSR-TR-94-09, 1994.[18] U. Kj�rul�, A Computational Scheme for Reasoning in Dynamic Probabilistic Net-works. Proc. of VIII Workshop on Uncertainty in Arti�cial Intelligence, 1992, 14{21.[19] T. Leonard, Bayesian simultaneous estimation for several multinomial experiments.Commun. Statist.- Theor. Meth. (part a), A6(7), (1977), 619-630.[20] L. Lenert, L. Sheiner, T. Blaschke, Improving Drug Dosing in Hospitalized Patients:Automated Modeling of Pharmacokinetics for Individualization of Drug Dosage Reg-imens. Computer Methods and Programs in Biomedicine, 30 (1989) 169{176.21

[21] L. Ljung, System Identi�cation - Theory for the user, Prentice Hall, Englewood Cli�s,N.J., 1987.[22] U.G. Oppel, A. Hierle, L. Janke, W. Moser, Transformation of Compartmental mod-els into Sequences of Causal Probabilistic Networks, in: Arti�cial Intelligence inMedicine, 1993, S. Andreassen. R. Engelbrecht and J. Wyatt eds., IOS Press, Ams-terdam, 319{330.[23] G.M. Provan, Tradeo�s in Knowledge-Based Construction of Probabilistic Models,IEEE Trans. On Sys. Man Cybern., 24 (1994) 1580-1592.[24] S.L. Normand, D. Tritchler, Kalman �ltering in a Bayesian Network. Tech. Report,Dept. Of Preventive Medicine & Biostatistics, University of Toronto, Canada, 1989.[25] J Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Palo Alto,CA, 1988.[26] S. Quaglini, R. Bellazzi, C. Berzuini, M. Stefanelli and G. Barosi, Hybrid Knowledge-based Systems for Therapy Planning. Arti�cial Intelligence in Medicine, 4 (1992)207{226.[27] S. Quaglini, R. Bellazzi, F. Locatelli, M. Stefanelli, C. Salvaneschi, An In uenceDiagram for Assessing GVHD Prophylaxis after Bone Marrow Transplantation inChildren, Medical Decision Making, 14 (1994) 223-235.[28] M. Ramoni, A. Riva Belief maintenance with probabilistic logic Proceedings of theAAAI Fall Symposium on Automated Deduction in Non-Standard Logics, Raleigh,NC, 1993.[29] A. Riva, R. Bellazzi, High level control strategies for diabetes therapy, AIME-95Conference, Pavia, June, 1995.22

[30] L. Sheiner, S.L. Beal, Bayesian Individualization of Pharmacokinetics: Simple Imple-mentation and Comparison with Non Bayesian Methods. J. Pharm. Sci., 71 (1982)1344{1348.[31] D. Spiegelhalter, A. Dawid, S. Lauritzen, R. Cowell, Bayesian Analysis in ExpertSystems. Statistical Science, 8 (1993) 219-283.[32] D. Spiegelhalter, S. Lauritzen, Sequential updating of conditional probabilities ondirected graphical models. Networks, 20 (1990) 579-605.[33] J.Q. Smith, Statistical Principles on Graphs (with discussion), in: In uence Diagrams,Belief Nets and Decision Analysis, 1990, R.M. Oliver and J.Q. Smith eds., Wiley,[34] J.A. Tatman, R.D. Shachter, Dynamic Programming and In uence Diagrams. IEEETrans. Sys. Man Cybern., 20 (1990) 365-379.[35] D. Worthington, The use of models in the self-management of insulin-dependentdiabetes mellitus, Comp. Meth. and Prog. in Biomed., 32 (1990) 233-239.

23

Figure Captions1. cpn representation of a time-series. In this example the output y at each time instantis a dependent on the past two measurements and on the past two input values. Thechance nodes are represented with circles, while known inputs are depicted as squarenodes.2. cpn representation of a time-series with a periodic behavior.3. A typical bgl time series over one week.4. The temporal structure of the models used in the application example: yi representsthe bgl in slice i, regi and nphi represent the dose of regular and nph insulin,respectively.5. The AIC score as a function of the model complexity.

Table CaptionsTable I The variables used in the models with the respective discrete values.Table II The results of the example.

. . .

. . .

t t+1t-1 t+2

y yy

u� u� u�

t t+1

y y yy y y yy1 2 3 4 1 2 3 4

u u uu u u uu1 2 3 4 1 2 3 4

...

0 20 40 60 80 100 120 140 160100

120

140

160

180

200

220

240

260

280

300

Time (hours)

Blo

od

Glu

co

se

Le

ve

ls (

mg

/dl)

day1 day2 day3 day4 day5 day6 day7

y i-5 y i-1 y i

regi-1

regi-2

nphi-2

nphi-1

. . . . . .

day t-1 day tModel Input variables0 yi�5; regi�1; nphi�21 yi�5; yi�1; regi�1; nphi�22 yi�5; regi�1; regi�2; nphi�23 yi�5; yi�1; regi�1; regi�2; nphi�24 yi�5; regi�1; nphi�1; nphi�25 yi�5; yi�1; regi�1; nphi�1; nphi�26 yi�5; regi�1; regi�2; nphi�1; nphi�27 yi�5; yi�1; regi�1; regi�2; nphi�1; nphi�2

0 1 2 3 4 5 6 7 8 9

x 104

6

6.5

7

7.5

8

8.5

9

9.5

Model complexity

AIC

score

Model 0

Model 1

Model 2

Model 3

Model 4

Model 5

Model 6

Model 7

blood-glucose <70, 70-120, 120-180, 180-300, >300regular 0-3, 4-7, >7nph 0-4, 5-10, 11-15, >15slice Breakfast, Lunch, Dinner, Bedtime, Night

PatientsModel 1 2 3 4 5 6 7 8LLS score0 59.163 136.873 133.724 135.147 118.558 142.884 59.248 97.2691 60.894 139.068 132.488 131.121 123.549 141.308 67.013 100.4762 62.027 142.283 139.808 139.704 122.289 150.069 61.663 102.0113 64.621 147.974 141.876 138.835 130.269 151.048 69.018 110.9824 70.487 158.219 153.523 153.314 134.696 164.012 71.118 114.2305 71.948 164.659 157.410 155.451 140.388 168.602 77.218 118.8986 72.328 163.807 158.991 158.159 138.125 170.257 73.190 120.7177 73.942 169.650 164.076 161.590 143.978 174.103 78.824 126.867Brier score0 22.963 54.368 53.092 52.853 46.807 56.340 23.510 37.9841 24.020 55.717 52.796 51.680 49.454 56.287 26.752 39.7912 24.406 56.834 55.616 54.937 48.743 59.715 24.399 40.0963 25.806 59.535 56.600 55.279 52.236 60.500 27.493 44.3414 28.031 63.693 61.791 61.050 53.999 65.650 28.607 45.5756 28.861 65.884 63.950 63.256 55.422 68.226 29.386 48.4835 28.866 66.331 63.364 62.391 56.323 67.884 31.028 47.6117 29.664 68.171 65.976 65.026 57.675 69.947 31.653 50.934AIC score0 5.587 6.220 6.209 6.349 6.148 6.327 5.431 6.3492 6.204 6.576 6.546 6.703 6.632 6.857 6.131 6.8504 5.915 6.597 6.803 7.145 6.836 6.648 6.759 7.1411 6.399 7.253 7.216 7.469 7.282 7.471 6.873 7.6506 6.924 7.812 7.628 7.956 7.825 7.794 7.415 7.8813 7.509 8.134 7.784 7.859 8.030 8.110 7.516 7.7145 7.627 8.290 8.333 8.325 8.222 8.024 8.203 8.5857 8.530 9.319 9.470 9.450 9.341 9.497 8.922 9.766

Documents

Learning temporal probabilistic causal models from longitudinal data