214
UNIVERSITY OF ALMER ´ IA Department of Statistics and Applied Mathematics PhD thesis Hybrid Bayesian networks in high dimensionality frameworks PhD student Antonio Fern´ andez ´ Alvarez Advisors Antonio Salmer´ on Cerd´ an Rafael Rum´ ı Rodr´ ıguez Almer´ ıa, March 2011

UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

UNIVERSITY OF ALMERIA

Department of Statistics and Applied Mathematics

PhD thesis

Hybrid Bayesian networks in high

dimensionality frameworks

PhD student

Antonio Fernandez Alvarez

Advisors

Antonio Salmeron Cerdan

Rafael Rumı Rodrıguez

Almerıa, March 2011

Page 2: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 3: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

To all who fight for their dreams.

Page 4: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 5: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Acknowledgements

In all honesty, I am pleasantly surprised by all the support I have received

throughout these years both academically and personally. I am privileged that

most of this time I have felt a barrage of positive experiences and feel indebted

to all those who have supported me.

First and foremost, I wish to thank my mentor Antonio Salmeron for believing

in me from the very beginning. His encouragement has helped me through some

my difficult moments and he has taught me to view difficulties as challenges.

Without his help and trust in me, this work would have never been possible.

I also thank the great support received from my co-mentor Rafael Rumı. His

advice and his friendly attitude have made my work much easier.

I would particularly highlight Jens D. Nielsen for his help and good moments

shared over two years working together. I am indebted to him for everything

I learnt. I would also like to thank Ildiko Flesch for her friendliness and the

work done during her short but fruitful stay in Almerıa. Other officemates whom

I thank for their support and understanding are Sandra Rodrıguez and Carlos

Rodrıguez.

I shall cherish great memories of the members of the Machine Intelligence

Group of Computer Science Department, Aalborg University, for their warm wel-

come during my research stay there. Especially to Thomas D. Nielsen for su-

pervising my work and for later collaborations we have had. Thanks to Aderson

C. Pifer and Nicolaj Søndberg for their wonderful hospitality and for making my

stay more pleasant.

I am also very grateful to Helge Langseth for the research collaborations we

have had and his friendly attitude during his stay in Almerıa.

Other people who deserve my deepest thanks are my colleagues of the Data

Analysis Group and Department of Statistics and Applied Mathematics for mak-

ing daily work fun. In particular Fernando Reche, Inma Lopez, Marıa Morales,

Jose Caceres, Carmelo Rodrıguez and Irene Martınez.

To Pedro Aguilera and Rosa Fernandez for the research collaborations that

we have started in the field of environment.

Page 6: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

iv

On a personal level, I owe everything to Charo for her longstanding support

and unselfish generosity. I hope someday I might be able to reward her back.

My friends have been a very important daily source of support. Thanks to

my football team mates for the good matches: Antonio Mendoza, Manuel Yuste,

Fernando Perez, Ignacio Fernandez, Antonio Garcıa, and everyone else in the

team. Also, to my paddle mate Juan Antonio Chaichio. To my friends and former

flat mates Angel, Carlos and Vıctor for making life an enriching experience. In

short, I offer my gratitude to all my friends for always being available.

I would like to specially thank my friend Luis Garcıa, whom I have known

since childhood, and who has taught me many of the principles and values I

cherish today.

I give my deepest thanks to my parents Paco and Emilia for their support

and education, and to my brother Paco and his wife Carmen. To them, I owe

everything.

Finally, I dedicate this thesis to the memory of my grandmother Pura for her

humility and love.

Antonio Fernandez Alvarez

Almerıa, February 18, 2011

This dissertation has been supported by the Spanish Ministry of Science and

Innovation through projects TIN2007-67418-C03-02 and TIN2010-20900-C04-02,

and with the FPI scholarship BES-2008-004014.

Page 7: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Contents

Contents v

List of Algorithms ix

I Introduction 1

1 Introduction 3

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Organisation of the dissertation . . . . . . . . . . . . . . . . . . . 4

2 Preliminaries 7

2.1 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Hybrid Bayesian networks . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Discretisation . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Conditional Gaussian (CG) distributions . . . . . . . . . . 14

2.2.3 Mixtures of Truncated Exponentials . . . . . . . . . . . . . 16

2.2.4 Mixtures of Polynomials . . . . . . . . . . . . . . . . . . . 19

2.3 State-of-the-art in hybrid Bayesian networks . . . . . . . . . . . . 20

II Theoretical contributions 27

3 Learning models for regression from complete data 29

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Bayesian networks for classification . . . . . . . . . . . . . . . . . 31

3.3 Bayesian networks for regression . . . . . . . . . . . . . . . . . . . 34

Page 8: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

vi CONTENTS

3.4 Regression based on the MTE model . . . . . . . . . . . . . . . . 35

3.5 Filtering the independent variables . . . . . . . . . . . . . . . . . 37

3.6 The naıve Bayes model for regression . . . . . . . . . . . . . . . . 40

3.7 The tree augmented naıve Bayes regression model . . . . . . . . . 41

3.8 The forest augmented naıve Bayes regression model . . . . . . . . 45

3.9 Regression model based on kDB structure . . . . . . . . . . . . . 46

3.10 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 51

3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Learning models for regression from incomplete data 55

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Regression model from incomplete data . . . . . . . . . . . . . . . 56

4.3 The algorithm for learning a regression model from incomplete data 59

4.4 Improving the final estimations by reducing the bias . . . . . . . . 63

4.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 64

4.6 Results discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 Parametric learning in MTE networks using incomplete data 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Translating standard distributions into MTE distributions . . . . 75

5.2.1 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.2 Conditional linear Gaussian . . . . . . . . . . . . . . . . . 76

5.2.3 Logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4 The M-step. Updating rules for the parameter estimates . . . . . 85

5.4.1 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.4.2 Conditional linear Gaussian . . . . . . . . . . . . . . . . . 86

5.4.3 Logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 The E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Page 9: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

CONTENTS vii

6 Approximate inference in MTE networks using importance sam-

pling 99

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3 Approximate propagation using importance sampling . . . . . . . 103

6.3.1 Obtaining a sampling distribution . . . . . . . . . . . . . . 107

6.3.2 Computing multiple probabilities simultaneously . . . . . . 110

6.4 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 113

6.5.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.5.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.5.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

III Applications 123

7 Species distribution modelling 125

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2.1 Variables and data set description . . . . . . . . . . . . . . 127

7.2.2 Selection of variables . . . . . . . . . . . . . . . . . . . . . 127

7.2.3 Bayesian classifiers and calibration of models . . . . . . . . 129

7.2.4 Inference in Bayesian classifiers . . . . . . . . . . . . . . . 129

7.2.5 Validation of the models . . . . . . . . . . . . . . . . . . . 130

7.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 131

7.3.1 NB model . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.3.2 TAN model . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.3.4 Spatial application of the models . . . . . . . . . . . . . . 142

7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Page 10: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

viii CONTENTS

8 Relevance analysis of performance indicators in higher educa-

tion 147

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.2 Relevance analysis using Bayesian networks . . . . . . . . . . . . 149

8.3 Application to the analysis of performance indicators at the Uni-

versity of Almerıa . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.3.1 Relevance analysis for compulsory courses . . . . . . . . . 152

8.3.2 Relevance analysis for optional courses . . . . . . . . . . . 155

8.4 Software for relevance analysis . . . . . . . . . . . . . . . . . . . . 157

8.5 Using the software to construct composite indicators . . . . . . . 160

8.5.1 Generating the rank of descriptions . . . . . . . . . . . . . 161

8.5.2 Generating the composite index from the database . . . . 162

8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

IV Concluding remarks 165

9 Conclusions and future works 167

Bibliography 171

A Notation and mathematical derivations 191

B Publications 201

Page 11: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

List of Algorithms

1 Median of a density function . . . . . . . . . . . . . . . . . . . . . 38

2 MTE-NB regression model . . . . . . . . . . . . . . . . . . . . . . 40

3 Selective MTE-NB regression model . . . . . . . . . . . . . . . . . 42

4 Maximum Spanning Tree (based on Kruskal’s algorithm) . . . . . . 45

5 MTE-TAN regression model . . . . . . . . . . . . . . . . . . . . . . 46

6 Selective MTE-TAN regression model . . . . . . . . . . . . . . . . 47

7 Maximum Spanning Forest (based on Kruskal’s algorithm) . . . . . 48

8 MTE-FAN regression model . . . . . . . . . . . . . . . . . . . . . . 48

9 Selective MTE-FAN regression model . . . . . . . . . . . . . . . . . 49

10 MTE-kDB regression model . . . . . . . . . . . . . . . . . . . . . . 50

11 Bayesian network regression model from missing data . . . . . . . . 60

12 Selective Bayesian network regression model from missing data . . 62

13 Computing a vector of bias to refine the predictions . . . . . . . . . 64

14 An EM algorithm for learning MTE networks from incomplete data. 85

15 PruneMTEPotential (T, α) . . . . . . . . . . . . . . . . . . . . . . 109

16 SamplingDistributions (B, e) . . . . . . . . . . . . . . . . . . . . . 111

17 ApproximateProbabilityPropagation (B, e, P ) . . . . . . . . . . . 114

18 Naıve Bayes classifier with continuous features . . . . . . . . . . . 129

19 TAN classifier with continuous features . . . . . . . . . . . . . . . . 130

Page 12: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 13: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Part I

Introduction

Page 14: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 15: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 1

Introduction

1.1 Motivation

In the last decades the complexity and uncertainty of information systems and the

huge amount of data available has turned the traditional analysis into obsolete

in many cases, and more sophisticated techniques has become necessary. Within

Artificial Intelligence field, Bayesian networks have shown to be a powerful tool

for handling such problems [89].

In recent years, much research about Bayesian networks has been focused on

the study of efficient methods for learning and inference in high dimensionality

frameworks from several points of view.

First, in situations in which it is already difficult to model the problem due to

its complexity, it is desirable to avoid as far as possible any other approximation

in the calculations. In this sense, much effort in the field of Bayesian networks has

been aimed at trying to avoid discretisation processes of continuous variables and

directly deal with them, representing their probability distributions as accurately

as possible. Mixtures of Truncated Exponentials (MTEs) have been considered

an appropriate tool without any restriction for this topic, and they will be the

main focus of this dissertation.

It is also very common facing problems with a high number of variables, in

which a selection of them is needed both to reduce the complexity of the model,

and sometimes increase the accuracy of the results. In this way, learning and

Page 16: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

4 1.2. Organisation of the dissertation

inference processes are considerably simplified.

In the same line, if the MTE network is too complex, exact probability prop-

agation is computationally hard and therefore new approximate methods for in-

ference must be investigated.

Advances in hybrid Bayesian networks create opportunities to solve problems

already solved using other techniques. For example, a regression problem consid-

ers by nature the use of continuous and discrete variables. Some advantages from

Bayesian networks with respect to classical techniques can be obtained, such as

scalability or no need of a full observation of the independent variables to give

a prediction. In this dissertation, several MTE network structures are developed

and applied to solve regression problems.

Also, there are many situations in which missing data are frequent, for exam-

ple, in large databases it is more likely to find incomplete data. Other situations

in which missing data appear frequently, are in scenarios where the data acqui-

sition is difficult or even impossible. Therefore, new methods for learning MTE

models from missing data will be developed.

Finally, there are also many other disciplines where much of the theory about

hybrid Bayesian networks has not been applied yet. Hence, we present two works

in the environmental field and in higher education management.

1.2 Organisation of the dissertation

The document is divided into four parts. The first and last are the Introduction

and Concluding remarks, containing Chapters 1 and 2, and Chapter 9, respec-

tively. Part II includes Chapters 3, 4, 5, 6 and has been called Theoretical Contri-

butions. It describes the new theoretical and methodological advances developed

in this dissertation. On the other hand, Chapters 7 and 8 are embedded within

Part III called Applications, where two applications of MTEs are presented.

The nine chapters are distributed as follows:

Chapter 1 explains the motivation of the dissertation and how the contents

throughout the document are organised.

Chapter 2 introduces the topic of the dissertation beginning with the most

general concepts and gradually focusing on those most related to the remaining

Page 17: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 1. Introduction 5

chapters. First the concept of Bayesian network is explained, and then a specific

type of them in which discrete and continuous variables coexist simultaneously,

the so-called hybrid Bayesian networks. Different approaches for their treatment

are explained: Discretisation, conditional Gaussian (CG) distributions and Mix-

tures of Truncated Exponentials (MTEs), with special emphasis on the latter, the

main theme of the dissertation. Finally, we review the state of the art in hybrid

Bayesian networks.

In Chapter 3 we explore the extension of various kinds of MTE-based Bayesian

network classifiers to regression problems where some of the independent variables

are continuous and some others are discrete.

Chapter 4 is devoted to face the same problem as in Chapter 3 but for the

case of incomplete data.

In Chapter 5 we describe an EM-based algorithm for learning the maximum

likelihood parameters of an MTE network when confronted with incomplete data.

In Chapter 6, a new approximate propagation algorithm for MTE networks

based on importance sampling is presented.

The aim of Chapter 7 is to characterise the habitat of an endangered specie

(we focus on the spur-thighed tortoise), using several continuous environmental

variables. Two MTE models for this purpose are presented.

Chapter 8 presents a methodology for relevance analysis of performance in-

dicators in higher education based on the use of Bayesian networks. The MTE

model is applied for constructing composite indicators by using a Bayesian re-

gression model implemented in a web application.

Finally, Chapter 9 is devoted to conclude the dissertation summarising the

main contributions of the work and future research lines.

Page 18: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 19: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2

Preliminaries

2.1 Bayesian networks

Bayesian networks [117, 77] are considered one of the most powerful tools for

representing complex systems in which the relationships among the variables

are subject to uncertainty. Their main purpose is to provide a framework for

efficiently reasoning about the system they represent, in the sense of updating

the information about the unobserved variables given that some new information

is incorporated to the system [76, 143].

We will use uppercase letters to denote random variables, and boldfaced up-

percase letters to denote random vectors, e.g. X = X1, . . . , Xn, and its domain

will be written as ΩX. By lowercase letters x (or x) we denote some element of

ΩX (or ΩX).

A Bayesian network is a statistical multivariate model for a set of variables

X, which is defined in terms of two components:

• A qualitative component defined by means of a directed acyclic graph

(DAG) where each vertex represents one of the variables in the model, and

so that the presence of an edge linking two variables indicates the existence

of statistical dependence between them.

• A quantitative component specified through a conditional distribution

p(xi | pa(xi)) for each variable Xi, i = 1, . . . , n given its parents in the

graph, denoted as pa(Xi).

Page 20: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

8 2.1. Bayesian networks

X1

X2 X3

X4 X5

Figure 2.1: An example of Bayesian network with five variables.

For example, the graph depicted in Figure 2.1 could be the qualitative com-

ponent of a Bayesian network for variables X1, . . . , X5. According to the graph

structure, it would be necessary to specify a conditional distribution for each

variable given its parents. In this case, the distributions are p(x1), p(x2 | x1),p(x3 | x1), p(x4 | x2, x3) and p(x5 | x3).

In what follows, we will describe how the qualitative component encodes the

dependencies among the variables in the model, and how the strength of these

dependencies is determined by the quantitative component, i.e., the conditional

distributions.

Qualitative component of a Bayesian network

One of the most important advantages of Bayesian networks is that the structure

of the associated DAG determines the dependence and independence relationships

among the variables, so that it is possible to find out, with no need of carrying

out any numerical calculations, which variables are relevant or irrelevant for some

other variable of interest.

Figure 2.2 shows, through an example, the three types of connections among

variables. They can be interpreted as follows:

• Serial connections: Information may be transmitted from X1 to X3, un-

less the state of the variable X2 is known.

• Diverging connections: Information may be transmitted from X1 to X3,

unless the state of the variable X2 is known.

Page 21: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 9

X1 X2 X3

(a) Serial connection.

X1

X2

X3

(b) Diverging connection.

X1

X2

X3

(c) Converging connection.

Figure 2.2: Kinds of connections in a DAG.

• Converging connections: Information may only be transmitted through

a converging connection if either information about the state of the variable

X2 or one of its descendants is available.

More formally, the rules for interpreting information flow given the structure

of a Bayesian network are based on the d-separation concept [77]:

Definition 1 (d-separation). Two variables X and Y in a Bayesian network are

d-separated, if for all the paths between them there is an intermediate variable Z

such as:

• There is a serial or diverging connection and Z is instantiated.

• There is a converging connection and either Z or any descendant of Z have

received evidence.

We will use a toy example, taken from [75], to explain the transmission of

information in a Bayesian network.

Page 22: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

10 2.1. Bayesian networks

Example 1 (Burglary or earthquake). Mr. Holmes is working in his office when

he receives a phone call from his neighbor Dr. Watson, who tells him that Holmes’

burglar alarm has gone off. Convinced that a burglar has broken into his house,

Holmes rushes to his car and heads for home. On his way, he listens to the radio,

and in the news it is reported that there has been a small earthquake in the area.

Knowing that earthquakes have a tendency to turn burglar alarms on, he returns

to his work.

WatsonCalls

Alarm

Burglary Earthquake

RadioNews

Figure 2.3: The Bayesian network for the burglary or earthquake example.

The scenario described in Example 1 can be represented by the Bayesian

network in Figure 2.3. The semantic interpretation is given over this example:

• Serial connections:

– “Burglary” has a causal influence on “Alarm”, which in turn has a

causal influence on “Watson calls”. Therefore, information flows from

“Burglary” to “Watson calls” and vice versa, since knowledge about

one of the variable provides information about the other.

– However, if we observe “Alarm”, any information about the state of

“Burglary” is irrelevant to our belief about “Watson calls” and vice

versa, since once we have certainty about the fact that the alarm has

gone off, the information provided by Watson does not change our

state of belief.

Page 23: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 11

• Diverging connections:

– “Earthquake” has a causal influence on both “Alarm” and “Radio

news”. Therefore, information flows from “Alarm” to “Radio news”

and vice versa, since knowledge about one of the variable provides

information about the other. For instance, if our only knowledge is

that the radio news reported a small earthquake, our belief about the

alarm going off would increase.

– On the other hand, if we observe “Earthquake”, i.e. we have certainty

about that, any information about the state of “Alarm” is irrelevant

for our belief about an earthquake report in the “Radio news” and vice

versa.

• Converging connections:

– “Alarm” is causally influenced by both “Burglary” and “Earthquake”.

However, in this case the last two variables are irrelevant to each other:

If we do not have any information about the alarm, there is no rela-

tionship between the other two variables.

– However, if we observe “Alarm” and “Burglary”, then this will effect

our belief about “Earthquake”: Burglary explains the alarm, reducing

our belief that earthquake is the triggering factor, and vice versa.

Quantitative component of a Bayesian network

Once the structure is defined, it is necessary to know how strong are the relations

among the variables. This is achieved by using the quantitative component of

the Bayesian network through the joint probability distribution.

Using chain rule, the joint probability distribution over a set of variables

X1, . . . , Xn can be expressed as follows:

p(x1, . . . , xn) =n∏

i=1

p(xi | x1, . . . , xi−1), (2.1)

where X1, . . . , Xn is a consistent order of the variables holding that pa(Xi) ⊆X1, . . . , Xi−1.

Page 24: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

12 2.1. Bayesian networks

Taking into account the independences encoded by the network structure,

each conditioned probability in Equation (2.1) may be simplified. Thus, the

joint distribution over all the variables is equal to the product of the conditional

distributions attached to each node, so that

p(x1, . . . , xn) =

n∏

i=1

p(xi | pa(xi)). (2.2)

Note that the induced factorisation allows to represent complex distributions

by a set of simpler ones, and therefore the number of parameters needed to specify

a model is lower in general. For instance, the network in Figure 2.1 is factorised

as

p(x1, x2, x3, x4, x5) = p(x1)p(x2 | x1)p(x3 | x1)p(x5 | x3)p(x4 | x2, x3). (2.3)

Thus, if binary variables are assumed in Equation (2.3), 32 parameters are

needed to specify the joint distribution, whilst if the induced factorisation is

applied, the number of parameters is reduced to 22. The more complex the

network (number of arcs, variables and states), the greater the reduction when

performing the factorisation. This fact is really important to reduce the space in

memory.

Another advantage of factorising is related to inference. Assume that Xi is

a variable in which we are interested, and XE is a set of variables whose values

are known. Then, the prediction for the value of Xi given XE can be obtained

by computing the distribution p(xi | xE). This distribution could be obtained

from the joint distribution in Equation (2.1). The key point is that there is no

need to compute the joint distribution, since there are efficient algorithms that

allow us to compute p(xi | xE) taking advantage of the factorisation of the joint

distribution imposed by the network structure [99, 143].

Page 25: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 13

2.2 Hybrid Bayesian networks

Bayesian networks were originally proposed for handling discrete variables and

nowadays a broad and consolidated theory about it can be found in the literature.

However, in real problems, it is very common the presence of continuous and

discrete domains simultaneously.

Definition 2. A Bayesian network is called hybrid when continuous and discrete

random variables coexist simultaneously in the model.

In a hybrid framework, a solution is to discretise the continuous data and

treat them as they were discrete. Thus, the application of the existing methods

for discrete variables can be carried out. However, the discretisation is just an

approximation and other alternatives were successfully studied later.

In this section, several approaches to deal with hybrid Bayesian networks are

explored. First, the discretisation is presented as the most extreme solution.

Afterwards, we study two frameworks where continuous and discrete variables

can be handled simultaneously without using discretisation. These models are

the Conditional Gaussian (CG) model, the Mixtures of Truncated Exponentials

model (MTE), and the Mixtures of Polynomials model (MOP).

2.2.1 Discretisation

Most existing algorithms in the literature for learning and inference in Bayesian

networks are only valid for discrete variables. One popular approach is just to

discretise the domain of the continuous variables [59, 81] which is a simple (but

sometimes inaccurate) solution.

Definition 3. Let X be a continuous random variable with support ΩX and with

density function f(x). Let A = A1, . . . , An be a partition of ΩX in which

Ai, i = 1, . . . , n are intervals. A discretisation of X is the process of building a

discrete random variable X ′ with support ΩX′ = 1, . . . , n such that:

P (X ′ = i) =

Ai

f(x)dx i = 1, . . . , n.

Page 26: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

14 2.2. Hybrid Bayesian networks

After the discretisation, a hybrid Bayesian network is considered as a dis-

crete one when performing the inference and learning. The higher the number

of intervals, the more accurate the approximation of the discretisation. This fact

can be seen in Figure 2.4, where the standard Gaussian distribution has been

approximated by a discrete probability function using 3, 6, 12 and 24 intervals,

respectively.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

(a) 3 intervals

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

(b) 6 intervals

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

(c) 12 intervals

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

(d) 24 intervals

Figure 2.4: Discretising a normal density with different number of intervals.

However, using many intervals is computationally hard, since when the infer-

ence is carried out, the size of the potentials increase exponentially and there is

a limitation in memory for storage.

2.2.2 Conditional Gaussian (CG) distributions

Although discretisation is a technique that can always be applied, there is some

types of variables whose distributions allow to operate over hybrid Bayesian net-

works in an exact way. These distributions are part of the conditional Gaussian

model [90, 37, 91, 31] explained next.

Page 27: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 15

Definition 4. Let X be a continuous variable in a hybrid Bayesian network,

Z = (Z1, . . . , Zd)T be its discrete parents, and Y = (Y1 . . . , Yc)

T be its continu-

ous parents. Conditional linear Gaussian (CLG) potentials in hybrid Bayesian

networks have the form

φ(X | z,y) ∼ N(

µ = lTzy + bz, σ2z

)

, (2.4)

where z and y are an assignment of discrete and continuous states of the parents

of X. For a concrete assignment z, lTz is the transpose of the coefficients of a

linear regression model with c values (one for each continuous parent), bz the

mean for variable X and σ2z > 0, the variance for variable X.

The conditional mean of CLG potentials depends linearly on the continuous

parent variables while the variance does not. For each configuration of the discrete

parents of X , a linear function of the continuous parents is specified as the mean

of the conditional distribution of X given its parents, and a positive real number

is specified for the variance of the distribution of X given its parents.

The scheme originally developed by Lauritzen [90] allows exact computation

of means and variances in CLG networks; however, this algorithm did not always

compute the exact marginal densities of continuous variables. A new computa-

tional scheme for CLG models was later developed by Lauritzen and Jensen [91].

This scheme allows calculation of full local marginals and also permits condi-

tionally deterministic linear variables, i.e. distributions where σ2y = 0 in Equa-

tion (2.4).

The CLG model has the property that for any assignment of values for the

discrete variables, the distribution for the continuous variables is multivariate

Gaussian. This is because, given an assignment of the discrete variables, the

conditional probability distributions for the continuous variables are linear Gaus-

sian models. When these linear Gaussian models are combined, they produce a

multivariate Gaussian. The joint distribution of all continuous variables in the

network is a mixture of Gaussians. CLG models can not accommodate continu-

ous random variables whose conditional distribution is not Gaussian. Moreover,

CLG models are not valid in frameworks where a discrete variable has continuous

parents, since their domain must be discretised in some way. This is solved by

Page 28: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

16 2.2. Hybrid Bayesian networks

the MTE model presented next.

2.2.3 Mixtures of Truncated Exponentials

The CG model is useful in situations in which it is known that the joint distribu-

tion of the continuous variables for each configuration of the discrete ones follows

a multivariate Gaussian. However, in practical applications it is possible to find

scenarios where this hypothesis is violated, in which case another model, like dis-

cretisation, should be used. Since discretisation is equivalent to approximating

a target density by a mixture of uniforms, the accuracy of the final model could

be increased if, instead of uniforms, other functions were used. A good choice

are exponential functions since they have good properties, i.e., high fitting power

and they are closed under restriction, marginalisation and combination.

This is the idea behind the so-called Mixtures of Truncated Exponentials

(MTE) model [106].

During the probability inference process, where the posterior distributions of

the variables are obtained given some evidence, the intermediate functions are not

necessarily density functions, therefore a general function called MTE potential

needs to be defined as follows:

Definition 5. (MTE potential) Let X be a mixed n-dimensional random vector.

Let Z = (Z1, . . . , Zd)T and Y = (Y1, . . . , Yc)

T be the discrete and continuous parts

of X, respectively, with c + d = n. We say that a function f : ΩX 7→ R+0 is a

Mixture of Truncated Exponentials potential (MTE potential) if one of the next

conditions holds:

i. Z = ∅ and f can be written as

f(x) = f(y) = a0 +m∑

i=1

ai exp

bT

i y

(2.5)

for all y ∈ ΩY, where ai ∈ R and bi ∈ Rc, i = 1, . . . , m.

ii. Z = ∅ and there is a partition D1, . . . , Dk of ΩY into hypercubes such that

f is defined as

f(x) = f(y) = fi(y) if y ∈ Di,

Page 29: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 17

where each fi, i = 1, . . . , k can be written in the form of Equation (2.5).

iii. Z 6= ∅ and for each fixed value z ∈ ΩZ, fz(y) = f(z,y) can be defined as in

ii.

Example 2. The function f defined as

f(y1, y2) =

2 + exp3y1+y2 +expy1+y2 if 0 < y1 ≤ 1, 0 < y2 < 2

1 + expy1+y2 if 0 < y1 ≤ 1, 2 ≤ y2 < 3

14+ exp2y1+y2 if 1 < y1 < 2, 0 < y2 < 2

12+ 5 expy1+2y2 if 1 < y1 < 2, 2 ≤ y2 < 3

is an MTE potential since all of its parts are MTE potentials.

Definition 6. (MTE density) An MTE potential f is an MTE density if

z∈ΩZ

ΩY

f(z,y)dy = 1.

A conditional MTE density can be specified by dividing the domain of the

conditioning variables and specifying an MTE density for the conditioned variable

for each configuration of splits of the conditioning variables.

Example 3. Consider two continuous variables X and Y . A possible conditional

MTE density for Y given X is the following:

f(y | x) =

1.26− 1.15 exp0.006y if 0.4 ≤ x < 5, 0 ≤ y < 13,

1.18− 1.16 exp0.0002y if 0.4 ≤ x < 5, 13 ≤ y < 43,

0.07− 0.03 exp−0.4y +0.0001 exp0.0004y if 5 ≤ x < 19, 0 ≤ y < 5,

−0.99 + 1.03 exp0.001y if 5 ≤ x < 19, 5 ≤ y < 43.

(2.6)

Since MTEs are defined into hypercubes, they admit a tree-structured rep-

resentation in a natural way. Moral et al. [106] proposed a data structure to

Page 30: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

18 2.2. Hybrid Bayesian networks

represent MTE potentials, which is specially appropriate for this kind of condi-

tional densities: The so-called mixed probability trees or mixed trees for short.

The formal definition is as follows:

Definition 7. (Mixed tree) We say that a tree T is a mixed tree if it meets the

following conditions:

i. Every internal node represents a random variable (discrete or continuous).

ii. Every arc outgoing from a continuous variable Y is labeled with an inter-

val of values of Y , so that the domain of Y is the union of the intervals

corresponding to the arcs emanating from Y .

iii. Every discrete variable has a number of outgoing arcs equal to its number

of states.

iv. Each leaf node contains an MTE potential defined on variables in the path

from the root to that leaf.

Mixed trees can represent MTE potentials defined by parts. Each entire

branch in the tree determines one hypercube where the potential is defined, and

the function stored in the leaf of a branch is the definition of the potential on it.

Example 4. Consider the following MTE potential, defined for a discrete variable

(Z1) and two continuous variables (Y1 and Y2).

φ(z1, y1, y2) =

2 + exp3y1+y2 if z1 = 0, 0 < y1 ≤ 1, 0 < y2 < 2,

1 + expy1+y2 if z1 = 0, 0 < y1 ≤ 1, 2 ≤ y2 < 3,

14+ exp2y1+y2 if z1 = 0, 1 < y1 < 2, 0 < y2 < 2,

12+ 5 expy1+2y2 if z1 = 0, 1 < y1 < 2, 2 ≤ y2 < 3,

1 + 2 exp2y1+y2 if z1 = 1, 0 < y1 ≤ 1, 0 < y2 < 2,

1 + 2 expy1+y2 if z1 = 1, 0 < y1 ≤ 1, 2 ≤ y2 < 3,

13+ expy1+y2 if z1 = 1, 1 < y1 < 2, 0 < y2 < 2,

12+ expy1−y2 if z1 = 1, 1 < y1 < 2, 2 ≤ y2 < 3.

Page 31: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 19

A representation of this potential by means of a mixed probability tree is dis-

played in Figure 2.5.

Z1

Y1

0

Y2

(0, 1]

2 + exp3y1+y2

(0, 2)

1 + expy1+y2

[2, 3)

Y2

(1, 2)

14

+ exp2y1+y2

(0, 2)

12

+ 5 expy1+2y2

[2, 3)

Y1

1

Y2

(0, 1]

1 + 2 exp2y1+y2

(0, 2)

1 + 2 expy1+y2

[2, 3)

Y2

(1, 2)

13

+ expy1+y2

(0, 2)

12

+ expy1−y2

[2, 3)

Figure 2.5: An example of mixed probability tree.

In the same way as in discretisation, the more intervals used to divide the

domain of the continuous variables, the better the MTE model accuracy, but also

more complex. Furthermore, in the case of MTEs, using more exponential terms

within each interval substantially improves the fit to the real model as we will

see in Chapter 5.

2.2.4 Mixtures of Polynomials

A recent research line connected with hybrid Bayesian networks is the Mixtures of

Polynomials (MOPs) proposed in [145]. The idea is to replace the basis function

of the MTE (exponential) by a polynomial.

A one-dimensional function f : R → R is said to be a mixture of polynomials

(MOP) function if it is a piecewise function of the form:

f(x) =

a0i + a1ix+ a2ix2 + . . .+ anix

n for x ∈ Ai, i = 1, . . . , k,

0 otherwise.(2.7)

where A1, . . . , Ak are disjoint intervals in R that do not depend on x, and a0i, . . . , ani

Page 32: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

20 2.3. State-of-the-art in hybrid Bayesian networks

are constants for all i. We will say that f is a k-piece, and n-degree (assuming

ani 6= 0 for some i) MOP function.

The main motivation for defining MOP functions is that such functions are

easy to integrate in closed form, and that they are closed under multiplication,

integration, and addition, the main operations in making inferences in hybrid

Bayesian networks. The requirement that each piece is defined on an interval Ai

is also designed to ease the burden of integrating MOP functions.

2.3 State-of-the-art in hybrid Bayesian networks

The purpose of this section is to review the main advances in the literature

regarding hybrid Bayesian networks, mainly focusing on MTEs. So far, there are

three different alternatives to discretisation in the state-of-the-art for working

with hybrid Bayesian networks. The order in which they were proposed is:

• Conditional Linear Gaussian (CLG) model,

• Mixtures of Truncated Exponentials (MTEs), and

• Mixture of Polynomials (MOPs).

One of the earliest algorithm for dealing with hybrid Bayesian networks was

proposed in [90], and later revised in [91]. This algorithm is applied to Bayesian

networks where continuous variables are modeled by conditional linear Gaussian

(CLG) distributions. Its main weakness is that the network does not allow dis-

crete variables to have continuous parents, a dependency that arises in many

domains.

An approximate way to avoid this problem was suggested in [112, 94] with

the so-called augmented CLG networks, which are hybrid Bayesian networks with

conditional linear Gaussian distributions for continuous variables, and which al-

low discrete variables with continuous parents. The idea is to approximate the

product of a Gaussian and a logistic function (discrete variables with continuous

parents) by a variational approximation [112] or by a mixture of Gaussians using

numerical integration [94].

Page 33: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 21

Another solution to the limitations of the CLG model was proposed in [141]

where a method for exact inference in hybrid networks is presented. It consists

of approximating general hybrid Bayesian networks by a mixture of Gaussian

Bayesian networks [153] and then apply the exact algorithm proposed in [91].

The approximation is based on an arcs reversal technique [116, 140] to avoid

discrete nodes with continuous parents and also on approximating non-Gaussian

distributions by Gaussian distributions.

On the other hand, the MTE framework was first stated in 2001 as an al-

ternative to represent the distributions of hybrid Bayesian networks. The MTE

model does not impose any restriction about the interactions among variables

(discrete nodes with continuous parents are allowed), and also exact probability

propagation can be carried out by means of local computation algorithms. This

model was formally proposed in [106] where its basic operations were defined and

also a Markov Chain Monte Carlo propagation algorithm was described to deal

with complex networks.

Later on, in [107, 135] an iterative algorithm to estimate MTE distributions

from data is proposed based on least squares approximation. In 2003, a method

to estimate conditional MTE densities using mixed trees is proposed [108] with a

criterion for selecting variables during the construction of the tree. Afterwards,

once some issues about learning MTEs from data were proposed, these two pa-

pers [127, 128] proposed structural learning from data with MTEs by means of a

hill-climbing algorithm.

Later in [30], a comparison among different approaches that can somehow deal

with hybrid networks was presented. In particular the behaviour of discretisation,

CLG models, mixed trees, and linear deterministic models were compared.

Although exact probability propagation in MTE networks can be accom-

plished by using standard algorithms, in complex networks the size of the poten-

tials involved grow so much that the propagation becomes infeasible. To overcome

this problem, the Penniless propagation algorithm (already proposed for the dis-

crete case) is adapted to MTE networks in [133]. In [134] the study goes beyond

also considering how to use Markov Chain Monte Carlo method in approximate

propagation. A comparison between both methods reported that MCMC method

is not competitive with respect to Penniless.

Page 34: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

22 2.3. State-of-the-art in hybrid Bayesian networks

MTEs have also been applied to hybrid Bayesian networks with linear [29, 32]

and nonlinear deterministic variables [30]. In these works operations required for

performing inference are developed. Later in [144], an architecture for solving

large general hybrid Bayesian networks with deterministic variables is developed.

The problem of arc reversals is treated in [24] over hybrid Bayesian networks with

deterministic variables.

The work in [61] represents the first approach for learning MTE networks

from missing data. A naıve Bayes model for unsupervised data clustering with

hidden class variable is proposed. The proposal is compared with the conditional

Gaussian model implemented in the WEKA data mining suite [158].

In [31] MTE potentials that approximate an arbitrary normal density function

with any mean and positive variance are presented. In addition, the work in [33]

proposes a general solution to the approximation problem above, and shows that

the most common density functions can be approximated by an MTE potential,

which can always be marginalised in closed form. This advance is very useful,

since MTE potentials can be used for inference in hybrid Bayesian networks

because they do not fit the restrictive assumptions of the CLG model (discrete

nodes with continuous parents and Gaussian distribution).

In [47] an incremental method for building MTE classifiers in domains with

very large amounts of data or for data streams is proposed. This incremental

approach is specially interesting for the case of the MTE distributions, where it is

not possible to keep the sufficient statistics necessary to estimate the parameters,

and therefore, the only possibility to update an MTE potential so far was to

re-learn from scratch.

In [27] two frameworks for handling hybrid Bayesian networks based on the

CG and MTE distributions are reviewed. In both cases it is studied how infer-

ence and learning from data can be carried out, concluding that the CG model

relies on a very solid theoretical development and it allows efficient inference, but

with the restriction that discrete nodes cannot have continuous parents. On the

other hand, the MTE model fits more naturally to local computation schemes for

inference, since it is closed for the basic operations used in inference regardless of

the structure of the network.

Page 35: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 23

Hybrid Bayesian networks have been applied to regression or prediction prob-

lems under the assumption, in a first stage, that the joint distribution of the

feature variables and the class is multivariate Gaussian [62]. If the normality

assumption is not fulfilled, the problem of regression was addressed using kernel

densities to model the conditional distributions [57], with poor results. A com-

mon restriction of Gaussian models and kernel-based models is that they are only

applied to scenarios in which all the variables are continuous.

Later in [110, 111], a naıve Bayes regression model based on MTEs and a vari-

able selection scheme was proposed, reporting competitive results with respect to

the state-of-the art methods so far. In 2007, a tree augmented naıve Bayes for

regression was proposed [51], solving the problem of estimating the conditional

mutual information which could not be analytically obtained for MTEs. The per-

formance of this model was tested in a real life context related to higher education

management where mixed variables are common. Afterwards in 2008, previous

ideas about regression with MTEs were collected and an study of the extension

of several Bayesian classifier structures to regression problems was addressed [55].

Also, a variable selection scheme was considered for the proposed models.

Having successfully implemented the MTE networks in regression problems,

the next step was aimed at studying the problem of learning regression models

from missing data. A first approach was adopted in [52], where an iterative algo-

rithm for learning a naıve Bayes model from incomplete data was proposed. Later

on, this idea was extended to TAN models, also considering variable selection,

and a deeper comparison with the state-of-the-art techniques was developed [53].

MTE and Gaussian networks have been also used as approximate models to

make feasible the inference in other models where it is not feasible. For example,

in [23, 25] methods for approximating PERT Bayesian networks by Gaussian and

MTE networks, respectively, were proposed.

So far, most prevalent MTE learning methods estimated the parameters based

on least squares estimation [135, 128]. The drawback of this approach is that by

not directly attempting to find the parameter estimates that maximise the likeli-

hood, there is no principled way of performing subsequent model selection using

those parameter estimates. In [85, 88] an estimation method that directly aims

at learning the parameters of an MTE potential following a maximum likelihood

Page 36: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

24 2.3. State-of-the-art in hybrid Bayesian networks

approach is presented. Empirical results on this topic demonstrate that the pro-

posed method yields significantly better likelihood results than existing methods.

The approach above focus on the univariate case, but does not address the

conditional MTE specification. A preliminary work in this line [87] has investi-

gated two alternatives for the definition of conditional MTE densities, showing

that only the most restrictive one is compatible with standard efficient algorithms

for inference in Bayesian networks.

One of the current research line under study about MTE-based networks

is focused on learning parameters from incomplete data. In [49, 48] an EM-

based algorithm for learning maximum likelihood parameters of a general hybrid

network is described. The proposed learning procedure is not limited to any

distributional family, which is an important advance on this topic.

Moreover, recent work on inference with MTEs are intended to the approxi-

mate probability propagation. In [54] a propagation algorithm based on impor-

tance sampling was proposed with a remarkable accuracy with respect to the

approximate methods existing in the literature.

Current research directions for hybrid Bayesian networks are also focused on

Mixtures of Polynomials (MOPs) [145, 146, 142]. A MOP potential is defined as

for MTEs, just replacing the basis function of the MTE (exponential) by a polyno-

mial. MOP functions can be easily integrated, and are closed under combination

and marginalisation. This allows to propagate MOP potentials in the extended

Shenoy-Shafer architecture. MOP approximations have several advantages over

MTE approximations, since they are easier to find, even for multi-dimensional

conditional PDFs.

Other minor alternatives (apart from CLG model, MTEs and MOPs) that

have been applied to hybrid Bayesian networks are related to dynamic discreti-

sation of the continuous variables during inference [81], or use sampling methods

to compute approximate marginals [64, 80, 20, 65].

While theoretical developments on hybrid Bayesian networks have emerged

in the literature in recent years, a significant number of works in the field of

application are appearing, mostly due to the need for simultaneous treatment

of continuous and discrete variables in real problems. In what follows we show

some applications of hybrid BNs by focusing on the MTE model, since it is the

Page 37: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 2. Preliminaries 25

core of this thesis. Anyway many applications of CLG model can be found in the

literature.

In [28] two applications of MTE networks to finance problems are presented.

First naıve Bayes and TAN models are used to provide a distribution of stock

return and second, a Bayesian network is used to determine a return distribution

for a portfolio of stocks. In [86] some of the last decade’s research on inference

in hybrid Bayesian networks is summarised and the discussions are linked to an

example in which a model is developed for explaining and predicting humans’

ability to perform specific tasks in a given environment. Hybrid Bayesian net-

works have been also applied to higher education management in [50] where a

methodology for relevance analysis of performance indicators in the management

of the University of Almerıa is developed. The MTEs are applied for constructing

composite indicators by using a Bayesian network regression model.

The application of MTEs has been also developed in environmental sciences

in the study of species predictive distribution modelling. In [2], the habitat of

the spur-thighed tortoise is successfully characterised using MTE models.

A recent applied work is [26] where the authors introduce a graphical method

for valuing options on real asset investments that allow the investor to switch

between different operating modes at a single point-in-time. The technique uses

MTE functions to approximate both the probability density function for project

value and the expressions for options value of each alternative.

Page 38: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 39: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Part II

Theoretical contributions

Page 40: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 41: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3

Learning models for regression

from complete data

In this chapter we explore the extension of various kinds of Bayesian network classi-

fiers to regression problems where some of the independent variables are continuous

and some others are discrete. The goal is to compute the posterior distribution of

the dependent variable given the independent ones, and then use that distribution

to predict a value for the dependent variable given the observations. The involved

distributions are represented as Mixtures of Truncated Exponentials (MTEs). The

construction of some of these classifiers requires the use of the conditional mutual

information, which cannot be analytically obtained for MTEs. In order to solve this

problem, we introduce an unbiased estimator of the conditional mutual information,

based on Monte Carlo estimation. We test the performance of the proposed models on

different datasets commonly used as benchmarks, showing a competitive performance

with respect to the state-of-the-art methods.

Abstract

3.1 Introduction

In real life applications, it is common to find problems in which the goal is to

predict the value of a variable of interest depending on the values of some other

observable variables. If the variable of interest is discrete, we are faced with a

classification problem, whilst if it is continuous, it is usually called a regression

Page 42: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

30 3.1. Introduction

problem. In classification problems, the variable of interest is called class and

the observable variables are called features, while in regression frameworks, the

variable of interest is called dependent variable and the observable ones are called

independent variables.

Bayesian networks [75, 117] have been used both for classification and regres-

sion purposes. Its main advantage with respect to other regression models is that

it is not necessary to have a full observation of the independent variables to give

a prediction for the dependent variable. Also, the model is usually richer from a

semantic point of view.

Naıve Bayes models have been applied to regression problems under the as-

sumption that the joint distribution of the independent variables and the depen-

dent variable is multivariate Gaussian [62]. If the normality assumption is not

fulfilled, the problem of regression with naıve Bayes models has been approached

using kernel densities to model the conditional distribution in the Bayesian net-

work [57], but the obtained results are poor. Furthermore, the use of kernels

introduce a high complexity in the model, which can be problematic, especially

because standard algorithms for carrying out the computations in Bayesian net-

works are not valid for kernels. A restriction of Gaussian models is that they only

apply to scenarios in which all variables are continuous.

In a more general solution, we are interested in regression problems where

the independent variables can be either continuous or discrete. Therefore, the

joint distribution is not multivariate Gaussian in any case, due to the presence of

discrete variables. To solve this problem, a naıve Bayes regression model based

on the approximation of the joint distribution by an MTE was proposed [111].

In the same line, the aim of this chapter is to investigate the behaviour of

different Bayesian network classifiers when applied to regression problems. The

fact that models as the naıve Bayes are appropriate for classification as well as

regression is not surprising, as the nature of both problems is similar: Predict the

value of a dependent variable given an observation over the independent variables.

In all the cases we will consider problems where some of the independent variables

are continuous while some others are discrete, and therefore we will concentrate on

the use of MTEs. More precisely, the starting point is the naıve Bayes model for

regression proposed in [111], and the other models will be obtained by increasing

Page 43: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 31

the structural complexity of the underlying Bayesian network. In order not to use

a misleading terminology, from now on we will refer to the observable variables

as features, even if we are in a regression context.

The rest of the chapter is organised as follows. Section 3.2 is devoted to

present several Bayesian networks classifier structures as the basis of the proposed

regression models. The use of Bayesian networks for regression is explained in

Section 3.3. The solution of the regression problem using MTEs is described

in Section 3.4. In Section 3.5 we explain how the selection of features in the

proposed selective models can be carried out. The particular regression models

based on Bayesian networks are introduced in Sections 3.6 to 3.9. Section 3.10 is

devoted to the experimental evaluation. The chapter ends with some concluding

remarks in Section 3.11.

3.2 Bayesian networks for classification

A Bayesian network can be used for classification purposes if it contains a class

variable Y , and a set of feature variables X1, . . . , Xn, where an object with ob-

served features x1, . . . , xn will be classified as belonging to class y∗ obtained as

y∗ = argmaxy∈ΩYf(y | x1, . . . , xn), (3.1)

where ΩY denotes the set of possible values of Y .

Note that f(y | x1, . . . , xn) is proportional to f(y) × f(x1, . . . , xn | y), andtherefore, solving the classification problem would require a distribution to be

specified over the n feature variables for each value of the class. The associated

computational cost can be very high. However, using the factorisation determined

by the network, the cost is reduced. Although the ideal would be to build a

network without restrictions on the structure, usually this is not possible due to

the limited data available. Therefore, networks with fixed and simple structures

and specifically designed for classification are used.

The extreme case is the so-called naıve Bayes (NB) structure [44, 58]. It

consists of a Bayesian network with a single root node and a set of attributes

having only one parent (the root node). The NB model structure is shown in

Page 44: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

32 3.2. Bayesian networks for classification

X1 X2

Y

· · · Xn

Figure 3.1: Structure of a naıve Bayes model.

Figure 3.1.

Its name comes from the naive assumption that the feature variablesX1, . . . , Xn

are considered independent given Y . This strong independence assumption is

somehow compensated by the reduction of the number of parameters to be esti-

mated from data, since in this case, it holds that

f(y | x1, . . . , xn) ∝ f(y)n∏

i=1

f(xi | y), (3.2)

which means that, instead of one n-dimensional conditional distribution, n one-

dimensional conditional distributions are estimated. Despite this extreme inde-

pendence assumption the results are amazing in many cases, therefore it has

become the most widely used Bayesian classifier in the literature.

However, if some variables are highly correlated, the accuracy of classification

would improve if any dependence between them could be included in the network

(i.e. links among the features). The impact of relaxing the independence as-

sumption has been studied for classification oriented Bayesian networks. In what

follows, several structures are presented expanding the naıve Bayes structure by

permitting each feature to have one more parent beside Y .

A structure in which some dependencies are allowed among the features is

the so-called tree augmented naıve Bayes (TAN), which also has been used for

classification [58]. The TAN structure is obtained according to this restriction:

The features must conform a directed tree structure. Figure 3.2 shows an example

of a TAN structure with 4 features. Note that all of them (except the root of the

directed tree) have two parents: The independent variable and one feature. The

model is richer, since it allows arcs among features, but an increase of complexity

Page 45: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 33

is assumed instead, both in the learning of the graph structure and the associated

probabilities.

X1 X2

Y

X3 X4

Figure 3.2: A TAN structure where X2 is the root of the directed tree among thefeatures.

A problem of the TAN model is that some of the introduced links between fea-

tures may be not necessary, as every feature is forced to be connected to another

one. This fact was pointed out by Lucas [96] within the context of classification

problems. He proposed to discard unnecessary links, obtaining a structure as the

one displayed in Figure 3.3, where, instead of a tree, the features conform a forest

of directed trees. The resulting classifier is called a forest augmented naıve Bayes

(FAN).

X1 X2

Y

X3 X4 X5

Figure 3.3: A forest augmented naıve Bayes structure with 2 trees.

Some more complex is the kDB classifier [136] (see Figure 3.4). This last

model establishes an upper limit of k parents per feature plus the class.

The detailed construction of previous classifiers when they are applied to

regression is described in Sections 3.6 to 3.9.

Page 46: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

34 3.3. Bayesian networks for regression

Y

X1 X2 X3 X4 X5

Figure 3.4: A sample 2-DB regression model.

3.3 Bayesian networks for regression

Assume we have a set of variables Y,X1, . . . , Xn, where Y is continuous and the

rest are either discrete or continuous. Regression analysis consists of finding a

model g that explains the response variable Y in terms of the features X1, . . . , Xn,

so that given an assignment of the features, x1, . . . , xn, a prediction about Y can

be obtained as y = g(x1, . . . , xn).

A Bayesian network can be used as a regression model for prediction purposes

following the same ideas as for classification, since both problems are solved in

a similar way. Therefore, the classifier structures presented in Section 3.2 will

be applied for regression. Thus, in order to predict the value for Y for observed

features x1, . . . , xn, the conditional density

f(y | x1, . . . , xn), (3.3)

is computed and a numerical prediction for Y is given using the corresponding

mean (or the median) as follows:

y = g(x1, . . . , xk) = E[Y | x1, . . . , xk] =∫

ΩY

yf(y | x1, . . . , xk)dy, (3.4)

where ΩY represents the domain of Y .

Page 47: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 35

In any case, regardless of the structure employed, it is necessary that the joint

distribution for Y,X1, . . . , Xn follows a model for which the computation of the

density in Equation (3.3) can be carried out efficiently. As we are interested in

models able to simultaneously handle discrete and continuous variables, we think

that the approach that best meets these requirements is the MTE model.

3.4 Regression based on the MTE model

Once we know how Bayesian networks can be applied to solve regression problems

and also that MTEs are an appropriate tool, from now on we are concentrated

into two tasks:

1. Determining the structure of the network (except for NB).

2. Estimating the MTE densities corresponding to the obtained structure.

Example 5. Consider the next regression model with naıve Bayes structure,

where the dependent variable is X, and the independent ones are Y and Z, where

Y is continuous and Z is discrete:

X

Y Z

One example of conditional densities for this regression model is given by the

following expressions:

f(x) =

1.16− 1.12e−0.02x if 0.4 ≤ x < 4,

0.9e−0.35x if 4 ≤ x < 19.(3.5)

Page 48: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

36 3.4. Regression based on the MTE model

f(y | x) =

1.26− 1.15e0.006y if 0.4 ≤ x < 5, 0 ≤ y < 13,

1.18− 1.16e0.0002y if 0.4 ≤ x < 5, 13 ≤ y < 43,

0.07− 0.03e−0.4y + 0.0001e0.0004y if 5 ≤ x < 19, 0 ≤ y < 5,

−0.99 + 1.03e0.001y if 5 ≤ x < 19, 5 ≤ y < 43.

(3.6)

f(z | x) =

0.3 if z = 0, 0.4 ≤ x < 5,

0.7 if z = 1, 0.4 ≤ x < 5,

0.6 if z = 0, 5 ≤ x < 19,

0.4 if z = 1, 5 ≤ x < 19.

(3.7)

In this chapter we follow the approach in [111], where a 5-parameter MTE is

fitted for each split of the support of the variable, which means that in each split

there will be 5 parameters to be estimated from data:

f(x) = a0 + a1ea2x + a3e

a4x, α < x < β. (3.8)

The reason to use the 5-parameter MTE is that it has shown its ability to fit

the most common distributions accurately, while the model complexity and the

number of parameters to estimate is low [33]. The estimation procedure is based

on least squares and is described in [128, 135].

The general procedure for obtaining a regression model is, therefore, to fix

one of the structures mentioned in Section 3.2 and to estimate the correspond-

ing conditional distributions using 5-parameter MTEs. Once the model is con-

structed, it can be used to predict the value of the dependent variable given that

the features are observed. The forecasting is carried out by computing the pos-

terior distribution of the dependent variable given the observed values for the

features. A numerical prediction for the class value will be obtained from the

posterior distribution, through its mean or its median. The choice between them

is a problem-dependent. A situation in which the median can be more robust

is when the training data contains outliers, and therefore the mean can be very

Page 49: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 37

biased towards the outliers. The posterior distribution will be computed using

the Variable Elimination algorithm for MTEs.

The expected value of a random variable X with a density defined as in

Equation (3.8) is computed as

E[X ] =

∫ ∞

−∞xf(x)dx =

∫ β

α

x(a0 + a1ea2x + a3e

a4x)dx

= a0β2 − α2

2+a1a22

((a2β − 1)ea2β − (a2α− 1)ea2α) +

a3a24

((a4β − 1)ea4β − (a4α− 1)ea4α).

If the density if defined by parts, the expected value would be the sum of the

expression above in each one of the parts.

The expression of the median, however, cannot be obtained in closed form,

since the corresponding distribution function cannot be inverted. Instead, it is

simulated using the search procedure described in [111] and showed in Algo-

rithm 1, which approximates the median with an error lower than 10−3 in terms

of probability. The input parameters for the algorithm is the n-part density

function, i. e.,

f(x) = fi(x) αi < x < βi, i = 1, . . . , n,

where each fi is defined as in Equation (3.8) and (αi, βi) is a partition of the

domain of X such that αi+1 = βi.

3.5 Filtering the independent variables

It is a well known fact in classification and regression that, in general, it is not true

that including more variables increases the accuracy of the model. It can happen

that some variables are not informative for the prediction task and therefore

including them in the model provides noise to the predictor. Also, unnecessary

variables cause an increase in the number of parameters need to be determined

from data.

Page 50: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

38 3.5. Filtering the independent variables

Algorithm 1: Median of a density function

Input : A density f over interval (α, β).Output: An estimation of the median of a random variable with density

f , with error lower than 10−3 in terms of probability.found := false.1

accum := 0.0.2

i := 0.3

while (found == false and i ≤ n) do4

m :=

∫ βi

αi

fi(x)dx.5

if (accum+m) ≥ 0.5 then6

found := true.7

else8

i := i+ 1.9

accum := accum+m.10

end11

end12

max := βi.13

min := αi.14

found := false.15

while (found == false) do16

mid := (max+min)/2.17

p := accum+

∫ mid

min

fi(x)dx.18

if ⌊0.5 ∗ 1000⌋ == ⌊p ∗ 1000⌋ then19

found := true.20

else21

if p > 0.5 then22

max := mid.23

else24

min := mid.25

end26

end27

end28

return mid.29

Page 51: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 39

There are different approaches to the problem of selecting variables in regres-

sion and classification problems:

• The filter approach, which in its simplest formulation consists of estab-

lishing a ranking of the variables according to some measure of relevance

respect to the class variable, usually called filter measure. Then, a threshold

for the ranking is selected and variables below that threshold are discarded.

• The wrapper approach, proceeds by constructing several models with dif-

ferent sets of feature variables, and finally the model with higher accuracy

is selected.

• The filter-wrapper approach [130] is a mixture of the former ones. First

of all, the variables are ordered using a filter measure and then they are

incrementally included or excluded from the model according to that order,

so that a variable is included whenever it increases the accuracy of the

model.

The selection of the independent variables to be included in the model was

addressed in [111] following a filter-wrapper approach. First, the independent

variables are ordered according to the mutual information with respect to the

class and then they are included in the model one by one according to the initial

ranking, whenever the inclusion of a new variable increases the accuracy of the

preceding model. The accuracy of the model is measured by the root mean

squared error between the actual values of the dependent variable and those ones

predicted by the model for the records in a test database. If we call y1, . . . , yn

the corresponding estimates provided by the model, the root mean squared error

is obtain as [158]

rmse =

1

n

n∑

i=1

(yi − yi)2. (3.9)

The mutual information between two random variables X and Y is defined as

I(X, Y ) =

∫ ∞

−∞

∫ ∞

−∞fXY (x, y) log2

fXY (x, y)

fX(x)fY (y)dxdy, (3.10)

Page 52: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

40 3.6. The naıve Bayes model for regression

where fXY is the joint density for X and Y , fX is the marginal density for X and

fY is the marginal for Y . The mutual information has been successfully applied

as filter measure in classification problems with continuous features [119].

In the case of using MTEs, the computation of the integral in Equation (3.10)

cannot be obtained in closed form. We will therefore use the estimation procedure

proposed in [111], which is based on the estimator

I(X, Y ) =1

m

m∑

i=1

(

log2 fX|Y (Xi | Yi)− log2 fX(Xi))

, (3.11)

for a sample of size m, (X1, Y1), . . . , (Xm, Ym), drawn from fXY .

3.6 The naıve Bayes model for regression

This regression model was proposed in [111]. We describe it here in detail, as

it is the basis of our proposals. The task of estimating this model from data is

simplified in the sense that we do not have to care about the structure of the

underlying network, since it is fixed beforehand, as in Figure 3.1. The detailed

steps for its construction can be found in Algorithm 2.

Algorithm 2: MTE-NB regression model

Input : A database D with variables X1, . . . , Xn, Y .Output: A NB model with root variable Y and features X1, . . . , Xn, with

joint distribution of class MTE.Construct a new network G with nodes Y,X1, . . . , Xn.1

Insert the links Y → Xi, i = 1, . . . , n in G.2

Estimate an MTE density for Y , and a conditional MTE density for each3

Xi, i = 1, . . . , n given Y .Let P be the set of estimated densities.4

Let NB be a Bayesian network with structure G and distributions P.5

return NB.6

This procedure includes all the available independent variables in the model.

The version in which the independent variables are filtered and selected is called

selective. The variable selection procedure described in Section 3.5 is illustrated

in Figure 3.5 for the NB model. The steps for its construction are presented in

Page 53: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 41

Algorithm 3. The main idea is to start with a model containing the class variable

and one feature variable, which is the node with a highest mutual information with

the dependent variable (i.e. Y and X(1)). Afterwards, the rest of the variables are

included in the model in sequence, according to their mutual information with

Y . In each step, if the included variable increases the accuracy of the model, it

is kept. Otherwise, it is discarded.

Ordered features by I(Xi, Y ) : X2,X3,X1,X4

rmse = 0.15

Y

X2

Add X3

rmse = 0.14

X2

Y

X3

Add X1

rmse = 0.145

X2

Y

X3 X1

Remove X1

Add X4

rmse = 0.143

X2 X3

Y

X4

Remove X4

FINAL MODEL

rmse = 0.14

X2

Y

X3

Figure 3.5: Example of selecting the independent variables in a naıve Bayesregression model.

3.7 The tree augmented naıve Bayes regression

model

For the construction of the TAN model we must take into account the dependence

structure among the features. The goal is to find a tree structure containing

them [58], so that the links of the tree connect the variables with higher degree

of dependence. This task can be solved using a variation of the method proposed

in [22]. The idea is to start with a fully connected graph, labelling the links with

the conditional mutual information between the connected features given the in-

Page 54: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

42 3.7. The tree augmented naıve Bayes regression model

Algorithm 3: Selective MTE-NB regression model

Input : Variables X1, . . . , Xn, Y and a database D for variablesX1, . . . , Xn, Y .

Output: Selective NB predictor for the variable Y .for i := 1 to n do1

Compute I(Xi, Y ).2

end3

Let X(1), . . . , X(n) be a decreasing order of the feature variables according4

to I(X(i), Y ).Divide the database D into two sets, one for learning the model (Dl) and5

the other for testing its accuracy (Dt).Using Algorithm 2, construct a NB model M with variables Y and X(1)6

from database Dl.Let rmse(M) be the estimated accuracy of model M using Dt.7

for i := 2 to n do8

Let M1 be the NB predictor obtained from Algorithm 2 for variables of9

M and X(i).Let rmse(M1) be the estimated accuracy of model M1 using Dt.10

if rmse(M1) ≤ rmse(M) then11

M :=M1.12

end13

end14

return M .15

dependent variable (Y ). Afterwards, the tree structure is obtained by computing

the maximum spanning tree of the initial labelled graph (see Algorithm 4). The

weights used in Step 1 of this algorithm correspond to the conditional mutual

information between the linked variables given the dependent variable.

The conditional mutual information between two continuous features Xi and

Xj given Y is

I(Xi, Xj | Y ) =

∫∫∫

f(xi, xj , y) logf(xi, xj | y)

f(xi | y)f(xj | y)dxidxjdy. (3.12)

The computation of the conditional mutual information defined in Equa-

tion (3.12) has been addressed for the Conditional Gaussian model [119], but

only in classification contexts, i.e., with variable Y being discrete. For MTEs,

Page 55: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 43

the integral in Equation (3.12) cannot be computed in closed form. Therefore,

we propose to estimate it in a similar way as in [111] for the marginal mutual

information. Our proposal is based on the estimator given in the next proposition.

Proposition 1. Let Xi, Xj and Y be continuous random variables with joint

MTE density f(xi, xj, y). Let (X(1)i , X

(1)j , Y (1)), . . . , (X

(m)i , X

(m)j , Y (m)) be a sam-

ple of size m drawn from f(xi, xj, y). Then,

I(Xi, Xj | Y ) =1

m

m∑

k=1

(

log f(X(k)i | X(k)

j , Y (k))− log f(X(k)i | Y (k))

)

(3.13)

is an unbiased estimator of I(Xi, Xj | Y ).

Proof.

E[I(Xi, Xj | Y )] = E

[

1

m

m∑

k=1

(

log f(X(k)i | X(k)

j , Y (k))− log f(X(k)i | Y (k))

)

]

= E[log f(Xi | Xj , Y )]− E[log f(Xi | Y )]

= E[log f(Xi | Xj , Y )− log f(Xi | Y )] = E

[

logf(Xi|Xj, Y )

f(Xi|Y )

]

=

∫∫∫

f(xi, xj , y) logf(Xi | Xj, Y )

f(Xi | Y )dxidxjdy

=

∫∫∫

f(xi, xj , y) logf(Xi | Xj, Y )f(Xj | Y )

f(Xi | Y )f(Xj | Y )dxidxjdy

=

∫∫∫

f(xi, xj , y) logf(Xi, Xj | Y )

f(Xi | Y )f(Xj | Y )dxidxjdy

= I(Xi, Xj | Y ).

Proposition 1 can be extended for the case in which Xi or Xj are discrete by

replacing the corresponding integral by summation.

Therefore, the procedure for estimating the conditional mutual information

consists of getting a sample from f(xi, xj, y) and evaluating Equation (3.13).

Sampling from an MTE density is described in [134].

As we are not using the exact value of the mutual information, but an esti-

mation, it is interesting to get some clue about the sample size m that should be

Page 56: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

44 3.7. The tree augmented naıve Bayes regression model

used to obtain a given accuracy. Assume, for instance, that we want to estimate

I with an error lower than ǫ > 0 with probability not lower than δ, that is,

P|I − I| < ǫ ≥ δ.

Using this expression, we can find a bound for m using Tchebyshev’s inequality:

P|I − I| < ǫ ≥ 1− Var(I)

ǫ2≥ δ. (3.14)

We only need to relate m with Var(I). It holds that

Var(I) =1

mVar (log f(Xi | Xj, Y )− log f(Xi | Y )) . (3.15)

Therefore, Var(I) depends on the discrepancy between f(Xi | Xj, Y ) and f(Xi |Y ). As these distributions are unknown beforehand, we will compute a bound for

m using two fixed distributions with very different shape, in order to simulate a

case of extreme discrepancy. If we choose f(xi | xj , y) = ae−xi and f(xi | y) = bex,

with α < xi < β and a, b normalisation constants, we find that

Var(I) =1

mVar

(

log ae−x − log bex)

=1

mVar ((log a)− x− (log b)− x)

=1

mVar(−2Xi) =

4

mVar(Xi).

Plugging this into Equation (3.14), we obtain

1− Var(I)

ǫ2≥ δ ⇒ 1− 4Var(Xi)

mǫ2≥ δ ⇒ m ≥ 4Var(Xi)

(1− δ)ǫ2(3.16)

If we assume, for instance, that Xi follows a normal distribution with mean 0

and standard deviation 1, and fix δ = 0.9 and ǫ = 0.1, we obtain

m ≥ 4

0.1× 0.01= 4000. (3.17)

Page 57: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 45

Thus, in all the experiments described in this chapter we have used a sample of

size m = 4000.

The steps to construct a TAN model [51] are described in Algorithm 5. All

the independent variables are included in this model. Here we propose to improve

it by introducing a variable selection scheme analogous to the one used for the

selective naıve Bayes in Section 3.6. The selective TAN model is computed by

Algorithm 6.

Algorithm 4: Maximum Spanning Tree (based on Kruskal’s algorithm)

Input: A graph G= (V,E), in which V is the set of vertices and E is theset of links

Output: Maximum Spanning Tree T

Order the links of E decreasingly using its weight.1

Let A be a set of links empty initially.2

T:= (V,A).3

for i := 0 to n− 2 do4

Add i-th link (u, v) ∈ E to the set A, iff it does not cause a cycle in T.5

end6

return T.7

3.8 The forest augmented naıve Bayes regres-

sion model

We will consider in this section how to construct a regression model following the

FAN methodology [96] (see Algorithm 8). The first step is to create a maximum

spanning forest following Algorithm 7, with an input parameter k that represents

the number of links included among the features. The construction of each tree

inside the forest is carried out in a similar way to the TAN construction, i.e.,

selecting a random root variable and directing the links visiting all the nodes. An

example can be found in Figure 3.3.

An important detail in the construction of a FAN model, is the optimal se-

lection of k. Usually, a low value of k means a good efficiency in the parameter

learning and worse accuracy in the model, while a high value of k the other way

Page 58: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

46 3.9. Regression model based on kDB structure

Algorithm 5: MTE-TAN regression model

Input: A database D with variables X1, . . . , Xn, Y .Output: A TAN model with root variable Y and features X1, . . . , Xn,

with joint distribution of class MTE.Construct a complete graph C with nodes X1, . . . , Xn.1

Label each link (Xi, Xj) with the conditional mutual information between2

Xi and Xj given Y , i.e., I(Xi, Xj | Y ).Let T be the maximum spanning tree obtained from C using the3

Algorithm 4.Direct the links in T in such way that no node has more than one parent.4

Construct a new network G with nodes Y,X1, . . . , Xn and the same links as5

T.Insert the links Y → Xi, i = 1, . . . , n in G.6

Estimate an MTE density for Y , and a conditional MTE density for each7

Xi, i = 1, . . . , n given its parents in G.Let P be the set of estimated densities.8

Let TAN be a Bayesian network with structure G and distributions P.9

return TAN.10

round. In the experiments the value of k has been set to k = ⌊n/2⌋, where n is

the number of features.

As in the construction of the TAN, the links are labeled with the conditional

mutual information when computing the maximum spanning forest.

The selective version of the FAN regression model corresponds to Algorithm 9.

The procedure is totally analogous to the selective TAN.

3.9 Regression model based on kDB structure

Our last proposal consists of extending the kDB structure, already known in clas-

sification contexts [136], to regression problems. The kDB structure is obtained

by forcing the features to conform a directed acyclic graph where each variable has

at most k parents beside the class. An example can be found in Figure 3.4. The

method proposed by Sahami [136] to obtain such structure ranks the features

according to their mutual information with respect to the dependent variable.

Then, the variables are inserted in the directed acyclic graph following that rank.

Page 59: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 47

Algorithm 6: Selective MTE-TAN regression model

Input: Variables X1, . . . , Xn, Y and a database D for variablesX1, . . . , Xn, Y .

Output: Selective TAN regression model for variable Y .for i := 1 to n do1

Compute I(Xi, Y ).2

end3

Let X(1), . . . , X(n) a decreasing order of the independent variables4

according to I(Xi, Y ).Divide D into two sets, one for learning the model (Dl) and the other for5

testing its accuracy (Dt).Using Algorithm 5, construct a TAN model M with variables Y and X(1)6

from database Dl.Let rmse(M) the estimated accuracy of model M using Dp.7

for i := 2 to n do8

Let M1 be the TAN predictor obtained from Algorithm 5 for the9

variables in M and X(i).Let rmse(M1) be the estimated accuracy of model M1 using Dt.10

if rmse(M1) ≤ rmse(M) then11

M :=M1.12

end13

end14

return M .15

The parents of a new variable are selected among those variables already included

in the graph, so that the k variables with higher conditional mutual information

with the new variable given the dependent one are chosen.

The regression model we propose here is constructed in an analogous way,

but estimating the mutual information and the conditional mutual information

as described in Section 3.7. The details can be found in Algorithm 10. Note that

the complexity of constructing a kDB model is much higher than the complexity

for NB, TAN and FAN. The complexity is in the selection of the parent of a

variable and also in the estimation of the parameters, since their number is much

higher than in the other cases. For this reason, we do not propose a selective

version of this regression model, since the selection scheme used for the other

models is too costly from a computational point of view.

Page 60: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

48 3.9. Regression model based on kDB structure

Algorithm 7: Maximum Spanning Forest (based on Kruskal’s algorithm)

Input: A graph G = (V,E), in which V is the set of vertices and E is theset of links. An integer value k that represents the number of linksthat will contain the maximum spanning forest.

Output: Maximum Spanning Forest F.Order the links of E decreasingly using its weight.1

Let A be a set of links empty initially.2

F := (V,A).3

for i := 0 to k − 1 do4

Add i-th link (u, v) ∈ E to the set A, if it does not cause a cycle in F.5

end6

return F.7

Algorithm 8: MTE-FAN regression model

Input : A database D with variables X1, . . . , Xn, Y and an integer valuek ∈ [1, n− 2], that represents the number of links that willcontain the maximum spanning forest.

Output: A FAN model with root variable Y and features X1, . . . , Xn,with joint distribution of class MTE.

Construct a complete graph C with nodes X1, . . . , Xn.1

Label each link (Xi, Xj) with the estimated conditional mutual2

information between Xi and Xj given Y , i.e., I(Xi, Xj | Y ).Let F be the maximum spanning forest obtained from C using Algorithm 73

and with exactly k links.For each connected components Fi in the forest, select a random root and4

direct its links constructing a tree.Construct a new network G with nodes Y,X1, . . . , Xn and the links5

computed in each connected component Fi.Insert the links Y → Xi, i = 1, . . . , n in G.6

Estimate a MTE density for Y , and a conditional MTE density for each7

Xi, i = 1, . . . , n given its parents in G.Let P be the set of estimated densities.8

Let FAN be a Bayesian network with structure G and distributions P.9

return FAN.10

Page 61: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 49

Algorithm 9: Selective MTE-FAN regression model

Input: Variables X1, . . . , Xn, Y and a database D for variablesX1, . . . , Xn, Y .

Output: Selective FAN predictor for the variable Y .for i := 1 to n do1

Compute I(Xi, Y ).2

end3

Let X(1), . . . , X(n) a decreasing order of the independent variables4

according to I(X(i), Y ).Divide D into two sets, one for learning the model (Dl) and the other for5

testing its accuracy (Dt).Using Algorithm 8, construct a FAN model M with variables Y and X(1)6

from database Dl.Let rmse(M) be the estimated accuracy of model M using Dp.7

for i := 2 to n do8

Let M1 be the FAN predictor obtained from Algorithm 8 for the9

variables in M and X(i).Let rmse(M1) the estimated accuracy of model M1 using Dp.10

if rmse(M1) ≤ rmse(M) then11

M :=M1.12

end13

end14

return M .15

Page 62: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

50 3.9. Regression model based on kDB structure

Algorithm 10: MTE-kDB regression model

Input: Variables X1, . . . , Xn, Y and a database D for them. An integervalue k, which is the maximum number of parents allowed.

Output: kDB regression model for variable Y , with a joint distribution ofclass MTE.

Let G a graph with nodes Y,X1, . . . , Xn and a empty set of arcs.1

Let X(1), . . . , X(n) be a decreasing order of the independent variables2

according to I(Xi, Y ).S := X(1).3

for i := 2 to n do4

S := S ∪ X(i).5

for j := 1 to k do6

Select the variable Z ∈ S \ pa(X(i)) with higher I(X(i), Z | Y ).7

Add the link Z → X(i) to G.8

end9

end10

Insert the links Y → Xi, i = 1, . . . , n in G.11

Estimate an MTE density for Y , and a conditional MTE density for each12

Xi, i = 1, . . . , n given its parents in G.Let P be the set of estimated densities.13

Let kDB be a Bayesian network with structure G and distributions P.14

return kDB.15

Page 63: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 51

3.10 Experimental evaluation

We have implemented all the models proposed in this chapter in the Elvira plat-

form [34]1. For testing the models we have chosen a set of benchmark databases

borrowed from the UCI [11] and StatLib [149] repositories. A description of the

used databases can be found in Table 3.1. In all the cases, we have considered the

options of predicting with the mean and the median of the posterior distribution

of the dependent variable. Regarding the kDB model, we have restricted the

experiments to k = 2.

It was shown in [111] that the selective naıve Bayes regression model was

competitive with the so far considered state-of-the-art in regression problems

using graphical models, namely the so-called M5’ Algorithm [155]. The M5’ algo-

rithm [155] is an improved version of the model tree introduced by Quinlan [125].

The model tree is basically a decision tree where the leaves contain a regression

model rather than a single value, and the splitting criterion uses the variance of

the values in the database corresponding to each node rather than the information

gain. We chose the M5’ algorithm because it was the state-of-the-art in graphical

models for regression [57], before the introduction of MTEs for regression [111].

Therefore, we have compared the new models with the NB and the M5’ methods.

For the M5’ we have used the implementation in software Weka [158].

The results of the comparison are shown in Tables 3.2 and 3.3, where the values

displayed correspond to the root mean squared error for each one of the tested

models in the different databases, computed through 10-fold cross validation [150].

The boldfaced numbers represent the best value obtained between all the models,

while the underlined ones are the worst.

We have used Friedman’s test [42] to compare the experimental results, ob-

taining that there are not significant differences among the analysed algorithms

(p-value of 0.9961). However, a more detailed analysis of the results in Ta-

bles 3.2 and 3.3, show how in general the more simple models (NB, TAN, FAN)

have a better performance than the more complex (2DB). We believe that this

is due to increase on the number of parameters to estimate from data. Also,

the selective versions are usually more accurate than the models including all

1Available at http://leo.ugr.es/elvira

Page 64: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

52 3.11. Conclusions

Database # records # continuous vars. # discrete vars.abalone 4176 8 1bodyfat 251 15 0

boston housing 452 11 2cloud 107 6 2

disclosure 661 4 0halloffame 1340 12 2

mte50 50 3 1pollution 59 16 0strikes 624 6 1

Table 3.1: A description of the databases used in the experiments.

Model abalone bodyfat boston housing cloud

NB(mean) 2.8188 6.5564 6.2449 0.5559NB(median) 2.6184 6.5880 6.1728 0.5776SNB(mean) 2.5307 5.1977 4.6903 0.5144SNB(median) 2.4396 5.2420 4.7857 0.5503TAN(mean) 2.5165 5.7095 6.9826 0.5838TAN(median) 2.4382 5.8259 6.8512 0.6199STAN(mean) 2.4197 4.5885 4.5601 0.5382STAN(median) 2.3666 4.5820 4.3853 0.5656FAN(mean) 2.6069 6.0681 6.5530 0.5939FAN(median) 2.4908 6.1915 6.5455 0.5957SFAN(mean) 2.4836 4.9123 4.2955 0.5253SFAN(median) 2.4037 4.9646 4.3476 0.56232DB(mean) 3.1348 8.2358 8.4179 1.04482DB(median) 3.0993 8.2459 8.3241 1.0315M5’ 2.1296 23.3525 4.1475 0.3764

Table 3.2: Results I of the experiments with the proposed regression models interms of rmse.

the variables. The behavior of the SNB model is remarkable, as it gets the bets

results in two experiments and is never the worst. M5’ also shows a very good

performance, obtaining the best results in 5 databases, though in one is the worst

model.

3.11 Conclusions

In this chapter we have analysed the performance of well known Bayesian net-

works classifiers when applied to regression problems. The experimental analysis

Page 65: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 3. Learning models for regression from complete data 53

Model disclosure halloffame mte50 pollution strikes

NB(mean) 196121.4232 186.3826 1.8695 43.0257 503.5635NB(median) 792717.1428 187.6512 2.0224 44.0839 561.4105SNB(mean) 93068.5218 160.2282 1.5798 31.1527 438.0894

SNB(median) 797448.1824 170.3880 1.6564 31.2055 582.9391TAN(mean) 250788.2272 165.3717 2.6251 48.4293 571.9346TAN(median) 796471.2998 166.5050 2.6718 48.8619 584.9271STAN(mean) 108822.7386 147.2882 1.5635 38.1018 447.2259STAN(median) 794381.9068 153.1241 1.6458 36.8456 596.6292FAN(mean) 228881.7789 179.8712 2.0990 43.9893 525.9861FAN(median) 798458.2283 181.8235 2.2037 44.2489 560.7986SFAN(mean) 97936.0322 150.0201 1.5742 36.3944 449.8657SFAN(median) 794825.1882 157.6618 1.6707 35.6820 597.70382DB(mean) 23981.0983 293.8374 2.7460 58.5873 516.24002DB(median) 793706.7221 301.4728 2.7598 58.9876 715.0236M5’ 23728.8983 35.3697 2.4718 46.8086 509.1756

Table 3.3: Results II of the experiments with the proposed regression models interms of rmse.

show that all the models considered are comparable in terms of accuracy, even

with the very robust M5’ method. However, in general we would prefer to use a

Bayesian network for regression rather than the M5’, at least from a modelling

point of view. In this way, the regression model could be included in a more gen-

eral model as long as it is a Bayesian network, and the global model can be used

for other purposes different to regression. However, the M5’ provides a regression

tree, which cannot be used for reasoning about the model it represents.

We think that the methods studied in this chapter can be improved by con-

sidering more elaborate variable selection schemes.

Page 66: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 67: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4

Learning models for regression

from incomplete data

In Chapter 3 we addressed the problem of inducing Bayesian networks models for

regression from full databases. In this chapter we face the same problem but for the

case of incomplete data. We also use MTEs to represent the joint distribution in the

induced networks. Only two particular Bayesian network structures are considered,

the so-called naıve Bayes (NB) and tree augmented naıve Bayes (TAN), which were

successfully applied in Chapter 3 as regression models when learning from complete

data. We propose an iterative procedure for inducing the models, based on a variation

of the data augmentation method in which the missing values of the explanatory

variables are filled by simulating from their posterior distributions, while the missing

values of the response variable are generated using the conditional expectation of

the response given the explanatory variables. We also consider the refinement of

both regression models by using variable selection and bias reduction. We illustrate

through a set of experiments with various databases the performance of the proposed

algorithms.

Abstract

4.1 Introduction

In Chapter 3 MTEs have been successfully applied to regression problems consid-

ering different underlying network structures [51, 55, 111] obtained from complete

databases. Motivated by the common presence of missing values in databases,

Page 68: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

56 4.2. Regression model from incomplete data

we face the problem of building Bayesian networks for regression from incomplete

data. We propose an iterative algorithm based on a variation of the data augmen-

tation method [151] in which the missing values of the explanatory variables are

filled by simulating from their posterior distributions, while the missing values

of the response variable are generated from its conditional expectation given the

explanatory variables. In this chapter we will focus only on two Bayesian net-

work structures, the so-called naıve Bayes (NB) and tree augmented naıve Bayes

(TAN) [58]. Also, the algorithm is extended to incorporate variable selection

in a similar way as in Section 3.6 and 3.7. Finally, we introduce a method for

reducing the bias in the predictions that can be used in all the models, regardless

they have been induced from complete or incomplete databases.

The rest of the chapter is organised as follows. Section 4.2 presents the theory

underlying the learning procedure. The new algorithm that operates over missing

values is formally proposed in Section 4.3. Section 4.4 introduces a method for

reducing the bias in the predictions. The behaviour of the algorithm is tested

through two experiments in Section 4.5 and the results are discussed in Sec-

tion 4.6. The chapter ends with the concluding remarks in Section 4.7.

4.2 Regression model from incomplete data

The aim of a regression problem is to find a model g that explains the response

variable Y in terms of the features X1, . . . , Xn, so that given an assignment of the

features, x1, . . . , xn, a prediction about Y can be obtained as y = g(x1, . . . , xn).

In order to calculate y, the conditional expectation of the response variable given

the observed explanatory variables is used. Therefore, our regression model will

be

y = g(x1, . . . , xn) = E[Y | x1, . . . , xn] =∫

ΩY

yf(y | x1, . . . , xn)dy ,

where f(y | x1, . . . , xn) is the conditional density of Y given x1, . . . , xn, which we

assume to be of class MTE.

A conditional distribution of class MTE can be represented as in Equa-

Page 69: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 57

tion (3.6), where actually a marginal density is given for each element of the

partition of the support of the variables involved. It means that, in each of the

four regions depicted in Equation (3.6), the distribution of the response variable

Y is independent of the explanatory variables within each region.

Therefore, from the point of view of regression, the distribution for the re-

sponse variable Y given an element in a partition of the domain of the explanatory

variablesX1, . . . , Xn, can be regarded as an approximation of the true distribution

of the actual values of Y for each possible assignment of the explanatory variables

in that region of the partition. This fact justifies the selection of E[Y | x1, . . . , xn]as the predicted value for the regression problem, because that value is the one

that best represents all the possible values of Y for that region, in the sense

that it minimises the mean squared error between the actual value of Y and its

predictions y, namely

mse =

ΩY

(y − y)2f(y | x1, . . . , xn)dy, (4.1)

which is known to be minimised for y = E[Y | x1, . . . , xn]. Thus, the key point

to find a regression model of this kind is to obtain a good estimation of the

distribution of Y for each region of values of the explanatory variables. The NB

and TAN models [51, 111] proposed in Chapter 3 estimate that distribution by

fitting a kernel density [132] to the sample and then obtaining an MTE density

from the kernel using least squares [108, 135]. Obtaining such an estimation is

more difficult in the presence of missing values. The first approach to estimating

MTE distributions from incomplete data was developed in the more restricted

setting of unsupervised data clustering [61]. In that case, the only missing values

are on the class variable, which is hidden, while the data about the features are

complete.

Here we are interested in problems where the missing values can appear in the

response variable as well as in the explanatory variables. A first approach to solve

this problem could be to apply the EM algorithm [41], which is a commonly used

tool in semi-supervised learning [115]. However, the application of the method-

ology is problematic because the likelihood function for the MTE model cannot

be optimised in an exact way [85, 135].

Page 70: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

58 4.2. Regression model from incomplete data

Another way of approaching problems with missing values is the so-called

data augmentation (DA) algorithm [151]. The advantage with respect to the EM

algorithm is that DA does not require a direct optimisation of the likelihood

function. Instead, it is based on imputing the missing values by simulating from

the posterior distribution of the missing variables, which is iteratively improved

from an initial estimation based on a random imputation. The DA algorithm

leads to an approximation of the maximum likelihood estimates of the parameters

of the model, as long as the parameters are estimated by maximum likelihood

from the complete database in each iteration. As maximum likelihood estimates

cannot be found in an exact way, we have chosen to use least squares estimation,

as in the NB and TAN regression models proposed in Chapter 3.

Furthermore, as our main goal is to obtain an accurate model for predicting

the response variable Y , we propose to modify the DA algorithm in connection

to the imputation of missing values of Y . The next proposition is the key on how

to proceed in this direction.

Proposition 2. Let Y and YS be two continuous independent and identically

distributed random variables. Then,

E[(Y − YS)2] ≥ E[(Y − E[Y ])2]. (4.2)

Proof.

E[(Y − YS)2] = E[Y 2 + Y 2

S − 2Y YS]

= E[Y 2] + E[Y 2S ]− 2E[Y YS]

= E[Y 2] + E[Y 2S ]− 2E[Y ]E[YS]

= 2E[Y 2]− 2E[Y ]2

= 2(E[Y 2]− E[Y ]2)

= 2Var(Y )

≥ Var(Y ) = E[(Y − E[Y ])2].

In the proof we have relied on the fact that both variables are independent

Page 71: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 59

and identically distributed, and therefore the expectation of the product is the

product of the expectations, and the expected value of both variables is the same.

Proposition 2 motivates our proposal for modifying the data augmentation

algorithm, since it proves that using the conditional expectation of Y to impute

the missing values instead of simulating values for Y (denoted as YS in the propo-

sition), reduces the mse of the estimated regression model. Notice that it is true

even if we are able to simulate from the exact distribution of Y conditional on

any configuration on a region of the values of the explanatory variables.

4.3 The algorithm for learning a regression mo-

del from incomplete data

Our proposal consists of an algorithm which iteratively learns a regression model

(which can be a NB or a TAN) by imputing the missing values in each iteration

according to the following criterion:

• If the missing value corresponds to the response variable, it is imputed

with the conditional expectation of Y given the values of the explanatory

variables in the same record of the database, computed from the current

regression model.

• Otherwise, the missing cell is imputed by simulating the corresponding vari-

able from its conditional distribution given the values of the other variables

in the same record, computed from the current regression model.

As the imputation requires the existence of a model, for the construction of

the initial model we propose to impute the missing values by simulating from

the marginal distribution of each variable computed from the observed values.

In this way we have reached better results than using pure random initialisation,

which is the standard way of proceeding in data augmentation [151]. Another

way of proceeding could be to simulate from the conditional distribution of each

explanatory variable given the response, but we rejected this option because the

estimation of the conditional distributions requires more data than the estimation

of the marginals, which can be problematic if the number of missing values is high.

Page 72: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

60 4.3. The algorithm for learning a regression model from incomplete data

Algorithm 11: Bayesian network regression model from missing data

Input: An incomplete database D for variables Y,X1, . . . , Xn. A testdatabase Dt.

Output: A Bayesian network regression model for response variable Yand explanatory variables X1, . . . , Xn.

for each variable X ∈ Y,X1, . . . , Xn do1

Learn a univariate distribution fX(x) from its observed values in D.2

Create a new database D′ from D by imputing the missing values for each3

variable X ∈ Y,X1, . . . , Xn by simulating from fX(x).Learn a Bayesian network regression model M ′ from D′.4

Let srmse′ be the sample root mean squared error of M ′ computed using5

Dt according to Equation (4.3).srmse := ∞.6

while srmse′ < srmse do7

M :=M ′.8

srmse := srmse′.9

Create a new database D′ from D by imputing the missing values as10

follows:for each variable X ∈ X1, . . . , Xn do11

for each record z in D with missing value for X do12

Obtain fX(x | z) by probability propagation in model M .13

Impute the missing value for X by simulating from fX(x | z).14

for each record z in D with missing value for Y do15

Obtain fY (x | z) by probability propagation in model M .16

Impute the missing value for Y with EfY [Y | z].17

Re-estimate model M ′ from D′.18

Let srmse′ be the sample root mean squared error of M ′ computed19

using Dt.

return M .20

Page 73: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter

4.Learningmodels

forregressio

nfro

mincomplete

data

61

D t

2XX 1

X 1 2X X 1 2X

X 1 2X

D tUsing

i (M)srmseX 1 2X

Y if (x|case ) fYE [Y|z]

if (x|case )X

2XX 1

Xf (x)

, , Y

D

?

?

?

3

4

4

1

3

1

?2 2

Y

2

?

4

?

4

?

2

3

1

Y

D’M

5

3

4

4

1

3

1

2 22

1

3

Y

(M)srmsei−1<

No

Yes

class variable has no valueDiscarding records where

Y: Obtain

Learn a regressionmodel M from D’

X =

Fill old missing cells

propagating in M and impute missing value as

: Obtain propagating in M and impute missing value simulating from it

Y

Y

Explanatory variables

Response variable

RETURN M

ITERATIVE ALGORITHM

For each variable learn an univariate

distribution and fill missingcells in D sampling from it

Figu

re4.1:

Algorith

mfor

learningaregression

model

frommissin

gdata.

Page 74: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

62 4.3. The algorithm for learning a regression model from incomplete data

Algorithm 12: Selective Bayesian network regression model from missingdataInput: An incomplete database D for variables Y,X1, . . . , Xn. A test

database Dt.Output: A Bayesian network regression model made up of the response

variable Y and a subset of explanatory variablesS ⊆ X1, . . . , Xn.

for i := 1 to n do1

Compute I(Xi, Y ).2

Let X(1), . . . , X(n) be a decreasing order of the feature variables according3

to I(X(i), Y ).Using the Algorithm 11, construct a regression model M with variables Y4

and X(1) from database D.Let rmse(M) be the estimated accuracy of model M using Dt.5

for i := 2 to n do6

Let M1 be the model obtained by the Algorithm 11 with the variables7

of M plus X(i).Let rmse(M1) be the estimated accuracy of model M1 using Dt.8

if rmse(M1) ≤ rmse(M) then9

M :=M1.10

return M .11

The algorithm (see Algorithm 11) proceeds by imputing the initial database,

learning an initial model and re-imputing the missing cells. Then, a new model

is constructed and, if the mean squared error is reduced, the current model is re-

placed and the process repeated until convergence. As the mse in Equation (4.1)

requires the knowledge of the exact distribution of Y conditional on each con-

figuration of the explanatory variables, we use as error measure the sample root

mean squared error, computed as

srmse =

1

m

m∑

i=1

(yi − yi)2, (4.3)

where m is the sample size, yi is the observed value of Y for record i and yi is its

corresponding prediction through the regression model.

Page 75: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 63

The details are given in Algorithm 11 and graphically represented trough

a single example in Figure 4.1. Notice that, in Steps 4 and 18 the regression

model is learnt from a complete database, and therefore the existing estimation

methods for MTEs can be used [135, 111]. Also, notice that the algorithm is

valid for any Bayesian network structure, and therefore it is valid for our purpose,

which is to learn a NB or a TAN, just by calling to the appropriate procedure in

Steps 4 and 18. For learning the NB regression model [111], we use Algorithm 2

and for learning the TAN [51], Algorithm 5.

In the same way as in Chapter 3, we have also incorporated variable selection

in the construction of the regression models [55, 111] as described in Algorithm 12.

4.4 Improving the final estimations by reducing

the bias

In existing approaches to using MTEs for regression, the prediction that is used

is a corrected version computed by subtracting an estimated expected bias from

the prediction provided by the model [111]. That is, if Y is the response variable

and Y ∗ is the response variable actually identified by the model, i.e., the one that

corresponds to the estimations provided by the model, then the expected bias is

E[b(Y, Y ∗)] = E[Y − Y ∗], which is estimated as [111]

b =1

m

m∑

i=1

(yi − y∗i ), (4.4)

where yi and y∗i are the exact values of the response variable and their estimates

in a test database of m records. Finally, the estimates were corrected by giving

y∗i − b as the final estimation for item number i.

We have improved the estimation of the expected bias by detecting homoge-

neous regions in the set of possible values of Y and then estimating a different

expected bias in each region. The domain of the response variable is split using

the k-means clustering algorithm, determining k by exploring the dendrogram.

In this work we have considered a maximum value of k = 4, as we did not reach

any improvement by increasing its value in the experiments carried out.

Page 76: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

64 4.5. Experimental evaluation

Therefore, instead of a single estimation of the expected bias, b now we com-

pute a vector of estimations of the expected bias, bj , j = 1, . . . , k, and the final

estimation given is y∗i − bj(i), where j(i) denoted the cluster where y∗i lies in. The

procedure for estimating the bias is detailed in Algorithm 13.

Algorithm 13: Computing a vector of bias to refine the predictions

Input: A full database D for variables Y,X1, . . . , Xn.A regression model M .Output: vBias, a vector of biases.Run a hierarchical clustering to obtain a dendrogram for the values of Y .1

Determine the number of clusters, numBias, using the dendrogram.2

Partition D into numBias partitions D1, . . . , DnumBias by clustering Y3

using the k-means algorithm.for i := 1 to numBias do4

Compute vBias[i] by means of Equation (4.4) using Di and M .5

return vBias, a vector of estimated expected biases.6

This new bias estimation heuristic is not really costly, and provides important

increases in accuracy. Therefore, we have used it in the experiments reported in

Section 4.5.

4.5 Experimental evaluation

In order to test the performance of the proposed regression models, we have

carried out a series of experiments over 16 databases, four of which are artificial

(mte50, extended mte50, tan and extended tan).

The mte50 dataset [111] consists of a random sample of 50 records drawn

from a Bayesian network with NB structure and MTE distributions. The aim of

this network is to represent a situation which is handled in a natural way by the

MTE model. In order to obtain this network, we first simulated a database with

500 records for variables X , Y , Z and W , where X follows a χ2 distribution with

5 degrees of freedom, Y follows a negative exponential distribution with mean

1/X , Z = ⌊X/2⌋, where ⌊·⌋ stands for the integer part function, and W is a

random variable with Beta distribution with parameters p = 1/X and q = 1/X .

Out of that database, a naıve Bayes regression model was constructed using X

Page 77: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 65

Database Size # Continuous variables # Discrete variablesabalone 4176 8 1auto-mpg 392 8 0bodyfat 251 15 0cloud 107 6 2

concrete 1030 9 0forestfires 517 11 2

housing 506 14 0machine 209 8 1pollution 59 16 0servo 166 1 4strikes 624 6 1veteran 137 4 4mte50 50 3 1

extended mte50 50 4 2tan 500 3 2

extended tan 500 4 3

Table 4.1: A description of the databases used in the experiments, indicatingtheir size, number of continuous variables and number of discrete variables.

as response variable, and a sample of size 50 drawn from it using the Elvira

software [34]. Database extended mte50 was obtained from mte50 by adding

two columns independently of the others. One of the columns was drawn by

sampling uniformly from the set 0, 1, 2, 3 and the other by sampling from a

N(4, 3) distribution.

Database tan was constructed in a similar way. We generated a sample of size

1000 for variables X0, . . . , X4, where X0 is a N(3, 2), X1 is a negative exponential

with mean 2 × |X0|, X2 is uniform in the interval (X0, X0 +X1), X3 is sampled

from the set 0, 1, 2, 3 with probability proportional to X0 and X4 has a Poisson

distribution with mean λ = log(|X0 − X1 − X3| + 1). Out of that database, a

TAN regression model [51] was generated, and a sample of size 500 drawn from

it using the Elvira software [34]. Finally, the dataset extended tan was obtained

from tan by adding two independent columns, one of them drawn by sampling

uniformly from the set 0, 1, 2, 3 and the other by sampling from a N(10, 5)

distribution.

Page 78: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

66 4.5. Experimental evaluation

The aim of using the two extended databases for mte50 and tan is to test the

performance of the variable selection scheme in two databases where we know for

sure that some of the explanatory variables do not influence the response variable.

The other databases are available in the UCI [11] and StatLib [149] reposito-

ries. A description of the used databases can be found in Table 4.1.

In each database, we randomly inserted missing cells, ranging from a percent-

age of 10% to 50%. The missing cells have been created in an incremental way,

i.e., a database D with 20% of missing cells is constructed from the same database

with a 10% of missing values and so on. That is, these two data sets have the

same missing cells in a 10% of their positions. Over the resulting databases, we

have run 5 algorithms: NB, TAN, SNB and STAN, where the last two correspond

to the selective versions of NB and TAN. We have also included the M5’ algo-

rithm [155] in the comparison. Regarding the implementation of our regression

models, we have included it in the Elvira software [34].

We have used 10-fold cross validation [150] to estimate the srmse. The missing

cells in the databases were selected before running the cross validation, therefore,

in this case both the training and test databases contain missing cells in each

iteration of the cross validation. We discarded from the test set the records for

which the value of Y was missing. If the missing cells in the test set correspond to

explanatory variables, algorithmM5’ imputes them as column average for numeric

variables and column mode for qualitative variables [158]. The regression models

do not require the imputation of the missing explanatory variables in the test set,

as the posterior distribution for Y is computed by probability propagation and

therefore, the variables which are not observed are marginalised out. The results

of the experimental comparison are displayed in Figures 4.2, 4.3 and 4.4. The

values represented correspond to the average srmse computed by 10-fold cross

validation.

We used Friedman’s test [42] to compare the algorithms, reporting statisti-

cally significant difference among them, with a p-value of 2.2× 10−16. Therefore,

we continued the analysis by carrying out a pairwise comparison, following the

procedure discussed by Garcıa and Herrera [63], based on Nemenyi’s, Holm’s,

Shaffer’s and Bergmann’s tests. The ranking of the algorithms analysed, accord-

ing to Friedman’s statistic, is shown in Table 4.2. Notice that a higher rank

Page 79: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 67

indicates that the algorithm is more accurate, as we are using the rmse as target.

The result of the pairwise comparison is shown in Table 4.3. It can be seen that

SNB and STAN outperform their versions without variable selection. Also, M5’

is outperformed by SNB and STAN. Finally there are no statistically significant

difference between the two most accurate methods: SNB and STAN. The con-

clusions are rather similar regardless of the test used. The only difference is that

Holm’s and Bergmann’s tests also report significant differences between NB and

TAN and between TAN and M5’.

Algorithm RankingNB 2.4687500000000004TAN 1.7916666666666676SNB 4.302083333333335STAN 3.9895833333333313M5’ 2.447916666666668

Table 4.2: Average rankings of the algorithms tested in the experiments usingFriedman’s test.

Hypothesis Nemenyi Holm Shaffer BergmannTAN vs. SNB 3.8173E-27 3.8173E-27 3.8173E-27 3.8173E-27TAN vs. STAN 5.9273E-21 5.3346E-21 3.5564E-21 3.5564E-21SNB vs. M5’ 4.4902E-15 3.5922E-15 2.6942E-15 2.6942E-15NB vs. SNB 9.4913E-15 6.6439E-15 5.6948E-15 3.7965E-15STAN vs. M5’ 1.4259E-10 8.5557E-11 8.5557E-11 4.2778E-11NB vs. STAN 2.6655E-10 1.3328E-10 1.0662E-10 5.3310E-11NB vs. TAN 0.0301 0.0120 0.0120 0.0120TAN vs. M5’ 0.0403 0.0121 0.0121 0.0121SNB vs. STAN 1 0.3418 0.3418 0.3418NB vs. M5’ 1 0.9273 0.9273 0.9273

Table 4.3: Adjusted p-values for the pairwise comparisons using Nemenyi’s,Holm’s, Shaffer’s and Bergmann’s statistical tests.

Page 80: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

68 4.5. Experimental evaluation

NBTANSNBSTANM5

0 10 20 30 40 50

2.2

2.4

2.6

2.8

abalone

% of missing values

rmse

0 10 20 30 40 50

3.0

3.5

4.0

4.5

auto−mpg

% of missing values

rmse

0 10 20 30 40 50

510

1520

25

bodyfat

% of missing values

rmse

0 10 20 30 40 50

0.4

0.5

0.6

0.7

0.8

0.9

cloud

% of missing values

rmse

0 10 20 30 40 50

810

1214

concrete

% of missing values

rmse

Figure 4.2: Comparison of the different models for the data sets.

Page 81: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 69

0 10 20 30 40 50

4045

5055

6065

7075

forestfires

% of missing values

rmse

0 10 20 30 40 50

4.0

5.0

6.0

7.0

housing

% of missing valuesrm

se

0 10 20 30 40 50

3040

5060

7080

90

machine

% of missing values

rmse

0 10 20 30 40 50

3040

5060

7080

pollution

% of missing values

rmse

0 10 20 30 40 50

0.6

0.8

1.0

1.2

servo

% of missing values

rmse

0 10 20 30 40 50

400

450

500

550

strikes

% of missing values

rmse

Figure 4.3: Comparison of the different models for the data sets. The legends arethe same as in Figure 4.2.

Page 82: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

70 4.5. Experimental evaluation

0 10 20 30 40 50

120

140

160

180

veteran

% of missing values

rmse

0 10 20 30 40 50

1.6

1.8

2.0

2.2

2.4

2.6

2.8

mte50

% of missing values

rmse

0 10 20 30 40 50

2.0

2.5

3.0

3.5

4.0

extended_mte50

% of missing values

rmse

0 10 20 30 40 50

1.5

1.6

1.7

1.8

1.9

tan

% of missing values

rmse

0 10 20 30 40 50

1.6

1.7

1.8

1.9

2.0

2.1

extended_tan

% of missing values

rmse

Figure 4.4: Comparison of the different models for the data sets. The legends arethe same as in Figure 4.2.

Page 83: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 4. Learning models for regression from incomplete data 71

4.6 Results discussion

The experimental evaluation shows a satisfactory behaviour of the proposed re-

gression methods. The selective versions outperform the sophisticated M5’ algo-

rithm. Notice that the M5’ algorithm also incorporates variable selection, through

tree-pruning. The difference between the models based on Bayesian networks and

model trees becomes sharper as the rate of missing values grows. Also, the use

of variable selection always increases the accuracy. The fact that there are no

significant differences between SNB and STAN make the first one preferable, as

it is simpler (contains fewer parameters).

Finally, consider the line corresponding to M5’ in the graph for database

bodyfat in Figure 4.2. In that case, the error decreases abruptly for 40% and

50% of missing values, which is counterintuitive. We have found out that this is

due to the presence of outliers in the database, which are removed when the rate

of missing values is high. It suggests that M5’ is more sensitive to outliers than

the models based on Bayesian networks.

4.7 Conclusions

In this chapter we have studied the induction of Bayesian network models for

regression from incomplete data sets, based on the use of MTE distributions. We

have considered two well known network structures in classification and regres-

sion: The NB and TAN.

The proposal for handling missing values relies on the data augmentation

algorithm, which iteratively re-estimates a model and imputes the missing values

using it. We have shown that this algorithm can be adapted for the regression

problem by distinguishing the imputation of the response variable, in such a way

that the prediction error is minimised.

We have also studied the problem of variable selection, following the same

ideas as in Chapter 3. The final contribution of this chapter is the method for

improving the accuracy by reducing the bias, which can be incorporated regardless

of whether the model is obtained from complete or incomplete data.

The experiments conducted have shown that the selective versions of the pro-

Page 84: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

72 4.7. Conclusions

posed algorithms outperform the robust M5’ scheme, which is not surprising, as

M5’ is mainly designed for continuous explanatory variables, while MTEs are

naturally developed for hybrid domains.

Page 85: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5

Parametric learning in MTE

networks using incomplete data

Estimating an MTE from data has turned out to be a difficult task. Current methods

suffer from a considerable computational burden as well as the inability to handle

missing values in the training data. In this chapter we describe an EM-based algorithm

for learning the maximum likelihood parameters of an MTE network when confronted

with incomplete data. In order to overcome the computational difficulties we make

certain distributional assumptions about the domain being modeled, thus focusing

on a subclass of the general class of MTE networks. Preliminary empirical results

indicate that the proposed method offers results that are inline with intuition.

Abstract

5.1 Introduction

One of the major challenges when using probabilistic graphical models for mod-

eling hybrid domains is to find a representation of the joint distribution that

support 1) efficient algorithms for exact inference based on local computations

and 2) algorithms for learning the representation from data.

In this chapter we will focus on the learning problem considering MTEs [106]

as a candidate framework. Algorithms for learning marginal and conditional

MTE distributions from complete data have previously been proposed [135, 128,

88, 87]. When facing with incomplete data, in Chapter 4 we considered a data

Page 86: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

74 5.1. Introduction

augmentation technique for learning (tree augmented) naıve MTE networks for

regression [53], but so far no attempt has been made at learning the parameters

of a general MTE network.

The task of learning MTEs from data was initially approached using least

squares estimation [135, 128]. However, this technique does not combine well

with more general model selection problems, as many standard score functions for

model selection, including the Bayesian information criterion (BIC) [138], assume

Maximum likelihood (ML) parameter estimates to be available. ML learning of

univariate distributions was introduced in [88], and a fist attempt on learning

conditional distributions was made in [87].

In this chapter we propose an EM-based algorithm [41] for learning parameters

in MTE networks from incomplete data. The general problem of learning MTE

networks (also with complete data) is computationally hard [87]: Firstly, the suf-

ficient statistics of a dataset is the dataset itself, and secondly, there are no known

closed-form equations for finding the maximum likelihood (ML) parameters. In

order to circumvent these problems, we focus on domains, where the probability

distributions mirror standard parametric families for which ML parameter esti-

mators are known to exist. This implies that instead of trying to directly learn

ML estimates for the MTE distributions, we may consider the ML estimators

for the corresponding parametric families. Hence, we define a generalised EM

algorithm that incorporates the following two observations (corresponding to the

M-step and the E-step, respectively):

i) Using the results of [33, 88] the domain-assumed parametric distributions

can be transformed into MTE distributions.

ii) Using the MTE representation of the domain we can evaluate the expected

sufficient statistics needed for the ML estimators.

For ease of presentation we shall only consider domains with multinomial,

Gaussian, and logistic functions, but, in principle, the proposed learning proce-

dure is not limited to these distributional families. Note that for these types of

domains exact inference is not possible using the assumed distributional families.

Page 87: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 75

The remainder of the chapter is organised as follows. In Section 5.2 we give

rules for transforming selected parametric distributions into MTEs. From Sec-

tion 5.3 to 5.5 we describe the proposed algorithm. In Section 5.6 we present

some preliminary experimental results. Finally, the conclusion and some ideas

for future research are given in Section 5.7.

For ease of exposition, some irrelevant calculations in Sections 5.4 and 5.5

have been moved to the Appendix. The Appendix also includes help about the

vector notation used throughout the chapter.

5.2 Translating standard distributions into MTE

distributions

In this section we will consider transformations from selected parametric distri-

butions to MTE distributions.

5.2.1 Multinomial

The conversion from a multinomial into an MTE potential is straightforward,

since a multinomial distribution can be seen as a specific case of an MTE po-

tential [106]. For example, consider two discrete variables X and Z with states

xi, i = 1, . . . , n and zj , j = 1, . . . , d. The multinomial potential P (x | z) definedas a probability table,

Z = z1 . . . Z = zd

X = x1 p11 . . . p1d...

......

...

X = xn pn1 . . . pnd

or as a probability tree,

Page 88: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

76 5.2. Translating standard distributions into MTE distributions

Z

X

1

p11

1

. . . pn1

n

. . . X

d

p1d

1

. . . pnd

n

can be translated to a MTE potential using a mixed tree structure (see Subsec-

tion 2.2.3), in which only discrete variables will take part. The MTE densities

located in the leaves will have only the independent term with the corresponding

value pij .

5.2.2 Conditional linear Gaussian

In [33, 88] methods for obtaining an MTE approximation of a (marginal) Gaussian

distribution are described. Common for both approaches is that the split points

used in the approximations depend on the mean value of the distribution being

modeled. Consider now a variable X with continuous parents Y = (Y1, . . . , Yc)T

and assume that X given Y follows a conditional linear Gaussian distribution:1

X | Y = y ∼ N(µ = b+ lTy, σ2).

In the conditional linear Gaussian distribution, the mean value is a weighted linear

combination of the continuous parents. This implies that we cannot directly

obtain an MTE representation of the distribution by following the procedures

of [33, 88]; each part of an MTE potential has to be defined on a hypercube

(see Definition 5 in Subsection 2.2.3), and the split points can therefore not

depend on any of the variables in the potential. Instead we define an MTE

approximation by splitting the domain of the variables Y, ΩY, into hypercubes

D1, . . . , Dk, and specifying an MTE density for X for each of the hypercubes.

For hypercube Dp, p = 1, . . . , k, the mean of the distribution is assumed to be

1For ease of exposition we will disregard any discrete parent variables in the remainder ofthis section, since they will only serve to index the parameters of the function.

Page 89: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 77

Y1

Y2

[ymin1 , ya1 ]

f(x;µ1, σx)

µ1 = b + l1mid11 + l2mid1

2

[ymin2 , ya2 ]

f(x;µ2, σx)

µ2 = b + l1mid21 + l2mid2

2

[ya2 , ymax2 ]

Y2

[ya1 , ymax1 ]

f(x;µ3, σx)

µ3 = b + l1mid31 + l2mid3

2

[ymin2 , yb2]

f(x;µ4, σx)

µ4 = b + l1mid41 + l2mid4

2

[yb2, ymax2 ]

Y1

Y2

[1, 8]

f(x;µ1, σx)

µ1 = 0.5 + 9 · 4.5 + 5 · 3.5

[2, 5]

f(x;µ2, σx)

µ2 = 0.5 + 9 · 4.5 + 5 · 6

(5, 7]

Y2

(8, 12]

f(x;µ3, σx)

µ3 = 0.5 + 9 · 4.5 + 5 · 3.5

[2, 3]

f(x;µ4, σx)

µ4 = 0.5 + 9 · 4.5 + 5 · 3.5

(3, 7]

Figure 5.1: Mixed tree for a Gaussian variable X with two continuous (Gaussian)parents Y1, Y2. The leaves are represented by the MTE potential in Equation (5.1)with values µ = µp and σ = σX . The mean values for variables X, Y1 and Y2 inthe example are: b = 0.5, l1 = 9 and l2 = 5, respectively.

constant, i.e., µp = b + l1midp1 + · · · + lcmidpc , where midpi denotes the midpoint

of Yi in Dp, i ∈ 1, . . . , c (by defining fixed upper and lower bounds on the

ranges of the continuous variables, the midpoints are always well-defined). Thus,

finding an MTE representation of the conditional linear Gaussian distribution

has been reduced to defining a partitioning D1, . . . , Dk of ΩY and specifying

an MTE representation for a (marginal) Gaussian distribution (with mean µp

and variance σ2) for each of the hypercubes Dp in the partitioning1. Figure 5.1

shows an example of an MTE representation of a conditional linear Gaussian

distribution for a variable.

In the current implementation we define the partitioning of ΩY based on

1Note that σ2 does not depend on the continuous parents

Page 90: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

78 5.2. Translating standard distributions into MTE distributions

equal-frequency binning, and we use BIC-score [138] to choose the number of

bins. To obtain an MTE representation of the (marginal) Gaussian distribution

for each partition in ΩY we follow the procedure of [88]; six MTE candidates for

the domain [−2.5, 2.5] are shown in Figure 5.2 (no split points are being used,

except to define the boundary). The 6-term MTE density has been selected to

take part of the approximation because it shows a good compromise between fit

and complexity.

Notice that this approximation is only positive within the interval [µ−2.5σ, µ+

2.5σ] (confer 5.2), and it actually integrates up to 0.9876 in that region, which

means that there is a probability of 0.0124 of finding points outside this inter-

val. In order to avoid problems with 0 probabilities, we add tails covering the

remaining probability mass of 0.0124. More precisely, we define the normalisation

constant

c =0.0124

2(

1−∫ 2.5σ

0exp−xdx

) ,

and include the tail

t(x) = c · exp −(x− µ) ,

for the interval above x = µ+ 2.5σ in the MTE specification. Similarly, a tail is

also included for the interval below x = µ − 2.5σ. The transformation rule from

Gaussian to MTE therefore becomes

f(x) =

c · exp x− µ if x < µ− 2.5σ,

σ−1[

a0 +∑6

j=1 aj exp

bjx−µσ

]

if µ− 2.5σ ≤ x ≤ µ+ 2.5σ,

c · exp −(x− µ) if x > µ− 2.5σ,

(5.1)

where,

Page 91: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 79

−3 −2 −1 0 1 2 3−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

(a) 2 terms

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(b) 4 terms

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(c) 6 terms

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(d) 8 terms

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(e) 10 terms

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

(f) 12 terms

Figure 5.2: MTE approximations with 2, 4, 6, 8, 10, 12 exponential terms, respec-tively, for the truncated standard Gaussian distribution with support [−2.5, 2.5].It is difficult to visually distinguish the MTE and the Gaussian for the four lattermodels.

Page 92: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

80 5.2. Translating standard distributions into MTE distributions

a0 = 49.8248 a1 = −34.958 b1 = −0.33333

a2 = −34.958 b2 = 0.33333

a3 = 11.7704 b3 = −0.66667

a4 = 11.7704 b4 = 0.66667

a5 = −1.5269 b5 = −1

a6 = −1.5269 b6 = 1

Figure 5.3 shows the MTE approximation to a standard Gaussian density

using the 3 pieces specified in Equation (5.1).

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

f(x)

Figure 5.3: 3-piece MTE approximation of a standard Gaussian density. Thedashed red line represents the Gaussian density and the blue one the MTE ap-proximation.

5.2.3 Logistic

The sigmoid function for a discrete variable X with a single continuous parent Y

is given by

Page 93: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 81

P (X = 1 | Y ) = 1

1 + expb+ wy .

In [31] a 4-piece 1-term MTE representation for this function is proposed:

P (X = 1 | Y = y)

=

0 if y < 5−bw,

−0.021704 + 0.521704c exp−0.635w(y − b(w + 1)) if 5−bw

≤ y ≤ −bw,

−1.021704− 0.521704c−1 exp0.635w(y − b(w + 1)) if −bw< y ≤ −5−b

w,

1 if y > −5−bw,

(5.2)

where c = 0.529936b(w2+w+1). Note that the MTE representation is 0 or 1 if

y < (5− b)/w or y > (−5− b)/w, respectively. The representation can therefore

be inconsistent with the data (i.e., we may have data cases with probability 0),

and we therefore replace the 0 and 1 with ǫ and 1− ǫ, where ǫ is a small positive

number. (ǫ = 0.0001 was used in the experiments reported in Section 5.6.).

Figure 5.5 shows both the logistic and the MTE approximation for the poten-

tial P (X = 1 | Y = y) in Equation (5.2).

In the general case, where X has continuous parents Y = (Y1, . . . , Yc)T and

discrete parents Z = (Z1, . . . , Zd)T, then for each configuration z of Z, the condi-

tional distribution of X given Y is given by

P (X = 1 | Y = y,Z = z) =1

1 + expbz +∑c

i=1wi,zyi. (5.3)

With more than one continuous variable as argument, the logistic function cannot

easily be represented by an MTE having the same structure as in Equation (5.2).

The problem is that the split points would then be (linear) functions of at least

one of the continuous variables, which is not consistent with the MTE framework

(see Definition 5 in Subsection 2.2.3). Instead we follow the same procedure as for

the conditional linear Gaussian distribution: For each of the continuous variables

in Y′ = Y2, . . . Yc, split the variable Yi into a finite set of intervals and use the

Page 94: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

82 5.2. Translating standard distributions into MTE distributions

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

x

f(x)

Figure 5.4: 4-piece MTE approximation of a logistic function with b = 0 andw = −1. The dashed red line represents the logistic function and the blue onethe MTE approximation.

midpoint of the pth interval to represent Yi in that interval. The intervals for the

variables in Y′ define a partitioning D1, . . . , Dk of ΩY′ into hypercubes, and for

each of these partitions we apply Equation (5.2). That is, for partition Dp we get

P (X = 1 | y, z) = 1

1 + expb′ + w1y1, (5.4)

where b′ = b +∑c

k=2midpkwpk. In the current implementation Y1 is chosen arbi-

trarily from Y, and the partitioning of the state space of the parent variables is

performed as for the conditional linear Gaussian distribution.

Figure 5.5 shows an example of the MTE representation of a logistic vari-

able X with 3 continuous (Gaussian) parents. The information about the logistic

parameters of Y1 and Y2 is contained in parameter b′. Thus, each leave repre-

sents an MTE potential P (X | y3) which is an approximation of the potential

Page 95: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 83

P (X | y1, y2, y3) in the same way as in Equation (5.2), where only the logistic

variable and one conditioning Gaussian variable were allowed. Note that for this

approximation we consider y1 = midp1 and y2 = midp2.

Y1

Y2

[

ymin1 , ya1

]

f(y3; b′, w3)

b′ = b3 + w1mid11 + w2mid1

2

[

ymin2 , ya2

]

f(y3; b′, w3)

b′ = b3 + w1mid21 + w2mid2

2

[

ya2 , ymin2

]

Y2

[

ya1 , ymin1

]

f(y3; b′, w3)

b′ = b3 + w1mid31 + w2mid3

2

[

ymin2 , yb2

]

f(y3; b′, w3)

b′ = b3 + w1mid41 + w2mid4

2

[

yb2, ymin2

]

Y1

Y2

[1, 8]

f(y3; b′, w3)

b′ = 0.5 + 0.1 · 4.5 + 0.8 · 3.5

[2, 5]

f(y3; b′, w3)

b′ = 0.5 + 0.1 · 4.5 + 0.8 · 6

(5, 7]

Y2

(8, 12]

f(y3; b′, w3)

b′ = 0.5 + 0.1 · 4.5 + 0.8 · 3.5

[2, 3]

f(y3; b′, w3)

b′ = 0.5 + 0.1 · 4.5 + 0.8 · 3.5

(3, 7]

Figure 5.5: Mixed tree for a logistic variable X with three continuous (Gaussian)parents Y1, Y2, Y3. The leaves are represented by the MTE potential in Equa-tion (5.2) with y = y3, b = b′ and w = w3. The values for the example areb3 = 0.5, w1 = 0.1 and w2 = 0.8.

5.3 The EM Algorithm

As previously mentioned, deriving an EM algorithm for general MTE networks is

computationally hard because the sufficient statistics of the dataset is the dataset

itself and there is no closed-form solution for estimating the maximum likelihood

parameters. To overcome these computational difficulties we will instead focus on

a subclass of MTE networks, where the conditional probability distributions in

Page 96: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

84 5.3. The EM Algorithm

the network mirror selected distributional families. By considering this subclass

of MTE networks we can derive a generalised EM algorithm, where the updating

rules can be specified in closed form.

To be more specific, assume that we have an MTE network for a certain

domain, where the conditional probability distributions in the domain mirror

traditional parametric families with known ML-based updating rules. Based on

the MTE network we can calculate the expected sufficient statistics required by

these rules (the E-step) and by using the transformations described in Section 5.2

we can in turn update the distributions in the MTE network.

The overall learning algorithm is detailed in Algorithm 14, where the domain

in question is represented by the model B. Note that in order to exemplify the

procedure we only consider the multinomial distribution, the Gaussian distribu-

tion, and the logistic distribution. The algorithm is, however, easily extended to

other distribution classes. The algorithm finishes when the following convergence

criterion is satisfied:

| log L(B′t | D)− log L(B′

t−1 | D) | < ǫ,

where L(B′t | D) is the likelihood of the MTE network B′ given the database D

in step t.

The steps of the algorithm are graphically shown through an example in Fig-

ure 5.6.

The transformation rules for the conditional linear Gaussian distribution, the

multinomial distribution, and the logistic distribution are alredy given in Sec-

tion 5.2. In order to complete the specification of the algorithm, we therefore

only need to define the E-step and the M-step for the three types of distributions

being considered.

Page 97: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 85

Algorithm 14: An EM algorithm for learning MTE networks from incom-plete data.

Input: A parameterised model B over X1, . . . , Xn, and an incompletedatabase D of cases over X1, . . . , Xn.

Output: An MTE network B′.Initialise the parameter estimates θB randomly.1

repeat2

Using the current parameter estimates θB, represent B as an MTE3

network B′ (Section 5.2).(E-step) Calculate the expected sufficient statistics required by the4

M-step using B′ (Section 5.4).(M-step) Use the result of the E-step to calculate new ML parameter5

estimates θB for B (Section 5.5).θB := θB.6

until convergence ;7

return B′.8

5.4 The M-step. Updating rules for the param-

eter estimates

This section is devoted to calculate the updating rules for the parameter estimates

of the considered distributions. Given a database of cases D = d1 . . .dN for

variables X1, . . . , Xn, where di = (x(i)1 , . . . , x

(i)n ), the updating rules are derived

based on the expected data-complete log-likelihood function Q:

Q =N∑

i=1

E[log f(X1, . . . , Xn) | di] =N∑

i=1

n∑

j=1

E[log f(Xj | pa(Xj)) | di]. (5.5)

5.4.1 Multinomial

Let Xj be a discrete variable with only discrete parents Z as follows,

Z

Xj

Page 98: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

86 5.4. The M-step. Updating rules for the parameter estimates

Random parameter initialisation

Translation

rules

General framework

Mult. Mult. CLG CLG

Mult. Log. CLG

MTE framework

MTE MTE MTE MTE

MTE MTE MTE

E-stepM-step

Figure 5.6: EM algorithm for learning MTE networks from incomplete data.Continuous nodes are represented with double lines and the discrete ones withsingle line.

where the conditioning distribution for variable Xj, f(xj | z), is a discrete poten-

tia P (xj | z). Thus, the updating rule for the multinomial parameters is:

θj,k,z :=

N∑

i=1

P (Xj = k,Z = z | di)

|sp(Xj)|∑

k=1

N∑

i=1

P (Xj = k,Z = z | di). (5.6)

For the particular case in which the variable Xj has no parents the formula

above is simplified as:

θj,k :=

N∑

i=1

P (Xj = k | di)

|sp(Xj)|∑

k=1

N∑

i=1

P (Xj = k | di). (5.7)

5.4.2 Conditional linear Gaussian

Let Xj be a continuous (Gaussian) variable with discrete parents Z and contin-

uous (Gaussian) parents Y as follows,

Page 99: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 87

Z Y

Xj

with density function f(xj | z,y) where

Xj ∼ N(

µ = lTz,jy + bz,j , σ2z,j

)

.

To ease notation, we shall use lz,j = [lTz,j, bz,j ]T and y = [yT, 1]T so,

lTz,jy + bz,j = lTz,jy. (5.8)

Therefore, the density function for Xj can be written as:

f(xj | z,y) =1

σz,j√2π

exp

−1

2

(

xj − lTz,jy

σz,j

)2

∝ exp

−1

2

(

xj − lTz,jy

σz,j

)2

.

(5.9)

So, we need to calculate the updating rule for the unknown parameters lz,j and

σz,j . The factor 1σz,j

√2π

can be considered a constant in the calculation of the

updating rule of lz,j, but not for for the updating rule of σz,j .

Since the parameters of the distribution need to be maximised, the updating

rules for each one will be calculated bellow by taking the derivative of the function

Q with respect to the parameter and then calculating the roots of the equation

to get the maximum value (intermediate calculations are reported in Appendix

at the end of the chapter).

For the simplest case in which the variable Xj has no parents, the density

function is:

f(xj) ∝ exp

−1

2

(

xj − µjσj

)2

. (5.10)

In this case the parameters to be estimated are µj and σj .

Page 100: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

88 5.4. The M-step. Updating rules for the parameter estimates

Updating rule for µj

µj :=1

N

[

N∑

i=1

E [Xj | di]]

(5.11)

Updating rule for σj

σj :=

[

1

N√2π

(

N∑

i=1

E(X2j | di) +Nµ2

j − 2µj

N∑

i=1

E(Xj | di))]1/2

(5.12)

Updating rule for lz,j

ˆlz,j :=

[

N∑

i=1

f(z | di)E(YYT | di, z)]−1 [ N

i=1

f(z | di)E(XjY | di, z)]

(5.13)

Updating rule for σz,j

σz,j :=

[

1∑N

i=1 f(z | di)

N∑

i=1

f(z | di)E[

(Xj − lTz,jY)2 | di, z]

]1/2

(5.14)

5.4.3 Logistic

Let Xj be a binary variable with discrete parents Z and continuous (Gaussian)

parents Y as follows,

Z Y

Xj

where

P (xj | z,y) = σz,j(y)xj(1− σz,j(y))

(1−xj) , xj ∈ 0, 1 (5.15)

and

σz,j(y) =1

1 + expwT

z,jy + bz,j, (5.16)

Page 101: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 89

where wT

z,j is a set of coefficients, one for each continuous parent of Xj . To ease

notation, we shall use wz,j = [wT

z,j , bz,j]T and y = [yT, 1]T so

wT

z,jy + bz,j = wT

z,jy. (5.17)

The calculation for the logistic parameters are explained bellow following the

same ideas as in Subsection 5.4.2.

Updating rule for wz,j

∂Q

∂wz,j=

N∑

i=1

f(z | di)∂

∂wz,jE [logP (Xj | z,Y) | di, z]

=N∑

i=1

P (z | di)∂

∂wz,j

E[Xj log σz,j(Y) + (1−Xj) log(1− σz,j(Y)) | di, z]

=

N∑

i=1

P (z | di)[

∂wz,jE[Xj log σz,j(Y) | di, z] +

∂wz,jE[(1 −Xj) log(1− σz,j(Y)) | di, z]

]

.

(5.18)

Now, for the first part of the Equation (5.18) we get

∂wz,jE[Xj log σz,j(Y) | di, z] =

∂wz,j

(∫

y

P (xj = 1,y | di, z)1 logσz,j(y)dy +

y

P (xj = 0,y | di, z)0 log σz,j(y)dy)

=∂

∂wz,j

(∫

y

P (xj = 1,y | di, z) log σz,j(y)dy)

=

y

P (xj = 1,y | di, z)∂

∂wz,j

log σz,j(y)dy.

(5.19)

The derivative can be further expanded by noting that

Page 102: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

90 5.4. The M-step. Updating rules for the parameter estimates

∂wz,jlog σz,j(y)dy =

1

σz,j(y)

∂σz,j(y)

∂wz,j

=1

σz,j(y)σz,j(y)(1− σz,j(y))y = (1− σz,j(y))y,

(5.20)

and we therefore get

∂wz,j

E(Xj log σz,j(Y) | di, z) =∫

y

P (xj = 1,y | di, z)(1− σz,j(y))ydy. (5.21)

In a similar way, for the second part of the Equation (5.18) we get

∂wz,jE[(1−Xj) log(1− σz,j(Y)) | di, z]

=∂

∂wz,j

(∫

y

P (xj = 1,y | di, z)0 log(1− σz,j(y))dy +

y

P (xj = 0,y | di, z)1 log(1− σz,j(y))dy

)

=∂

∂wz,j

(∫

y

P (xj = 0,y | di, z) log(1− σz,j(y))dy

)

=

y

P (xj = 0,y | di, z)∂

∂wz,jlog(1− σz,j(y))dy. (5.22)

The derivative can be further expanded by noting that

∂wz,jlog(1− σz,j(y))dy =

1

1− σz,j(y)

∂(1 − σz,j(y))

∂wz,j

= − 1

1− σz,j(y)σz,j(y)(1− σz,j(y))y = −σz,j(y)y.

(5.23)

and we therefore get

Page 103: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 91

∂wz,jE((1−Xj) log(1− σz,j(Y)) | di, z) = −

y

P (xj = 0,y | di, z)σz,j(y)ydy.(5.24)

By inserting the expressions back into Equation (5.18) we ended up with

∂Q

∂wz,j

=N∑

i=1

P (z | di)[∫

y

P (xj = 1,y | di, z)σz,xj=1(y)ydy−∫

y

P (xj = 0,y | di, z)σz,xj=0(y)ydy

]

. (5.25)

In order to find this partial derivative we need to evaluate two integrals.

However, the combination of the MTE potential P (xj,y | di, z) and the logistic

function σz,xj(y) makes these integrals difficult to evaluate. In order to avoid

this problem we use the MTE representation of the logistic function specified in

Subsection 5.2.3, which allows the integrals to be calculated in closed form.

Furthermore, even considering the previous approximation for the integrals,

the roots of the resulting equation can not be found to get the updating rule for

the weights vector. Instead one typically resorts to numerical optimisation such

as gradient ascent for maximising Q. Thus, the gradient ascent updating rule can

be expressed as

ˆwz,j := wz,j + γ∂Q

∂wz,j,

where γ > 0 is a small number.

The calculation of ∂Q∂wz,j

is evaluated for each configuration of the discrete

parents z and returns a vector of values (v1, . . . , vc), where c is the number of

continuous parents. Let see in more detail the calculation of Equation (5.25).

The following part of the expression,

y

P (xj = 1,y | di, z)σz,xj=1(y)ydy

Page 104: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

92 5.5. The E-step

for one specific parent yi would be calculated as follows:

yi

yi

(∫

y\yiP (xj = 1,y | di, z) σz,xj=1(y) dy \ yi

)

dyi,

that is, for each parent yi it is necessary to make c − 1 integrals and finally

compute the expectation like this:

yi

yi g(yi) dyi ,

where g(yi) is an MTE density.

5.5 The E-step

In this section the expected sufficient statistics are calculated to perform the

updating rules in the M-step. More specifically, we need to compute the following

four expectations in the MTE framework:

1) E(Xj | di, z)

2) E(XjY | di, z)

3) E(YYT | di, z)

4) E[

(Xj − lTz,jY)2 | di, z]

For the calculation of all the expectations above we will simplify it saying that

for each configuration of the discrete parents z we calculate:

1) E(Xj | di)

2) E(XjY | di)

3) E(YYT | di)

4) E[

(Xj − lTz,jY)2 | di]

Page 105: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 93

All the expectations shown bellow can be obtained analytically (the interme-

diate calculations and notation are reported in the Appendix). For the first one,

we have that:

E(Xj | di) =a02

(

xb2 − xa

2)

+

m∑

j=1

aj

bj2 (expbjxb(bjxb − 1)− expbjxa(bjxa − 1))

(5.26)

For the second one, we need to calculate a vector of expectations, where the

j-element is E(XjYj | di). For simplicity we will denote Xj as X and Yj as Y .

The ranges of the variables will be [xa, xb] and [ya, yb] respectively.

E(XY | di) =a04(y2b − y2a)(x

2b − x2a) +

m∑

j=1

aj

cj2bj2

(−expbjya+ bjyaexpbjya+ expbjyb − bjybexpbjyb)(−expcjxa+ cjxaexpcjxa+ expcjxb − cjxbexpcjxb)

(5.27)

A new version of E(XY | di) is shown in Equation (5.28) for the case in which

the exponent has only one variable, that is, bj = 0:

E(XY | di) =a04(y2b − y2a)(x

2b − x2a) +

m∑

j=1

aj(y2b − y2a)

2cj2

(expcjxa − cjxaexpcjxa − expcjxb+ cjxbexpcjxb)(5.28)

For the third one, we need to calculate a matrix of expectations, where the

jk-element will be E(YjYk | di). If j 6= k the calculation can be carried out in the

same way as in Equation (5.27). When j = k, the expectation is E(Y 2j | di) and

Page 106: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

94 5.5. The E-step

will be calculated later in Equation (5.30).

For the fourth one, we have that:

E[

(Xj − lTz,jY)2 | di]

= E(X2j | di)− 2lTz,jE(XjY | di) + E((lTz,jY)2 | di),

(5.29)

where

E(X2j | di) =

a03

(

xb3 − xa

3)

+m∑

j=1

aj

bj3 ( expbjxb(bjxb(bjxb − 2) + 2)−

− [expbjxa(bjxa(bjxa − 2) + 2)] ) . (5.30)

The expectation E(XjY | di) of the second part in Equation (5.29) have been

calculated previously in Equation (5.27). For the last term in Equation (5.29),

we have that:

E[

(lTz,jY)2 | di]

= E[

(lTz,jY)(lTz,jY)T | di]

= E[

lTz,jYYTlz,j | di)]

= lTz,jE[

YYT | di)]

lz,j, (5.31)

and the calculation of E[

YYT | di)]

has been previously considered.

For calculating the expectations in all this section we will consider the most

general case in which all the variables involved are unobserved. When some of

them are observed, we will follow the next basic rules:

• if X and Y are both observed, then E(XY | di) is substituted by xy.

• if only X is unobserved, then E(XY | di) is substituted by y ∗ E(X), and

the other way around.

For the logistic we do not have any expectation to compute in the E-step,

since its updating rule in the M-step only has a product of MTE functions and

no expectations. Anyway, a simple integral need to be solved in Equation (5.25),

since we have an MTE times y.

Page 107: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 95

The calculation of the expectation for the multinomial is straightforward and

implicitly is made inside the M-step in Section 5.4.1.

5.6 Experimental results

In order to evaluate the proposed learning method we have generated data from

the Crops network [112]. We sampled six complete datasets containing 50, 100,

500, 1000, 5000, and 10000 cases, respectively, and for each of the datasets we

generated three other datasets with 5%, 10%, and 15% missing data (the data is

missing completely at random [95]), giving a total of 24 training datasets. The

actual data generation was performed using WinBUGS [97].

Subsidize Crop

Price

Buy

Figure 5.7: The Crops network.

For comparison, we have also learned baseline models using WinBUGS. How-

ever, since WinBUGS does not support learning of multinomial distributions from

incomplete data we have removed all cases where Subsidize is missing from the

datasets.

The learning results are shown in Table 5.1, which lists the average (per

observation) log-likelihood of the model with respect to a test-dataset consisting

of 15000 cases (and defined separately from the training datasets). From the

table we see the expected behaviour: As the size of the training data increases,

the models tend to get better; as the fraction of the data that is missing increases,

the learned models tend to get worse.

Page 108: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

96 5.6. Experimental results

The results also show how WinBUGS in general outperforms the algorithm

we propose in this chapter. We believe that one of the reasons is the way we

approximate the tails of the Gaussian distribution in Equation (5.1). As the tails

are thicker than the actual Gaussian tails, the likelihood is lower in the central

parts of the distribution, where most of the samples potentially concentrate.

Another possible reason is the way in which we approximate the CLG distribution.

Recall that when splitting the domain of the parent variable, we take the average

data point in each split to represent the parent, instead of using the actual value.

This approximation tends to give an increase in the estimate of the conditional

variance, as the approximated distribution needs to cover all the training samples.

Obviously, this will later harm the average predictive log likelihood. Two possible

solution to this problem are i) to increase the number of splits, or ii) to use

dynamic discretisation to determine the optimal way to split the parent’s domain.

However, both solutions come with a cost in terms of increased computational

complexity, and we consider the tradeoff between accuracy and computational

cost as an interesting topic for future research.

ELVIRA WINBUGSNo. Cases Percentage of missing data Percentage of missing data

0% 5 % 10% 15% 0% 5 % 10% 15%50 -3.8112 -3.7723 -3.8982 -3.8553 -3.7800 -3.7982 -3.7431 -3.6861

100 -3.7569 -3.7228 -3.9502 -3.9180 -3.7048 -3.7091 -3.7485 -3.7529500 -3.6452 -3.6987 -3.7972 -3.8719 -3.6272 -3.6258 -3.6380 -3.6295

1 000 -3.6325 -3.7271 -3.8146 -3.8491 -3.6174 -3.6181 -3.6169 -3.61795 000 -3.6240 -3.6414 -3.8056 -3.9254 -3.6136 -3.6141 -3.6132 -3.6144

10 000 -3.6316 -3.6541 -3.7910 -3.8841 -3.6130 -3.6131 -3.6131 -3.6135

Table 5.1: The average log-likelihood for the learned models, calculated per ob-servation on a separate test set.

The algorithm has been implemented in Elvira [34] and the software, the

datasets used in the experiments, and the WinBUGS specifications are all avail-

able from http://elvira.ual.es/MTE-EM.html.

Page 109: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 5. Parametric learning in MTE networks using incomplete data 97

5.7 Conclusions

In this chapter we have proposed an EM-based algorithm for learning MTE net-

works from incomplete data. In order to overcome the computational difficulties

of learning MTE distributions, we focus on a subclass of the MTE networks,

where the distributions are assumed to mirror known parametric families. This

subclass supports a computationally efficient EM algorithm. Preliminary empir-

ical results indicate that the method learns as expected, although not as well as

WinBUGS. In particular, our method seems to struggle when the portion of the

data that is missing increases. We have proposed some remedial actions to this

problem that we will investigate further.

Page 110: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 111: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6

Approximate inference in MTE

networks using importance

sampling

In this chapter we propose an algorithm for approximate inference in hybrid Bayesian

networks where the underlying probability distribution is of class MTE. The algorithm

is based on importance sampling simulation. We show how it is able to compute

multiple posterior probabilities simultaneously. The behaviour of the new algorithm is

experimentally tested and compared with previous methods existing in the literature.

Abstract

6.1 Introduction

Even though Bayesian networks allow efficient inference algorithms to operate

over them, it is known that exact probabilistic inference is an NP-hard prob-

lem [35]. Furthermore, approximate probabilistic inference is also an NP-hard

problem if a given precision is required [39]. For that reason, approximate algo-

rithms that tradeoff complexity for accuracy have been developed both for discrete

Bayesian networks [21, 17, 137, 18, 109] and for hybrid Bayesian networks with

MTEs [134].

In this chapter we propose an approximate algorithm for inference in hybrid

Page 112: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

100 6.2. Problem formulation

Bayesian networks with MTEs. The algorithm is based on importance sampling,

and therefore it is an anytime algorithm [126] in the sense that the accuracy

of its results is proportional to the time it is allowed to use for computing the

propagation. We show how our proposal outperforms the previous state-of-the-art

method for approximate inference with MTEs, introduced in [134].

The rest of the chapter is organised as follows. The problem for which we

propose a solution is formally posted in Section 6.2. The core of the methodolog-

ical contributions is in Section 6.3, and the details of the algorithm can be found

in Section 6.4. The experimental analysis carried out to test the performance of

the algorithm is reported in Section 6.5. The concluding remarks are given in

Section 6.6.

6.2 Problem formulation

We are interested in hybrid Bayesian networks, which are defined for a set of

variables X that contains discrete and continuous variables. Throughout this

chapter we will assume that X = Y ∪ Z, being Y and Z sets containing only

discrete and only continuous variables respectively.

Inference consists of computing a probability value for a target variableW ∈ X

given that the values of some variables E ⊂ X are known. Thus, if we write

X = (W,YT,ZT,ET)T, where Y = (Y1, . . . , Yd)T represents the non-observed

discrete variables and Z = (Z1, . . . , Zc)T represents the non-observed continuous

variables and E = (E1, . . . , Ek)T, then we are interested in calculating

P (a < W < b | E = e) =P (a < W < b,E = e)

φ(e)(6.1)

ifW is a continuous variable. The function φ in the denominator of Equation (6.1)

is the marginal over variables E of the joint distribution in the network. Let φX

denote the conditional distribution of any variable X in the network. Then, the

joint distribution is defined as

Page 113: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 101

φ(w,y, z, e) =

φW (w | pa(w))d∏

i=1

φYi(yi | pa(yi))c∏

j=1

φZj(zj | pa(zj))

k∏

l=1

φEl(el | pa(el)). (6.2)

Since our goal is to compute a probability given a fixed value e of variables

E, we will rather be interested in the restriction of the joint distribution to the

knowledge that E = e. We will replace any symbol φ in Equation (6.2) by ψ,

where the new symbols means the former function restricted to e. With this

notation, the joint distribution restricted to e can be written as

ψ(w,y, z) =

ψW (w | pa(w))d∏

i=1

ψYi(yi | pa(yi))c∏

j=1

ψZj(zj | pa(zj))

k∏

l=1

ψEl(el | pa(el)). (6.3)

So, the numerator in Equation (6.1) can be obtained as

P (a < W < b,E = e) =

∫ b

a

(

y∈Y

ΩZ

ψ(w,y, z)dz

)

dw

=

∫ b

a

h(w)dw. (6.4)

where

h(w) =∑

y∈Y

ΩZ

ψ(w,y, z)dz. (6.5)

To finally compute the probability expressed in Equation (6.1), we still have

to compute φ(e). This is obtained as

Page 114: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

102 6.2. Problem formulation

φ(e) =

ΩW

(

y∈Y

ΩZ

ψ(w,y, z)dz

)

dw

=

ΩW

h(w)dw. (6.6)

On the other hand, if W is discrete, the probability is formulated as

P (W = w | E = e) =P (W = w,E = e)

φ(e), (6.7)

where w is any possible value of W .

The numerator of Equation (6.7) can be expressed as

P (W = w,E = e) =∑

y∈Y

ΩZ

ψ(w,y, z)dz = h(w). (6.8)

A similar procedure is carried out to compute the denominator of Equa-

tion (6.7), which is obtained as

φ(e) =∑

w∈ΩW

y∈Y

ΩZ

ψ(w,y, z)dz =∑

w∈ΩW

h(w). (6.9)

Hence, calculating the probabilities formulated in Equations (6.1) and (6.7),

requires the computation of the expressions in Equations (6.4), (6.6), (6.8) and

(6.9). The problem is that in all the cases, the calculations are carried out over

the joint distribution, which size is exponential in the number of variables in

the network. Therefore, if the number of variables is high, it can be difficult or

even impossible to represent such a joint distribution in a computer, specially

if memory resources are limited. In the next section we propose a solution for

approximating the probabilities required, keeping the complexity bounded. The

solution is based on the use of the importance sampling technique [129].

Page 115: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 103

6.3 Approximate propagation using importance

sampling

We will start off by considering the case in which the target variable, W , is

continuous. We can write Equation (6.1) as follows.

P (a < W < b,E = e) =

∫ b

a

h(w)dw =

∫ b

a

h(w)

f ∗(w)f ∗(w)dw

= Ef∗

[

h(W ∗)

f ∗(W ∗)

]

, (6.10)

where f ∗ is a probability density function on (a, b) called sampling distribution,

andW ∗ is a random variable with density f ∗. LetW ∗1 , . . . ,W

∗m be a sample drawn

from f ∗. Then it is easy to prove that

θ1 =1

m

m∑

i=1

h(W ∗i )

f ∗(W ∗i )

(6.11)

is an unbiased estimator of P (a < W < b,E = e). This estimation procedure is

called importance sampling.

As θ1 is unbiased, the error of the estimation is determined by its variance,

which is

Var(θ1) = Var

(

1

m

m∑

i=1

h(W ∗i )

f ∗(W ∗i )

)

=1

mVar

(

h(W ∗)

f ∗(W ∗)

)

. (6.12)

In order to minimise the variance in the expression above, f ∗ must be selected

in such a way that the ratio between h and f ∗ be as constant as possible within

interval (a, b). Actually, the minimum variance is reached when f ∗ is proportional

to h in that interval, but that is of no practical value, as we are assuming that

h, which is equivalent to the joint distribution, is difficult to handle. Later on

we will show in detail a way to obtain an approximation to h, but keeping the

complexity bounded. Let h∗ be such an approximation. Then it holds that

Page 116: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

104 6.3. Approximate propagation using importance sampling

f ∗(w) =

h∗(w)∫ b

ah∗(w)dw

if a < w < b,

0 otherwise,

(6.13)

is a probability density function within interval (a, b). Therefore, in order to

apply importance sampling to answer our target query, we have to find an ap-

proximation, h∗, of h and then obtain a sampling distribution from it, according

to Equation (6.14). Finally, we can obtain an estimation of P (a < W < b,E = e)

using Equation (6.11).

On the other hand, φ(e) can be estimated using importance sampling as well.

In principle, a new sample should be generated, since the integral range in this

case is the entire domain of W , and not only interval (a, b). To avoid generating

two different samples, we can consider the following density:

f ∗2 (w) =

h∗(w)∫

ΩWh∗(w)dw

, (6.14)

which is a density for ΩW . From this, we can generate a sample W ∗1 , . . . ,W

∗m.

Then, it holds that

δ =1

m

m∑

i=1

h(W ∗i )

f ∗2 (W

∗i )

(6.15)

is an unbiased estimator of φ(e).

Now, if we writeW ∗(1), . . . ,W

∗(k) for the elements from sampleW ∗

1 , . . . ,W∗m that

fall inside interval (a, b), then it can be shown that

θ2 =1

k

k∑

i=1

h(W ∗(i))

f ∗2 (W

∗(i))

(6.16)

is an unbiased estimator of P (a < W < b,E = e).

Next proposition establishes the impact of using the same sample on the

accuracy of the estimation.

Page 117: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 105

Proposition 3. Let m, k, θ2 and δ be as in Equations (6.15) and (6.16). Then,

Var(θ2) ≤m

kVar(δ) +

1

2k. (6.17)

Proof. Let functions h and f ∗2 be as in Equations (6.15) and (6.16). We define

ξ, ξ1 and ξ2 as

ξ(w) =h(w)

f ∗2 (w)

, w ∈ R,

ξ1(w) =h(w)I(a,b)(w)

f ∗2 (w)

, w ∈ R,

ξ2(w) =h(w)IR\(a,b)(w)

f ∗2 (w)

, w ∈ R,

where a, b ∈ R and

I(a,b)(w) =

1 if w ∈ (a, b),

0 otherwise.

and

IR\(a,b)(w) =

0 if w ∈ (a, b),

1 otherwise.

It is clear that ξ = ξ1 + ξ2 and ξ1 × ξ2 = 0. Then,

Var(ξ) = Var(ξ1 + ξ2) = Var(ξ1) + Var(ξ2) + 2Cov(ξ1, ξ2)

= Var(ξ1) + Var(ξ2) + 2(E[ξ1ξ2]− E[ξ1]E[ξ2])

= Var(ξ1) + Var(ξ2)− 2P (a < W < b,E = e)P (W /∈ (a, b),E = e)

= Var(ξ1) + Var(ξ2)

−2P (a < W < b,E = e)(1− P (a < W < b,E = e))

= Var(ξ1) + Var(ξ2)

−2(P (a < W < b,E = e)− P 2(a < W < b,E = e))

Page 118: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

106 6.3. Approximate propagation using importance sampling

Hence,

Var(ξ1) = Var(ξ)−Var(ξ2) + 2(P (a < W < b,E = e)

−P 2(a < W < b,E = e)) ≤ Var(ξ) +1

2,

since Var(ξ2) ≥ 0 and P (a < W < b,E = e)−P 2(a < W < b,E = e) ≤ 1

4. Thus,

1

mVar(ξ1) ≤ 1

mVar(ξ) +

1

2m⇒

k

m

1

kVar(ξ1) ≤ 1

mVar(ξ) +

1

2m⇒

k

mVar(θ2) ≤ Var(δ) +

1

2m⇒

Var(θ2) ≤ m

kVar(δ) +

1

2k.

Proposition 3 establishes that the variance of θ2 is related to the variance

of δ by the inverse of the proportion of elements in the sample that fall within

interval (a, b). It means that using a single sample does not increase the error of

the estimation dramatically. Actually, if all the elements in the sample are inside

the target interval, then the variance of both estimators is almost the same, as

the term 1/2k tends to 0 as k increases.

If the target variable is discrete, the procedure is analogous. More precisely,

if W is discrete then, from Equation (6.8) it follows that

P (W = w,E = e) =∑

w′∈ΩW

h(w′)Iw(w′)

=∑

w′∈ΩW

h(w′)Iw(w′)

p∗(w′)p∗(w′) = Ep∗

[

h(W ∗)Iw(W∗)

p∗(W ∗)

]

,

where p∗ is any probability mass function defined on ΩW ,W ∗ is a discrete random

Page 119: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 107

variable with distribution p∗, and

Iw(x) =

1 if w = x,

0 otherwise.

The rest of the procedure is identical to the continuous case.

6.3.1 Obtaining a sampling distribution

The error in the estimation procedure above described, depends on the variance

of the ratio h/f ∗. Therefore the best behaviour would be obtained if the sampling

distribution is close to h, as we mentioned before. Salmeron et. al [137] developed

a method for computing an accurate sampling distribution for discrete Bayesian

networks. It is based on computing the sampling distribution for a given vari-

able through a process of eliminating the other variables from the set of all the

conditional distributions in the network, H = p(xi | xpa(i)), i = 1, . . . , n. The

procedure can be adapted to the case of a hybrid Bayesian network as follows.

Let X1, . . . , Xl be the set of all the variables in the network, except the target

W and the observations E. An elimination order σ is considered and variables

are deleted according to such order: Xσ(1), . . . , Xσ(l).

The deletion of a variable Xσ(i) consists of marginalising it out from the com-

bination of all the functions in H which are defined for that variable. More

precisely, the steps are as follows:

• Let dom(f) denote the set of variables for which function f is defined.

• Let Hσ(i) = f ∈ H | Xσ(i) ∈ dom(f).

• Calculate

fσ(i) =∏

f∈Hσ(i)

f (6.18)

and f ′σ(i) defined on dom(fσ(i)) \ Xσ(i), by

f ′σ(i)(y) =

xσ(i)∈ΩXσ(i)

fσ(i)(y, xσ(i))dxσ(i) ∀y ∈ Ωdom(fσ(i))\Xσ(i). (6.19)

Page 120: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

108 6.3. Approximate propagation using importance sampling

• Transform H into H \Hσ(i) ∪ f ′σ(i).

Note that the integral in Equation (6.19) would be a summatory if W were

discrete. After deleting all the variables Xσ(1), . . . , Xσ(l) from the set of distribu-

tions H = p(xi | xpa(i)), i = 1, . . . , n, the remaining functions will depend only

on W . If all the computations are exact, it was proved in [72] that the remaining

function is actually the optimal sampling distribution.

However, the result of the products (see Equation (6.18)) in the process of

obtaining the sampling distribution may require a large amount of space to be

stored, and therefore the algorithm in [137] approximates the result of the com-

binations by pruning the probability trees (in our case, mixed trees) used to

represent the potentials. The price to pay is that the sampling distribution is not

the optimal one and the accuracy of the estimations will depend on the quality

of the approximations. Here we propose a strategy for approximating the MTE

potentials resulting from the products in Equation (6.18). We will explain the

idea by considering an MTE potential defined for a set of continuous variables

Z = (Z1, . . . , Zt)T as

φ(z) = a0 +

t∑

i=1

aiebT

i z.

The goal is to detect those exponential terms in φ(z) that are almost constant

and remove them. The rationale behind this strategy is that, from the point of

view of simulation, a flat or constant term does not provide any useful information

to the entire density, as there is already a constant term, namely a0.

Thus, we consider a threshold α ∈ (0, 1) and then, for each term gj(z) =

ajebT

j z, j = 1, . . . , t, in the mixture, if the following condition is satisfied

min(gj(z))

max(gj(z))> α,

then gj(z) is replaced by a constant

kj =min(gj(z)) + max(gj(z))

2. (6.20)

The closer to 1 α is, the more accurate the approximation.

Page 121: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 109

Note that the previous statements can be made taking into account that the

exponential function by nature is strictly increasing or decreasing on its whole

domain, and therefore its maximum and minimum are always located at the

borders of the domain. In this way, the shape of the function can be controlled.

Summing up, if the j-th term of the mixture is replaced by constant kj , the

resulting potential would be

φ(z) = k + kj +∑

i∈Ni 6=j

aiebT

i z,

where N = 1, . . . , t. But in fact, MTE potentials are defined into hypercubes.

Therefore, rather than approximating a single potential, after each product the

whole mixed tree representing the resulting potential should be approximated

following this strategy. The detailed procedure can be found in Algorithm 15.

Algorithm 15: PruneMTEPotential (T, α)

Input: An mixed tree T and a threshold α for pruning terms.Output: Tree T with terms pruned according to α.Let Z be the set of continuous variables of tree T.1

foreach leaf in T do2

Let φ(z) = k +t∑

i=1

aiebT

i z be the MTE stored in the current leaf.3

for j := 1 to t do4

Let ajebT

j z be the j-term of φ(z) .5

ifmin(aje

bT

j z)

max(ajebT

jz)> α then

6

kj :=min(aje

bT

j z)+max(aje

bT

j z)

2.7

Remove ajebT

j z from φ(z).8

Update the independent term k of φ(z) to k + kj .9

return T.10

Page 122: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

110 6.3. Approximate propagation using importance sampling

6.3.2 Computing multiple probabilities simultaneously

The procedure described so far is designed to calculate probabilities concerning a

single variable at a time. We will show in this section that it can be extended to

allow the possibility of calculating multiples probabilities about different variables

at the same time. The idea is based on the elimination procedure described in

Section 6.3.1.

It is possible to carry out a simulation in an order contrary to the one in which

variables are deleted. To obtain a value for Xσ(i), the function fσ(i) obtained in

the deletion of this variable is used. This function is defined for the values of

variable Xσ(i) and other variables already sampled. Function fσ(i) is restricted to

the already obtained values of variables in dom(fσ(i)) \ Xσ(i), giving rise to a

density function which depends only on Xσ(i). Finally, a value for this variable is

drawn from this density. If all the computations are exact, it was proved in [72]

that the simulation is actually carried out using the optimal density, and we

obtain a sample from the joint distribution of Xσ(1), . . . , Xσ(l).

The details of this procedure are given in Algorithm 16, which computes a

sampling distribution for each unobserved variable in a hybrid Bayesian network.

Later on we will study how to determine the order of the variables in Step 4.

Now let us denote by W1, . . . ,Wn the unobserved variables in the network,

and by E1, . . . , Ek the observed ones. Note that after applying Algorithm 16, if

we set α = 1 in Step 7, then it holds that the true joint probability function is

f(w1, . . . , wn, e1, . . . , ek) =

l∏

i=1

f ∗Xi. (6.21)

That is, if we simulate each variable Xi using f∗Xi, we would actually be obtaining

a sample of random vectors w1, . . . ,wn, e1, . . . , ek from the true distribution.

Our goal in this section is to calculate a set of probabilities about the unob-

served variables expressed as P (Wi = wi,E = e) or P (ai < Wi < bi,E = e),

i = 1, . . . , n, if Wi is discrete or continuous, respectively. It can be shown that we

can use the joint sample to estimate the different probabilities separately, since

each individual sample is itself a sufficient statistic for the probability of a precise

variable.

Page 123: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 111

Algorithm 16: SamplingDistributions (B, e)

Input: A hybrid BN, B, and an observation e.Output: A sampling distribution for each variable in the network.Let H := ψX1 , . . . , ψXl

be all the potentials in B restricted to the1

evidence e, represented as mixed trees.S := ∅.2

for i := 1 to l do3

Select the next variable to remove, Xi.4

HXi:= ψ ∈ H | Xi ∈ dom(ψ).5

fXi:=∏

ψ∈HXiψ.6

f ∗Xi

:= PruneMTEPotential(fXi, α).7

S := S ∪ f ∗Xi.8

H := H \HXi.9

if Xi is continuous then10

H := H ∪ ∫

Xif ∗Xidxi.11

else12

H := H ∪ ∑Xif ∗Xi.13

return S .14

Let W(j)1 , . . . ,W

(j)n , j = 1, . . . , m be a sample of size m drawn from the sam-

pling distribution in the set S returned by Algorithm 16. Then

δ2 =1

m

m∑

j=1

ψ(W(j)1 , . . . ,W

(j)n )

∏ni=1 f

∗Wi(W

(j)i )

(6.22)

is an unbiased estimator of φ(e).

Let W(j)∗1 , . . . ,W

(j)∗n , j = 1, . . . , r be the elements from the sample above that

fall into interval (ai, bi) (or for which W(j)i = wi in the discrete case), i = 1, . . . , n.

Then

θXi=

1

r

r∑

j=1

ψ(W(j)∗1 , . . . ,W

(j)∗n )

∏ni=1 f

∗Wi(W

(j)∗i )

(6.23)

is an unbiased estimator of P (ai < Wi < bi,E = e), i = 1, . . . , n. A similar

result can be derived immediately in the case that Wi is discrete, and therefore

the quantity to estimate is P (Wi = wi,E = e).

Page 124: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

112 6.4. The algorithm

In Equations (6.22) and (6.23), function ψ in the numerator is defined in a

similar way as in Equation (6.3), i.e. the product of conditionals, restricted to

the observations.

6.4 The algorithm

In this section we give the details of the algorithm that implements our proposal

for computing multiples probabilities in hybrid Bayesian networks with MTEs

using importance sampling. First of all it should be emphasised that Algorithm 16

makes a decision about which variable to remove in each iteration (see Step 4).

The decision there influences the complexity of the product in Step 6, since it

determines the set of potentials that will be multiplied. We propose to use a

one-step look-ahead heuristic based on selecting the variable that results in a

potential of lowest size after the product in Step 4. The concept of size is given

in the next definition.

Definition 8 (Size of an MTE potential). The size of an MTE potential is defined

as its number of exponential terms, including the independent term.

Example 6. The potential represented in Figure 2.5 has size equal to 16, since

it has 8 leaves, and in each one there is an independent term and one exponential

term, that is, 8× (1 + 1) = 16.

Though it is not possible to know beforehand the exact size of a potential

resulting from a product, an upper bound is given in the next proposition. This

is the bound actually used for deciding the elimination order in Algorithm 16.

Proposition 4 (proposed in [134]). Let T1, . . . ,Th be mixed probability trees,

Yi,Zi the discrete and continuous variables of each one of them, and ni the

highest number of intervals into which the domain of the continuous variables of

Ti is split. Let ΩYi, be the set of possible values of the discrete variable Yi. The

size of the tree T = T1 × T1 × . . .× Th is lower than

Yi∈∪hi=1Yi

|ΩYi |

×(

h∏

j=1

nkjj

)

×(

h∏

j=1

tj

)

,

Page 125: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 113

where tj is the maximum number of exponential terms in each leaf of Tj, and kj

is the number of continuous variables in Tj.

In this point, we have all the tools necessary for establishing our proposal for

computing multiple probabilities, which is described in Algorithm 17.

6.5 Experimental evaluation

Three experiments have been carried out to analyse the performance of the pro-

posed methodology. We have used five hybrid Bayesian networks. The first three,

denoted as artificial1, artificial2, artificial3 are artificial networks with

41, 77 and 97 variables, respectively, whose structure and parameters were gen-

erated at random, in the same way as the networks used in [134].

The two remaining networks have been created taking the structure from

the alarm [8] and barley [82] networks, which are originally fully discrete, and

making some assumptions about the kind of the variables. Out of the 37 and

48 variables in the networks, respectively, 10 of them were considered as discrete

with two states, and the remaining were considered continuous with support in the

interval [0, 1]. The domain of each continuous variable was split into two pieces.

The MTE densities associated with each split was defined using 2 exponential

terms, with parameters generated at random as in [134].

For each network, 20% of the variables were observed at random, considering

as goal variables the remaining 80%. For each network, we considered 10 differ-

ent observations. The probabilities were also selected at random, with uniform

probability for each value of the discrete variables, and considering an interval of

width of a 10% of its support for each continuous target variable.

6.5.1 Experiment 1

In this experiment we compared the performance of the Importance Sampling

(IS) algorithm versus the other two approximate propagation methods existing

in the literature for MTE networks: Markov Chain Monte Carlo (MCMC) and

Penniless Propagation (PP) [134].

Page 126: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

114 6.5. Experimental evaluation

Algorithm 17: ApproximateProbabilityPropagation (B, e, P )

Input: A hybrid Bayesian network B with variables X. An observation eabout a set of variables E. A list of probabilities P to becalculated, of the form P (ai < Wi < bi | e) if Wi is continuous andP (Wi = wi | e) otherwise.

Output: Estimations P (ai < Wi < bi | e) or P (Wi = wi | e).Let W1, . . . ,Wn be the variables in X \ E.1

S := SamplingDistributions(B,e).2

Initialise ri := 0 and Pi := 0, i = 1, . . . , n, and φ(e) := 0.3

for j := 1 to m do4

Generate a sample w∗1, . . . , w

∗n for variables W

(j)1 , . . . ,W

(j)n by5

simulating in reverse order to the one used in Algorithm 16, using thesampling distributions in S (a procedure for sampling from an MTEdensity is given in [134]).for i := 1 to n do6

if Wi is continuous then7

if w∗i ∈ (ai, bi) then8

Pi := Pi +ψ(w∗

1 ,...,w∗

n)∏n

k=1 f∗

Wk(w∗

k).

9

ri := ri + 1.10

else11

if w∗i = wi then12

Pi := Pi +ψ(w∗

1 ,...,w∗

n)∏n

k=1 f∗

Wk(w∗

k).

13

ri := ri + 1.14

φ(e) := φ(e) +ψ(w∗

1 ,...,w∗

n)∏n

k=1 f∗

Wk(w∗

k).

15

φ(e) := φ(e)m

.16

Pi :=Pi

ri×φ(e), i = 1, . . . , n.17

return P1, . . . , Pn.18

Page 127: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 115

For each set of observations, the execution time and the error in the estima-

tions were computed. The error was calculated using the χ2 divergence, which is

defined as

χ2 =1

n

n∑

i=1

(pi − pi)2

pi,

where pi, i = 1, . . . , n are the true probabilities for each query, and pi, i = 1, . . . , n

are their estimations.

Figures 6.1 and 6.2 show the results of the experiment for the five networks.

The three box plots correspond to the χ2 error, execution time and the rate

error × time obtained for a set of 10 observations. Notice that the outliers have

not been represented in these charts. Each execution of the simulation algorithms

(IS and MCMC) was repeated 10 times, using in both cases a sample of size 500.

The results shown correspond to the average over the ten executions.

In order to simplify the potentials during the propagation, we have set a

threshold α = 0.95 for the mixed trees in the IS algorithm (see Section 6.3.1) and

for algorithm PP we chose the following parameters, taken from [134]: ǫJoin =

0.05, ǫDisc = 0.05. We refer the readers to the original reference for a detailed

explanation of the meaning of those parameters. Finally, we limited the maximum

number of exponential terms in the PP algorithm to 2.

The experimental results show how the IS algorithm clearly outperforms

the other two in terms of accuracy, speed and rate error × time for network

artificial3. For networks alarm and barley, the error is again lower for IS,

but in exchange the running time is the worse. This is due to the higher com-

plexity of the potentials involved in these networks, which makes the algorithm

invest much time on obtaining the sampling distributions. However, the time

invested is worth it, as can be seen looking at the plots corresponding to the

rate error× time, which is better for IS. Therefore, we conclude that this exper-

iment suggests that IS offers the best way for dealing with the tradeoff between

complexity and accuracy when computing multiple probabilities.

Page 128: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

116 6.5. Experimental evaluation

IS MCMC PP

0.2

0.4

0.6

0.8

alarm

erro

r

IS MCMC PP

1.0

1.5

2.0

2.5

3.0

3.5

alarm

time

(sec

)

IS MCMC PP

0.2

0.4

0.6

0.8

alarm

erro

r ·

time

IS MCMC PP

01

23

45

6

barley

erro

r

IS MCMC PP

24

68

1012

14

barley

time

(sec

)

IS MCMC PP

02

46

810

1214

barley

erro

r ·

time

Figure 6.1: Box plots of the χ2 error, execution time and the rate error × timefor the probabilities in networks alarm and barley.

Page 129: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 117

IS MCMC PP

0.1

0.2

0.3

0.4

0.5

0.6

artificial1

erro

r

IS MCMC PP

0.2

0.3

0.4

0.5

0.6

0.7

artificial1

time

(sec

)

IS MCMC PP0.0

0.2

0.4

0.6

0.8

artificial1

erro

r ·

time

IS MCMC PP

01

23

45

artificial2

erro

r

IS MCMC PP

24

68

artificial2

time

(sec

)

IS MCMC PP

05

1015

artificial2

erro

r ·

time

IS MCMC PP

01

23

45

artificial3

erro

r

IS MCMC PP

1.5

2.0

2.5

3.0

artificial3

time

(sec

)

IS MCMC PP

02

46

8

artificial3

erro

r ·

time

Figure 6.2: Box plots of the χ2 error, execution time and the rate error × timefor the probabilties in networks artificial1, artificial2 and artificial3.

Page 130: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

118 6.5. Experimental evaluation

6.5.2 Experiment 2

The second experiment is devoted to analyse the impact of the sample size as well

as the execution time in the behaviour of the simulation algorithms, that is IS

and MCMC. Figures 6.3 and 6.4 show the χ2 divergence as a function of sample

size and time, for the five networks considered. It can be seen that IS converges

more quickly than MCMC, and also converges to a more accurate solution.

0 500 1000 1500 2000

0.0

0.5

1.0

1.5

2.0

2.5

3.0

alarm

Sample size

erro

r

ISMCMC

0 2 4 6 8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

alarm

time (sec)

erro

r

ISMCMC

0 500 1000 1500 2000

01

23

45

barley

Sample size

erro

r

ISMCMC

0 10 20 30 40

01

23

45

barley

time (sec)

erro

r

ISMCMC

Figure 6.3: χ2 error for methods IS and MCMC as a function of the sample sizeand execution time. Results for the networks alarm and barley.

Page 131: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 119

0 500 1000 1500 2000

01

23

artificial1

Sample size

erro

r

ISMCMC

0.0 0.5 1.0 1.5 2.0 2.5

01

23

artificial1

time (sec)er

ror

ISMCMC

0 500 1000 1500 2000

02

46

810

artificial2

Sample size

erro

r

ISMCMC

0 1 2 3 4 5

02

46

810

artificial2

time (sec)

erro

r

ISMCMC

0 500 1000 1500 2000

02

46

artificial3

Sample size

erro

r

ISMCMC

0 2 4 6

02

46

artificial3

time (sec)

erro

r

ISMCMC

Figure 6.4: χ2 error for methods IS and MCMC as a function of the samplesize and execution time. Results for networks artificial1, artificial2 andartificial3.

Page 132: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

120 6.6. Conclusions

6.5.3 Experiment 3

Finally, we conducted an experiment aimed at testing the impact of using the

pruning method proposed in Section 6.3.1. More precisely, we performed two

tests. In the first one, we run the algorithm with different α thresholds and

measured the χ2 error of the predictions. As in previous experiments, for each

of the 10 observations, the algorithm was run 10 times. The results displayed

in Figure 6.5 show the average of the errors obtained. As expected, the error

decreases as we increase the threshold, which means that we are being more

strict with the pruning criterion.

In the second test, we have plotted an MTE density for different α thresholds.

The density had originally 10 terms (all of them positive). The idea is to see the

impact of reducing terms in the shape of the density function. The results are

displayed in Figure 6.6 and show how the shape of the density becomes smoother

as exponential terms are removed.

6.6 Conclusions

We have introduced a method for computing multiples posterior probabilities in

hybrid Bayesian networks with MTEs. The method is based on importance sam-

pling, which makes it an anytime algorithm. The algorithm is able to compute all

the require probabilities using a unique sample. We have shown that the variance

remains bounded if the same sample is also used to compute the numerator and

denominator in each conditional probability.

The experiments conducted illustrate the behaviour of the proposed algo-

rithm, and they support the idea that the IS algorithm outperforms the two algo-

rithms previously used for carrying out probabilistic reasoning in hybrid Bayesian

networks with MTEs. Therefore, the methodology introduced in this chapter ex-

pands the class of problems that can be handled using hybrid Bayesian networks,

and more precisely, it provides versatility to the MTE model, by increasing the

efficiency in solving probabilistic inference tasks.

We expect to continue this research line by developing methods for carrying

out more complex reasoning tasks. For instance, finding the most probable ex-

Page 133: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 6. Approximate inference in MTE networks using importance sampling 121

0.0 0.2 0.4 0.6 0.8 1.00.01

0.02

0.03

0.04

0.05

0.06

0.07

alarm

threshold

erro

r

0.0 0.2 0.4 0.6 0.8 1.0

0.02

0.04

0.06

0.08

0.10

0.12

barley

thresholder

ror

0.0 0.2 0.4 0.6 0.8 1.0

0.10

0.15

0.20

artificial1

threshold

erro

r

0.0 0.2 0.4 0.6 0.8 1.0

0.1

0.2

0.3

0.4

0.5

artificial2

threshold

erro

r

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

artificial3

threshold

erro

r

Figure 6.5: χ2 error for different levels of pruning. The higher the α threshold,the less pruning is actually carried out. Results for networks alarm, barley,artificial1, artificial2 and artificial3.

Page 134: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

122 6.6. Conclusions

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

x

f(x)

Threshold

1.0 (10 terms)0.8 (7 terms)0.7 (3 terms)0.6 (2 terms)0.5 (1 term)0.3 (no terms)

Figure 6.6: Several approximations to the same MTE density using different levelsfor pruning exponential terms.

planation to an observed fact in terms of a set of target variables, which is called

abductive inference [60].

Page 135: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Part III

Applications

Page 136: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 137: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7

Species distribution modelling

Bayesian networks have been widely used to solve problems in environmental sci-

ences [1] by discretising the continuous domains to apply the techniques developed

for learning and inference so far. However, there are few studies facing directly with

continuous data, even in other areas. In this chapter the naıve Bayes (NB) and tree

augmented naıve Bayes (TAN) classification models based on MTEs are applied. The

aim is to characterise the habitat of the spur-thighed tortoise (Testudo graeca graeca),

using several continuous environmental variables, and one discrete (binary) variable

representing the presence or absence of the tortoise. These models are compared with

the full discrete models and the results show a better classification rate for the con-

tinuous ones. Therefore, the application of continuous models instead of discrete ones

avoids loss of statistical information due to the discretisation. Moreover, the results

of the TAN continuous model show a more spatially accurate distribution of the tor-

toise. The species is located in the Donana Natural Park, and in semiarid habitats.

The proposed continuous models based on MTEs are valid for the study of species

predictive distribution modelling.

Abstract

7.1 Introduction

Over the last decade, advances in species predictive distribution modelling have

been paralleled by the evolution and the development of geographical informa-

tion systems (GIS), remote sensing, statistical modelling and database manage-

ment [71, 7, 92, 139]. Statistical models relate observations of species, communi-

Page 138: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

126 7.1. Introduction

ties or diversity [101, 66, 15, 70, 157] to environmental predictors, and project the

fitted relationships into geographical space to produce distribution maps [100].

The modelling of species distribution is a useful tool [70] that is widely used

in spatial ecology, biogeography and conservation biology. The models have

contributed significantly to test biogeographical, ecological and evolutionary hy-

potheses [4, 67], for assessing species invasion and proliferation [120], for rare

species distribution [68], for supporting conservation planning and reserve selec-

tion [56, 5], and for the study of the impacts of global change [121, 102, 152, 6].

Many statistical techniques have been applied to modelling [71, 157, 16, 118, 46],

including classical statistical models such as generalised linear regressions [69],

generalised additive models [98], generalised regression analysis and spatial pre-

diction (GRASP) [93] or logistic regressions [101]. Recently, machine learning

methods such as classification and regression trees [103, 45] and neural net-

works [105, 40] have also been applied.

Bayesian networks [77] have been applied in solving environmental prob-

lems [1] such as eutrophication in an estuary [13], credal classification in agri-

culture [160], management of endangered species [12, 123], water resources plan-

ning [14] and conservation of dunnarts [147]. Several advantages are gained from

this methodology [154]: suitability for incomplete data sets, possibility of struc-

tural learning, combination of different sources of knowledge, explicit treatment

of uncertainty and support for decision analysis, and fast response. However,

most environmental variables are continuous whilst Bayesian networks usually

build the model over discrete domains, so that continuous variables need to be

first discretised [154]. Discretisation implies capturing only rough characteristics

of the original distribution [59] and loss of statistical information. Thus, there is

a need to develop Bayesian networks that can work with continuous values.

The problem of using continuous and discrete variables simultaneously in-

volves more complex mathematical models. As we discussed in Section 2.2, there

are several techniques in the literature to cope with hybrid variables. In this work

we will concentrate on the use of MTEs.

As described in Chapter 3, Bayesian networks can be used to solve classifi-

cation problems. The most frequently-used structures for this purpose are the

naıve Bayes (NB) and the tree augmented naıve Bayes (TAN) [58]. These models

Page 139: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 127

have usually been applied only to discrete variables. Bayesian classifiers bring

significant benefits over traditional statistical techniques. Mainly, accurate in-

formation about a target variable can be obtained without requiring complete

observation of all the remaining variables.

The aim of this chapter is to develop NB and TAN classifier structures based

on the MTE model that allows the simultaneous use of continuous and discrete

variables in the same network, without any pre-processing in either the variables

or the data. Continuous environmental variables and presence/absence records of

the spur-thighed tortoise (Testudo graeca graeca) were used to develop the models.

The results are compared with discrete NB and TAN structures proposed in [3]

and used to characterise the habitat of the tortoise.

7.2 Methodology

7.2.1 Variables and data set description

The study area selected (Figure 7.1) is located in the region of Andalusia (south-

ern Spain). A set of thematic maps of vegetation and land use, lithology and soils,

were selected and incorporated into an automatic spatial representation system,

ArcGis 9.2. A 10x10 kilometre grid was superimposed over the thematic maps

to calculate the percentage cover of each variable in each cell. Mean, maximum

and minimum, height, slope, temperature and rainfall were also considered. In

this way, a matrix of 988 observations and 176 environmental variables was ob-

tained. The data relating to the presence/absence of the spur-thighed tortoise

(Testudo graeca graeca) for each cell, were derived from the Atlas of Amphibians

and Reptiles of Spain [122]. This tortoise is an endangered species [74].

7.2.2 Selection of variables

Since the number of variables described in Section 7.2.1 is excessive, a selection of

the most representative ones is needed [3]. This selection can be done in different

ways within the framework of Bayesian networks [9, 10, 73, 104]. In this case, filter

measures, based on information functions applied to discrete variables (qualitative

Page 140: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

128 7.2. Methodology

Figure 7.1: Location of the study area. A 10x10 kilometre grid was superimposedto calculate the values of the variables.

or quantitative), were used. The selected measure was Kullback-Leibler [84, 83].

Since this method is only defined for discrete variables, the continuous variables

were discretised using the k-means clustering algorithm. Three groups represent-

ing low, medium and high values for each variable, were considered. This process

was developed using Elvira GUI software [34].

The discretisation of the continuous variables was taken into account only to

select the final set of variables in our study. Once obtained, they were treated as

continuous variables in order to implement the NB and TAN models.

Once the discretisation and Kullback-Leibler filter measure were applied, 10

variables were selected after consulting with an expert. In decreasing order they

are: areas with low vegetation (vegetation cover < 20%), sparse shrubland with

pasture, aridisols soil type, mean rainfall, mean temperature, irrigated woody

crops, marsh with vegetation, sand and dunes, dry woody crops and dry herba-

ceous crops.

Page 141: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 129

7.2.3 Bayesian classifiers and calibration of models

Although the ideal would be to build a network without restrictions on the struc-

ture, this is not possible due to the limited data available. Therefore, networks

with fixed and simple structures and specifically designed for classification have

been used. The extreme case is the NB model [44]. In Algorithm 18 the steps

for constructing a NB classifier with continuous features are shown. In essence,

they consist of building a Bayesian network with a NB structure and estimating

the marginal distribution for the class variable and conditional MTE densities for

the features [107].

Algorithm 18: Naıve Bayes classifier with continuous features

Input : A database D with variables X1, . . . , Xn, Y .Output: A NB model with root variable Y and features X1, . . . , Xn, with

joint distribution of class MTE.Construct a new network G with nodes Y,X1, . . . , Xn.1

Insert the links Y → Xi, i = 1, . . . , n en G.2

Estimate a discrete distribution for Y , and a conditional MTE density for3

each Xi, i = 1, . . . , n given its parents in G [135, 131, 128].Let P be the set of estimated distributions.4

Let NB be a Bayesian network with structure G and distributions P.5

return NB.6

The steps to build the TAN classifier with continuous features are shown in

Algorithm 19.

Elvira API software [34] was used both for the learning of the models and

their validation. It is remarkable that this is the only software in the literature

dealing with hybrid Bayesian networks using MTEs.

7.2.4 Inference in Bayesian classifiers

The goal in this section is to determine all the probabilistic information, both a

priori and a posteriori of the model. Thus, the model can be used to give a true

reflection of the initial data set (model a priori) or to predict the impact (in terms

of probability) of introducing evidence for certain variables (model a posteriori).

For example, if the evidence (the observation that the tortoise is present),

Page 142: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

130 7.2. Methodology

Algorithm 19: TAN classifier with continuous features

Input: A database D with variables X1, . . . , Xn, Y .Output: A TAN model with root variable Y and features X1, . . . , Xn,

with joint distribution of class MTE.Construct a complete graph C with nodes X1, . . . , Xn.1

Label each link (Xi, Xj) with the conditional mutual information between2

Xi and Xj given Y [51], i.e.,

I(Xi, Xj | Y ) =

ΩXi

ΩXj

ΩY

f(xi, xj, y) logf(xi, xj | y)

f(xi | y)f(xj | y)dxidxj .

Let T be the maximum spanning tree obtained from C using the3

Algorithm 4 in Chapter 3.Directs the links in T in such a way that no node has more than one parent.4

Construct a new network G with nodes Y,X1, . . . , Xn and the same links as5

T.Insert the links Y → Xi, i = 1, . . . , n in G.6

Estimate a discrete distribution for Y , and a conditional MTE density for7

each Xi, i = 1, . . . , n given its parents in G [135, 131, 128].Let P be the set of estimated distributions.8

Let TAN be a Bayesian network with structure G and distributions P.9

return TAN.10

P (tortoise = presence) = 1, is set in the class variable, the density functions of

the remaining environmental variables will be modified. In this way, an approxi-

mation to the most probable configuration for the presence of the tortoise can be

obtained.

7.2.5 Validation of the models

The models were tested using k-fold cross validation [150]. This technique is

applied to the initial data set and used to evaluate the quality of a classification

model.

A lazy choice would be to use holdout validation (k = 1). It is not consid-

ered cross validation as such, since the data never cross. The initial data set is

randomly divided into two subsets: The first one (Dl) is devoted to the training

phase of the model and the second one (Dt), to validating it. Usually, less than

Page 143: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 131

a third of the initial data set is used for Dt.

For a k-value greater than 1, the data set is split into k subsets. In each step,

one subset is assigned to Dt and the remaining k − 1 to Dl. Cross validation is

repeated k times, each time taking a different subset for Dt. This is the approach

followed to test the classifiers presented in this chapter.

A particular and extreme case of k-fold cross validation, is when k coincides

the number of cases n of the data set. Each model is trained with n−1 cases and

tested with the remaining unused case. This validation method is called Leave-

One-Out [79]. It returns more accurate results, but its disadvantage is the need

to train as many models as there are cases in the initial data set and therefore it

is inefficient from a computational point of view.

The output model is constructed by including the entire database in Dl.

7.3 Results and discussion

7.3.1 NB model

The resulting NB model is shown in Figure 7.2. The introduction of the evidence

”presence of tortoise”, changes the probability distribution of the features due to

d-separation criterion (see Definition 1 in Chapter 2).

Tortoise

MV DHC IWC SSP SD DC AWV AS MR MT

Figure 7.2: Modelling using a NB structure. MV: marsh with vegetation; DHC:dry herbaceous crops; IWC: irrigated woody crops; SSP: sparse shrubland withpasture; SD: sand and dunes; DC: dry woody crops; AWV: areas with low vege-tation; MR: mean rainfall; AS: aridisols; MT: mean temperature.

Page 144: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

132 7.3. Results and discussion

a priori a posteriori % changeMarsh with vegetation 3.70 15.90 329Dry herbaceous crops 22.54 10.06 -56Irrigated woody crops 6.83 3.39 -50

Sparse shrubland with pasture 17.56 40.00 127Sand and dunes 2.47 15.11 510Dry woody crops 38.02 17.18 -55

Areas with low vegetation 16.33 42.03 157Aridisols 6.85 23.11 237

Mean rainfall 608.39 586.71 -3.43Mean temperature 16.10 16.92 5

Table 7.1: Expected values of the probability distributions a priori and a poste-riori for the NB model. The values represent percentage cover except for meanrainfall (mm) and mean temperature (C).

Figures 7.3 and 7.4 show the density function for each variable, both without

evidence of the tortoise being present (a priori) and with evidence (a posteriori).

Table 7.1 shows the expected values of the marginal density function for each

environmental variable both a priori and a posteriori.

In order to compare the mean results of the variables a priori and a posteriori,

a threshold value was calculated above or below which (depending on whether the

mean decreases or increases) the differences between the a priori density function

and the a posteriori density function are maximised.

The mean value of variable marsh with vegetation is increased by 329%, from

a cover of 3.70% to 15.90%. The function shows that the prior probability of

its value exceeding the 2% threshold is 0.20, whereas the posterior probability

becomes 0.84.

The variable dry herbaceous crops decreases its mean value by 56%. The

function shows that the prior probability of its value exceeding the 5% threshold

is 0.74, whereas a posteriori it becomes 0.23, i.e., it is less likely that the variable

takes a high value (greater than 5).

For the variable irrigated woody crops, the mean value decreases by 50%,

from 6.83% to 3.39%. Both functions show that the mean value of the posterior

distribution should increase, however the right tail of the posterior distribution

Page 145: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 133

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

Marsh with vegetation

Den

sity

A PrioriA Posteriori

0 20 40 60 80 1000.

00.

10.

20.

30.

4Dry herbaceous crops

Den

sity

A PrioriA Posteriori

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Irrigated woody crops

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.00

0.02

0.04

0.06

Sparse shrubland with pasture

Den

sity

A PrioriA Posteriori

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Sand and dunes

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.00

0.02

0.04

0.06

Dry woody crops

Den

sity

A PrioriA Posteriori

Figure 7.3: Prior and posterior marginal probability distributions in the NBmodel for the variables marsh with vegetation, dry herbaceous crops, irrigatedwoody crops, sparse shrubland with pasture, sand and dunes and dry woodycrops.

Page 146: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

134 7.3. Results and discussion

0 20 40 60 80 100

0.00

0.04

0.08

0.12

Areas with low vegetation

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.00

0.10

0.20

0.30

Aridisols

Den

sity

A PrioriA Posteriori

200 400 600 800 1000 1200 14000.00

000.

0010

0.00

200.

0030

Mean rainfall

Den

sity

A PrioriA Posteriori

8 10 12 14 16 18

0.0

0.1

0.2

0.3

0.4

0.5

Mean temperature

Den

sity

A PrioriA Posteriori

Figure 7.4: Prior and posterior marginal probability distributions in the NBmodel for the variables areas with low vegetation, mean rainfall, aridisols andmean temperature.

Page 147: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 135

is very long, and with lower probability, which explains this feature.

The mean value for the variable sparse shrubland with pasture increases by

127%. The function suggests that the prior probability of its value exceeding the

19% threshold is 0.30, whereas a posteriori it becomes 0.74. In other words, it is

more likely that the variable takes a high value.

The change in the sand and dunes variable is remarkable. The mean value

increases by 510%. The function shows that the prior probability of its value

exceeding the 2% threshold is 0.12, whereas a posteriori it becomes 0.86.

For the variable dry woody crops, the mean value decreases by 55%. The

prior probability that its value is lower than the threshold 35% is 0.54, whereas a

posteriori increases to 0.86, i.e., it is more likely to find low values for this variable

in the case that the presence of the tortoise is recorded.

The mean value of variable areas with low vegetation increases by 157%. The

function shows that the prior probability of its value exceeding the 16% threshold

is 0.29, whereas a posteriori it becomes 0.79.

The variable aridisols increases its mean value by 237%. The prior probability

that its value exceeds the 5% threshold is 0.22, whereas a posteriori it becomes

0.80.

The decrease in variable mean rainfall is lower (3.43%). The function shows

that the prior probability of its value being lower than the threshold 437 mm is

0.16, whereas if the presence of tortoise is known, this probability is 0.40, i.e.,

lower precipitations favours the presence of the tortoise.

For mean temperature, the mean value is increased slightly, by 5%. The

probability that the temperature exceeds the threshold value of 17C is 0.36

a priori, whereas a posteriori becomes 0.63. Therefore, slightly higher mean

temperatures favour the presence of tortoise.

Summing up, the marginal probability distributions of the NB model show

that it is likely to find the tortoise in areas with sand and dunes, marsh with

vegetation, aridisols, areas with low vegetation and in sparse shrubland with

pasture. The remaining variables vary by less than 100%, between the case of

evidence to no evidence. The model also suggests that is likely to find the tortoise

where mean rainfall is lower and where mean temperature is slightly higher.

Page 148: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

136 7.3. Results and discussion

7.3.2 TAN model

Figure 7.5 shows the constructed TAN model. The main difference with respect to

the NB is the relationship between the features. This increases the number of arcs

in the structure and its complexity, but improves the accuracy and expressivity.

SD

DHC SSP

IWC MT AS AWV

MV DC MR

(a) Tree showing the relations amongthe features.

TORTOISE

SD

DHC SSP

IWC MT AS AWV

MV DC MR

TORTOISE

SD

DHC SSP

IWC MT AS AWV

MV DC MR

(b) TAN model constructed adding tothe previous tree the class variable andlinks from it to each feature.

Figure 7.5: Sequence followed to obtain the TAN model. MV: marsh with veg-etation; DHC: dry herbaceous crops; IWC: irrigated woody crops; SSP: sparseshrubland with pasture; SD: sand and dunes; DC: dry woody crops; AWV: areaswith low vegetation; MR: mean rainfall; AS: aridisols; MT: mean temperature.

Figure 7.5 shows the structure of the corresponding TAN model built from a

tree. The procedure consists of adding the tortoise variable and drawing an arc

from it to each environmental variable.

Figure 7.6 and 7.7 show the density functions for each variable in case of

there being no evidence of the tortoise (a priori) and with evidence (a posteriori).

Page 149: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 137

Table 7.2 shows the expected values of the prior and posterior density function

for each environmental variable.

a priori a posteriori % changeMarsh with vegetation 3.70 15.90 329Dry herbaceous crops 22.54 10.06 -56Irrigated woody crops 6.16 4.08 -34

Sparse shrubland with pastures 17.56 40.00 127Sand and dunes 2.47 15.11 510Dry woody crops 38.02 17.18 -55

Areas with low vegetation 20.70 47.12 228Aridisols 15.50 26.62 72

Mean rainfall 600.81 392.56 -35Mean temperature 15.85 16.92 7

Table 7.2: Expected values of the probability distributions a priori and a poste-riori for the TAN model. The values represent percentage cover except for meanrainfall (mm) and mean temperature (C).

In the same way as for the NB, a threshold value has been calculated to see

where the greatest differences between the prior and posterior density functions

lie.

Mean percentage cover for irrigated woody crops decreases by 34%. The

function shows that the prior probability of its value exceeding the 1% threshold

is 0.47, whereas a posteriori it becomes 0.78. The probability that the mean is

lower than the threshold 10% (the same as the NB) goes from 0.72 to 0.91.

For areas with low vegetation, the mean value increases its cover by 228%.

The function suggests that the probability of its value exceeding the threshold of

20% is 0.33 a priori, whereas a posteriori this probability becomes 0.84.

The variable aridisols increases its mean value by 72%. The prior probability

that its value is greater than the threshold 3% is 0.35, whereas a posteriori, the

probability becomes 0.89.

Mean rainfall decreases by 35%. The function shows that the prior probability

of its value being less than the threshold 407 mm is 0.14, whereas a posteriori it

is 0.71. So, it is more likely to find low precipitation associated with the presence

of tortoise.

Page 150: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

138 7.3. Results and discussion

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

Marsh with vegetation

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

Dry herbaceous crops

Den

sity

A PrioriA Posteriori

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Irrigated woody crops

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.00

0.02

0.04

0.06

Sparse shrubland with pasture

Den

sity

A PrioriA Posteriori

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Sand and dunes

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.00

0.02

0.04

0.06

Dry woody crops

Den

sity

A PrioriA Posteriori

Figure 7.6: Prior and posterior marginal probability distributions in the TANmodel for marsh with vegetation, dry herbaceous crops, irrigated woody crops,sparse shrubland with pasture, sand and dunes and dry woody crops.

Page 151: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 139

0 20 40 60 80 100

0.00

0.04

0.08

0.12

Areas with low vegetation

Den

sity

A PrioriA Posteriori

0 20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Aridisols

Den

sity

A PrioriA Posteriori

200 400 600 800 1000 1200 14000.00

00.

002

0.00

4

Mean rainfall

Den

sity

A PrioriA Posteriori

8 10 12 14 16 18

0.0

0.1

0.2

0.3

0.4

0.5

Mean temperature

Den

sity

A PrioriA Posteriori

Figure 7.7: Prior and posterior marginal probability distributions in the TANmodel for areas with low vegetation, mean rainfall, aridisols and mean tempera-ture.

Page 152: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

140 7.3. Results and discussion

Mean temperature increases slightly by 7%. The function shows that the prior

probability that the temperature exceeds the threshold of 17C is 0.37, whereas

the posterior probability becomes 0.63.

Marsh with vegetation, dry herbaceous crops, sparse shrubland with pasture,

sand and dunes, and dry woody crops show prior and posterior marginal proba-

bility functions similar to the NB model.

Thus, NB and TAN models show similar distributions both a priori and a

posteriori, but the quantification varies (Tables 7.1 and 7.2). They vary in the

definition of relationships between the features in the TAN model, so that each

variable is influenced, not only by the class variable tortoise, but also by the

variables directly connected with it in the network. Five probability distributions

differ between TAN and NB, so the habitat description is slightly different: Mean

precipitation decreases by 33% (from 586.71 mm in NB to 392.56 mm in TAN),

cover of irrigated woody crops increases by 20% (from 3.39% in NB to 4.08% in

TAN), areas with low vegetation increases by 12% (from 42.03% in NB to 47.12%

in TAN) and aridisols increases by 15.2% (from 23.11% in NB to 26.62% in TAN).

Table 7.2 identify the representative variables related to the presence of the

tortoise. In descending order, they are: sand and dunes, marsh with vegetation,

areas with low vegetation and sparse shrubland with pasture. For these variables,

evidence of the tortoise being present implies an increase of more than 100% in

their mean cover.

Aridisols increases by only 72%. Mean rainfall and mean temperature are

important climatic variables in the habitat characterisation, and indicate that

the tortoise’s habitat has a lower mean rainfall and a higher mean temperature.

7.3.3 Validation

Table 7.3 shows the classification rate for the discrete [3] and the continuous

models, using 10-fold cross validation. A classification rate grouping by NB,

TAN, continuous and discrete models, as well as the standard deviation for each

value are also shown.

Figure 7.8 shows two box plots. The first one represents the values of the

classification rates for the continuous and discrete models. The second one shows

Page 153: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 141

Classification rate Standard deviationDiscrete NB 0.9362 0.0165

Continuous NB 0.9707 0.0074Discrete TAN 0.9493 0.0519

Continuous TAN 0.9707 0.0074NB models 0.9535 0.0165TAN models 0.9600 0.0377

Continuous models 0.9707 0.0072Discrete models 0.9428 0.0381

Table 7.3: 10-fold cross validation for the discrete and continuous version of NBand TAN.

the same values for the TAN and NB models.

CONTINUOUS DISCRETE

0.88

0.90

0.92

0.94

0.96

0.98

1.00

NB TAN

0.88

0.90

0.92

0.94

0.96

0.98

1.00

Figure 7.8: Box plots comparing the classification rate for continuous againstdiscrete models and NB against TAN models.

After aplying Lilliefors’ test to check the normality of the data, the t-test

to compare the experimental results was applied (see Table 7.4). There are sig-

nificant differences between continuous and discrete models (p-value of 0.0021,

p < 0.05). This difference is due to the loss of statistical information in the

discretisation process. On the other hand, there are no significant differences

between TAN and NB models (p-value of 0.2531, p > 0.05). In general, it is

demostrated that TAN models are better for classification than NB models, but

Page 154: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

142 7.3. Results and discussion

with scarce data (our case) the MTE learning process may cause a worse clas-

sification rate. This fact can modify slightly the results of TAN. In any case it

seems from Figure 7.8 that TAN outperforms NB but not statistically.

p-valueContinuous - Discrete 0.0021

TAN - NB 0.2531

Table 7.4: Statistical differences between the models.

7.3.4 Spatial application of the models

Figures 7.9(a) and 7.9(b) show the probability of the tortoise being present in

Andalusia according to the discrete models developed in [3]. The same is shown in

Figures 7.10(a) and 7.10(b) for the continuous models developed in this chapter.

Figures 7.9(a), 7.9(b), 7.10(a), 7.10(b) clearly indicate the existence of two

populations of tortoise in Andalusia: one located in the southwest and another

in the southeast. The discrete models NB and TAN recognise this pattern, but

show a more dispersed distribution in the region, locating the presence of tor-

toises in less likely inland habitats. The continuous NB model shows a better

characterisation of the habitat, however it includes an area close to the Strait of

Gibraltar, determined by higher precipitation (mean value of 586.71 mm) with

respect to the continuous TAN.

The continuous TAN model corresponds exactly to the presence of the tor-

toise in Andalusia. The probability distributions determined by this model char-

acterises both habitats. In the southwest, the tortoises occur in areas of sandy

substrate alternating with vegetation near to marshes. These environmental vari-

ables correspond spatially to the Donana National Park. In the southeast the

habitat is semiarid, with sparse shrubland with pasture, areas with low vegeta-

tion and an abundance of aridisol soil types. The model shows that the most

important factors in the distribution of tortoises in the southeast are climate and

vegetation type.

The results obtained in the characterisation of tortoise habitat indicate that

NB and TAN continuous models based on Mixtures of Truncated Exponentials

Page 155: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 143

(a) Discrete NB.

(b) Discrete TAN.

Figure 7.9: Probability of presence of the tortoise in the region of Andalusiaaccording to the discrete models [3].

Page 156: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

144 7.3. Results and discussion

(a) Continuous NB.

(b) Continuous TAN.

Figure 7.10: Probability of presence of the tortoise in the region of Andalusiaaccording to the continuous models.

Page 157: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 7. Species distribution modelling 145

(MTEs) can be applied to species distribution modeling, by allowing the simul-

taneous use of both discrete and continuous variables in the development of the

models.

7.4 Conclusions

We have applied two classification models based on MTEs (NB and TAN) to char-

acterise the habitat of the spur-thighed tortoise (Testudo graeca graeca), using

several continuous environmental variables. The application of the hybrid models

instead of the full discrete ones has reported a better classification rate and also

a more spatially accurate distribution of the tortoise. The study also show that

the results of the TAN continuous model corresponds exactly to the presence of

the tortoise in the region of Andalusia considering an expert’s opinion.

Page 158: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 159: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8

Relevance analysis of

performance indicators in higher

education

In this chapter we describe a methodology for relevance analysis of performance in-

dicators in higher education based on the use of Bayesian networks. We analyse the

behaviour of the described methodology in a practical case, showing that it is a useful

tool to help decision making when elaborating policies based on performance indica-

tors. The methodology has been implemented in a software that interacts with the

Elvira package for graphical models, and that is available to the administration board

at the University of Almerıa (Spain) through a web interface. The software also im-

plements a new method for constructing composite indicators by using a Bayesian

network regression model based on MTEs.

Abstract

8.1 Introduction

During the last decades, the way in which the financial support provided by the

administration to public Universities is determined, has been gradually moved

to a system where an increasing part of the funds are obtained depending on

the goals achieved by each institution. The usual way to determine to which

extent an institution has achieved the compromised goals is through the so-called

Page 160: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

148 8.1. Introduction

performance indicators [38]. Sometimes, the term performance is understood in a

wide sense, assuming that a performance indicator is any institutional goal that

can be objectively measured [43].

In order to design efficient policies oriented to increase the amount of pub-

lic funds, the administration boards of the Universities should determine which

variables, under their control, actually have an impact on the value of the perfor-

mance indicators that are lately used to compute the funds. This task requires

to take into account a high number of variables of different nature (qualitative

and quantitative), and which may have a complex dependence structure. In the

last years, there has been an increasing interest, within the fields of Statistics

and also in Artificial Intelligence, in handling scenarios in which a high number

of variables take part. One of the most satisfactory solutions is based on the use

of probabilistic graphical models and, more precisely, Bayesian networks [19, 77].

Examples of applications of Bayesian networks in enterprise information systems

can be found in the literature [156].

The main advantage of Bayesian networks is that they have a rich semantics,

and they can be easily interpreted by the user with no need of a high background

on Statistics. From an operational point of view, Bayesian networks provide a

natural framework for relevance analysis and also they can be used for prediction

tasks [111].

In this chapter we propose a methodology for relevance analysis of perfor-

mance indicators in higher education based on the use of Bayesian network mo-

dels. We illustrate the appropriateness of the proposed methodology for the par-

ticular case of the University of Almerıa (Spain). We also describe the decision

support system designed to implement this methodology. The system interacts

with the Elvira platform [34], and provides a web interface that guides the user

through the process of determining the variables that are relevant to a given

performance indicator. Furthermore, the software implements a novel procedure

for constructing composite indicators, based on rankings provided by experts.

Composite indicators [113] are indicators that sum up the information provided

by various indicators of different nature, with the aim of describing, with a sin-

gle number, the performance of an institution. Our proposal is a supervised

algorithm, consisting of creating a database using the rankings provided by the

Page 161: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 149

experts and the corresponding values of the individual indicators. The composite

indicator is constructed induced from the above mentioned database through the

Bayesian network regression model [111] explained in Section 3.6.

The rest of the chapter is organised as follows. In Section 8.2 we explain

the fundamentals of the methodology for relevance analysis using Bayesian net-

works. Section 8.3 is devoted to show the behaviour of the proposed technique

in a real-world problem. The software developed to implement the methodology

is described in Section 8.4. We describe the procedure to construct composite

indicators in Section 8.5. The chapter ends with conclusions in Section 8.6.

8.2 Relevance analysis using Bayesian networks

One of the most important advantages of Bayesian networks is that the structure

of the associated DAG determines the dependence and independence relationships

among the variables (see d-separation criterion in Section 2.1), so that it is pos-

sible to find out, with no need of carrying out any numerical calculations, which

variables are relevant or irrelevant for some other variable of interest (for instance,

a performance indicator). More precisely, we will illustrate how the relevance

analysis is performed in Bayesian networks through the concept of transmission

of information, so that two variables are irrelevant to each other if no informa-

tion can be transmitted between them. Thus, for instance, we could determine

which are the variables over which the administration board of a university has

to operate in order to change the value of a performance indicator.

Though relevance analysis can be carried out simply taking into account the

structure of the network, once the relevant variables for a given performance

indicator have been located, it is necessary to know to which extent the changes

in those variables determine the value of the performance indicator. This is

achieved by using the distributions of the Bayesian network.

Assume Xi is the performance indicator in which we are interested, and E is

a set of variables that can be controlled by the administration board. Then, the

prediction for the value of Xi given E = e can be obtained by computing the

distribution p(xi | E = e), that would provide us the likelihood of each possible

value of Xi given each possible configuration of E. This distribution can be

Page 162: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

150 8.3. Application to the analysis of performance indicators at the University of Almerıa

obtained from the joint distribution in Equation (2.2).

Once we know how to use a Bayesian network model for relevance analysis,

we must consider how to obtain it. Nowadays, university administration is fully

assisted by computers, so that a large amount of statistical data is available.

More precisely, it is in general possible to obtain databases composed of records

describing items of information that contain the value of some performance indi-

cators together with other variables that can be controlled. For instance, we can

have a record with information about a course (number of students, number of

lecturers, etc.) together with some performance indicator regarding that course

(the success rate, for instance).

There are several algorithms that allow the construction of Bayesian networks

from databases. We will mention two of them that are commonly used: The so-

called K2 [36] and PC [148] algorithms. The K2 algorithm searches within the

space of all Bayesian networks that contain the variables in the database, and

tries to find an optimal network in terms of the likelihood of the database for each

candidate network. On the other hand, the PC algorithm tries to determine the

structure of the network by means of statistical tests of independence. None of

the methods is absolutely superior to the other, so that in practical applications

it is common to construct two networks, one with each algorithm, and then use

the network for which the likelihood of the database is higher. A common feature

of both algorithms is that they operate with qualitative variables, therefore, the

continuous variables must be discretised beforehand. A review on discretisation

methods can be found in [78].

There are free software packages that allow the construction of Bayesian net-

works from databases. In this work we have used the Elvira system [34].

8.3 Application to the analysis of performance

indicators at the University of Almerıa

In this section we describe a practical application of the methodology introduced

in Section 8.2, consisting of the analysis of some performance indicators that

are used to compute the amount of public funds received by the University of

Page 163: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 151

Almerıa.

The starting point is a database with 1345 records and 17 variables regarding

all the courses taught at the University of Almerıa in the different degree programs

during the academic year 2003-2004. A description of the considered variables can

be found in Table 8.1. The first five variables correspond to academic performance

indicators.

Variable Description

Performance Rate Ratio between the number of students that succeedin a course and the number of students that go to the exam.

Relative Mark Average of the marks obtained by each studentin the course in relation to the other students’ marks.

RepStudents Percentage of students that repeat a course.Used Exam Diets Number of times a student goes to the course

exam before passing.Rate of Diets Used Number of times a student goes to the course exam

divided by the maximum number of trials allowed.#StudTheLect Number of students per classroom in theoretical lectures.#StudPrtLect Number of students per classroom in practical lectures.Type Of Course Whether the course is compulsory or optionalSemester The semester, within the degree schedule,

in which the course is taught.#Lecturers Number of lecturers in the same course.Lecturer Evaluation Mark obtained by the lecturer in the students opinion polls.Type of Lecturer Position of the lecturer.PerctPhD Percentage of lecturers in the course with PhD degree.Degree Program The degree program in which the course is taught.FullTimeStud Percentage of full time students.AvgAccessMark Average marks of the students in the degree obtained

in the high school.P80AccessMark 80th percentile of the marks of the students

in the degree obtained in the high school.

Table 8.1: Description of the variables considered in the case study.

The database has been pre-processed by discretising the continuous variables

using the k-means clustering algorithm, which is one of the most popular clus-

tering algorithms in data mining [159], establishing a number of 5 categories for

each discretised variable. We have used the PC and the K2 algorithms, obtaining

the best model, in terms of likelihood of the data, with the K2. The resulting

network can be seen in Figure 8.1.

Attending the structure of the network in that figure, it can be seen that there

are two important variables, Type of Course and Degree Program, which play an

important role in the network, since information can flow from them to all the

performance indicators. We can evaluate the importance of these variables using

Page 164: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

152 8.3. Application to the analysis of performance indicators at the University of Almerıa

UsedExamDietsRelativeMark

PerctPhDTypeOfLecturer

LecturerEvaluation #Lecturers

#StudPrtLect

RateOfDietsUsed

RepStudents

Semester

TypeOfCourse#StudTheLect

PerformanceRateDegreeProgramAvgAccessMark

FullTimeStudP80AccessMark

Figure 8.1: Bayesian network obtained for the case study.

the quantitative part of the Bayesian network. For instance, if we concentrate

on variable Type of Course, its influence on two important performance indica-

tors, Performance Rate and Relative Mark is clear attending to the conditional

probabilities displayed in Tables 8.2 and 8.3 respectively. The differences for the

distribution of the values of both performance indicators are significant depend-

ing on whether the course is compulsory or optional. This fact suggests that a

separate study of compulsory and optional courses is appropriate.

8.3.1 Relevance analysis for compulsory courses

The Bayesian network obtained using the registers in the database concerning

compulsory courses is displayed in Figure 8.2. We can draw the following conclu-

sions:

Page 165: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 153

Performance Prior Type of Courserate probability Compulsory Optional[0, 0.545) 0.03 0.04 0.03[0.545, 0.735) 0.10 0.15 0.04[0.735, 0.855) 0.17 0.25 0.09[0.855, 0.955) 0.20 0.26 0.13[0.955, 1] 0.49 0.31 0.71

Table 8.2: Conditional probabilities of Performance Rate given the Type ofCourse.

Relative Prior Type of Coursemark probability Compulsory Optional[0, 0.195) 0.23 0.28 0.15[0.195, 0.315) 0.27 0.29 0.24[0.315, 0.465) 0.24 0.21 0.28[0.465, 0.775) 0.18 0.14 0.23[0.775, 1] 0.08 0.07 0.09

Table 8.3: Conditional probabilities of Relative Mark given the Type of Course.

• The structure of the Lecturer board is irrelevant to the rest of the network,

since variables Number of Lecturers, Type of Lecturer and Percentage of

PhDs are disconnected from the rest.

• The evaluation obtained by a lecturer in the opinion polls is fully determined

by the number of students in classrooms in theoretical lectures. It is an

important conclusion, since it is common to find poor evaluation results in

large classrooms, which suggests that rather than the lecturer’s profile, it

is the size of the classroom what determines the result of the evaluation.

• Any possible information flow towards the performance indicators goes

through variable Degree Program. It is true that the administration board

Performance Prior # of students per classroom in theoretical lecturesRate probability < 25.5 [25.5, 49.5) [49.5, 79.25) [79.25, 114.75) ≥ 114.75[0, 0.355) 0.20 0.18 0.19 0.20 0.20 0.24[0.355, 0.535) 0.28 0.26 0.27 0.28 0.28 0.29[0.535, 0.695) 0.23 0.23 0.23 0.23 0.23 0.22[0.695, 0.845) 0.18 0.20 0.19 0.18 0.18 0.15[0.845, 1] 0.11 0.13 0.12 0.11 0.11 0.09

Table 8.4: Performance rate vs. # students in theoretical lectures for compulsorysubjects.

Page 166: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

154 8.3. Application to the analysis of performance indicators at the University of Almerıa

#Lecturers

PerctPhDTypeOfLecturer

RateOfDietsUsedPerformanceRateLecturerEvaluation

UsedExamDietsRepStudents#StudTheLect#StudPrtLect

RelativeMarkSemesterDegreeProgramAvgAccessMark

FullTimeStudP80AccessMark

Figure 8.2: Bayesian network for compulsory courses.

cannot control the value of the degree program where a subject is included,

but they actually can control some characteristics of the degree program as

the access mark and number of students per classroom in theoretical and

practical lectures. The effect of these variables on the performance rate is

illustrated in Tables 8.4, 8.5 and 8.6.

• Attending to the results in Table 8.4, we can conclude that a good policy in

order to increase the performance rate is to establish a maximum number

Performance Prior # of students per classroom in practical lecturesRate probability < 23.9 [23.9, 41.65) [41.65, 68.5) [68.5, 114.16) ≥ 114.16[0, 0.355) 0.20 0.19 0.20 0.21 0.20 0.24[0.355, 0.535) 0.28 0.27 0.28 0.28 0.28 0.29[0.535, 0.695) 0.23 0.23 0.23 0.23 0.23 0.24[0.695, 0.845) 0.18 0.19 0.18 0.18 0.18 0.15[0.845, 1] 0.11 0.12 0.11 0.11 0.11 0.10

Table 8.5: Performance rate vs. # students in practical lectures for compulsorysubjects.

Page 167: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 155

LecturerEvaluation #Lecturers

PerctPhDTypeOfLecturer

UsedExamDiets

RelativeMark

#StudTheLect#StudPrtLect

RateOfDietsUsedRepStudentsFullTimeStudSemester

PerformanceRateDegreeProgramAvgAccessMark

P80AccessMark

Figure 8.3: Bayesian network for optional courses.

of students per classroom not greater than 49.

• The influence of the number of students in practical lectures is not so im-

portant, as can be seen in Table 8.5 (the columns are rather similar).

• Finally, the access marks have little impact on the performance rate. Only

increasing the 80th percentile of the access mark above 7.3 points out of 10,

a slight improvement in the performance rate can be noticed.

8.3.2 Relevance analysis for optional courses

The Bayesian network obtained using the registers in the database concerning op-

tional courses is displayed in Figure 8.3. Analysing the structure of this network,

we can deduce that the lecturers’ profile, including the result of the evaluation,

is irrelevant to the course performance indicators.

The access marks are connected to the performance rate through the degree

program. Their impact on this indicator is quantified in Table 8.7. The proba-

Page 168: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

156 8.3. Application to the analysis of performance indicators at the University of Almerıa

Performance Prior Average Access MarkRate a Probability [5.32, 5.92) [5.92, 6.17) [6.17, 6.46) [6.46, 6.83) [6.83, 7.71][0, 0.355) 0.20 0.19 0.21 0.21 0.19 0.19[0.355, 0.535) 0.28 0.28 0.28 0.28 0.27 0.27[0.535, 0.695) 0.23 0.24 0.23 0.23 0.23 0.23[0.695, 0.845) 0.18 0.18 0.17 0.17 0.18 0.19[0.845, 1] 0.11 0.11 0.11 0.11 0.12 0.13

Performance Prior 80th Percentile of the Access MarkRate probability [5.46, 6.30) [6.30, 6.65) [6.65, 7.30) [7.30, 7.61) [7.61, 8.5][0, 0.355) 0.20 0.19 0.21 0.21 0.20 0.19[0.355, 0.535) 0.28 0.28 0.28 0.28 0.27 0.27[0.535, 0.695) 0.23 0.23 0.23 0.23 0.23 0.23[0.695, 0.845) 0.18 0.18 0.18 0.18 0.18 0.19[0.845, 1] 0.11 0.11 0.11 0.11 0.12 0.12

Table 8.6: Performance rate vs. student profile for compulsory subjects.

Performance Prior Average Access MarkRate probability [5.32, 6.00) [6.00, 6.20) [6.20, 6.44) [6.44, 6.69) [6.69, 8.03][0, 0.59) 0.20 0.18 0.19 0.22 0.22 0.17[0.59, 0.71) 0.19 0.17 0.20 0.29 0.19 0.18[0.71, 0.82) 0.20 0.21 0.21 0.21 0.21 0.18[0.82, 0.92) 0.20 0.23 0.21 0.19 0.19 0.19[0.92, 1] 0.21 0.20 0.19 0.19 0.20 0.28

Performance Prior 80th Percentile of the Access MarkRate probability [5.37, 6.28) [6.28, 6.61) [6.61, 6.92) [6.92, 7.22) [7.22, 8.57][0, 0.59) 0.20 0.19 0.19 0.21 0.21 0.18[0.59, 0.71) 0.19 0.18 0.19 0.19 0.19 0.18[0.71, 0.82) 0.20 0.21 0.21 0.21 0.20 0.18[0.82, 0.92) 0.20 0.22 0.21 0.20 0.19 0.19[0.92, 1] 0.21 0.20 0.20 0.20 0.21 0.26

Table 8.7: Performance rate given access marks for optional courses.

bilities in that table indicate that the best performances are attained for average

access marks above 6.69 points and 80th percentiles above 7.22 points.

The number of students in theoretical lectures is more relevant here that

in the case of compulsory subjects, since it is connected to the relative mark,

the number of diets used and the percentage of repeating students. Also, it is

indirectly connected to the performance rate.

Table 8.8 summarises the probabilities of indicator Performance Rate given

the number of students in theoretical lectures. It can be observed that the per-

formance rate is strongly influenced by this variable, so that low performances

appear when the number of students increases. Therefore, any policy oriented to

decrease the number of students per classroom conveys a significant improvement

Page 169: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 157

Performance Prior # students per classroom in theoretical lecturesRate probability < 10 [10, 18) [18, 28) [28, 51) [51, 137][0, 0.59) 0.20 0.18 0.20 0.19 0.21 0.20[0.59, 0.71) 0.19 0.15 0.19 0.19 0.20 0.19[0.71, 0.82) 0.20 0.17 0.20 0.21 0.21 0.22[0.82, 0.92) 0.20 0.19 0.19 0.19 0.21 0.23[0.92, 1] 0.21 0.31 0.22 0.21 0.17 0.16

Table 8.8: Results for optional subjects given the size of the classrooms in theo-retical lectures.

Performance Prior # students per classroom in practical lecturesRate probability < 9 [9, 17) [17, 25) [25, 38) ≥ 38[0, 0.59) 0.20 0.18 0.20 0.20 0.20 0.20[0.59, 0.71) 0.19 0.16 0.18 0.19 0.19 0.19[0.71, 0.82) 0.20 0.17 0.20 0.21 0.21 0.21[0.82, 0.92) 0.20 0.19 0.19 0.20 0.21 0.22[0.92, 1] 0.21 0.30 0.23 0.20 0.18 0.17

Table 8.9: Results for optional subjects given the size of the classrooms in prac-tical lectures.

in the course performance.

Finally, it can be concluded from the probabilities in Table 8.9, that the

influence of the number of students in practical lectures is not as important as in

the case of theoretical lectures.

8.4 Software for relevance analysis

We have implemented the above described methodology in a software package,

called academic advisor, which provides an intuitive web-based interface appro-

priate for academic staff not familiarised with Bayesian network models.

The functionality of the academic advisor is based on an client/server Web

architecture. In the client side the users interact with the system using a Web

browser to access an interface with data forms. The server side, contains most

of the functionality of the application. The system uses the Web server Apache

Tomcat 5.5 Servlet/JSP Container, that allows to run servlets and to generate

JSPs (Java Server Pages) automatically. In addition, Java classes of the Elvira

program [34] and the Bayesian networks described in Section 8.3 are stored.

Servlets and JSPs are two methods to create dynamic Web pages under the

Page 170: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

158 8.4. Software for relevance analysis

context of a server and using the Java language. More precisely, the JSPs are

HTML pages with special labels and Java code embedded using scripts, which

makes possible to generate dynamic content. On the other hand, a servlet is a

Java program that receives requests, processes them and generates a Web page

from them. The structure of the academic advisor is shown in Figure 8.4.

JSP

CLASSES

ELVIRAJAVA

Web browser 1 Web browser n. . .

Javaservletclasses

CLIENTS

TOMCAT

WEB

SERVER

Internet

Request Response

Bayesiannetwork

Bayesiannetwork

Bayesiannetwork

courses courses programmesDegreeOptionalCompulsory

Figure 8.4: Structure of the decision support system for relevance analysis.

We can justify the use of these technologies from different points of view.

At first, this problem requires a constant interactivity between the application

and the user. In addition, the client/server approach seems appropriate, since it

makes possible the remote use of the application by different users at the same

time. On the other hand, the use of Java language allows the direct interaction

with the Elvira system, which is implemented using that language.

The interaction process is made as follows: The users introduce the input data

using a Web form designed in HTML, or JSP in case of needing some kind of

processing in Java. This request is sent to the server to activate the corresponding

Page 171: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 159

servlet that manages the search interacting with the underlying Elvira Java classes

and Bayesian network files. Once the information has been processed, another

servlet is in charge of generating a HTML/JSP page that will show the results to

the user (response).

From the point of view of the users, the system can be used for four tasks:

Relevance analysis, probability propagation, profile extraction and construction

of composite indicators.

In the relevance analysis module, which is depicted in Figure 8.5, the user

can choose a target variable and the system returns the list of variables that are

directly related to it, according to the Markov blanket [117]. The process can be

repeated in order to detect the relevant variables at a second level in the Bayesian

network, and so on. Finally, the posterior distribution of the target variable given

the relevant selected ones can be obtained.

Figure 8.5: The relevance analysis screen of the academic advisor.

In the probability propagation module, the user can obtain the posterior prob-

ability distribution of any target variable given an assignment of values to some

other variables (see Figure 8.6).

The profile extraction module can be seen in Figure 8.7. This module allows

to compute a set of explanations to a given fact. For instance, we can compute

Page 172: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

160 8.5. Using the software to construct composite indicators

Figure 8.6: The probability propagation screen of the academic advisor.

the best explanation for a given value of the success rate in terms of number of

students in classroom and in terms of the rate student/teacher. This tool is very

useful for descriptive purposes, as allows to determine typical situations under

given restrictions. The problem of finding a set of explanations, also known as

MAP problem, is solved using the implementation in Elvira, which corresponds

to the method proposed in [114].

The module for constructing composite indicators is shown in Figure 8.8, and

described in Section 8.5.

8.5 Using the software to construct composite

indicators

Composite indicators [113] are indicators that sum up the information provided

by various indicators of different nature, with the aim of describing, with a single

number, the performance of an institution. The module for constructing com-

Page 173: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 161

Figure 8.7: The profile extraction screen of the academic advisor.

posite indicators implements a novel methodology of supervised nature. It is

supervised in the sense that there must be an expert that ranks different descrip-

tions of the institution in terms of performance. Out of that description, the

software creates a composite indicator that is computed from the values of the

individual indicators for each description. We give the details of this two main

tasks in the next sub-sections.

8.5.1 Generating the rank of descriptions

The first step to construct a composite indicator is to choose a set of individual

indicators X1, . . . , Xk. Then, an expert gives a set of descriptions of the insti-

tution in terms of some observable variables. For each description, the software

computes a list of profiles that explain the given description, and show them on

the screen. Afterwards, the expert assigns a number between 0 and 1 to each

profile, according to the performance of the institution corresponding to each de-

scription, where a value close to 1 indicates high performance, and a value close

Page 174: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

162 8.5. Using the software to construct composite indicators

Figure 8.8: Screen for constructing composite indicators.

to 0 means low performance. A screenshot of this procedure can be seen in Fig-

ure 8.8. After this process, the software has stored a database D with variables

X1, . . . , Xk, Y , where Y is the ranking assigned to each description by the expert.

8.5.2 Generating the composite index from the database

Using the database D, described above, we construct the composite indicator

by using a naıve Bayes regression model [111] based on the MTE model (see

Section 3.3 for more details). Variable Y will be the composite indicator (response

variable) and X1, . . . , Xk are the individual indicators (explanatory variables).

Note that Y is constructed from the rankings given by the expert, and therefore

we use the same variable name. As is shown in Equation (3.4), the posterior

distribution of Y given X1, . . . , Xk (more precisely, its expectation) will be used

to obtain a prediction y for Y .

After constructing the composite indicator, it is included in the system and

can be computed from every possible combination of values of the individual

indicators.

Page 175: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 8. Relevance analysis of performance indicators in higher education 163

8.6 Conclusions

In this chapter we have introduced a methodology for relevance analysis of per-

formance indicators in higher education. We have shown through a case study

that this methodology can help the process of decision making when designing

policies oriented to increase the amount of public funds when they are assigned

according to some performance indicators.

The graphical nature of the used model allows drawing conclusions with no

need to interpret any numerical data, since relevance analysis can be carried out

just taking into account the structure of the Bayesian network. If the user is also

interested in quantifying the strength of the dependencies among the variables,

it can be achieved using the conditional probability distributions provided by the

Bayesian network as well.

We have also introduced the academic advisor system, which implements the

proposed methodology making use of the Elvira system, but providing a user

friendly interface appropriate for academic staff not familiarised with Bayesian

network models. An important novel feature of this software is that it allows to

construct composite indicators in an easy way, using an expert’s opinion.

The fact that the software interacts with the Elvira system, makes it easy to

update its knowledge base with, for instance, indicator databases corresponding to

forthcoming academic years, or any other database. It is specially interesting from

the point of view of data privacy, as there is no need of external intervention for

updating the system. The importance of preserving data privacy during external

intervention in data mining tasks is analysed in [124].

In the next future, we plan to apply Bayesian technology to construct a recom-

mendation system for students, so that they can choose the appropriate courses

in order to maximise their success chances.

We also plan to improve the module for constructing composite indicators,

by following a semi-supervised approach, in which there is no need to specify the

value for the composite indicator in all the records of the training database.

Page 176: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 177: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Part IV

Concluding remarks

Page 178: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 179: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 9

Conclusions and future works

This dissertation is a contribution to the state of the art of simultaneous pro-

cessing of discrete and continuous variables in hybrid Bayesian networks based

on the Mixtures of Truncated Exponential model proposed in [106]. Most of the

work has been focused on the learning problem, but also a contribution about

inference has been proposed in Chapter 6.

Regarding learning, Chapter 3 has addressed the problem of regression using

hybrid Bayesian networks. The main advantage of using Bayesian networks to

solve a regression problem with respect to classical techniques, is that it is not

necessary to have evidence for the entire set of independent variables to give a

prediction, since inference can be carried out in this case. Another advantage of

applying Bayesian networks in regression is scalability, i.e. a Bayesian network

can be included within another system acting as input or output for other models

whose aim is not regression.

Several existing BN classifier structures have been applied to solve the regres-

sion problem. Instead of selecting the value of the class variable that maximises

the posterior probability of the class given the observations (such as in classifica-

tion), we have used the posterior distribution of the dependent variable (which

is continuous) given the independent ones to give a prediction. This prediction

has been computed through the mean or the median of the distribution. The

construction of some of these predictors requires the use of the conditional mu-

tual information, which can not be obtained analytically for MTEs. In order to

solve this problem, we have introduced an unbiased estimator of the conditional

Page 180: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

168

mutual information, based on Monte Carlo estimation. The models have been

refined using a variable selection scheme and the results of the experiments have

shown a good behaviour in terms of accuracy in comparison to the very robust

M5’ method.

Having successfully implemented Bayesian networks for regression in Chap-

ter 3, the problem is now addressed for the case of incomplete data. An iterative

procedure for inducing the models is proposed, based on a variation of the data

augmentation method in which the missing values of the explanatory variables

are filled by simulating from their posterior distributions, while the missing val-

ues of the response variable are generated using the conditional expectation of

the response given the explanatory variables. With this way of calculating the

prediction, it has been proved that the error is minimised. Another contribution

of this chapter is the method for improving the accuracy by reducing the bias

in the prediction, which can be incorporated regardless of whether the model is

obtained from complete or incomplete data. The experiments conducted have

shown that the selective versions of the proposed algorithms outperform the ro-

bust M5’ scheme, which is not surprising, as M5’ is mainly designed for continuous

explanatory variables, while MTEs are naturally developed for hybrid domains.

In the same line of parameter learning from missing data, in Chapter 5 we have

proposed an EM-based algorithm for learning the maximum likelihood parameters

of an MTE network when confronted with incomplete data. In this work any

network structure and any underlying probability distribution is permitted, since

transformation rules for approximating distributions to MTEs are proposed to

make feasible the inference during the E-step of the algorithm. On the other

hand the updating rules for maximising the likelihood in M-step are performed

over the original distributions of the variables. The results of the experiments

show the expected behaviour for different levels of missing data, although the

method is still not competitive in terms of likelihood.

Regarding inference, in Chapter 6 we have proposed an approximate proba-

bility propagation algorithm in hybrid networks based on the MTE model. The

algorithm is based on importance sampling technique that has already been suc-

cessfully applied for discrete networks. The obtained results represent a consid-

erable advance in terms of error in the calculation of posterior probabilities with

Page 181: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter 9. Conclusions and future works 169

respect to the approximate methods in the hybrid Bayesian networks literature.

Finally, the dissertation ends with two applications of MTE networks to real

problems. On the one hand, we have used Bayesian networks to characterised

the habitat of the spur-thigh tortoise, using several environmental continuous

variables, and one discrete representing the presence or absence of the tortoise.

This work represents an advance in the field of application, since there are few

studies facing directly with continuous data, even in other areas. Previous works

about applications of Bayesian networks have been widely used to solve problems

in environmental sciences, but discretising the continuous domains to apply the

techniques developed for learning and inference so far. The results of the models

show a spatially accurate distribution of the tortoise and the conclusion is that

the proposed continuous models based on MTEs are valid for the study of species

predictive distribution modelling.

On the other hand, we have described Bayesian networks in relevance analysis

of performance indicators in higher education showing that it is a useful tool to

help decision making when elaborating policies based on performance indicators.

The methodology has been implemented in a software that interacts with the

Elvira package for graphical models, and that is available to the administration

board at the University of Almerıa (Spain) through a web interface. The main

contribution of the software is that implements a new method for constructing

composite indicators by using a Bayesian network regression model like the ones

in Chapter 3.

In what follows, we give some clues about possible future research lines derived

from this dissertation. In Chapters 3 and 4 where Bayesian networks were applied

for regression, we could consider more elaborate variable selection schemes to

improve the accuracy in the estimations. Also, other classifier structures can be

considered in the study, for instance, a semi-naıve Bayes model.

Regarding parametric learning in Chapter 5, we plan to investigate further

the behaviour of the proposed algorithm refining the implementation for being

able to deal with more complex networks.

In Chapter 6, we expect to continue this research line by developing methods

for carrying out more complex reasoning tasks. For instance, finding the most

probable explanation to an observed fact in terms of a set of target variables,

Page 182: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

170

which is called abductive inference.

With respect to applications, we are still collaborating with people in en-

vironmental sciences applying Bayesian networks, and in particular MTEs, to

environmental modelling: Climate change, water resources, species distribution,

etc. In the same line as for the academic advisor in Chapter 8, we also plan to

apply Bayesian technology to construct a recommendation system for students,

so that they can choose the appropriate courses in order to maximise their suc-

cess chances. We also plan to improve the module for constructing composite

indicators, by following a semisupervised approach, in which there is no need to

specify the value for the composite indicator in all the records of the training

database.

Page 183: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Bibliography

[1] P. A. Aguilera, A. Fernandez, R. Fernandez, R. Rumı, and

A. Salmeron. Bayesian networks in environmental modelling. Environ-

mental Modelling & Software (submitted), 2011. 125, 126

[2] P. A. Aguilera, A. Fernandez, F. Reche, and R. Rumı. Hybrid

Bayesian network classifiers: Application to species distribution models.

Environmental Modelling & Software, 25[12]:1630–1639, 2010. 25

[3] P. A. Aguilera, F. Reche, E. Lopez, B. A. Willaarts, A. Castro,

and M. F. Schmitz. Aplicacion de las redes bayesianas a la caracteri-

zacion del habitat de la tortuga mora (testudo graeca graeca) en Andalucıa.

In Proceedings of the I Congreso Nacional de Biodiversidad, 2007. 127, 140,

142, 143

[4] R. P. Anderson, A. T. Peterson, and M. Gomez-Laverde. Using

niche-based GIS modelling to test geographic predictions of competitive

exclusion and competitive release in South American pocket mice. Oikos,

98:3–16, 2002. 126

[5] M. B. Araujo, M. Cabeza, W. Thuiller, L. Hannah, and P. H.

Williams. Would climate change drive species out of reserves?. An assess-

ment of existing reserve-selection models. Global Change Biology, 10:1618–

1626, 2004. 126

[6] M. B. Araujo, W. Thuiller, and R. G. Pearson. Climate warming

and the decline of amphibians and reptiles in Europe. Journal of Biogeog-

raphy, 33:1712–1728, 2006. 126

Page 184: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

172 BIBLIOGRAPHY

[7] M. P. Austin. Spatial prediction of species distribution: an interface

between ecological theory and statistical modelling. Ecological Modelling,

157:101–118, 2002. 125

[8] I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F.

Cooper. The ALARM monitoring system: A case study with two proba-

bilistic inference techniques for Belief networks. In Second European Con-

ference on Artificial Intelligence in Medicine, 38, pages 247–256, 1989.

113

[9] D. A. Bell and H. Wang. A formalism for relevance and its application

in feature subset selection. Machine Learning, 41:175–195, 2000. 127

[10] M. Ben-Bassat. Use of distance measures, information measures and

error bounds in features evaluation. HandBook of Statistics, 2:773–791,

1982. 127

[11] C. L. Blake and C. J. Merz. UCI repository of machine learn-

ing databases. http://www.ics.uci.edu/~mlearn/MLRepository.html,

1998. University of California, Irvine, Dept. of Information and Computer

Sciences. 51, 66

[12] M. E. Borsuk, P. Reichert, A. Peter, E. Schager, and

P. Burkhardt-Holm. Assessing the decline of brown trout (Salmo

trutta) in swiss rivers using Bayesian probability network. Ecological Mod-

elling, 192:224–244, 2006. 126

[13] M. E. Borsuk, C. A. Stow, and K. H. Reckhow. A Bayesian network

of eutrophication models for synthesis, prediction, and uncertainty analysis.

Ecological Modelling, 173:219–239, 2004. 126

[14] J. Bromley, N. A. Jackson, O. J. Clymer, A. M. Giacomello,

and F. V. Jensen. The use of HuginR© to develop Bayesian networks

as aid to integrated water resource planning. Environmental Modelling &

Software, 20:231–242, 2005. 126

Page 185: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 173

[15] L. Brotons, W. Thuiller, M. B. Araujo, and A. H. Hirzel.

Presence-absence versus presence only modelling methods for predicting

bird habitat suitability. Ecography, 27:437–448, 2004. 126

[16] M. Burgmann, D. B. Lindenmayer, and J. Elith. Managing land-

scapes for conservation under uncertainty. Ecology, 86:2007–2017, 2005.

126

[17] A. Cano, S. Moral, and A. Salmeron. Penniless propagation in join

trees. International Journal of Intelligent Systems, 15:1027–1059, 2000. 99

[18] A. Cano, S. Moral, and A. Salmeron. Lazy evaluation in Penniless

propagation over join trees. Networks, 39:175–185, 2002. 99

[19] E. Castillo, J. M. Gutierrez, and A. S. Hadi. Expert systems and

probabilistic network models. Springer-Verlag, New York, 1997. 148

[20] K. C. Chang and Z. Tian. Efficient inference for mixed Bayesian net-

works. In Proceedings of the 5th ISIF/IEEE International Conference on

Information Fusion, pages 512–519, 2002. 24

[21] J. Cheng and M. J. Druzdzel. AIS-BN: An adaptive importance sam-

pling algorithm for evidential reasoning in large Bayesian networks. Journal

of Artificial Intelligence Research, 13:155–188, 2000. 99

[22] C. K. Chow and C. N. Liu. Approximating discrete probability distri-

butions with dependence trees. IEEE Transactions on Information Theory,

14:462–467, 1968. 41

[23] E. N. Cinicioglu and P. P. Shenoy. Solving stochastic PERT networks

exactly using hybrid Bayesian networks. In Proceedings of the Seventh

Workshop on Uncertainty Processing (WUPES-06), pages 183–197, 2006.

23

[24] E. N. Cinicioglu and P. P. Shenoy. Arc reversals in hybrid Bayesian

networks with deterministic variables. International Journal of Approxi-

mate Reasoning, 50:763–777, 2009. 22

Page 186: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

174 BIBLIOGRAPHY

[25] E. N. Cinicioglu and P. P. Shenoy. Using mixtures of truncated

exponentials for solving stochastic PERT networks. In Proceedings of the

Eighth Workshop on Uncertainty Processing (WUPES-09), pages 269–283,

2009. 23

[26] B. R. Cobb and J. M. Charnes. A graphical method for valuing switch-

ing options. Journal of the Operational Research Society, 61:1596–1606,

2010. 25

[27] B. R. Cobb, R. Rumı, and A. Salmeron. Advances in probabilistic

graphical models, chapter Bayesian networks models with discrete and con-

tinuous variables, pages 81–102. Studies in Fuzziness and Soft Computing.

Springer, 2007. 22

[28] B. R. Cobb, R. Rumı, and A. Salmeron. Predicting stock and portfolio

returns using mixtures of truncated exponentials. ECSQARU’09. Lecture

Notes in Computer Science, 5590:781–792, 2009. 25

[29] B. R. Cobb and P. P. Shenoy. Hybrid Bayesian networks with linear

deterministic variables. In Proceedings of the Proceedings of the Twenty-

First Conference Annual Conference on Uncertainty in Artificial Intelli-

gence (UAI-05), pages 136–144. AUAI Press, 2005. 22

[30] B. R. Cobb and P. P. Shenoy. Nonlinear deterministic relationships in

Bayesian networks. ECSQARU’05. Lecture Notes in Artificial Intelligence,

3571:27–38, 2005. 21, 22

[31] B. R. Cobb and P. P. Shenoy. Inference in hybrid Bayesian networks

with mixtures of truncated exponentials. International Journal of Approx-

imate Reasoning, 41:257–286, 2006. 14, 22, 81

[32] B. R. Cobb and P. P. Shenoy. Operations for inference in continuous

Bayesian networks with linear deterministic variables. International Journal

of Approximate Reasoning, 42:21–36, 2006. 22

Page 187: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 175

[33] B. R. Cobb, P. P. Shenoy, and R. Rumı. Approximating probability

density functions with mixtures of truncated exponentials. Statistics and

Computing, 16:293–308, 2006. 22, 36, 74, 76

[34] Elvira Consortium. Elvira: An Environment for Creating and Using

Probabilistic Graphical Models. In Proceedings of the First European Work-

shop on Probabilistic Graphical Models, pages 222–230, 2002. 51, 65, 66,

96, 128, 129, 148, 150, 157

[35] G. F. Cooper. The computational complexity of probabilistic inference

using Bayesian belief networks. Artificial Intelligence, 42:393–405, 1990. 99

[36] G. F. Cooper and E. Herskovits. A Bayesian method for the induction

of probabilistic networks from data. Machine Learning, 9:309–347, 1992.

150

[37] R. G. Cowel, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegel-

halter. Probabilistic Networks and Expert System. Springer, 1999. 14

[38] S. Cuenin. The use of performance indicators in universities: An interna-

tional survey. International Journal of Institutional Management in Higher

Education, 2:117–139, 1987. 148

[39] P. Dagum and M. Luby. Approximating probabilistic inference in

Bayesian belief networks is NP-hard. Artificial Intelligence, 60:141–153,

1993. 99

[40] A. P. Dedecker, P. L. M. Goethals, W. Gabriels, and N. De

Pauw. Optimization of Artificial Neural Network (ANN) model design

for prediction of macroinvertebrates communities in the Zwalm river basin

(Flanders, Belgium). Ecological Modelling, 174:161–173, 2004. 126

[41] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood

from incomplete data via the EM algorithm. Journal of the Royal Statistical

Society B, 39:1–38, 1977. 57, 74

[42] J. Demsar. Statistical comparisons of classifiers over multiple data sets.

Journal of Machine Learning Research, 7:1–30, 2006. 51, 66

Page 188: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

176 BIBLIOGRAPHY

[43] F. Dochy, M. Segers, and W. Wijnen. Selecting performance in-

dicators. A proposal as a result of research. In F. Goedegebuure,

F. Maasen, and D. Westerheijden, editors, Peer review and perfor-

mance indicators, pages 135–153. Lemma B.V., 1990. 148

[44] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification.

Wiley Interscience, 2001. 31, 129

[45] S. Dzeroski and D. Drumm. Using regression trees to identify the habi-

tat preference of the sea cucumber (Holothuria leucospilota) on Rarotonga,

Cook Islands. Ecological Modelling, 170:219–226, 2003. 126

[46] J. Elith, C. H. Graham, R. P. Anderson, M. Dudik, S. Fer-

rier, A. Guisan, R. J. Hijmans, F. Huettmann, J. R. Leathwick,

J. Li, L. G. Lohmann, B. A. Loiselle, G. Manion, C. Moritz,

M. Nakamura, Y. Nakazawa, J. McM. Overton, A. T. Peterson,

S. J. Phillips, K. S. Richardson, S. Scachetti-Pereria, R. E.

Schapire, J. Soberon, S. Williams, M. S. Wisz, and N. E. Zim-

mermann. Novel methods to improve prediction of species’ distribution

from occurrence data. Ecography, 29:129–151, 2006. 126

[47] A. Fernandez, I. Flesch, and A. Salmeron. Incremental super-

vised classification for the MTE distribution: a preliminary study. In Actas

del Congreso Nacional de Informatica (CEDI’07), Simposio de Inteligencia

Computacional (SICO’07), pages 217–224, 2007. 22

[48] A. Fernandez, H. Langseth, T. D. Nielsen, and A. Salmeron.

MTE-based parameter learning using incomplete data. Technical report,

Department of Statistics and Applied Mathematics, University of Almerıa,

Spain, 2010. 24

[49] A. Fernandez, H. Langseth, T. D. Nielsen, and A. Salmeron. Pa-

rameter learning in MTE networks using incomplete data. In Proceedings of

the Fifth European Workshop on Probabilistic Graphical Models (PGM’10),

pages 137–145, 2010. 24

Page 189: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 177

[50] A. Fernandez, M. Morales, C. Rodrıguez, and A. Salmeron. A

system for relevance analysis of performance indicators in higher education

using Bayesian networks. Knowledge and Information Systems, In press,

2011. 25

[51] A. Fernandez, M. Morales, and A. Salmeron. Tree augmented

naıve Bayes for regression using mixtures of truncated exponentials: Appli-

cations to higher education management. IDA’07. Lecture Notes in Com-

puter Science, 4723:59–69, 2007. 23, 45, 55, 57, 63, 65, 130

[52] A. Fernandez, J. D. Nielsen, and A. Salmeron. Learning naıve

Bayes regression models with missing data using mixtures of truncated

exponentials. In Proceedings of the Fourth European Workshop on Proba-

bilistic Graphical Models, pages 105–112, 2008. 23

[53] A. Fernandez, J. D. Nielsen, and A. Salmeron. Learning Bayesian

networks for regression from incomplete databases. International Journal

of Uncertainty, Fuzziness and Knowledge Based Systems, 18:69–86, 2010.

23, 74

[54] A. Fernandez, R. Rumı, and A. Salmeron. Answering queries in

hybrid bayesian networks using importance sampling. Decision Support

Systems (submitted), 2011. 24

[55] A. Fernandez and A. Salmeron. Extension of Bayesian network clas-

sifiers to regression problems. IBERAMIA’08. Lecture Notes in Artificial

Intelligence, 5290:83–92, 2008. 23, 55, 63

[56] S. Ferrier. Mapping spatial pattern in biodiversity for regional conserva-

tion planning: where to from here? Systematic Biology, 51:331–363, 2002.

126

[57] E. Frank, L. Trigg, G. Holmes, and I. H. Witten. Technical note:

Naive Bayes for regression. Machine Learning, 41:5–25, 2000. 23, 30, 51

[58] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network

classifiers. Machine Learning, 29:131–163, 1997. 31, 32, 41, 56, 126

Page 190: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

178 BIBLIOGRAPHY

[59] N. Friedman and M. Goldszmidt. Discretizing continuous attributes

while learning Bayesian networks. In Proceedings of the 13th International

Conference on Machine Learning (ICML), pages 157–165. Morgan Kauf-

mann Publishers, 1996. 13, 126

[60] J. A. Gamez. Abductive inference in Bayesian networks: A review. In

J. A. Gamez, S. Moral, and A. Salmeron, editors, Advances in

Bayesian Networks, pages 101–120. Springer Verlag, 2004. 122

[61] J. A. Gamez, R. Rumı, and A. Salmeron. Unsupervised naıve Bayes

for data clustering with mixtures of truncated exponentials. In Proceedings

of the 3rd European Workshop on Probabilistic Graphical Models (PGM’06),

pages 123–132, 2006. 22, 57

[62] J. A. Gamez and A. Salmeron. Prediccion del valor genetico en ovejas

de raza manchega usando tecnicas de aprendizaje automatico. In Actas de

las VI Jornadas de Transferencia de Tecnologıa en Inteligencia Artificial,

pages 71–80. Paraninfo, 2005. 23, 30

[63] S. Garcıa and F. Herrera. An extension on “Statistical comparisons

of classifiers over multiple data sets” for all pairwise comparisons. Journal

of Machine Learning Research, 9:2677–2694, 2008. 66

[64] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov

chain Monte Carlo in practice. Chapman and Hall, London, UK, 1996. 24

[65] V. Gogate and R. Dechter. Approximate inference algorithms for

hybrid Bayesian networks with discrete constraints. In Proceedings of the

21st Conference on Uncertainty in Artificial Intelligence (UAI-05), pages

209–216, 2005. 24

[66] C. H. Graham, S. Ferrier, F. Huettman, C. Moritz, and A. T.

Peterson. New developments in museum-based informatics and applica-

tions in biodiversity analysis. Trends in Ecology and Evolution, 19:497–503,

2004. 126

Page 191: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 179

[67] C. H. Graham, S. R. Ron, J. C. Santos, C. J. Schneider, and

C. Moritz. Integrating phylogenetics and environmental niche models to

explore speciation mechanisms in dendrobatid frogs. Evolution, 58:1781–

1793, 2004. 126

[68] A. Guisan, O. Broennimann, R. Engler, M. Vust, N. G. Toccoz,

A. Lehmann, and N. E. Zimmermann. Using niche-based models to

improve the sampling of rare species. Conservation Biology, 20:501–511,

2006. 126

[69] A. Guisan, S. B. Gueiss, and A. D. Gueiss. GLM versus CCA spatial

modeling of plant species distribution. Plant Ecology, 143:107–122, 1999.

126

[70] A. Guisan and W. Thuiller. Predicting species distribution: offering

more than simple habitats models. Ecology Letters, 8:993–1009, 2005. 126

[71] A. Guisan and N. E. Zimmermann. Predictive habitat distribution

models in ecology. Ecological Modelling, 135:147–186, 2000. 125, 126

[72] L. D. Hernandez, S. Moral, and A. Salmeron. A Monte Carlo

algorithm for probabilistic propagation in belief networks based on impor-

tance sampling and stratified simulation techniques. International Journal

of Approximate Reasoning, 18:53–91, 1998. 108, 110

[73] I. Inza, P. Larranaga, R. Etxeberria, and B. Sierra. Feature sub-

selection by Bayesian networks based optimization. Artificial Intelligence,

123:157–184, 2000. 127

[74] IUCN. Red list of threatened species (version 2009.1). 2009. 127

[75] F. V. Jensen. Bayesian networks and decision graphs. Springer, 2001. 9,

30

[76] F. V. Jensen, S. L. Lauritzen, and K. G. Olesen. Bayesian updat-

ing in causal probabilistic networks by local computation. Computational

Statistics Quarterly, 4:269–282, 1990. 7

Page 192: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

180 BIBLIOGRAPHY

[77] F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision

Graphs. Springer, 2007. 7, 9, 126, 148

[78] R. Jin, Y. Breitbart, and C. Muoh. Data discretization unification.

Knowledge and Information Systems, 19:1–29, 2009. 150

[79] R. Kohavi. A study of cross-validation and bootstrap for accuracy estima-

tion and model selection. In Proceedings of Fourteenth International Joint

Conference on Artificial Intelligence, pages 1137–1143. Morgan Kaufmann,

1995. 131

[80] D. Koller, U. Lerner, and D. Anguelov. A general algorithm for

approximate inference and its application to hybrid Bayes nets. In Proceed-

ings of the 15th Conference on Uncertainty in Artificial Intelligence, pages

324–333, 1999. 24

[81] D. Kozlov and D. Koller. Nonuniform dynamic discretization in hy-

brid networks. In Proceedings of the 13th Conference on Uncertainty in

Artificial Intelligence, pages 302–313, 1997. 13, 24

[82] K. Kristensen and I. A. Rasmussen. The use of a Bayesian network in

the design of a decision support system for growing malting barley with- out

use of pesticides. Computers and Electronics in Agriculture, 33:197–217,

2002. 113

[83] S. Kullback. Information theory and statistics. John Wiley & Son, 1959.

128

[84] S. Kullback and R. A. Leibler. On information and sufficiency. Annals

of Mathematical Statistics, 22:79–86, 1951. 128

[85] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Parameter

estimation in mixtures of truncated exponentials. In Proceedings of the

Fourth European Workshop on Probabilistic Graphical Models (PGM’08),

pages 169–176, 2008. 23, 57

Page 193: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 181

[86] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Inference

in hybrid Bayesian networks. Reliability Engineering and Systems Safety,

94:1499–1509, 2009. 25

[87] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Maxi-

mum likelihood learning of conditional MTE distributions. ECSQARU’09.

Lecture Notes in Artificial Intelligence, 5590:240–251, 2009. 24, 73, 74

[88] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Param-

eter estimation and model selection in mixtures of truncated exponentials.

International Journal of Approximate Reasoning, 51:485–498, 2010. 23, 73,

74, 76, 78

[89] P. Larranaga and S. Moral. Probabilistic graphical models in artificial

intelligence. Applied Soft Computing, 11:1511–1528, 2011. 3

[90] S. L. Lauritzen. Propagation of probabilities, means and variances in

mixed graphical association models. Journal of the American Statistical

Association, 87:1098–1108, 1992. 14, 15, 20

[91] S. L. Lauritzen and F. Jensen. Stable local computation with condi-

tional gaussian distributions. Statistics and Computing, 11:191–203, 2001.

14, 15, 20, 21

[92] A. Lehmann, J. McC. Overton, and M. P. Austin. Regression

models for spatial prediction: their role for biodiversity and conservation.

Biodiversity and Conservation, 11:2085–2092, 2002. 125

[93] A. Lehmann, J. McM. Overton, and J. R. Leathwick. Grasp:

generalized regression analysis and spatial prediction. Ecological Modelling,

160:165–183, 2002. 126

[94] U. Lerner, E. Segal, and D. Koller. Exact inference in networks

with discrete children of continuous parents. In Proceedings of the 17th

Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 319–

32, 2001. 20

Page 194: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

182 BIBLIOGRAPHY

[95] R. J. A. Little and D. B. Rubin. Statistical analysis with missing data.

John Wiley & Sons, New York, 1987. 95

[96] P. Lucas. Restricted Bayesian network structure learning. In Proceedings

of the 1st European Workshop on Probabilistic Graphical Models (PGM’02),

pages 217–232, 2002. 33, 45

[97] D. Lunn, A. Thomas, N. Best, and D. J. Spiegelhalter. WinBUGS

- A Bayesian modelling framework: Concepts, structure, and extensibility.

Statistics and Computing, 10:325–337, 2000. 95

[98] M. Luoto, J. Poyri, R. K. Heikkinen, and K. Saarinen. Uncer-

tainty of bioclimate envelope models based on the geographical distribution

of species. Global Ecology and Biogeography, 14:575–584, 2005. 126

[99] A. L. Madsen and F. V. Jensen. Lazy propagation: a junction tree in-

ference algorithm based on lazy evaluation. Artificial Intelligence, 113:203–

245, 1999. 12

[100] R. Maggini, A. Lehmann, E. Zimmermann, and A. Guisan. Im-

proving generalized regression analysis for the spatial prediction of forest

communities. Journal of Biogeography, 33:1729–1749, 2009. 126

[101] S. Manel, H. Ceri Williams, and S. J. Ormerod. Evaluating

presence-absence models in ecology: the need to account for prevalence.

Journal of Applied Ecology, 38:921–931, 2001. 126

[102] G. F. Midgley, L. Hannah, D. Millar, W. Thuiller, and

A. Booth. Developing regional and species-level assessments of climate

change impacts on biodiversity in the Cape floristic region. Biological Con-

servation, 112:87–97, 2003. 126

[103] J. Miller and J. Franklin. Modelling distribution of four vegetation

alliances using generalized linear models and classification trees with spatial

dependence. Ecological Modelling, 157:227–247, 2002. 126

Page 195: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 183

[104] D. Mladenic. Feature selection for dimensionality reduction. In Subspace,

Latent Structure and Feature Selection, 3940 of Lecture Notes in Computer

Science, pages 84–102. Springer, 2006. 127

[105] G. G. Moisen and T. S. Frescino. Comparing five modeling tech-

niques for predicting forest characteristics. Ecological Modelling, 157:209–

225, 2002. 126

[106] S. Moral, R. Rumı, and A. Salmeron. Mixtures of truncated ex-

ponentials in hybrid Bayesian networks. ECSQARU’01. Lecture Notes in

Artificial Intelligence, 2143:135–143, 2001. 16, 17, 21, 73, 75, 167

[107] S. Moral, R. Rumı, and A. Salmeron. Estimating mixtures of trun-

cated exponentials from data. In Proceedings of the First European Work-

shop on Probabilistic Graphical Models, pages 156–167, 2002. 21, 129

[108] S. Moral, R. Rumı, and A. Salmeron. Approximating conditional

MTE distributions by means of mixed trees. ECSQARU’03. Lecture Notes

in Artificial Intelligence, 2711:173–183, 2003. 21, 57

[109] S. Moral and A. Salmeron. Dynamic importance sampling in Bayesian

networks based on probability trees. International Journal of Approximate

Reasoning, 38:245–261, 2005. 99

[110] M. Morales, C. Rodrıguez, and A. Salmeron. Selective naıve

Bayes predictor using mixtures of truncated exponentials. In Proceedings

of the International Conference on Mathematical and Statistical Modelling

(ICMSM’06), 2006. 23

[111] M. Morales, C. Rodrıguez, and A. Salmeron. Selective naıve Bayes

for regression using mixtures of truncated exponentials. International Jour-

nal of Uncertainty, Fuzziness and Knowledge Based Systems, 15:697–716,

2007. 23, 30, 36, 37, 39, 40, 43, 51, 55, 57, 63, 64, 148, 149, 162

[112] K. P. Murphy. A variational approximation for Bayesian networks with

discrete and continuous latent variables. In Proceedings of the First Con-

Page 196: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

184 BIBLIOGRAPHY

ference on Uncertainty in Artificial Intelligence, pages 467–475, 1999. 20,

95

[113] M. Nardo, M. Saisana, A. Saltelli, and S. Tarantola. Handbook

on constructing composite indicators: Methodology and user guide. OECD,

European Commission, Joint Research Centre, 2008. 148, 160

[114] D. Nilsson. An efficient algorithm for finding the M most probable config-

urations in Bayesian networks. Statistics and Computing, 9:159–173, 1998.

160

[115] B. Sholkopf O. Chapelle and A. Zien. Semi-supervised learning.

MIT Press, 2006. 57

[116] S. M. Olmsted. On representing and solving decision problems. PhD

thesis, Stanford University, 1983. 21

[117] J. Pearl. Probabilistic reasoning in intelligent systems. Morgan-

Kaufmann (San Mateo), 1988. 7, 30, 159

[118] R. G. Pearson, W. Thuiller, M. B. Araujo, E. Martınez-Meyer,

L. Brotons, C. McClean, L. Miles P. Segurado T. P. Dawson,

and D. C. Lees. Model–based uncertainty in species range prediction.

Journal of Biogeography, 33:1704–1711, 2006. 126

[119] A. Perez, P. Larranaga, and I. Inza. Supervised classification with

conditional Gaussian networks: Increasing the structure complexity from

naıve Bayes. International Journal of Approximate Reasoning, 43:1–25,

2006. 40, 42

[120] A. T. Peterson. Predicting the geography of species’ invasions via ecolog-

ical niche modelling. The Quarterly Review of Biology, 78:419–433, 2003.

126

[121] A. T. Peterson, M. A. Ortega-Huerta, J. J. Bartley,

V. Sanchez-Cordero, J. Soberon R. H. Buddmeier, and D. R. B.

Stockwell. Future projections for Mexican fauna under global climate

change scenarios. Nature, 416:626–629, 2002. 126

Page 197: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 185

[122] J. M. Pleguezuelos, R. Marquez, and M. Lizana, editors. Atlas

y libro rojo de los anfibios y reptiles de Espana. Direccion General de la

Conservacion de la Naturaleza–Asociacion Herpetologica Espanola, Madrid,

second edition, 2002. In Spanish. 127

[123] C. A. Pollino, A. K. White, and B. T. Hart. Examination of

conflicts and improved strategies for the management of an endangered

eucalyp species using Bayesian networks. Ecological Modelling, 201:37–59,

2007. 126

[124] L. Qiu, Y. Li, and X. Wu. Protecting business intelligence and customer

privacy while outsourcing data mining tasks. Knowledge and Information

Systems, 17:99–120, 2008. 163

[125] J. R. Quinlan. Learning with continuous classes. In Proceedings of the

5th Australian Joint Conference on Artificial Intelligence, pages 343–348,

Singapore, 1992. 51

[126] F. T. Ramos and F. G. Cozman. Anytime anyspace probabilistic in-

ference. International Journal of Approximate Reasoning, 38:53–80, 2005.

100

[127] V. Romero, R. Rumı, and A. Salmeron. Structural learning of

Bayesian networks with mixtures of truncated exponentials. In Proceed-

ings of the 2nd European Workshop on Probabilistic Graphical Models

(PGM’04), pages 177–184, Leiden, The Netherlands, 2004. 21

[128] V. Romero, R. Rumı, and A. Salmeron. Learning hybrid Bayesian

networks using mixtures of truncated exponentials. International Journal

of Approximate Reasoning, 42:54–68, 2006. 21, 23, 36, 73, 74, 129, 130

[129] R. Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley (New

York), 1981. 102

[130] R. Ruiz, J. Riquelme, and J. S. lar Ruiz. Incremental wrapper-

based gene selection from microarray data for cancer classification. Pattern

Recognition, 39:2383–2392, 2006. 39

Page 198: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

186 BIBLIOGRAPHY

[131] R. Rumı. Modelos de redes bayesianas con variables discretas y continuas.

PhD thesis, Universidad de Almerıa, 2003. 129, 130

[132] R. Rumı. Kernel methods in Bayesian networks. In Proceedings of the

1st International Mediterranean Congress of Mathematics, pages 135–149,

2005. 57

[133] R. Rumı and A. Salmeron. Penniless propagation with mixtures of

truncated exponentials. ECSQARU’05. Lecture Notes in Computer Science,

3571:39–50, 2005. 21

[134] R. Rumı and A. Salmeron. Approximate probability propagation with

mixtures of truncated exponentials. International Journal of Approximate

Reasoning, 45:191–210, 2007. 21, 43, 99, 100, 112, 113, 114, 115

[135] R. Rumı, A. Salmeron, and S. Moral. Estimating mixtures of trun-

cated exponentials in hybrid Bayesian networks. Test, 15:397–421, 2006.

21, 23, 36, 57, 63, 73, 74, 129, 130

[136] M. Sahami. Learning limited dependence Bayesian classifiers. In Second

International Conference on Knowledge Discovery in Databases, pages 335–

338, 1996. 33, 46

[137] A. Salmeron, A. Cano, and S. Moral. Importance sampling in

Bayesian networks using probability trees. Computational Statistics and

Data Analysis, 34:387–413, 2000. 99, 107, 108

[138] G. Schwarz. Estimating the dimension of a model. The Annals of Statis-

tics, 6:461–464, 1978. 74, 78

[139] P. Segurado and M. B. Araujo. An evaluation of methods for mod-

elling species distribution. Journal of Biogeography, 31:1555–1568, 2004.

125

[140] R. D. Shachter. Evaluating influence diagrams. Operations Research,

34:871–882, 1986. 21

Page 199: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 187

[141] P. P. Shenoy. Inference in hybrid Bayesian networks using mixtures

of Gaussians. In Proceedings of the 22nd Conference on Uncertainty in

Artificial Intelligence (UAI-06), pages 428–436, 2006. 21

[142] P. P. Shenoy. Some issues in using mixtures of polynomials for inference

in hybrid Bayesian networks. Working Paper, No. 323, School of Business,

University of Kansas, October 2010. 24

[143] P. P. Shenoy and G. Shafer. Axioms for probability and belief function

propagation. In Uncertainty in Artificial Intelligence 4, pages 169–198,

1990. 7, 12

[144] P. P. Shenoy and J. West. Inference in hybrid Bayesian networks with

deterministic variables. ECSQARU’09. Lecture Notes in Computer Science,

5590:46–58, 2009. 22

[145] P. P. Shenoy and J. C. West. Mixtures of polynomials in hybrid

Bayesian networks with deterministic variables. In Proceedings of the 8th

Workshop on Uncertainty Processing (WUPES’09), pages 202–212, 2009.

19, 24

[146] P. P. Shenoy and J. C. West. Inference in hybrid Bayesian networks

using mixtures of polynomials. International Journal of Approximate Rea-

soning, In Press, 2010. 24

[147] C. S. Smith, A. L. Howes, B. Price, and C. A. McAlpine. Using

Bayesian belief network to predict suitable habitat of an endangered mam-

mal – the Julia Creek dunnart (Sminthopsis douglasi). Biological Conser-

vation, 139:333–347, 2007. 126

[148] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction and

search, 81 of Lecture Notes in Statistics. Springer Verlag, 1993. 150

[149] StatLib. http://www.statlib.org, 1999. Department of Statistics.

Carnegie Mellon University. 51, 66

Page 200: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

188 BIBLIOGRAPHY

[150] M. Stone. Cross-validatory choice and assessment of statistical predic-

tions. Journal of the Royal Statistical Society. Series B (Methodological),

36:111–147, 1974. 51, 66, 130

[151] M. A. Tanner and W. H Wong. The calculation of posterior distri-

butions by data augmentation (with discussion). Journal of the American

Statistical Association, 82:528–550, 1987. 56, 58, 59

[152] W. Thuiller. Patterns and uncertainties of species’ range shifts under

climate change. Global Change Biology, 10:2020–2027, 2004. 126

[153] D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical

Analysis of Finite Mixture Distributions. John Wiley, New York, 1985. 21

[154] L. Uusitalo. Advantages and challenges of Bayesian networks in environ-

mental modelling. Ecological Modelling, 203:312–318, 2007. 126

[155] Y. Wang and I. H. Witten. Induction of model trees for predicting

continuous cases. In Proceedings of the Poster Papers of the European

Conference on Machine Learning, pages 128–137, 1997. 51, 66

[156] Z. Wang, Q. Wang, and D. Wang. Bayesian network based business

information retrieval model. Knowledge and Information Systems, 20:63–

79, 2009. 148

[157] B. A. Wintle, J. Elith, and J. M. Potts. Fauna habitat modelling

and mapping: a review and case study in the Lower Hunter Central Coast

region of NSW. Austral Ecology, 30:719–738, 2005. 126

[158] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning

Tools and Techniques (Second Edition). Morgan Kaufmann, 2005. 22, 39,

51, 66

[159] X. Wu, V. Kumar, J. R. Quinlan, J. Gosh, Q. Yang, H. Motoda,

G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. Zhou, M. Steinbach,

D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining.

Knowledge and Information Systems, 14:1–37, 2008. 151

Page 201: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

BIBLIOGRAPHY 189

[160] M. Zaffalon. Credible classification for environmental problems. Envi-

ronmental Modelling & Software, 20:1003–1012, 2005. 126

Page 202: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people
Page 203: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Appendix A

Notation and mathematical

derivations

Vector notation for the Gaussian

Z =

Z1

...

Zd

⇒ ZT = [Z1, . . . , Zd]

z =

z1...

zd

⇒ zT = [z1, . . . , zd]

Y =

Y1...

Yc

⇒ YT = [Y1, . . . , Yc]

y =

y1...

yc

⇒ yT = [y1, . . . , yc]

Page 204: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

192

lz,j =

l(1)z,j...

l(c)z,j

⇒ lTz,j =[

l(1)z,j , . . . , l

(c)z,j

]

f(xj | z,y) ∼ N(

µ = lTz,jy + ηz,j , σ2z,j

)

= N

µ = [l(1)z,j , . . . , l

(c)z,j]

y1...

yc

+ ηz,j , σ2z,j

To ease notation,

lz,j =[

lTz,j , ηz,j]T

=[

l(1)z,j , . . . , l

(c)z,j, ηz,j

]T

=

l(1)z,j...

l(c)z,j

ηz,j

⇒ lTz,j =[

l(1)z,j , . . . , l

(c)z,j, ηz,j

]

y = [yT, 1]T = [y1, . . . , yc, 1]T =

y1...

yc

1

⇒ yT = [y1, . . . , yc, 1]

lTz,jy + ηz,j = lTz,jy ⇒[

l(1)z,j , . . . , l

(c)z,j

]

y1...

yc

+ ηz,j =[

l(1)z,j , . . . , l

(c)z,j, ηz,j

]

y1...

yc

1

So,

f(xj | z,y) ∼ N(

µ = lTz,jy, σ2z,j

)

= N

µ = [l(1)z,j , . . . , l

(c)z,j, ηz,j]

y1...

yc

1

, σ2z,j

Page 205: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter A. Notation and mathematical derivations 193

Vector notation for the logistic

wz,j =

w(1)z,j...

w(c)z,j

⇒ wT

z,j =[

w(1)z,j , . . . , w

(c)z,j

]

σz,j(y) =1

1 + expwT

z,jy + bz,j=

1

1 + exp

[

w(1)z,j , . . . , w

(c)z,j

]

y1...

yc

+ bz,j

wz,j = [wT

z,j, bz,j ]T =

[

w(1)z,j , . . . , w

(c)z,j, bz,j

]T

=

w(1)z,j...

w(c)z,j

bz,j

⇒ wT

z,j =[

w(1)z,j , . . . , w

(c)z,j, bz,j

]

wT

z,jy + bz,j = wT

z,jy ⇒[

w(1)z,j , . . . , w

(c)z,j

]

y1...

yc

+ bz,j =[

w(1)z,j , . . . , w

(n)z,j , bz,j

]

y1...

yc

1

σz,j(y) =1

1 + expwT

z,jy=

1

1 + exp

[

w(1)z,j , . . . , w

(c)z,j, bz,j

]

y1...

yc

1

Page 206: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

194

Vector notation for the expectations

E(XjY | di) = E

Xj

Y1...

Yc

1

| di

=

E(XjY1 | di)...

E(XjYc | di)E(Xj | di)

E(YYT | di) = E

Y1...

Yc

1

[Y1, . . . , Yc, 1] | di

=

E(Y 21 | di) . . . E(Y1Yc | di) E(Y1 | di)...

. . ....

...

E(YcY1 | di) . . . E(Y 2c | di) E(Yc | di)

E(Y1 | di) . . . E(Yc | di) 1

2lTz,jE(XjY | di) = 2[

l(1)z,j , . . . , l

(c)z,j, ηz,j

]

E(XjY1 | di)...

E(XjYc | di)E(Xj | di)

lTz,jE[

YYT | di)]

lz,j =

=[

l(1)z,j , . . . , l

(c)z,j, ηz,j

]

E(Y 21 | di) . . . E(Y1Yc | di) E(Y1 | di)...

. . ....

...

E(YcY1 | di) . . . E(Y 2c | di) E(Yc | di)

E(Y1 | di) . . . E(Yc | di) 1

l(1)z,j...

l(c)z,j

ηz,j

Page 207: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter A. Notation and mathematical derivations 195

Mathematical derivations

All the irrelevant mathematical derivations made in some calculations of the

updating rules in the M-step, as well as for getting the sufficient statistics in the

E-step are shown next:

∂Q

∂µj=

N∑

i=1

∂µjE [log f(Xj | di)]

=N∑

i=1

E

[

∂µjlog exp

−1

2

(

Xj − µjσj

)2

| di]

=

N∑

i=1

E

[

1

2σ2z,j

2 (Xj − µj) | di]

=1

σ2j

[

N∑

i=1

E [(Xj − µj) | di]]

=1

σ2j

[

N∑

i=1

(E [Xj | di]− µj)

]

=1

σ2j

[

N∑

i=1

E [Xj | di]]

−Nµj (A.1)

Page 208: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

196

∂Q

∂lz,j=

N∑

i=1

∂lz,jE [log f(Xj | Z,Y) | di]

=N∑

i=1

∂lz,j

z∈Z

y

xj

P (z,y, xj | di) log(f(xj | z,y))dydxj

=N∑

i=1

∂lz,j

y

xj

f(z,y, xj | di) log(f(xj | z,y))dydxj

=

N∑

i=1

∂lz,j

y

xj

f(y, xj | di, z)f(z | di) log(f(xj | z,y))dydxj

=

N∑

i=1

f(z | di)∂

∂lz,jE [log f(Xj | z,Y) | di, z]

=

N∑

i=1

f(z | di)E

∂lz,jlog exp

−1

2

(

Xj − lTz,jY

σz,j

)2

| di, z

=

N∑

i=1

f(z | di)E

−1

2

∂lz,j

(

Xj − lTz,jY

σz,j

)2

| di, z

=N∑

i=1

f(z | di)E[

1

2σ2z,j

2(

Xj − lTz,jY)

YT | di, z]

=1

σ2z,j

[

N∑

i=1

f(z | di)E[(

XjYT − lTz,jYYT

)

| di, z]

]

=1

σ2z,j

[

N∑

i=1

f(z | di)E(XjYT | di, z)− lTz,j

N∑

i=1

f(z | di)E(YYT | di, z)]

(A.2)

Page 209: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter A. Notation and mathematical derivations 197

∂Q

∂σj=

N∑

i=1

∂σjE [log f(Xj | di)]

=N∑

i=1

E

[

∂σjlog

(

1

σj√2π

exp

−1

2

(

Xj − µjσj

)2)

| di]

=N∑

i=1

E

[(

∂σjlog

1

σj√2π

− 1

2

∂σj

(

Xj − µjσj

)2)

| di]

=

N∑

i=1

E

[(

−√2π

∂σjlog σj −

1

2(Xj − µj)

2 ∂

∂σj

1

σ2j

)

| di]

=N∑

i=1

E

[(

−√2π

1

σj− 1

2(Xj − µj)

2 −2

σ3j

)

| di]

=

N∑

i=1

E

[(

(Xj − µj)2

σ3j

−√2π

σj

)

| di]

=

N∑

i=1

(

E

[

(Xj − µj)2

σ3j

| di]

−√2π

σj

)

=1

σ3j

N∑

i=1

E[

(Xj − µj)2 | di

]

− N√2π

σj

=1

σ3j

(

N∑

i=1

E[X2j | di] +Nµ2

j − 2µj

N∑

i=1

E[Xj | di])

− N√2π

σj

(A.3)

Page 210: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

198

∂Q

∂σz,j=

N∑

i=1

f(z | di)∂

∂σz,jE [log f(Xj | z,Y) | di, z]

=N∑

i=1

f(z | di)∂

∂σz,jE log

1

σz,j√2π

exp

−1

2

(

Xj − lTz,jY

σz,j

)2

| di, z

=N∑

i=1

f(z | di)E

−1

2

∂σz,j

(

Xj − lTz,jY

σz,j

)2

− ∂

∂σz,jlog(

σz,j√2π)

| di, z

=

N∑

i=1

f(z | di)E[

(Xj − lTz,jY)2

σ3z,j

− 1

σz,j| di, z

]

=N∑

i=1

f(z | di)E[

(Xj − lTz,jY)2 − σ2z,j

σ3z,j

| di, z]

=1

σ3z,j

N∑

i=1

f(z | di)E[

(Xj − lTz,jY)2 − σ2z,j | di, z

]

=1

σ3z,j

[

N∑

i=1

f(z | di)E[

(Xj − lTz,jY)2 | di, z]

− σ2z,j

N∑

i=1

f(z | di)]

(A.4)

E(Xj | di) =

∫ xb

xa

xjf(xj | di)dxj =∫ xb

xa

xj

(

a0 +

m∑

j=1

aj exp bjxj)

dxj

= a0

∫ xb

xa

xjdxj +m∑

j=1

aj

∫ xb

xa

xj exp bjxj dxj

= a0

∫ xb

xa

xjdxj +

m∑

j=1

aj

∫ xb

xa

xj exp bjxj dxj

=a02

(

xb2 − xa

2)

+

m∑

j=1

aj

bj2 (expbjxb(bjxb − 1)− expbjxa(bjxa − 1))

(A.5)

Page 211: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Chapter A. Notation and mathematical derivations 199

E(X2j | di) =

∫ xb

xa

x2jf(xj | di)dxj =∫ xb

xa

x2j

(

a0 +m∑

j=1

aj exp bjxj)

dxj

= a0

∫ xb

xa

x2jdxj +

m∑

j=1

aj

∫ xb

xa

x2j exp bjxj dxj

=a03

(

xb3 − xa

3)

+m∑

j=1

aj

bj3 ( expbjxb(bjxb(bjxb − 2) + 2)−

− [expbjxa(bjxa(bjxa − 2) + 2)] ) (A.6)

E(XY | di) =

∫ xb

xa

∫ yb

ya

xyf(x, y | di)dxdy

=

∫ xb

xa

∫ yb

ya

xy

(

a0 +

m∑

j=1

aj exp bjy + cjx)

dxdy

= a0

∫ xb

xa

∫ yb

ya

xydxdy +m∑

j=1

∫ xb

xa

∫ yb

ya

xyaj exp bjy + cjx dxdy

= a0

∫ xb

xa

∫ yb

ya

xydxdy +

m∑

j=1

∫ xb

xa

∫ yb

ya

xyaj exp bjy exp cjx dxdy

= a0

∫ xb

xa

∫ yb

ya

xydxdy +m∑

j=1

aj

∫ xb

xa

x exp cjx∫ yb

ya

y exp bjy dxdy

=a04(y2b − y2a)(x

2b − x2a) +

m∑

j=1

aj

cj2bj2

(−expbjya+ bjyaexpbjya+ expbjyb − bjybexpbjyb)(−expcjxa+ cjxaexpcjxa+ expcjxb − cjxbexpcjxb)

(A.7)

A new version of E(XY | di) is shown next for the case in which the exponent

has only one variable, i.e., bj = 0:

Page 212: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

200

E(XY | di) =

∫ xb

xa

∫ yb

ya

xyf(x, y | di)dxdy

=

∫ xb

xa

∫ yb

ya

xy

(

a0 +m∑

j=1

aj exp cjx)

dxdy

= a0

∫ xb

xa

∫ yb

ya

xydxdy +

m∑

j=1

∫ xb

xa

∫ yb

ya

xyaj exp cjx dxdy

= a0

∫ xb

xa

∫ yb

ya

xydxdy +m∑

j=1

∫ xb

xa

∫ yb

ya

xyaj exp cjx dxdy

= a0

∫ xb

xa

∫ yb

ya

xydxdy +

m∑

j=1

aj

∫ xb

xa

x exp cjx∫ yb

ya

ydxdy

=a04(y2b − y2a)(x

2b − x2a) +

m∑

j=1

aj(y2b − y2a)

2cj2

(expcjxa − cjxaexpcjxa − expcjxb+ cjxbexpcjxb)(A.8)

Page 213: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

Appendix B

Publications

The contents presented in this dissertation are the results of the following publi-

cations:

1. A. Fernandez, R. Rumı and A. Salmeron (2011). Answering queries

in hybrid Bayesian networks using importance sampling. Decision Support

Systems (submitted).

2. A. Fernandez, M. Morales, C. Rodrıguez, and A. Salmeron

(2011) A system for relevance analysis of performance indicators in higher

education using Bayesian networks. Knowledge and Information Systems

(In press) .

3. A. Fernandez, H. Langseth, T. D. Nielsen, and A. Salmeron

(2010) Parameter learning in MTE networks using incomplete data. Pro-

ceedings of the Fifth European Workshop on Probabilistic Graphical Models

(PGM’10), pages 137–145.

4. P. A. Aguilera, A. Fernandez, F. Reche, and R. Rumı (2010)

Hybrid Bayesian Network Classifiers: Application to species distribution

models. Environmental Modelling & Software, 25:1630–1639.

5. A. Fernandez, J. D. Nielsen, and A. Salmeron (2010) Learning

Bayesian networks for regression from incomplete databases. International

Journal of Uncertainty, Fuzziness and Knowledge Based Systems 18:69–86.

Page 214: UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research collaborations we have had and his friendly attitude during his stay in Almer´ıa. Other people

202

6. A. Fernandez and A. Salmeron (2008) Extension of Bayesian net-

work classifiers to regression problems. IBERAMIA’08. Lecture Notes in

Artificial Intelligence 5290:83–92.

7. A. Fernandez, J. D. Nielsen, and A. Salmeron (2008) Learning

naıve Bayes regression models with missing data using mixtures of truncated

exponentials. Proceedings of the Fourth European Workshop on Probabilis-

tic Graphical Models (PGM’08), pages 105–112.

8. A. Fernandez, M. Morales, and A. Salmeron (2007) Tree Aug-

mented Naıve Bayes for Regression Using Mixtures of Truncated Expo-

nentials: Application to Higher Education Management. IDA’07. Lecture

Notes in Computer Science 4723:59–69.

Other publications whose contents are not included in this dissertation are:

9. P. A. Aguilera, A. Fernandez, R. Fernandez, R. Rumı, and A.

Salmeron (2010). Bayesian networks in Environmental Modelling. Envi-

ronmental Modelling & Software (submitted).

10. A. Fernandez and A. Salmeron (2008) BayesChess: A computer chess

program based on Bayesian networks. Pattern Recognition Letters 29:1154–

1159.

11. A. Fernandez, I. Flesch, and A. Salmeron (2007) Incremental su-

pervised classification for the MTE distribution: a preliminary study. Pro-

ceedings of the CEDI’07-SICO’07, pages 217–224.

12. A. Fernandez and A. Salmeron (2006) BayesChess: programa de aje-

drez adaptativo basado en redes bayesianas. Proceedings of the CMPI’06,

pages 613–624.