DJ. Trigg% WJ. Walley* & SJ. Ormerod^problems of computational explosion, but an algorithm developed by Lauritzen and Spiegelhalter (1988) eventually overcame this problem. A detailed

A prototype Bayesian belief network for the

diagnosis of acidification in Welsh rivers

DJ. Trigg% WJ. Walley* & SJ. Ormerod^School of Computing, Staffordshire University, United Kingdom.^Catchment Research Group, Cardiff School ofBiosciences, CardiffUniversity, United Kingdom.

Abstract

The paper describes the development of a prototype Bayesian Belief Network(BBN) that models acidification in Welsh Rivers, It was based on data from the102 river sites surveyed in the 1995 Welsh Acid Waters Survey. The objectivesof the study were: a) to test the suitability of BBNs for the diagnosis of riverhealth; and b) to test the viability of a data-based approach to the development ofa prototype. The results of performance tests on the model, based upon its abilityto predict pH, aluminium concentration and trout density using macro-invertebrate data, are compared to those achieved by three multiple linearregression (MLR) models. The results of like-for-like tests on the MLR modelsand three simple BBNs are also presented. Following a discussion of the results,it is concluded that: a) BBNs are well suited to the task of diagnosis, becausethey model the causal relationships of the system in a non-linear way and offerflexibility of use; b) their potential can only be fully realised if sufficient data areavailable for the reliable evaluation of their conditional probabilities; and c) thedata-based approach to prototype development was efficient and effective.

1 Introduction

1.1 General

The biological approach to river quality monitoring has grown considerably inimportance over the past few decades. Its basic principle is that the compositionof the aquatic community supported by a river is a reflection of its state of health.However, the relationship between community composition and quality is

Development & Application of Computer Techniques to Environmental Studies VII, C.A. Brebbia, P. Zannetti & G.Ibarra-Berastegi (Editors) © 2000 WIT Press, www.witpress.com, ISBN 1-85312-819-8

164 Computer Techniques in Environmental Studies

confounded by several environmental and seasonal factors, but when these areaccounted for the aquatic community provides invaluable information.

There are several biological monitoring systems in use world-wide. De Pauw andHawkes (1993) presented a comprehensive review of the systems used inEurope. All, however, are fairly simplistic in their mathematical formulation andmerely serve to classify river quality into one of several bands, normally aboutsix. In short, they only utilise a fraction of the information contained in thecomposition of the living community supported by the river.

More advanced mathematical techniques, based on Artificial Intelligence(AI), were first applied to the interpretation of biomonitoring data by researchersat Aston University (Walley, et al, 1992; Ruck et a/., 1993). This work followedtwo parallel lines of research, one based on pattern recognition using neuralnetworks, and the other based on Bayesian methods of reasoning underuncertainty. More recently, a study for the UK Environment Agency (Walley elal, 1998; Walley & Fontama, 2000) demonstrated some possible applications ofartificial intelligence in river quality surveys. This resulted in a further study(Environment Agency R&D Project El-056) which is currently developingpattern recognition and plausible reasoning systems for the classification anddiagnosis of river health using biological and environmental data. Diagnosis (i.e.the identification of possible pollutants) requires the extraction of a far greaterproportion of the information in the composition of the aquatic community thandoes classification into quality classes. This paper describes a preliminary studyinto the use of Bayesian Belief Networks (BBN) for diagnosis. Its basic aim wasto develop a prototype BBN for the diagnosis of acidification problems in Welshrivers to test the methodology prior to commencing the development of a BBNfor the diagnosis of general pollution problems in England and Wales.

1.2 Bayesian belief networks

Bayesian belief networks are probabilistic Expert Systems in which theknowledge-base has two components: a network of causal relationships betweenvariables; and a set of conditional probability matrices that relate each variable toits causal variables. BBNs were designed specifically to provide a means ofreasoning under conditions of uncertainty in a mathematically sound andconsistent way. Initial attempts to reason probabilistically were confounded byproblems of computational explosion, but an algorithm developed by Lauritzenand Spiegelhalter (1988) eventually overcame this problem. A detaileddescription of the BBN method is beyond the scope of this paper, and theinterested reader is directed to Neopolitan (1990) and Jenson (1996).

Basically, the algorithm combines graph theory and Bayesian probabilitytheory to transform the initial problem into one of a series of local computationson 'cliques' of variables. This enables the effect of evidence about the state ofany variable in the network to be propagated to all the other variables, thusupdating the likelihoods of their possible states given the new knowledge gainedfrom the evidence. The algorithm replicates some key features of human


reasoning, namely the abilities to: a) retract current beliefs when new evidence'explains away' earlier evidence; b) change the dependencies between variableas evidence is acquired; and c) reason predictively (i.e. cause to effect) ordiagnostically (effect to cause) between variables as and when required. BBNsare highly flexible in their use, since evidence can be added anywhere to anynumber of variables arid consequently can be used for diagnosis or prognosis asthe user requires.

2 Development and testing of the models

2.1 The data

The data were derived from the 1995 Welsh Acid Waters Survey (Stevens el aL,1997). The survey covered many biological, chemical and physical variables,not all of which were used in this study. Those used in the study included: a)spring samples of aquatic macroinvertebrates from 102 stream sites; b) salmonidfish populations at 85 of the 102 stream sites; c) winter mean values of 17 waterchemistry variables for 102 stream sites sampled monthly; and d) 31 physicalcharacteristics of each site and its drainage basin. The biological samples cover115 taxa, mainly recorded to species level, but only 28 occurred at more than aquarter of the sites. For the purpose of this study some of the infrequentlyoccurring taxa were grouped into their genera or families, but others wereeliminated from the study. When combining taxa, a balance had to drawnbetween the likely loss of information caused by the decline in specificity andthat caused by eliminating taxa from the study. The biological data recorded thenumber of individuals of each species found. For the purpose of the study thesewere converted to six 'abundance' categories (0 = none found, 1 = 1 to 3individuals found, 2 = 4 to 10, 3 = 11 to 30, 4 = 31 to 100 and 5 = 100+). Table1 lists the taxonomic groups that were used in the models and their frequenciesof occurrence. Tables 2 and 3 list the chemical and environmental variables thatwere found to be of value (i.e. for the diagnosis of acidification problems). TheRiver Habitat variable listed in Table 3 was added because it was felt that thiswas an important factor governing trout density. The measure used was based ona principal component (PC2) derived by Stevens el al (1997) from River HabitatSurvey data. The only remaining variable used in the study, not listed in Tables1,2 or 3, was trout density (TrtDen - number/lOOm^).

2.2 The Models

2.2.1 Prototype diagnostic BBNSeveral models were developed and tested, but the main one was the prototypeBBN for the diagnosis of acidification problems. Figure 1 shows the causalrelationships between the biological, chemical and environmental variables thatformed the basis of the model. Normally, this structure would been developedfollowing a series of knowledge elicitation exercises with experts in the riverecology field, but in this study it was based upon data analysis and common


DLnd ) ( Aliwood

Istk3 ) ( Conf

BrPdz ) X Orgl

s )>H O-Hepta )/\P.meyeri) (Ncmoura) ( L.inerm ) Ji I gram ) ( C.tripunK C L.wlck I ( Plectro ) ( H.siltal 1 \ Limon

n ) ( B.nsi ) ( Asulc ) ( L.hippo ) ( L.mgra ) (C.torren) ( E.aenea) ( Rdors ) (O-Hydro) ( Limnc

Figure 1. The causal belief network of the prototype diagnostic BBN


167

knowledge. This was done to try and improve the efficiency of the knowledgeelicitation process prior to commencing work on a more ambitious BBN project.It was felt that it is wasteful of experts' time to involve them in the developmentof the first prototype, especially if it can be achieved by non-experts aided bydata analysis. Once a prototype has been developed it can be used as the basisfor knowledge elicitation sessions with experts.

Table 1. List of the 26 taxonomic groups used in the models

Composition of Taxonomic Group

Phagocata vitta.Oligochaeta.Baetis muticus, Baetis rhodani.Rhithrogena semicolorata.Heptagena lateralis, Heptagena sulphured,Ecdyonurus dispar, Ecdyonurus torrentis,Ecdyonurus venosus, Ecdyonurus sp.Brachyptera risi.Protonemura meyeri.Amphinemura su lei col Us.Nemoura avicularis, Nemoura cambrica, Nemouracinerea, Nemoura erratica.Leuctra hippopusLeuctra inermisLeuctra nigraIsoperla grammatica.Chloroperla torrentiumChloroperla tripunctataElmis aeneaLimnius volckmariRhyacophila dor sails.Plectrocnemia conspersa, Plectrocnemia geniculataHydropsyche slltalaiHydropsyche pellucidula, Hydropsyche in stab allsDiplectrona fellxDrusus annul atus, Ecclisopteryx guttulata, Halesusdigitatus, Halesus radiatus, Halesus sp.,Micropterna lateralis, Potamophylax cingulatus,Potamophylax latipennis, Chaetopteryx villosa,Limnephilus centialis, Limnephilidae.Limoniinae.SimuliidaeChironomidaeClinocerinae

Number ofOccurrences

3984393838

58509634

579338898345535965564036

60

65869548

Label inFigure 1P.vittaOligBaetisR.semiO-Hepta

B.risiP. meyeriA.sulcNemoura

L. hippoL.inermL. nigraI. gramC.torrenC.tripunE. aeneaL.volckR.dorsPlectroH.siltalO-Hydro

Limne

LimonSimulChironClino



The common knowledge that we imposed on the system was that: a) thechemical quality and certain physical characteristics of the river were principalcauses of the composition of it biological community; and b) the physicalcharacteristics of the catchment were principal cause of the river's chemistry.Thus, the biological variables appear in the bottom half of Figure 1, the chemicalvariable in the central portion and the environmental (site and catchmentcharacteristics) at the top.

Table 2. List of chemical variables used in the BBN study

Variable

Sodium (mg/1)Potassium (mg/1)Calcium (mg/1)Magnesium (mg/1)Aluminium (mg/1)Zinc (jug/1)Iron (ng/1)Manganese (pg/1)

Label inFigure 1NaKCaMgAlZnFeMn

Variable

Chloride (mg/1)Sulphate (mgSCM)Total Oxidised Nitrogen (mg/1)Total Organic Nitogen (mg/1)Dissolved Organic Carbon (mg/1)PHAlkalinity (mg /I CaCOg)Conductivity ( iS/cm)

Label inFigure 1

ClSO4TONTORNDOCPHAlkCond

Table 3. List of environmental variables used in the study.

Variable

Land (% by area)Improved land (i.e. limed)Woodland (% by area)All woodlandConiferous woodlandRestocked > 3 yrs ago

Site CharacteristicsAltitude (m)Flow categoryStream Gradient (m/km)Stream length (km)

Label usedin Figure 1

ImpLnd

AllwoodConfRstk3

AltFlowGradSLen

Variable

Soils (% by area)PeatBrown EarthBrown PodzolPodzolGleyOrganic 1Organic 2Organic 3

River HabitatRHS principal comp. 2

Label inFigure 1

PeatBrEthBrPdzPdzGleyOrglOrg2Org3

Habitat

The links between variables were decided after analysing the results ofstepwise multiple linear regression analyses between cause and effect variablesand vice versa. For example, each taxonomic group was used as the dependentvariable with the chemical and environmental variables as independent variables.This was repeated using chemical variables as the dependent variable andbiological and environmental variables as the independent variable. Finally, thiswas repeated for the environmental variables using the chemical and biologicalvariables as the independent variables. Tables were then drawn up showing themost important predictors of each variable.


169

Before deciding which variables should be linked by causal relationshipsthere were practical problems to consider concerning the evaluation of theconditional probabilities matrices. These contain a probability value for eachpossible combination of states of the variable and its parents (i.e. causalvariables). For example, if a taxon having six possible states has three parents,each having 5 possible states, then the total number of probabilities to beevaluated is 750 (i.e. 6x5x5x5). Clearly, it would not be possible to determinereliable estimates of these from just 102 cases. Thus, the limited size of theavailable database meant that restrictions had to be placed on the number ofstates and number of parents that each variable could have. It was decided tolimit the number of states to three and the number of parents to two, so that themaximum number of conditional probabilities to be determined for any variablewas 27. This meant that the number of abundance levels for the taxa had to bereduced from six to three, and that the continuous distributions of the othervariables, some of which had long tails, had to be represented by just threediscrete states. The boundaries between the states were selected by balancing theneed for sufficient cases in each band to permit the evaluation of the probabilitiesagainst the need to effectively represent the tails of the distribution.

The formation of the causal links between variables thus reduced to theselection of no more than two parents for each variable. The parents selected foreach variable were generally its most important causal factors, as identified bythe regression analyses. However, if this had been applied throughout some ofthe chemical variables would not have had a causal link with the taxa. Since wewanted to ensure that each had at least one link, a compromise had to be found insome cases. This also applied to the selection of the parents of other variables.It should be noted that although the number of parents was limited to two, therewas no limit on the number of children (effect variables) a variable could have.

2.2.2 Multiple linear regressionIn order to compare the performance of the prototype BBN with that of anothermodels, multiple regression models were developed for the prediction of pH,aluminium concentration and trout density. In each case the dependent variableused were the 26 taxonomic groups listed in Table 1. Although this did notprovide a comprehensive test of the full range of predictive and diagnosticcapabilities of the BBN it did provide a test of its relative performance on threethe key variables relating to acidification.

2.2.3 Variable-specific BBN predictorsSince the MLR models did not provide a truly like-for-like comparison with theBBN, three simple BBN models were developed for the prediction of pH,aluminium concentration and trout density. These consisted of a single parent(the variable to be predicted) with 26 children (the taxonomic groups), and thusprovided an exact like-for-like comparison with the MLR models. In thesemodels the taxonomic variable had 4 possible states and the predicted variableshad 5 states. This was possible because of the reduced complexity of the models.



3 Performance tests

The prototype BBN and MLR models were tested using two-way crossvalidation. That is, the data set was randomly divided into two equal subsets (Fland F2) and separated models developed using each subset (NB. The BBNscausal networks remained the same, only their probability matrices differed).Each model was then tested independently on the other subset. That is, the modelbased on Fl was tested on F2 (Test F1/F2) and vice versa (Test F2/F1). Testswere also carried out using dependent data, by testing each model on the subsetused for its development (Tests F1/F1 and F2/F2). The original data set was thenrandomly divided again into two equal subsets (SI and S2) and a further set oftests, S1/S2, S2/S1, SI/SI and S2/S2, carried out. The results of all eight testsare given in Table 4.

The MLR and simple BBN models used in the like-for-like tests weredeveloped using the full data set and tested on the same (dependent) set, sincethe purpose of the tests were only to compare the relative performances of themodels on a particular task. The results of these tests are given in Table 5.

Table 4. Results of dependent and independent performance tests on theprototype BBN and MLR models. R = correlation coefficient; Sg = standard errorof estimate (values for pH and Aluminium are

Test

PHBBNR Of

MLRR &

AluminiumBBNR &

MLRR Se

Trout DensityBBNR Se

MLRR &

Dependent TestsF1/F1F2/F2SI/SIS2/S2Avg.

0.8590.8710.8480.8580.859

254276273274269

0.8770.9030.8770.9340.898

253242228221236

0.8470.7010.6980.7210.742

2638393434

0.9110.8210.8550.8360.856

2345393435

0.3980.6210.6120.4880.530

11.813.917.49.813.2

0.8550.7660.7900.8820.823

21.724.224.719.6226

Independent TestsF1/F2F2/F1S1/S2S2/S1Avg.

0.7960.7650.7260.7860.768

315323313300312

0.7690.7690.7690.6690.744

363386303498387

0.5830.6540.4960.5640.574

4233483339

0.5580.6180.5690.4190.541

5559626961

0.5810.3920.3930.4760.461

11.713.817.99.813.3

0.4190.1890.1700.3350.278

41.737.937.629.536.7

Table 5. Results of like-for-like tests on the MLR and variable-specific BBNmodels. R = correlation coefficient; S? = standard error of estimate (values forpH and Aluminium are xlO^)

PHBBNR

0.912Se236

MLRR

0.871Se317

AluminiumBBNR

0.801Se45

MLRR

0.783&54

Trout DensityBBNR

0.837Se22.1

MLRR

0.644Se39.9


Computer Techniques in Environmental Studies 171

4 Discussion

The data-based approach to the construction of BBNs is currently the subject ofmuch research world-wide, the principal objective of which is to develop BBNsthat learn the causal relationships and conditional probabilities from data (e.g.Cheng et al, 1997). However, a recently released learning systems that wastested on the acid streams data did not produce a meaningful causal structure,and the authors have doubts about the real value of methods based solely onlearning from data. Our pragmatic approach to the development of prototypes,based upon data analysis supplemented by common sense knowledge, althoughless elegant was more effective.

The results given in Table 4 show that although MLR predict pH andaluminium concentrations better than the full BBN when tested on dependentdata, the reverse was true when tested on independent data. The latter issignificant because it is the model's performance on independent data that reallymatters. The prediction of trout density highlighted some interestingcharacteristics of the two methods. MLR achieved better R values on dependentdata than did BBN, simply because it fitted three outliers better. However, its &values were far worse than those produced by BBN. In addition, MLR predictedsome negative trout densities, whereas BBN only predicted positive values andtended to disregard the outliers. When tested on independent data the R and &values achieve by MLR were very poor compared to those of BBN, because ithad fitted the outliers in its training data. It should be noted that these were notlike-for-like tests, and that the MLR models had an advantage over BBN becausethey were highly specific models whereas BBN was a general model.

The results given in Table 5 show the results of the like-for-like tests on MLRand simple BBN models. The models were developed using all of the data, sounlike the earlier cross-validated models, they could only be tested on dependentdata. The results show that the BBNs outperformed the MLR models on all tests,and especially with respect to the prediction of trout density. Thus, in this like-for-like comparison the non-linear mapping capability of BBN showed through.

Finally, it should be realised that the performance of the BBN models wasseverely limited by the relatively small size of the database, since this restrictedthe number of causal links that could be made between variables and the numberof possible discrete states that each variable could take on. Thus, the real-valuedMLR were at an advantage over BBN in this respect. It follows the potential ofBBN can only be full realised if sufficiently large databases are available for thereliable estimation of their conditional probabilities.

5 Conclusion

A prototype BBN for the diagnosis of acidification in Welsh streams has beendeveloped using a data-based approach supplemented by common knowledge.The results of tests carried out on the model, in conjunction with tests on MLRand simple BBN models, showed that the BBN and MLR models performed to asimilar standard overall when tested on dependent data. However, when tested



on independent data the full BBN performed better than MLR models, althoughneither provided satisfactory predictions of trout density. It is concluded thatBBNs are well suited to the task of diagnosing river health, because they: a)model the causal relationships of the ecological system; b) are non-linear; and c)offer great flexibility of use. However, their potential can only be fully realisedif sufficient data are available for the reliable evaluation of their conditionalprobabilities. Finally, the data-based approach used in this study provided an andefficient and effective means of developing a prototype BBN, prior to its fine-tuning by experts.

ReferencesDe Pauw N. and Hawkes H.A. (1993) Biological monitoring of river waterquality. In Proc. Freshwater Europe Symposium on River Water QualityMonitoring and Control, pages 87-1 1 1. Aston University, Birmingham.

Cheng J., Bell D. A. and Liu W. (1997) An algorithm for Bayesian beliefnetwork construction from data. Sixth International Workshop on AI andStatistics, eds F. Golshani & K. Makki, 325-331, Fort Lauderdale, Florida.

Jenson F. V. (1996) An Introduction to Bayesian Networks. UCL Press, London.

Lauritzen S. L. and Spiegelhalter D. J. (1988) Local computations withprobabilities on graphical structures and their application to expert systems.

50(2), 157-224.

Neapolitan, R. E. (1990). Probabilistic Reasoning in Expert Systems, Wiley,New York, N. Y.

Ruck B.M., Walley W.J. and Hawkes H.A.(1993) Biological Classification ofRiver Water Quality using Neural Networks. In Applications of ArtificialIntelligence in Engineering, Vol.2: Applications and Techniques, eds. RzevskiG., Pastor J. and Adey R.A., Elsevier/CMP, 361-372

Stevens P. A., Ormerod S. J. and Reynolds B. (1997) Final Report on the AcidWaters Survey for Wales: Volume 1 Main Text. Natural Environment ResearchCouncil, July 1997.

Walley W. J. and Fontama V. N. (2000) New approaches to river qualityclassification based upon artificial intelligence. In Assessing the biologicalquality of fresh waters. RIVPACS and other techniques. Eds. Wright J. F.,Sutcliffe D. W. and Furse M. T. Freshwater Biological Association, Ambleside.263-279. (In Press. Due April 2000)

Walley W. J., Fontama V. N. and Martin R. W. (1998) Applications of artificialintelligence in river quality surveys. R&D Technical Report E52. EnvironmentAgency, Bristol.

Walley W. J., Boyd M. and Hawkes H. (1992) A. An Expert System for theBiological Monitoring of River Pollution, In Proc. 4th Int. Conf on theDevelopment and Application of Computer Techniques to EnvironmentalStudies, ed. Zanetti P., Elsevier/CMP. September 1992. 721-736. ISBN 1-85166-792-X.


Documents

DJ. Trigg% WJ. Walley* & SJ. Ormerod^problems of computational explosion, but an algorithm developed by Lauritzen and Spiegelhalter (1988) eventually overcame this problem. A detailed