133
APPLICATION OF BAYESIAN METHODS FOR CYANOBACTERIA AND CRYPTOSPORIDIUM PREDICTION AND HEALTH RISK ASSESSMENT by Yirao Zhang B.Eng, Beijing Normal University, 2020 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE COLLEGE OF GRADUATE STUDIES (Civil Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan) July 2022 © Yirao Zhang, 2022

application of bayesian methods for cyanobacteria and

Embed Size (px)

Citation preview

APPLICATION OF BAYESIAN METHODS FOR CYANOBACTERIA AND

CRYPTOSPORIDIUM PREDICTION AND HEALTH RISK ASSESSMENT

by

Yirao Zhang

B.Eng, Beijing Normal University, 2020

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF APPLIED SCIENCE

in

THE COLLEGE OF GRADUATE STUDIES

(Civil Engineering)

THE UNIVERSITY OF BRITISH COLUMBIA

(Okanagan)

July 2022

© Yirao Zhang, 2022

ii

The following individuals certify that they have read, and recommend to the College of

Graduate Studies for acceptance, a thesis/dissertation entitled:

APPLICATION OF BAYESIAN METHODS FOR CYANOBACTERIA AND

CRYPTOSPORIDIUM PREDICTION AND HEALTH RISK ASSESSMENT

submitted by Yirao Zhang in partial fulfillment of the requirements of

the degree of Master of Applied Science .

Dr. Nicolás Peleato, School of Engineering

Supervisor

Dr. Solomon Tesfamariam, School of Engineering

Supervisory Committee Member

Dr. Alyse Hawley, School of Engineering

Supervisory Committee Member

Dr. Sepideh Pakpour, School of Engineering

University Examiner

iii

Abstract

This research investigated the use of Bayesian methods in predictive modelling of

cyanobacteria concentration and Cryptosporidium presence/absence. Cyanobacteria

blooms and Cryptosporidium spp. in source waters are ubiquitous concerns for water

treatment and management. However, identifying and enumeration cyanobacteria and

Cryptosporidium has always been challenging. Previous research has shown Bayesian

methods to be a promising approach to predicting water quality variables with uncertainty.

However, the innate nature of highly imbalance in cyanobacteria abundance and

Cryptosporidium classification data reduces the performance of traditional predictive

models. Moreover, few studies have focused on the probabilistic disease burden estimation

of Cryptosporidium after prediction.

In this study, a Bayesian approach was proposed to fit the cyanobacteria abundance

data with mixture models that handle zero-inflated data. Predictor variables considered

included weather and water quality measures that can easily be obtained day-to-day.

Several models were compared based on fit to training data. Furthermore, the optimal

model (zero-inflated negative binomial) was used to predict cyanobacteria alert levels on a

separated test set of data. The ability to predict narrow alert levels was limited. However,

high accuracy was achieved in predicting cyanobacteria counts above or below 1,000

cells/mL.

For Cryptosporidium, a probabilistic quantitative microbial risk assessment approach

was proposed to predict Cryptosporidium presence/absence and estimate the disease

burden presented in disability-adjusted life years (DALYs). For severely imbalanced

Cryptosporidium data, the model achieved high precision and recall. The probabilistic

QMRA based on Monte Carlo and Markov chain Monte Carlo method was used to estimate

disease burden under different scenarios and backwards infer the necessary level of

iv

treatment and critical control points for sewer overflow. The modeling approach can be

applied to assess risk under different scenarios and advice for water management.

v

Lay Summary

Cyanobacterial blooms and Cryptosporidium spp. in source waters are persistent concerns

to water management and treatment. However, detection and counting of these

microorganisms are usually labour-intensive and time-consuming. A probabilistic disease

burden estimation of Cryptosporidium with uncertainty is not available. In this thesis, several

statistical models were applied to predict cyanobacteria abundance and Cryptosporidium

presence in source waters. Furthermore, a risk assessment model to estimate the disease

burden of Cryptosporidium was developed using simulation and sampling methods. The

model can also determine the removal efficiency and sewer overflow control required to

satisfy certain health outcome goals. This study proposed new approaches to predict

cyanobacteria abundance and Cryptosporidium presence in source waters in more accurate

and easy ways and highlighted the potential of using probabilistic risk assessment in

quantifying the disease burden of pathogens.

vi

Preface

The thesis is based on the research work completed in the School of Engineering at The

University of British Columbia, Okanagan, under the supervision and guidance of Dr.

Nicolás Peleato. All published works are included in this thesis. All published works are

included in this thesis.

Chapter 3 is based on the following paper submitted to Heliyon.

Zhang, Yirao, and Nicolas M Peleato. 2022. Predicting Cyanobacteria Abundance

with Bayesian Zero-Inflated Negative Binomial Models. Available at

SSRN: https://ssrn.com/abstract=3939421 or http://dx.doi.org/10.2139/ssrn.3939421

Chapter 4 is based on the following paper submitted to Water Research, and also accepted

to 2021 Thirty-sixth Conference on Neural Information Processing Systems (NIPS), Machine

Learning in Public Health (MLPH) Workshop

Zhang, Yirao, and Nicolas M Peleato. 2022. A probabilistic approach to evaluating

Cryptosporidium health risk in drinking water.

vii

Table of Contents

Abstract .............................................................................................................................. iii

Lay Summary ...................................................................................................................... v

Preface ................................................................................................................................ vi

Table of Contents .............................................................................................................. vii

List of Figures ..................................................................................................................... x

List of Tables ..................................................................................................................... xii

Acknowledgements ......................................................................................................... xiii

Chapter 1: Introduction ......................................................................................................1

1.1 Background ......................................................................................................... 1

1.2 Objectives ............................................................................................................ 3

Chapter 2: Literature Review .............................................................................................6

2.1 Source water quality and waterborne bacteria and pathogens of concern............ 6

2.1.1 Cyanobacteria and algae blooms ...............................................................6

2.1.2 Identification and treatment techniques for cyanobacteria ..........................8

2.1.3 Cryptosporidium parvum ..........................................................................11

2.1.4 Identification and treatment techniques for Cryptosporidium ....................13

2.2 Predictive modelling for cyanobacteria and Cryptosporidium ............................. 16

2.2.1 Cyanobacteria prediction..........................................................................16

2.2.2 Cryptosporidium prediction .......................................................................19

2.3 Bayesian modelling in source water quality........................................................ 21

2.4 Risk assessment for waterborne bacteria and pathogen .................................... 23

Chapter 3: Predicting Cyanobacteria Abundance with Bayesian Zero-inflated Negative

Binomial Models ..............................................................................................................28

3.1 Introduction ........................................................................................................ 28

viii

3.2 Methodology ...................................................................................................... 32

3.2.1 Study site and data source .......................................................................32

3.2.2 Mixture Models for Zero-inflated Count Data ............................................37

3.2.2.1 Zero-Inflated Negative Binomial Model .................................................37

3.2.2.2 Hurdle Negative Binomial Model ..........................................................37

3.2.3 Bayesian approach ..................................................................................38

3.2.4 Model development, selection and validation ...........................................39

3.2.4.1 Projection predictive inference .............................................................41

3.2.4.2 Leave-one-out cross-validation ............................................................42

3.2.4.3 Posterior Predictive Checks .................................................................43

3.3 Results and discussion ...................................................................................... 44

3.3.1 Variable selection .....................................................................................44

3.3.2 Model selection ........................................................................................46

3.3.3 Model checking ........................................................................................47

3.3.4 Cyanobacteria alert level prediction .........................................................49

3.3.5 Influence of weather and water quality factors on cyanobacteria counts ..52

3.4 Summary ........................................................................................................... 55

Chapter 4: A probabilistic approach to evaluating Cryptosporidium health risk in drinking

water ...............................................................................................................................57

4.1 Introduction ........................................................................................................ 57

4.2 Methodology ...................................................................................................... 62

4.2.1 Data sources ............................................................................................62

4.2.2 Predicting Cryptosporidium presence in source water ..............................64

4.2.2.1 Gaussian process classification............................................................64

4.2.2.2 Model performance evaluation and threshold determination .................66

ix

4.2.3 Modelling the Cryptosporidium exposure .................................................66

4.2.3.1 Removal through water treatment ........................................................66

4.2.3.2 Drinking water consumption .................................................................67

4.2.3.3 Sewer overflow rate..............................................................................68

4.2.4 Modelling the risk of illness ......................................................................69

4.2.5 Probabilistic QMRA ..................................................................................70

4.3 Results and discussion ...................................................................................... 71

4.3.1 Cryptosporidium prediction .......................................................................71

4.3.2 Scenario assessments with probabilistic QMRA .......................................73

4.3.2.1 Climate change ....................................................................................73

4.3.2.2 Treatment technique improvement .......................................................78

4.3.2.3 Sewer overflow control .........................................................................81

4.4 Summary ........................................................................................................... 83

Chapter 5: Conclusion .....................................................................................................85

5.1 Summary of Contributions ................................................................................. 85

5.2 Limitations and suggestions for future research ................................................. 87

Bibliography.....................................................................................................................89

Appendix ....................................................................................................................... 117

x

List of Figures

Figure 3-1 (a) Bar plots of sampling frequency in each year from 2002 to 2015; (b)

Histograms of cyanobacteria abundance. ...........................................................................33

Figure 3-2 (a) Visualization of the time series for cyanobacteria abundance in Cheney

reservoir; (b) Autocorrelation factor (ACF) of cyanobacteria abundance based on monthly

time lag. ..............................................................................................................................34

Figure 3-3 Correlation heatmap of water quality and weather parameters. ..........................36

Figure 4-1 (a) Bar plots of sampling times in each year from 2015 to 2021; (b) Historgrams

of Cryptosporidium presence and absence. ........................................................................63

Figure 4-2: Linear Correlation (r) heatmap of water quality, weather parameters, and

Cryptosporidium presence/absence(Class). ........................................................................64

Figure 4-3 Density plot of overflow rate ...............................................................................69

Figure 4-4 Schematic model for the process of DALYs estimation. The arrows represent the

relationship between two variables. .....................................................................................70

Figure 4-5 (a) The precision-recall curve when varying the threshold of predicting

“presence”. (b) The F-score to threshold curve. ..................................................................72

Figure 4-6 Quantile-quantile plots (Q-Q plots) (a) and box plots (b) of disability-adjusted life

years (DALYs) before climate change, after climate change under controlled emission, and

after climate change under uncontrolled emission. ..............................................................75

Figure 4-7 DALYs for Cryptopsoridium under temperature from15 to 65°C. ........................76

Figure 4-8 (a) Q-Q plots of DALYs before treatment improvement, and for improved

treatment at means of 2.5 ± 0.5log, 3.0 ± 0.5log, 3.5 ± 0.5log. (b)Box plots of DALYs in the

four groups. .........................................................................................................................79

Figure 4-9 Density plot with dark color represents estimated probability density of the

samples drawn target distribution, while the density plot with light color represents estimated

probability density of samples drawn from treatment before improvement (uniform

xi

distribution, lower = 1.5, upper = 2.5). (a) The density plot of samples from target distribution

before climate change (b) The density plot of samples drawn from target distribution after

climate change under emission control. ..............................................................................80

Figure 4-10 (a) quantile-quantile plots (Q-Q plots) of DALYs at current sewer overflow rate

(0.022), sewer overflow rate at three different levels of 0.01, 0.005, 0.001 (b)Box plots of

DALYs in the four groups. ...................................................................................................81

Figure 4-11 Density plot with dark color represents estimated probability density of the

samples drawn target sewer overflow rate distribution, while the density plot with light color

represents estimated probability density of samples drawn from sewer overflow rate before

control (normal distribution, mean = 0.022, sd = 0.12). (a) The density plot of samples from

target distribution before climate change (b) The density plot of samples drawn from target

distribution after climate change under emission control. ....................................................82

xii

List of Tables

Table 3-1 Selected variables used to build initial models.....................................................36

Table 3-2 Alert levels for management of toxic cyanobacteria (WQRA) ..............................41

Table 3-3 LOO-CV results to compare strength of model fits. Differences in elpd and

standard error (SE) were calculated using the highest performing model (ZINB). ...............47

Table 3-4 a) Confusion matrix for all WQRA levels along with figure depicting probability of

each class for a given prediction, and b) reduced confusion matrix for binary decisions > or

< 1,000 cells/mL. .................................................................................................................52

Table 4-1 Confusion matrix for binary classification of presence and absence of

Cryptosporidium ..................................................................................................................73

xiii

Acknowledgements

I would first like to thank my supervisor, Professor Nicolás Peleato, whose expertise has

helped me formulate the research questions and methodology. I could not have completed

the thesis without his insightful feedback and guidance. Thanks for his patient support along

the way.

Thanks also to committee members Dr. Solomon Tesfamariam and Dr. Alyse

Hawley, and university examiner Dr. Sepideh Pakpour for their crucial role in completing my

thesis and their meaningful comments that help me improve the thesis.

I would like to acknowledge my lab colleagues, Ziyu Li, Faezeh Ketabchi and Atefeh

Ashrafi for sharing their experiences and data with me. I would also like to thank Professor

Michael Noonan at the department of biology for his terrific course on biostatistics. Also, I

need to thank Dr. Andrew Gelman, Dr. Aki Vehtari and Dr. Ben Lambert for their wonderful

books on Bayesian statistics, which opened the door to the Bayesian world for me and gave

me opportunities to further my research.

Finally, sincere thanks to my family, friends and for their understanding,

encouragement and support over the two years.

1

Chapter 1: Introduction

1.1 Background

Source water refers to the natural sources of water, such as lakes, rivers, reservoirs and

groundwater, which provide water to public water supplies. Source water can be easily

contaminated by human activities, including sewage and agricultural pollution. Polluted

waters can contain a greater number of viruses, bacteria, parasites and other

microorganisms, such as cyanobacteria and Cryptosporidium. When certain conditions

exist, such as warm water temperature and abundance of nutrients, cyanobacteria that are

naturally found in aquatic environments can rapidly reproduce to form blooms. Cyanotoxins

released by some types of cyanobacteria can spread through contaminated or inadequately

treated drinking water to cause illness in humans and animals. For example, during 2009-

2010, 11 disease outbreaks associated with algal blooms were reported in the United

States, which represented 79% of the 14 freshwater algal blooms associated with outbreaks

that have been reported to CDC since 1978 (Hilborn et al., 2014).

Cryptosporidium that can be present in human or animal feces can cause

gastrointestinal illness when fecal-contaminated water is consumed. According to The World

Health Organization (WHO) guidelines (World Health Organization, 2005), about 1.8 million

people die from diarrheal diseases globally every year, many of which have been linked to

diseases acquired from the consumption of contaminated waters. The outbreak of

cryptosporidiosis in Milwaukee, Wisconsin, United States, was caused by an ineffective

filtration process, resulting in a total cost of $96.2 million for medical costs and productivity

losses (Corso et al., 2003). Since these waterborne microorganisms can result in high

maintenance costs and pose serious threats to public health, there is an urgent need to

develop technology for source water evaluation and risk assessment to reduce risks and

control treatment costs. However, accurate and rapid detection methods for the estimation

of cyanobacteria and Cryptosporidium remain a practical challenge. Current laboratory-

2

based methods related to microscopy are costly and time-consuming. With the need for

methods that can enumerate or estimate microorganisms’ levels rapidly and preferably, the

indirect data-driven approaches have been explored to make day-to-day predictions based

on easy-to-measure meteorological and water quality parameters.

Bayesian statistical method is a promising analytical method that has been utilized in

drinking water research. Bayesian methods propagate estimation uncertainty, allow

interpretation of probability of both the outcomes and parameters of interest, and have the

ability to incorporate prior expert knowledge. (Gelman et al., 2013). In the context of risk

assessment and decision making, Bayesian methods are well suited to inform the

management of complex systems with high uncertainty, such as cyanobacteria bloom

prediction and Cryptosporidium presence in source waters and the related public health

risks.

A common challenge with modelling cyanobacteria and Cryptosporidium is the

innate imbalance in monitoring datasets. A significant excess number of zero counts is

typical and may have resulted from either failure to detect or an actual absence of

microorganisms. Hence, in this study, the predictive models adaptive to imbalanced data

such as zero-inflated models, Gaussian process classification with threshold moving were

explored and applied to cyanobacteria and Cryptosporidium estimation and quantitative

microbial risk assessment (QMRA). Furthermore, the predictions produced from the

Bayesian approach utilized in this paper are probabilistic, providing interpretable results with

uncertainty for decision making.

In this thesis, the research focuses on the application of Bayesian methods in the

source water quality evaluation and risk assessment, i.e., imbalanced regression of

cyanobacteria abundance prediction (Chapter 4), imbalanced classification of

Cryptosporidium presence/absence prediction (Chapter 5), and probabilistic QMRA

3

(Chapter 5). In the following section, a detailed description of specific objectives and a brief

overview of major methods are provided.

1.2 Objectives

The primary goal of this work is to develop Bayesian predictive models for cyanobacteria

abundance and the presence of Cryptosporidium in source water, address the understudied

data imbalance problem in water quality modelling, and investigate the probabilistic

approach to evaluate Cryptosporidium health risk in drinking water. The specific hypotheses

and objectives of this thesis are listed below:

1) Various environmental factors directly related to cyanobacteria abundance.

Using zero-inflated models that account for data imbalance is hypothesized to

improve the fit. The aim of this study is to identify the key factors necessary for

cyanobacteria growth, and develop a predictive model for imbalanced

cyanobacteria data through Bayesian statistical methods and zero-inflated

models.

Cyanobacterial bloom is a persistent concern to water management and

treatment, with blooms potentially causing the release of toxins and degrading

water quality. Significant efforts have been made to model cyanobacteria growth

in surface waters to identify key factors driving growth and anticipate bloom

events. However, previous models have not considered the zero-inflation of

cyanobacteria count data, which refers to the excess zero-count in cyanobacteria

abundance. Commonly used Poisson and negative binomial models for count

data underestimate the probability of zeros, making these models less reliable.

As such, issues related to regression on imbalanced data must be addressed.

Several zero-inflated models including zero-inflated and hurdle Poisson models,

zero-inflated and hurdle negative binomial models will be investigated and

compared in this thesis to improve the overall accuracy.

4

Furthermore, there is limited discussion focused on the interpretability

and uncertainty of the predictive model. Point estimates from the frequentist

approach do not allow for informative, transparent prediction distributions. In this

work, A Bayesian approach to zero-inflated regression models will be presented.

The established model will also be used to assess the importance and impact of

meteorological and environmental variables on the probability of cyanobacteria

blooms. The optimal model with predictor variables of importance will be used to

predict cyanobacteria alert levels on a separate test set of data.

2) Day-to-day Cryptosporidium health risk can be evaluated through the QMRA

approach based on water quality data. Nonparametric methods are hypothesized

to be effective in addressing the imbalanced classification problem. The aim is to

apply Bayesian nonparametric methods and develop a probabilistic QMRA

approach to evaluate Cryptosporidium health risk in drinking water in different

scenarios and determine the critical control point for water management.

Cryptosporidium is an important pathogen that commonly drives public health

risks associated with drinking treatment, due to its persistence in aquatic

environments (Swaffer et al., 2018) and high probability of infection at low doses

(Lal et al., 2015). However, information on Cryptosporidium concentration is

inadequate due to significant sampling and measurement challenges.

Considering the commonly low amount of Cryptosporidium in samples, manual

concentration, filtration and detection are usually slow and labor-intensive. Data-

driven models to predict Cryptosporidium presence based on historical

meteorological and environmental data have not been intensively studied.

Furthermore, the consequential health risks in drinking water related to

Cryptosporidium concentration in source water should also be evaluated.

5

In this study, a Bayesian nonparametric method, logistic Gaussian

process regression will be applied to predict the presence in source waters using

easy-to-measure parameters and present a probabilistic QMRA connected to the

predictive model to evaluate Cryptosporidium health risks. With the probabilistic

QMRA, the effects of climate change (including temperature and precipitation

pattern change), treatment techniques improvements and sewage overflow

control will be investigated. The utility of the novel QMRA in backward inference

of the required removal efficiency and sewage overflow control to meet specific

disability-adjusted life years (DALYs) goals will also be demonstrated.

6

Chapter 2: Literature Review

2.1 Source water quality and waterborne bacteria and pathogens of concern

2.1.1 Cyanobacteria and algae blooms

Cyanobacteria, also called blue-green algae, are photosynthetic prokaryotes found naturally

in all types of illuminated environments (Whitton & Potts, 2012). They are single-celled and

synthesize various forms of chlorophyll to absorb energy in sunlight. Cyanobacteria flourish

in fresh, brackish and marine water, and can be found in environments where no other

microalgae can exist. Although most types of Cyanobacteria live in warm and nutrient-rich

waters, many species are capable of living in the soil and other terrestrial habitats (Mur et

al., 1999).

Even though cyanobacteria can live across a diverse range of environments, they

prefer warmer climates, and the temperature optimum for most cyanobacteria is at least

several degrees than for most eukaryotic algae (Whitton & Potts, 2012). Robarts & Zohary

(1987) found in literature and field data that the growth rate is temperature-dependent with

an optimum at 25°C or greater. However, direct temperature effects are secondary to

indirect temperature effects mixing with nutrients. Temperature is hypothesized to act

synergistically with other factors (Robarts& Zohary, 1987). Cyanobacteria can tolerate

desiccation, water stress and high levels of ultra-violet irradiation (Whitton & Potts, 2012),

and are diverse and abundant in higher pH values. The growth rate of most cyanobacteria is

at optima at high pH values between pH 7.5 and 10.0. Cyanobacteria have not been found

in acid lakes and are even not common in waters with pH between 5 and 6 (Fang et al.,

2018). Several cyanobacteria can fix atmospheric nitrogen and their growth in many

ecosystems is limited by the availability of nutrients including phosphorus and nitrogen.

When the freshwater bodies become enriched in nutrients, especially phosphorus, there is

often a shift in the phytoplankton community towards dominance by cyanobacteria (O’Neil et

al., 2012). The nitrogen to phosphorus (N:P) ratio is also the factor that regulates the

7

dominance of planktonic communities by blue-green or green microorganisms. A decrease

of the ratio through the addition of phosphorus usually leads to cyanobacterial blooming

(Levich, 1996). Other factors such as vertical stratification and increased atmospheric

carbon dioxide are also contributors to cyanobacteria’s increasing dominance in aquatic

ecosystems (Paerl & Paul, 2012). In water bodies with favorable conditions described

above, cyanobacteria can quickly multiply, forming harmful algae blooms that spread across

the surface.

Cyanobacteria blooms can form in warm, slow-moving waters that are rich in

nutrients such as fertilizer runoff or septic tank overflows. Blooms with the potential to harm

human health or aquatic ecosystems are referred to as harmful algal blooms or HABs. The

dense blooms are usually toxic and can degrade water quality, causing major problems for

water quality (Huisman et al., 2018). Cyanobacteria blooms block the sunlight that other

organisms need to live, and can deplete dissolved oxygen causing the death of fish and

benthic invertebrates (Rabalais et al., 2010). They can also form several compounds that

give unpleasant tastes and odours that interfere with the use for recreation and drinking.

Toxins produced by cyanobacteria, cyanotoxins, constitute the major source of natural

product toxins found in the surface supplies of freshwater (Carmichael, 1997). Cyanotoxins

have significant health effects, including human illness and mortality from direct

consumption of the toxins or indirect exposure to organisms that accumulate the toxins or

the toxins themselves (Sarma, 2012).

Cyanotoxins can be generally classified into three major classes based on their

primary toxicological effects: hepatotoxic peptides, neurotoxins and dermatotxin (Ferrão-

Filho & Kozlowsky-Suzuki, 2011). Most human and animal poisoning by cyanobacteria

involve acute hepatotoxicosis, caused by microcystins (MCs) and nodularins (NODs)

(Sarma, 2012). MCs are a class of toxins produced by a variety of cyanobacteria, including

Microcystis, Planktothrix, Anabaena, and Oscillatoria genera (Harke et al., 2016). MCs are

8

the most commonly found cyanobacterial toxins that cause a major risk to safe drinking

water and pose a serious threat to public health (Rastogi et al., 2014). Acute illnesses

caused by short-term exposure to MCs include headache, sore throat, vomiting and nausea,

diarrhea, and pneumonia, while prolonged

exposure to the reference level of MCs can cause severe liver injuries and might be at high

risk for developing nonalcoholic fatty liver disease (NAFLD) (Zhao et al., 2020). The

exposure to MCs is either through direct contact or by means of intake of untreated

contaminated water and food (Papadimitriou et al., 2012). Detectable levels of MCs have

been found in 80% of raw and treated water in 45 drinking water supplies in Canada and the

US in a 1996-1998 survey (Carmichael et al., 2001). However, only 4% of the samples

exceeded the WHO drinking water guideline of 1.0 μg/L. Between 2000 and 2012, an

extensive survey of 81 lakes in New York and lower Great Lakes found detectable levels of

MCs in nearly 60% of the 2500 samples collected during cyanobacteria bloom season, and

15% of which exceeded the WHO advisory limit (Boyer, 2008). NODs are potent produced

by the cyanobacteria Nodularia spumigena (Sivonen et al., 1989). NODs are often attributed

to gastroenteritis, allergic irritation reactions and liver diseases. Detrimental effects from

NODs have been frequently reported in many countries over the past 30 years (Chorus &

Welker, 2021). The WHO guideline for NODs concentration is 1.5 μg/L. Although there are

few records of human illness related to cyanobacterial blooms, Health Canada in 2002

classified MCs as possibly carcinogenic to humans and placed them in Group IIIB denoting

inadequate data in humans, limited evidence in experimental animals (Health Canada,

2002).

2.1.2 Identification and treatment techniques for cyanobacteria

Cyanobacterial cells are usually enumerated using optical microscopy (Marchall, 1982;

Tortora et al. 2007) which is usually labor-intensive and subjectively dependent on the

9

observers (Alversion et al., 2003; Correa). This method is further complicated by the

variable morphology of individual cells, high species diversity, and complexity of cell

aggregates or units (Jin et al., 2018). To replace the conventional enumeration methods,

recent studies have been focused on using image-driven techniques. Baek et al. (2020)

have used deep learning techniques, regional convolutional neural network (R-CNN) and

convolutional neural network (CNN) to quantify five cyanobacteria species. After reducing

the noises of the cell features, the model has achieved average precision values above 0.89

for all of the five species. Jin et al. (2018) have developed a novel imagining-driven

technique with an integrated fluorescence signature to enable automated enumeration of

cyanobacterial cells. The model was reported to achieve higher accuracy than using

standard manual microscopic enumeration techniques but in less time.

Both conventional and advanced treatment technologies have been used for

cyanobacterial cell removal. A number of studies have examined the effectiveness of

conventional treatment technologies on cyanobacterial cells and cyanotoxin removal.

Coagulation effectively removes cyanobacterial cells but cannot remove toxins, and the cells

remain intact during typical operating conditions, despite the high velocity gradients that are

produced during the rapid mixing stage (Ghernaout et al., 2010; Chow et al., 1999).

Filtration is usually followed by coagulation and sedimentation, which involves the passage

of the water by gravity through a filter of granular material (typically sand, gravel or

anthracite), with the purpose to remove the remaining particulates in the water.

Cyanobacterial cells and cell-bound cyanotoxins have been found effectively removed by

bank filtration, slow sand filtration and sedimentation (Grutzmacher et al., 2002; Hrudey et

al., 1999; Lahti et al., 2001). Although the large size of cyanobacterial cells achieves

effective removal during filtration, little or no removal of cyanotoxins occurs during filtration

(He et al., 2016).

10

Adsorption is also a widely applied effective treatment technology the removes trace

contaminants. Commonly used adsorbents include activated carbon and iron-based

adsorbents. Activated carbon, however, is more often utilized in practice than other

materials. Most of the studies relating to the activated carbon adsorption of cyanotoxins

have been conducted on the mycrocystins, in particular mycrocystins-LR and have

concluded that granular activated carbon is effective in removing cyanotoxins from drinking

water (Falconer et al., 1989; Himberg et al., 1989; Lambert et al., 1996). The capacity of

powdered activated carbon in adsorbing cyanotoxins has also been reported and is directly

related to the pore volume in the mesoporous region (Maatouk et al., 2002; Antoniou et al.,

2005). In addition to the pore size of the adsorbent, the surface chemistry of the adsorbent,

the pH of the solution, and the presence of competing compounds such as NOM, also

influence the adsorption process (Huang et al., 2007).

Disinfection is a highly effective method of bacteria removal and toxin inactivation.

The common oxidants include free chlorine, chlorine dioxide, chloramines and

permanganate. Free chlorine can effectively destroy microcystins and cylindrospermopsin

under optimized treatment conditions but is less effective in destroying saxitoxin and

anatoxin-a (Acero et al., 2005; Westrick et al., 2010). Chlorine dioxide (ClO2) is more

selective and in some cases comparable to or even more effective than chlorine in the

inactivation of microorganisms. However, studies

have found chlorine dioxide is not effective for the destruction of Microcystins,

cylindrospermospin and anatoxin (Westrick et al., 2010). Chloramine is the least effective

oxidant for inactivating certain cyanobacteria species, including Microsystis aeruginosa,

Oscillatoria sp. and Lyngbya sp. due to its relatively low reactivity with common water

constituents (Wert et al., 2013).

Membrane processes have been concluded to effectively remove toxic

cyanobacterial cells. A composite nanofiltration membrane have been found nearly remove

11

microcystins completely (Teixeira & Rosa, 2005). Gijsbertsen-Abrahamse et al. (2006) have

found 98% cell-bound microcystin was removed using an ultrafiltration membrane.

Ultrafiltration coupled to powdered activated carbon (PAC-UF) is an effective process for the

removal of cyanotoxins from drinking water. Lee & Walker (2006) have found PAC-UF

achieved more than 90% of microcystin-LR from water due to the adsorption of toxins

increasing the effective size of toxins and facilitating removal by ultrafiltration.

2.1.3 Cryptosporidium parvum

Cryptosporidium parvum is a protozoan parasite that can cause the diarrheal disease

cryptosporidiosis. Cryptosporidium is protected by an outer shell that makes it survive

outside the body for long periods of time and tolerant to chlorine disinfection during

treatment (Korich et al., 1990). Once the oocyst is consumed, the parasite can emerge from

the shell and infect the lining of the intestine, causing cryptosporidiosis (Templeton et al.,

2004).

The most common species in humans are Cryptosporidium parvum. However,

Cryptosporidium felis, Cryptosporidium muris and Cryptosporidium meleagridis have also

been identified in immunocompromised persons, especially those with the acquired

immunodeficiency syndrome (AIDS) (Chen et al., 2002). Symptoms of cryptosporidiosis

include watery diarrhea, stomach cramps or pain, dehydration, nausea and vomiting, which

begin 2 to 10 days after becoming infected with the parasite and usually last about 1 to 2

weeks in people with healthy immune systems (Leitch & He et al., 2011). People of all ages

can be infected, although infections are more common, and symptoms are more severe in

children. Up to now, Nitazoxanide, paromomycin, and azithromycin have activity against

Cryptosporidium (Sparks et al., 2015).

The major source of Cryptosporidium is from domestic and wild animals. Beef calves

are regarded as posing the greatest risk because of their large numbers, distribution and

high levels of oocysts excretion. Cryptosporidium completed their life cycle within the

12

epithelial cells in the intestine of a single host, underwent both sexual and asexual

development and produced oocysts excreted in the feces (Rose, 1997). Cryptosporidium is

widely distributed in the environment and the fecal to oral transmission of oocysts stage has

resulted in outbreaks through contamination of drinking water, food, and recreational water

(Fayer et al., 2000). The fecal contamination in water linked to discharge of untreated and

treated sewage and run-off of manure have been well-documented (Razzolini et al., 2020).

Lisle & Rose (1995) have reported that between 5.6% to 87.1% of source water including

surface, spring and groundwater samples that are not impacted by domestic and/or

agricultural waste contain 0.003 to 4.74 oocysts L-1. Monthly average oocysts in the river

network were predicted to range from 10-6 to 102 oocysts L-1 in most places. Densely

populated areas such as India, China, and Mexico are ‘hotspots’ regions with high oocysts

concentrations (Vermeulen et al., 2019). The infectivity of oocysts is high, and although the

infectious dose of oocysts excreted in feces is low, ingestion of 10-30 oocysts can cause

infection in healthy populations (Yoder & Beach, 2010).

Cryptosporidium is extremely resistant to chemical disinfection and has long survival

times in the aquatic environment. Cryptosporidium oocysts have demonstrated longevity in

all types of water investigated, including freezing (exposure to temperatures as low as -

22℃), desiccation and a series of the water treatment process (Robertson & Smith, 1992).

Oocysts survive well in waters at 20℃ with a salinity of 10 ppm over 20 weeks, but less than

4 weeks in seawater at the salinity of 30 ppm. Under natural conditions, the die-off rate of

oocysts in water is 0.005 – 0.037 log units per day (Fayer et al., 1998). The structure and

composition of the oocyst wall are primary factors determining the survival and hydrologic

transport of Cryptosporidium parvum oocysts outside the host. “Interim Enhanced Surface

Water Treatment Rule” (IESWTR) promulgated by the United States Environmental

Protection Agency (USEPA) has established a Maximum Contaminant Level Goal (MCLG)

of zero for Cryptosporidium (USEPA, 1997). However, most conventional water systems in

13

the US achieved 2-2.5 log10 removal and did not ensure the filtered water free of

Cryptosporidium (LeChevallier et al., 1991). High disinfection levels or more efficient

disinfection procedures are required, which will be discussed in subsection 2.2.

The largest Cryptosporidium outbreak reported was in Milwaukee in the US in 1993,

which resulted in an estimate of more than 400,000 people affected (Mac Kenzie et al.,

1994). In Canada, there have been two notable outbreaks: In the summer of 1996,

Cryptosporidium outbreak affected approximately 2,000 people in Cranbrook, British

Columbia and a separate incident occurred in Kelowna, British Columbia, causing illness in

10,000 to 15,000 people a few weeks later (Ong et al., 1999); In April 2001, an outbreak

occurred in the city of North Batteleford, Saskatchewan, causing 5800-7100 people

diarrheal illness and 1907 confirmed cases of cryptosporidiosis (Ong et al., 1999). In

developing countries, growing health burdens of faltering, malnutrition, and diarrheal

mortality related to Cryptosporidium remain underappreciated as diagnostic tools are not

available. The incidence of Cryptosporidium infection is also growing in developed countries

largely due to outbreaks in recreational water facilities. Without effective diagnostics,

treatment for immunocompromised patients and promising vaccines, the ability to reduce

the disease burden of Cryptosporidium infection remains limited (Shirley et al., 2012).

2.1.4 Identification and treatment techniques for Cryptosporidium

To evaluate the health risk related to Cryptosporidium oocysts in water and implement

appropriate treatment techniques, oocysts concentrations must be known. Quantification of

Cryptosporidium requires several steps including concentration or filtration, and manual

detection, which is challenging considering the commonly low numbers of Cryptosporidium

in samples. Other enumeration methods for Cryptosporidium include flow cytometry, solid

phase cytometry, electric resistance particle characterization, hemacytometry, chamber

slides and epifluorescent well slide (Lindquist et al., 1999).

14

As Cryptosporidium spp. oocyst occurs in low numbers in the aquatic environment

(Smith et al., 2003), in vitro methods that augment pathogen numbers prior to identification

are not available for Cryptosporidium in source water. Thus, large volumes of water samples

must be collected for the detection and concentrating small numbers of oocysts effectively is

important. Common methods for Cryptosporidium identification and enumeration include

concentrating and staining, microscopic enumeration, immune assay techniques and

molecular techniques (Ahmed & Karanis, 2018). In drinking water, concentration through

methods such as continuous flow centrifugation and membrane filtration is most commonly

practiced. Microscopic-based identification of Cryptosporidium is the most widely used due

to its relatively low cost (Ahmed & Karanis, 2018). Molecular methods, PCR tests can also

detect drinking water specimens. However, despite its high sensitivity and accuracy, the

false positive rate is usually high due to the detection of non-viable microorganisms and

laboratory contamination (Checkly et al., 2015). USEPA has developed a grab sample

method for Cryptosporidium in raw water samples. The method involves filtration,

immunomagnetic detection using an immunofluorescence assay 4′,6-diamidino-2-

phenylindole (DAPI) staining, detection by epifluorescence microscopy, and determination

of internal morphology using Nomarski differential interference contrast (DIC) microscopy

prior to determining oocyst concentration (Clancy et al., 1999).

The USEPA Interim Enhanced Surface Water Treatment Rule (IESWTR) which is

promulgated in 1998, regulates that the treatment technology to control Cryptosporidium in

water should achieve a Maximum Contaminant Level Goal (MCLG) of zero for

Cryptosporidium and a 2-log (99%) log removal when using filtration only (USEPA, 1998). In

the USEPA Long-term 2 Enhanced Surface Water Treatment Rule (LT2ESWTR), water

plants using conventional treatment will require monitoring for Cryptosporidium, E.coli and

turbidity for a period of 24 months (USEPA, 2006). However, LeChevallier et al. (1991)

examined 66 conventional water systems in the US and reported that most of the utilities

15

achieved 2-2.5 log 10 cyst and oocyst removal by clarification and filtration and compliance

with criteria outlined by the SWTR did not ensure that filtered water free of waterborne

parasites. High disinfection levels or more efficient disinfection procedures were ultimately

recommended.

Previous studies have investigated the treatment efficiency of various treatment

processes. Nieminski and Ongerth (1995) conducted a 2-year evaluation of Cryptosporidium

at a full-scale treatment plant and a pilot operating under conventional treatment reported an

average of 2.9 log removal for Cryptosporidium. States et al. (1997) have observed

Cryptosporidium removal of 1.49 log in a full-scale conventional treatment plant. Since

Cryptosporidium oocysts are resistant to removal and inactivation by conventional water

treatment, extensive research has been focused on the optimization of treatment processes

and new technologies application. Enhanced coagulation through the use of higher doses of

coagulants can significantly improve the removal efficiency to 5.8 log units (Betancourt &

Rose, 2004). Edzwald et al., (2000) evaluated removals of Cryptosporidium by clarification

combined with dual media filtration under challenging conditions of high cyst and oocyst

levels. The combined clarification and filtration together achieved an average greater than 5-

log removals, which were comparable to those achieved by sedimentation and filtration.

Diatomaceous earth filtration has been reported more effective than other conventional

filtration in removing Cryptosporidium oocysts with up to 6-log removal (Ongerth and Hutton,

1997; Ongerth and Hutton, 2001). Although no inactivation of Cryptosporidium was

observed after 18h of contact time with chlorine at high levels and with chloramines, chlorine

dioxide can inactivate about 90% oocysts (Betancourt & Rose, 2004). Low doses of UV (1-9

mJ/cm2) have been observed to inactivate 2-4 log 10 units of Cryptosporidium parvum

oocysts (Linden et al., 2002). Membranes such as microfiltration (MF) membrane and

ultrafiltration (UF) membrane can provide complete removal of all protozoan oocysts of

16

concern. Jacangelo et al. (1995) have demonstrated that various MF and UF provide log

removals of Cryptosporidium parvum oocysts ranging from >4 log to 6 log units.

2.2 Predictive modelling for cyanobacteria and Cryptosporidium

2.2.1 Cyanobacteria prediction

As the occurrence of algal bloom results in water quality degradation and possible public

health risks, previous studies have investigated the water quality and meteorological factors

including water temperature, precipitation, flow, and nutrient concentration that could affect

cyanobacterial proliferation and developed a few frameworks to predict and forecast future

cyanobacteria abundance/blooms with the aid of historical and existing data. However,

ecosystems are complex systems consisting of interlinked subsystems (Parrott & Kok,

2000). The complex processes involved in cyanobacterial blooms can be challenging to

model, such as nutrient loading, transport and diffusion, and compounding effects from

weather events. The mathematical modelling approaches for microbes can be divided into

two major classes: physical models that simulate the dynamics of underlying processes and

data-driven models that construct models from empirical data and employ the extensive

monitoring data to predict and make decisions for future scenarios. For water quality

modelling, data-driven or statistical models provide a fast and low-cost approach, since

recent growth of improved monitoring techniques such as wireless sensors have

significantly improved data availability.

Several data-driven models have been implemented to predict cyanobacteria occurrence or

abundance in source waters. Kim et al. (2020) presented a model to predict cyanobacteria

occurrence or absence in rivers using water temperature, velocity and phosphorus

concentration, which are readily available through direct measurements. Weighted function

models, including sigmoidal, linear, and exponential, were developed to predict

cyanobacteria occurrence conditional on the preceding state of cyanobacteria abundance.

This model was shown to achieve more than 75% accuracy through cross-validation. Zhao

17

et al. (2019) proposed a new cyanobacterial bloom occurrence prediction method to analyze

the probability and driving factors of the blooms. The dominant species were determined

through a dominant species identification model and the principal driving factors were

analyzed using canonical correspondence analysis (CCA). The probability of bloom was

calculated using the model and the critical control point of the probability of cyanobacterial

bloom was 0.75. Harris & Graham (2017) have compared 12 linear and nonlinear regression

modeling techniques to predict cyanobacterial abundance and cyanobacterial toxins using

14-year water quality data set. Support vector machine, random forest, boosted tree and

Cubist modeling techniques were reported as the most predictive approaches, and Cubist

modeling is the only approach that can predict maximal concentrations of cyanobacteria

abundance and geosmin.

Bayesian methods have also been used for cyanobacteria abundance prediction and

assessment of the relative importance of environmental factors on cyanobacteria growth.

Cha et al., (2017) developed a Bayesian hierarchical model to compare the relative

importance of predictors, including biological, environmental, meteorological and

hydrological factors, obtained from 16 sites in four major rivers in South Korea. Results

suggested that temperature and residence time instead of nutrient levels are important

variables to cyanobacteria growth in summer across the sites (Cha et al., 2017).

Considering the demand for forecasting the alert of cyanobacterial blooms 10- to 30-day-

ahead, Recknagel et al. (2017) developed a novel early warning scheme for cyanobacteria

abundance and cyanotoxins in drinking water reservoirs. The scheme ensembles inferential

models developed by the hybrid evolutionary algorithm (HEA) solely using in-situ data. The

resulting models for cyanobacteria have been reported to be capable of forecasts up to 30

days.

Furthermore, deep learning techniques and image processing methods have been

used by Pyo et al. (2021) using the remote sensing images of cyanobacteria. They

18

developed a convolutional neural network (CNN) model with a convolutional module

(CNNan) to predict cyanobacteria abundance using field monitoring data, hyperspectral

image sensing and simulated water quality from a hydrodynamic model. The prediction

performance of the CNNan model was better than the unmodified CNN model and

environmental fluid dynamics code (EFDC) simulation. The results demonstrated the deep

learning models are feasible for predicting the presence of harmful algae in the water. Wang

et al., (2010) developed a hybrid model consisting of a back-propagation neural network and

a rough decision to predict blooms in Dianchi Lake, China. Predictive accuracy of 0.8 has

been achieved in binary classification.

A common challenge with data-driven models is class imbalance, where the number

of instances in one or several groups (called the majority classes) severely exceeds the

number of instances in other groups (called the minority classes). Standard machine

learning classification algorithms are developed to enhance overall accuracy and will cause

misclassification of a minor class, which is often associated with serious consequences. In

cyanobacteria prediction, algorithms without consideration for imbalance will accurately

predict “absence” (majority class) by decreasing the predictability of “presence” (minority

class) in the presence of class imbalance. Shi et al. (2021) explored oversampling

algorithms and ensemble classifiers to predict cyanobacteria events. The model was

developed using a variety of physicochemical and hydrometeorological factors as

predictors. Cyanobacteria abundance data were collected from 2013 to 2019 in major rivers

in South Korea and classified into binary classes. They proved the imbalance ratio

adversely affected the model performance and the effectiveness of resampling applications

for addressing the class imbalance. AdaBoost ensemble classifier yielded the most stable

performance, and the temperature was the primary influencing factor of cyanobacteria

blooms. Kim et al. (2022) have also used classification algorithms and oversampling

methods to resolve the problem of having an imbalanced dataset of cyanobacteria. Mixture

19

models such as the hurdle model have also been developed to handle imbalanced data and

predict cyanobacteria abundance. Cha et al. (2014) have developed a hurdle Poisson model

to predict cyanobacteria abundance, allowing zero counts (absence) and nonzero counts to

be modelled using different models and environmental factors. The results suggest low

temperature and low suspended solids (SS) can promote low cyanobacteria abundance. As

the model is fitted under a Bayesian framework, the alert levels were predicted along with

probabilities, which can provide management implications. Apart from Poisson distribution

used in the study by Cha et al., (2014), the negative binomial distribution is another

commonly used model for count data. The negative binomial distribution uses an extra

parameter to accommodate overdispersion, which leads to broader applicability. To

understand the response of cyanobacteria to environmental changes such as climate

warming and nutrient enrichment, Richardson et al., (2019) used a zero-inflated model

along with linear mixed models to fit cyanobacteria biovolume data. The first process

(binomial distribution) in a zero-inflated model was used to model the effect of treatment

(temperature, nutrient treatments and extreme rainfall treatments) on the probability of

occurrence of cyanobacteria. The impact of treatment on the biovolume of different

cyanobacteria genera (non-zero data) was tested using linear mixed models. Commonly

used zero truncated models include zero-inflated model and hurdle model. Zero-inflated

models assume zeros are generated in both processes, while in hurdle models, all zeros

arise from the first process. Both models should be validated and compared to provide a

reasonable explanation of the zero generation mechanism.

2.2.2 Cryptosporidium prediction

As the identification of Cryptosporidium in source water is time- and labour- intensive, data-

driven models based on historical data have been used to predict Cryptosporidium oocyst

concentrations. Due to the prevalence and ease of enumeration, a suite of fecal indicator

bacteria or organisms has been used as indicators for the presence of Cryptosporidium

20

oocysts (Coffey et al., 2007). In many regions, E.coli is accepted as the best and most

affordable surrogate of contamination by Cryptosporidium (WHO, 2006). Under the USEAP

Long Term 2 Enhanced Surface Water Treatment Rule (LT2), if lakes or reservoirs have low

densities of E.coli (<10 CFU 100ml-1) or flowing streams (<50 CFU 100ml-1),

Cryptosporidium monitoring is not required. As such, majority of previous studies on

Cryptosporidium prediction have utilized indicator organisms. Agulló-Barceló et al. (2013)

have reported that when using E.coli and somatic coliphages data together, discriminant

analyses showed high accuracy in predicting infectious Cryptosporidium oocysts. However,

more studies have concluded that indicator bacteria alone cannot provide information (the

presence and/or concentrations) of most important pathogens in surface waters (Pachepsky

et al., 2016; Francy et al., 2013; Costán-Longares et al., 2008). Also, Lalancette et al.

(2014) have reported the use of E.coli concentrations as a surrogate for Cryptosporidium

concentrations can result in an inaccurate estimate of Cryptosporidium risk for agriculture

impacted drinking water intakes or for intakes with distant wastewater sources. More

recently, studies have been focused on machine learning applications in Cryptosporidium

risk assessment. Ligda et al. (2020) have investigated interactions between environmental

factors and Cryptosporidium oocysts concentrations, and applied machine learning models,

and linear discriminant function analysis to predict the contamination intensity of

Cryptosporidium. An overall accuracy of approximately 75% was achieved for the

classification of four levels of Cryptosporidium concentrations.

Although there is a lack of studies that focus on the direct prediction of

Cryptosporidium, previous studies have elucidated factors that drive the occurrence of

Cryptosporidium in water bodies. Most of these studies have used the Soil and Water

Assessment Tool (SWAT). Coffey et al. (2010) have reported the SWAT can be used to

predict Cryptosporidium oocysts concentrations in the source water. The result suggested

the mean monthly prediction ranged from 4.8 oocysts L-1 to 0.004 oocysts L-1 in the west of

21

Ireland. Further model development using observed oocysts levels is required to

quantitatively assess model accuracy. Liu et al. (2019) studied the fate and transport

dynamics of Cryptosporidium using SWAT and predicted the average annual concentration

of Cryptosporidium oocysts in Daning River in China was 0.95 oocysts L-1 but with high

spatial variability. A combined impact of rainfall and regional fertilization on the level of the

Cryptosporidium was emphasized. Frey et al., (2013) have used Classification and

Regression Tree Analysis (CART) to predict pathogen presence/absence for an agricultural

watershed using the simulated streamflow, total suspended solids (TSS), total N and total P,

and fecal indicator bacteria loads. The model identified air temperature, precipitation,

streamflow, and total P as the most important variables for classifying pathogen

presence/absence, and a close relationship between cattle pollution and pathogen

occurrence in the studied watershed.

2.3 Bayesian modelling in source water quality

Bayesian methods are becoming increasingly popular in recent years with the demand for

quantification of uncertainty. Bayesian inference has the advantage of combining prior

information on parameters with observations to provide an improved model parameter

estimation and output with uncertainty. Bayesian methods have been applied in a range of

water quality models (Freni & Mannina, 2010). Two objectives are usually associated with

Bayesian modelling: (1) Present a predictive model of water quality using environmental

variables and interpret the results with uncertainty analysis. (2) Investigate the relationship

between water quality and these variables and identify the key factors through sensitivity

analysis and/or comparison between prior and posterior distributions of parameters.

Dilks et al. (1992) applied a Bayesian Monte Carlo technique to predict Grand River

dissolved oxygen with nine uncertain model parameters. As little prior information was

available, uniform distribution was employed to initially describe each parameter. Results

indicate every parameter was significantly correlated to the ability of the model, and the

22

model predicted minimum dissolved oxygen concentration by 72% from 0.69 to 2.5 mg/L.

Malve et al., (2007) fitted 8 years of in situ observations of cyanobacteria with adaptive

Markov chain Monte Carlo (MCMC) methods to estimate model parameters. The model

discovered that to satisfy with 0.95 probability criteria of cyanobacteria (concentration does

not exceed 0.86 mg/L), the range for total phosphorus concentration should be between 45

μgC/L to 16 μg/L. Zooplankton grazing effect has a major effect on cyanobacteria.

Gronewold et al. (2009) have applied ordinary least squares (OLS) linear regression and

MCMC to calibrate first-order bacterial decay model and empirical bacterial die-off model.

Both models were validated by leave-one-out cross-validation and assessed by Bayesian

posterior predictive p-values. Results suggest that models without a bacterial kinetics

parameter related to the decay rate more appropriately reflected FIB rate and transport

processes. Zhao et al. (2014) developed a multi-pollution source water quality model

integrated with Bayesian statistics to support water quality management in Songhua River

system in northeastern China. The model estimated the distribution of the decay rate (k)

which was considered a key factor. The distribution curves enabled assessing the influence

of each loading and designing water quality management strategies seasonally. Bayesian

hierarchical modelling is also widely utilized in water quality modelling. Liu et al. (2021) used

a hierarchical Bayesian model averaging framework to explore the relationship between

event-based water quality and environmental variables, including sediments, nutrients and

salinity to predict the water quality at multiple sites and identify key environmental drivers.

The study found that rainfall and runoff affected in-stream particulate constituents, while

wetness and vegetation cover impacted dissolved nutrient concentration and salinity.

MCMC methods can also be combined with other simulation-based methods. Yang et al.

(2016) have integrated a genetic algorithm (GA) into a Bayesian approach to improve

sampling performance during parameter estimation. The eutrophication model based on the

MCMC coupled with GA was applied on data from an urban lake in north China. Water

23

quality assessment was conducted for eutrophication management. Results suggest that

the MCMC-GA method performed a better convergence efficiency during sampling and

narrower 95% credible intervals than classic MCM method. Rainfall runoff nutrient loading

was a key factor in eutrophication and should be controlled for lake restoration.

2.4 Risk assessment for waterborne bacteria and pathogen

In the field of health risk assessment, quantitative microbial risk assessment (QMRA) and

nonlinear intelligent models are common tools. With the advantage of allowing uncertainty

by running Monte Carlo simulations, QMRA has been widely used for drinking water safety

management and risk assessment. QMRA is a mathematical approach used to quantify the

health risks from microorganisms in source water and can be used to support water safety

management decisions. QMRA approach follows four steps: hazard identification, exposure

assessment, dose-response assessment, and risk characterization. Using the model, the

health risk can be quantified and compared with the risk that is agreed to be acceptable.

QMRA also allows the comparison of different scenarios and informs the required design of

the treatment to obtain a certain treatment level. QMRA usually assesses the most ‘risky’

exposure, which is assumed to be oral ingestion, since the exposure through inhalation or

skin is unlikely, and the data to estimate exposure through these routes is often unavailable.

Although the assessment was initially set up to evaluate the health risk of specific microbes,

it can also be used for chemical contaminants (WHO, 2021).

Probabilistic approaches are emerging as a practical complementary approach to

conducting QMRA. Compared to deterministic QMRA, which uses point estimates such as

arithmetic mean values for the input variables (Health Canada, 2018), probabilistic QMRA

models are superior since they can account for variability and uncertainty of the input

variables and parameters. Three basic approaches are currently being adopted in research

into probabilistic QMRA: Monte Carlo simulations, Bayesian networks, and Markov chain

Monte Carlo method.

24

The Monte Carlo method relies on repeated random sampling to generate

simulations. It uses randomness to solve deterministic problems (Metropolis & Ulam, 1949).

In Monte Carlo approach to QMRA, all input variables (such as the concentration of microbe

and drinking water intake) were described as appropriate probability distributions that

quantified uncertainty and were further introduced to the exposure assessment model. The

result, which is the exposure distribution, was passed to the dose response model to

quantify the probability of infection and illness using dose-response relationship and

morbidity rate, and the final output of risk characterization is a probability distribution of

DALYs. By repeatedly sampling from distributions of input variables through Monte Carlo

method, the risk of illness in DALYs can be depicted in distribution. Amha et al. (2015) have

developed a probabilistic QMRA to determine the risk of Salmonella infections resulting from

the consumption of crops irrigated with treated wastewater. The probabilistic exposure

models for raw consumption of three vegetables (lettuce, cabbage and cucumber) irrigated

with treated wastewater were constructed, and the disease burden of Salmonella was

estimated using the Monte Carlo method. The results suggested a raised median disease

burden compared with the acceptable disease burden set by the World Health Organization

of 10-6 DALYs per person per year. Consumption of lettuce irrigated with treated wastewater

have posed the highest risk of infection, while cucumber showed the lowest risk from the

study. Mok et al. (2014) have constructed a probabilistic QMRA model to determine the

health risks of norovirus infection from consumption of vegetables irrigated with wastewater

in Shepparton, Autstrilia. Annual norovirus disease burden was estimated for the

consumption of lettuce, broccoli, cabbage, cucumber and Asian vegetables through the

Monte Carlo simulation. The results indicate that wastewater treatment did not have

sufficient removal efficiency to meet the WHO threshold of 10-6, while extra disinfection

treatments provided acceptable results. Barker et al.(2014) proposed a probabilistic QMRA

using Monte Carlo method to assess the risk of gastroenteritis illness caused by rotavirus,

25

norovirus, and Ascaris lumbricoides associated with the consumption of street food salads.

The results indicate that both Rotavirus-dominated and norovirus-dominated annual disease

burdens have exceeded the 10-4 DALYs, and significant interventions are demanded to

maintain the health and safety of street food in Kumasi. Apart from microbial risk estimation,

probabilistic QMRA can also be used to estimate the health risk of systems and

management solutions using reference pathogens as proxies, Bivins et al. (2017) proposed

a probabilistic QMRA using Monte Carlo techniques to estimate the risks of infection of

waterborne illness when the population exposed to Intermittent water supply. Reference

pathogens including Campylobacter, Cryptosporidium, and rotavirus were used as

conservative risk proxies for infection via bacteria, protozoa, and viruses. Results suggested

that diarrheal disease burden associated with intermittent water supply likely exceeds the

WHO guideline for drinking water of 10-6 DALYs per person per year. Ishaq et al. (2022)

have estimated the disease burden of diarrhea from Campylobacter, Giardia,

Cryptosporidium and Norovirus with an integrated “Regression-QMRA method” by

examining the relationship between pathogens concentration and environmental variables.

The probability distribution of pathogen concentration was calculated by linear regression of

water source, LID type, pathogen type and season, and 1000 simulated data points for each

pathogen were generated for each pathogen. The results show that after applying the

methodology to a planned LID train, the predicted disease burden of diarrhea from

Campylobacter is highest, and followed by Giardia, Cryptosporidium, and Norovirus.

Bayesian networks (BNs) are probabilistic graphical models (directed acyclic graphs)

representing complex relationships of multiple variables. Each node corresponds to a

random variable and each edge represents the conditional probability for the corresponding

random variables. Bayesian networks are appropriate for nonlinear problems (Yang et al.,

2019). Since Bayesian networks can infer missing values, incorporate expert knowledge

and multi-source data, and address uncertainties with prediction, such models are widely

26

applied to quantitative risk assessment (Beaudequin et al., 2015; Jiang et al., 2021). Greiner

et al. (2013) have concluded that the entire QMRA model can be formulated as a Bayesian

network using the same equations through Monte Carlo methods but implemented in a

network that includes the joint distribution of all variables in the model. Previous studies

have utilized QMRA to quantify risk based on BNs. Jiang et al. (2021) investigated the

relationships between cyanobacterial blooms and multi-dimensional influencing variables.

An extended BN and an integrated framework were proposed to assess the risk of

cyanobacterial blooms. The model was used to evaluate the global warming effects on the

risk and has reported an increase of 38.5% under global warming. Beaudequin et al. (2016)

presented a QMRA expressed as BN in the wastewater reuse context and evaluated the

risk of norovirus infection associated with wastewater-irrigated lettuce in a range of

exposure and risk mitigation scenarios. Orak et al. (2020) have developed a hybrid BN

model for health risk assessment of arsenic contamination, and the results show that low

inorganic arsenic concentration increases the risk of low birth weight even for low

gestational age scenarios. Donald et al. (2009) presented a conceptual BN model to

illustrate the risks of gastroenteritis posed by the use of recycled water. The nodes were

quantified using an expert’s opinion, and the model allows users to make various predictions

as to the risks posed under various scenarios.

Although both Monte Carlo simulation and BNs present predictions as probability

distributions, these methods work from fixed estimates for means, variances and other

parameters (Donald et al., 2011). The uncertainty from parameter estimations can be

incorporated with the risk assessment by adopting a Bayesian approach with Markov Chain

Monte Carlo method. Donald et al. (2011) studied incorporating the parameter uncertainties

into the probabilistic QMRA model. The study illustrated that simultaneous parameters

estimates are a better methodology than the ‘plug-in’ of point estimates of parameters

approach through Monte Carlo simulation. Parsons et al. (2005) have compared BN,

27

Markov chain Monte Carlo (MCMC) approaches to QMRA modelling of Salmonella spp.

Although the BN model requires variables to be discrete, which may introduce error, it

responds immediately to changes under scenario analysis since it does not use simulation

and can propagate information from any point in the network to all others by Bayesian

inference. MCMC approach sacrifices the ability to propagate evidence but does not require

discrete variables and offers greater flexibility.

28

Chapter 3: Predicting Cyanobacteria Abundance with Bayesian Zero-inflated Negative

Binomial Models

3.1 Introduction

Cyanobacteria are photosynthetic microorganisms that can result in degraded freshwater

quality and threaten human health. Cyanobacterial blooms can significantly increase

turbidity, result in dissolved oxygen depletion due to biological degradation of cyanobacteria

biomass, and produce unpleasant taste and odour compounds (Huisman et al., 2018).

Furthermore, some species can release toxins such as microcystins, nodularins,

cylindrospermopsin, anatoxins, and saxitoxins (Catherine et al., 2013). It has previously

been observed that there is a significant positive relation between non-alcoholic liver

disease and large-scale blooms associated with toxin release (harmful algal blooms or

CyanoHABs) (Zhang et al., 2015). Furthermore, associations between drinking surface

water from cyanobacteria contaminated water bodies and a higher incidence of colorectal

cancer have been noted (Lee et al., 2017). CyanoHABs also pose severe problems for

ecological systems. Even low concentrations of microcystin-LR (5 g/L) and microcystins

(50 g/L) have been found to impact fish growth and survival rates. At high concentrations

of microcystins (> 10 mg/L), morphological effects on fish have been observed (Oberemm et

al., 1999). In addition, the accumulation of microcystins and cyanotoxins through the food

web is a threat to human health (Bownik, 2016). Based on analysis of lake sediments over

the last 200 years, data shows that cyanobacteria have increased significantly, with the

most rapid growth in blooms occurring from 1945 until the present (Taranu et al., 2015).

CyanoHABs are caused or promoted by a combination of environmental factors, with

strong associations with several anthropogenic and natural processes. Agricultural activities

can increase nitrogen and phosphorus input into the water system, promoting cyanobacteria

growth (O’Neil et al., 2012). Climate change impacts are also likely to increase the

29

occurrence of blooms in the future (Chapra et al., 2017). Higher water temperatures

stimulate the growth of cyanobacteria, since their optimal growth rate is often reached at

temperatures above 25°C (Thomas & Litchman, 2016). Cyanobacteria are carbon-fixing

bacteria that rely on a CO2 concentrating mechanism, and therefore rising concentrations of

CO2 in the atmosphere and water bodies may also promote blooms (Verspagen et al.,

2014). Elevated pH is also known to reduce the energy cost of the CO2 concentrating

process, with higher efficiencies observed in acidic environments (Mangan et al., 2016)

Cyanobacteria bloom density is usually counted with a mechanical or electronic

counter using an inverted microscope following sedimentation in a chamber or filtration

(Chorus & Welker, 2021). Cell counting is a labour intensive, time consuming, and

expensive method that limits the extent and frequency of monitoring campaigns. As such,

there is a need for methods that can enumerate or estimate cyanobacterial levels rapidly

and preferably without the need for sampling. Several studies have developed models to fit

count data and make predictions of day-to-day counts based on easy to measure

parameters. Dzialowski et al. (2009) attempted to build a linear regression model for

predicting the cyanobacteria abundance and toxins in five reservoirs in Kansas, USA.

However, their results suggest simple linear models could not accurately predict

cyanobacteria counts (Dzialowski et al., 2009). Pyo et al., (2020) utilized a convolutional

neural network applied to the output of a spatial fluid dynamics model of cyanobacteria

abundance, which achieved good short-term prediction of microcystis. Zhao et al., (2019)

put forward a species identification model and analyzed the dominant species using

canonical correspondence analysis (CCA). The model was used to identify major driving

factors, including water temperature, pH, total phosphorus, ammonia nitrogen, chemical

oxygen demand and dissolved oxygen, and predict the risk of algal blooms. Harris &

Graham, (2017) developed 12 linear and non-linear models to predict cyanobacteria

abundance, microcystin and geosmin in a reservoir. Support vector machines, random

30

forests, boosted trees, and cubist modelling approaches were observed to have the best

performance. However, all models underestimated cyanobacteria abundance, and none of

the models predicted peak bloom events or the highest counts.

A common challenge with modelling cyanobacteria abundance is the innate

imbalance in monitoring datasets. A significant excess number of zero counts is typical and

may have resulted from either failure to detect cyanobacteria or an actual absence of

cyanobacteria. Poisson and negative binomial distributions are commonly used for

modelling count data, but they cannot account for the information contained in the excess

proportion of zeros. Several mixture models have been proposed to consider better high

numbers of zero counts: zero-inflated models and hurdle models. Zero-inflated models

assume zeros are generated by a Bernoulli distribution with probability 𝑃 and negative

binomial (or Poisson) distribution with probability 1 − 𝑃 (Lambert, 1992). In hurdle models,

the zeros and non-zero values are generated separately by a Bernoulli distribution and

negative binomial (or Poisson) distribution (Min & Agresti, 2005). Both hurdle and zero-

inflated models have been used in environmental and ecological fields of study. Wenger &

Freeman (2008) showed improved fit of zero-inflated models to duck species abundance

and stream fish abundance. Cha et al. (2014) developed a Bayesian hurdle Poisson model

for predicting cyanobacteria abundance in Lake Paldang, Korea. Richardson et al., (2019)

used a zero-inflated model along with linear mixed models to fit cyanobacteria biovolume

data. Hegg et al., (2022) have also used a zero-inflated generalized linear mixed model to

model cyanobacteria abundance to investigate the toxin-producing cyanobacteria effects on

water fleas (Daphnia) fitness in eutrophic lakes. Marion et al. (2017) constructed a

multivariate zero-inflated beta regression models to assess the relationships between the

proportion of county area experiencing a cyanobacteria bloom, county land cover types, and

nutrient loading. Salmaso et al., (2015) have used zero-inflated negative binomial and to

analyze the relationships between environmental variables and cyanobacteria abundance

31

and tested against zero-inflated Poisson model based on AIC and likelihood ratio test. The

two zero-truncated models: hurdle model and zero-inflated model are considered to provide

two plausible explanations for the zero counts in cyanobacteria abundance data. However,

comparison between the goodness of the fit to cyanobacteria abundance data of the two

models have not been made, with studies adopting either the zero-inflated (Salmaso et al.,

2015; Marion et al., 2017; Richardson et al., 2019; Hegg et al., 2022) or hurdle model (Cha

et al., 2014). Furthermore, although previous studies have analyzed cyanobacteria

abundance data within the Bayesian framework (Cha et al., 2014), Bayesian variable

selection method within such as projection predictive inference has not been widely used.

Previous methods focused on the traditional variable selection methods, such as principal

component analysis and stepwise regression (Salmaso et al., 2015; Cha et al., 2014). While

the general factors that cause cyanobacteria blooms have been well investigated (Salmaso

et al., 2015), the method to select site-specific factors that influence idiosyncratic

cynobacteria abundance have not been developed, which can make accurate prediction

challenging.

This study presents a Bayesian approach to fit cyanobacteria data with a negative

binomial model, zero-inflated negative binomial model, and hurdle negative binomial model

to address challenges with inflated zero counts. It is hypothesized that through the novel

use of zero-inflated models for this application the elevated zero counts inherent in the

majority of cyanobacteria abundance data can be accounted for and model fit will be

improved. Additionally, a Bayesian framework was used to present abundance predictions

as distributions rather than point estimates, allowing for a more direct interpretation of

uncertainty. Through these two key aspects of the presented models, the aim is to improve

the integrability of models in water management by accounting for expected data

distributions and emphasizing the need for knowledge of uncertainty in predictions of

environmental systems. The fit of each model is compared to select an optimal model.

32

Predictions from the optimal model, are then classified according to Australian Management

Strategies for Cyanobacteria (Newcombe et al., 2009) to assess the capabilities of the

presented approach to identify cyanobacteria levels used in water management. The

application of the selected model integrated into a Bayesian framework and utilizing the

predictive distribution of each prediction obtained from MCMC sampling to assign the

prediction into predefined categories are novel and can achieve higher accuracy than the

regression method. The established model was also used to assess the importance and

impact of environmental variables on the probability of cyanobacteria blooms. The state-of-

art projection predictive variable selection for generalized linear models which has shown

superior performance to competing variable selection methods (Catalina et al., 2020) is

employed in this study, and the selected model is validated through the posterior predictive

checks, which are useful tools to inspect the discrepancies between real and

predicted/simulated data. The whole process has not been used to resolve water quality

issues, and the developed framework is appropriate to resolve a wide range of problems of

predicting the classification of imbalance data in environmental and ecological fields.

3.2 Methodology

3.2.1 Study site and data source

Data used in this study was collected from a eutrophic reservoir, Cheney Reservoir (37º45’

35’’N, 97 º 50’06’’W), the main water supply for Wichita, Kansas USA (Christensen et al.,

2006).

The data was obtained from the United States Geological Survey (USGS) (US Geological

Survey, 2015). The reservoir has experienced frequent cyanobacterial blooms, presence of

microcystin, and taste-and-odor problems. In part, this could be due to the shallow depth

(average depth=6.1 m) and persistent winds that cause maximal turbulence and a resulting

turbid environment. Among the 185 samples in the dataset, 34 samples indicate zero counts

of cyanobacteria (18.4% of the data). The site was sampled in 14 years from 2002 to 2015,

33

with different annual sampling frequency ranges from twice a year to 24 times a year. The

number of samples in each year is shown in Figure 3-1 (a). Although samples with cell

counts < 1,000 / mL make the majority class of the data, the highest value is 129,836 cells /

mL during a cyanobacterial bloom. The frequency of cyanobacteria abundance is shown in

Figure 3-1(b). Imbalanced datasets like this with a wide data range are challenging to model

and often resulting in poor model performance.

(a) (b)

Figure 3-1 (a) Bar plots of sampling frequency in each year from 2002 to 2015; (b)

Histograms of cyanobacteria abundance.

In order to identify trends or patterns of cyanobacteria abundance over time, the time

series of cyanobacteria from 2002 to 2015 was depicted and presented in Figure 3-2 (a).

Although the repetitive fluctuation over years can be leveraged in predicting future blooms,

the cyanobacteria abundance lack any meaningful pattern. It is worth noting that in the algal

blooms magnititudes for Cheney reservoir displayed a increasing trend from 2004 to 2006,

and a slightly decreasing trend from 2006 to 2013. This finding is consistent with harmful

algal bloom at Cheney Reservoir Dam in September, 2006 reported by the city of Wichita. In

2006, the city of Wichita upgraded to ozone treatment to control event effects (Oneby et al.,

2006). In order to gain more in-depth insight over historical fluctuations, autocorrelation

34

function (ACF) was used to explain the similarity between observations in the function of

lagged time. The autocorrelation function (ACF) can be used to explain the similarity

between observations in time series as a function of lagged time (Box et al., 2015). The

resulted ACF analysis is illustrated in Figure 3-2 (b). As it can be observed from the graph, a

strong ACF (=0.5) was oberserved for cyanobacteria abundance in one month, suggesting

the present values of abundance is related with values in last month. Although Li et al.,

(2010) has indicated an apparent seasonal variation of cyanobacteria, in this study no

considerable seasonal pattern was observed for cyanobacterian abundance (ACF <0.5).

However, the cyanobacteria abundance usually peaks at fall season as presented in Figure

S4 in appendix. A possible explanation might be that the in-lake intervention such as ozone

treatment to manage harmful cyanobacterial blooms after 2006 has dimished the seasonal

pattern of cyanobacteria abundance.

(a) (b)

Figure 3-2 (a) Visualization of the time series for cyanobacteria abundance in Cheney

reservoir; (b) Autocorrelation factor (ACF) of cyanobacteria abundance based on monthly

time lag.

The dataset that included water quality variables in Cheney reservoir, Kansas was

also obtained from the United States Geological Survey (USGS) (US Geological Survey,

2015). Precipitation, solar radiation and wind speed in Cheney reservoir were obtained from

35

NASA Power project. Meteorological data and environmental data along with cyanobacteria

abundance data were merged according to the same sampling dates. The original data set

included 9 variables: temperature, pH, total phosphorous, total nitrogen, Chlorophyll a (Chl

a), all sky insolation incident on a horizontal surface (i.e. solar radiation), wind, turbidity, and

precipitation (Table 3-1). All measurements were daily average values for the same

sampling day as the cyanobacteria counts. Among all of the potential drivers behind harmful

cyanobacterial blooms, eutrophication, which refers to the water body enriched with

minerals and nutrients, particularly nitrogen and phosphorus, can significantly stimulate the

occurrence of harmful algal blooms by causing a shift in the phytoplankton community

towards cyanobacteria dominance (O’Neil et al., 2012). Increased temperature can

stimulate cyanobacteria growth both directly and indirectly. Cyanobacteria favor high

temperatures with an optimum temperature than other groups of algae (Lürling et al., 2013).

The indirect effects of temperature include the intensified thermal stratification due to

increased temperature. Cyanobacteria can take advantage of the stratification by regulating

their buoyancy by forming gas vesicles and accumulating dense blooms at the surface

(Paerl & Huisman, 2009). Wind is also identified as a contributor to cyanobacteria growth.

Below a critical wind speed of 2-3m/s, the wind-generated turbulence is hypothesized not

capable of mixing floating cells away from surface into deeper layers, which leads to the

accumulation of cyanobacteria on the surface (Wang et al., 2016). Extreme precipitation

across surface may mobilize the sediments and nutrients into the reservoir (Woods et al.,

2017), and also leads to increased water column mixing and weakened vertical stratification,

which have been identified as contributors to cyanobacteria blooms (Reichwaldt &

Ghadouani, 2012). Although Chl a and turbidity are not the cause of cyanobacterial blooms,

they are parameters that are expected to be correlated with the presence of cyanobacteria,

and therefore can be used as predictors. In order to identify the underlying linear

36

relationships between variables, the correlations were determined and a correlation

heatmap is presented om Figure 3-3.

Table 3-1 Selected variables used to build initial models.

Variable Abbreviation Units

Total Phosphorous TP mg/L as P

pH pH NA

Temperature Temp C

Chlorophyll a Chl a g/L

All Sky Insolation Incident on a Horizontal Surface

Solar radiation 𝑊ℎ/𝑚2

Wind Wind m/s

Total Nitrogen TN mg/L as N

Precipitation Precipitation mm

Turbidity Turb FNU

Figure 3-3 Correlation heatmap of water quality and weather parameters.

As the heatmap indicates, linear dependency between all parameters and

cyanobacteria are weak (R< 0.3). Therefore, it seems that linearity cannot capture the

complicated relationship between cyanobacteria abundance, weather and water quality

37

parameters. Chl.a, temperature, precipitation and pH have been observed to have positive

correlations with cyanobacteria levels. The correlations between solar radiation and

temperature (0.67), total phosphorus and turbidity (0.66) are stronger than other pairs,

which are aligned with previous studies (Villa et al., 2019). However, as the correlation

coefficients among these predictors are below 0.7, multicollinearity is not a significant

problem (Dormann et al., 2013). Prior to modelling, variables were further selected by

projection predictive inference, a Bayesian approach for model selection and decision

making.

3.2.2 Mixture Models for Zero-inflated Count Data

3.2.2.1 Zero-Inflated Negative Binomial Model

The Zero-Inflated Negative Binomial (ZINB) model (Lambert, 1992) is a mixture model

consisting of a Bernoulli distribution and an untruncated negative binomial distribution. In a

ZINB model, zero count of cyanobacteria are generated in two processes: the first binomial

process accounts for the absence or presence of cyanobacteria, and the second negative

binomial distribution generates counts for cyanobacteria abundnace, in which zeros are

included. By combining these two processes, ZINB model accounts for both the real zero

count of cyanobacteria and measurement error. For cyanobacteria abudance 𝑌𝑐𝑦𝑎𝑛𝑜, the

ZINB model can be written as:

𝑝(𝑌𝑐𝑦𝑎𝑛𝑜 = 𝑛) {𝜋 + (1 − 𝜋)𝑓(0) 𝑖𝑓 𝑛 = 0

(1 − 𝜋)𝑓(𝑛) 𝑖𝑓 𝑛 > 0

Where is the parameter denoting the probability of zeros in a binomial distribution.

𝑓(𝑛) is the probability density function of the negative binomial distribution.

3.2.2.2 Hurdle Negative Binomial Model

In a hurdle model, there are two parts in control of cyanobacteria count generation (Welsh et

al., 1996). The first part decides the presence of cyanobacteria, which is typically

accomplished through logistic modelling. The second part, a truncated negative binomial

38

model, models the count of cyanobacteria abundance (non-zero value). The hurdle NB

model can be written as:

𝑝(𝑌𝑐𝑦𝑎𝑛𝑜 = 0) = 𝜋

𝑝(𝑌𝑐𝑦𝑎𝑛𝑜 = 𝑛) = (1 − 𝜋)𝑓(𝑛)

1 − 𝑓(0) 𝑦𝑖 ≠ 0

3.2.3 Bayesian approach

Bayesian framework is an approach to model data and estimate parameters based on

Bayes' theorem:

𝑃(𝜃|𝑋) = 𝑃(𝑋|𝜃)𝑃(𝜃)

𝑃(𝑋)

In a Bayesian approach, parameter estimation workflow consists of three

processes: first, a prior distribution 𝑃(𝜃) was set for parameters including the coefficients of

pH, Chl.a, TN. TP, turbidity, precipitation, solar radiation and temperature in the linear

regression. The prior distribution is determined based on available experience and

knowledge. Second, the likelihood 𝑃(𝑋|𝜃) of observed cyanobacteria abundance data is

calculated using the parameters 𝜃. Finally, the likelihood and prior are combined to

determine the posterior distribution 𝑃(𝜃|𝑋) of the parameters in the linear regression,

reflecting an updated representation of knowledge (van de Schoot et al., 2021).

Once the posterior distribution of parameters is determined, sample observations

cab be drawn. However, the parameter distribution is high-dimensional and usually not a

probability distribution we are familiar with, making exact inference intractable (Bishop,

2006). Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings

algorithm (Metropolis et al., 1953; Hastings, 1970), are used to generative random samples

from the target distribution.

Stan is a probabilistic programming language for Bayesian statistical inference

written in C++. It provides a No-U-Turn sampler (NUTS) to obtain simulations from the user-

39

specified posterior distribution (Carpenter et al., 2017). In this study, the R package rstan

have been used, which provides an interface to Stan using R. Through rstan, we

implemented mixture models such as zero-inflated and hurdle models for discrete

distributions.

Convergence of MCMC chains can be diagnosed with trace plots and Gelman–

Rubin diagnostic �̂� (Brooks & Gelman, 1998). Trace plots are helpful when identifying the

burn-in process and the convergence of Markov chains. Gelman-Rubin statistic compares

the total-within and between-chain variation to analyze the difference between multiple

Markov chains. �̂� = 1 indicates good convergence. Practically, a 0.975 quantile for �̂� ≤ 1.2

denotes convergence.

3.2.4 Model development, selection and validation

A summary of the fitting and testing process is presented in Figure 3-4. Initially, the

cyanobacteria abundance data was split into training and test sets, where the test set was

only used to assess predictive performance. A 5-fold cross-validation with stratified random

sampling was taken to prevent an imbalance between training and test data and reduces

the randomness in results. In our data, the common attribute is zero or non-zero

cyanobacteria count (18.4% of data was zero counts). As such, the data were stratified into

two subgroups: zero and non-zero. In each subgroup, the data was randomly split into five

equal folds and then one-fold from each group were combined to form training sets with an

equal proportion of zero and non-zero cyanobacteria count. The test set contained 44

samples, and train data contained 141 samples.

40

Figure 3-4 Flowchart of modelling and application on cyanobacteria abundance prediction.

ZINB: zero-inflated model; Hurdle NB: Hurdle negative binomial model; NB: negative

binomial model. LOO-CV: leave-one-out cross-validation; PPC: posterior predictive check.

After splitting the data into training/test sets, the most representative variables were

selected by projection predictive inference. The selected variables were then used to build a

Bayesian negative binomial model, Bayesian zero-inflated model and a Bayesian hurdle

negative binomial model. Model comparison and selection were achieved by leaving one out

cross validation (LOO-CV) and model validation (posterior predictive checks) for the best

model.

When using generalized linear models to solve regression problems (e.g., binary and

multinomial logistic regression), a threshold is commonly chosen as the decision rule. For

example, in binary logistic regression, it is a general practice to choose 0.5 as the threshold,

but in practice, different thresholds can be mannully selected for specific situations. If high

discriminative accuracy is required for positive cases, a larger threshold can be chosen (Kuk

et al., 2014). Traditional multinomial logistic regression is subject to large bias when dealing

with imbalanced data and do not take the distribution of the data into account. Thus, in order

to make the result more indicative, the probability distribution of the prediction points was

approximated by the density distribution obtained by MCMC sampling and assigned the

41

predictions to alert levels according to the management strategies for cyanobacteria by

Water Quality Research Australia (WQRA) (Table 3-2). This framework is based on the

standards that outline when potential toxin release may occur. At low alert level, health

authorities may decide to issue health warning or notice for water consumption. Higher alert

levels represent the situation where the potential risk of cyanobacteria may cause adverse

health effects if the treatment is unavailable or infective (Newcombe et al.,2009). The

categorization process is analog to assigning the predicted class according to the posterior

distribution and the probability threshold that was set in advance. Finally, the fitted model

was applied to our test set to generate predictions and classified the results.

Table 3-2 Alert levels for management of toxic cyanobacteria (WQRA)

Alert Level Definition Description

Safe < 500 𝑐𝑒𝑙𝑙/𝑚𝐿 Safe for drinking water

Low ≥ 500 𝑎𝑛𝑑 < 2,000 𝑐𝑒𝑙𝑙/𝑚𝐿 Detected at low levels

Medium ≥ 2,000 𝑎𝑛𝑑 < 6,500 𝑐𝑒𝑙𝑙/𝑚𝐿 Potential toxin to 1

3∼

1

2 to guideline

concentration

High ≥ 6,500 𝑎𝑛𝑑 < 65, 000 𝑐𝑒𝑙𝑙/𝑚𝐿 Potential toxin greater than guideline

concentration

Very High ≥ 65, 000 𝑐𝑒𝑙𝑙/𝑚𝐿 Potential toxin 10 × greater than

guideline concentration

3.2.4.1 Projection predictive inference

Projection predictive inference (Piironen et al., 2020) is a Bayesian variable selection

method. Variable selection was carried out using the projpred package in R. Initially, a

42

model with all nine environmental predictor variables for cyanobacteria abundance was

fitted and considered as the reference model. Sub-models are then fitted, initially with one

variable, and then sequentially more variables are added. A model with the smallest subset

of variables with an approximately similar fit to a full model was selected. In the forward

search process, where variables are sequentially added, each step determines the variable

that would result in the largest decrease in discrepancy between the reference model and

the sub-model. The submodels were compared with the reference model by cross-validation

prediction accuracy using leave-one-out cross-validation (LOO-CV) to prevent overfitting.

3.2.4.2 Leave-one-out cross-validation

Several measures have been developed to compare the fit of different models, such as

Akaike information criterion (AIC), Bayesian information criterion (BIC), deviance information

criterion (DIC), Watanabe–Akaike information criterion (WAIC), and LOO-CV. To measure

the wider applicability of a statistical model, out-of-sample data are commonly used to

evaluate its predictive power. However, usually we do not have access to extra. As such, we

estimated the predictive capability of the model using the expected log pointwise predictive

density (𝑒𝑙𝑝𝑝𝑑) with a penalty term (Gelman et al., 2013).

𝑒𝑙𝑝𝑝𝑑 = ∑ 𝐸𝑓(log[𝑝(𝑦𝑛𝑒𝑤|𝑦)])

𝑛

𝑖=1

Measures including AIC, BIC, DIC, and WAIC methods utilize all data to determine fit

and therefore can be biased in assessment. Therefore, a LOO-CV (loo package in R)

approach was used in order to determine model fit based on out-of-sample data. This can

be extremely computationally expensive, especially if the data set is large. However, with

less than 200 cyanobacteria abundance samples, computation time can be ignored. In

LOO-CV, a single sample from the data set is removed to test the model and the remaining

samples are used to train the model. The process is repeated 𝑛 times (where 𝑛 is the size of

the cyanobacteria data set) so that each sample is considered.

43

From each iteration, the log predictive density (𝑙𝑝𝑑) is evaluated by:

𝑙𝑝𝑑 = log [𝑝(𝑦𝑖|𝑦−𝑖)]

where 𝑦𝑖 denotes the 𝑖𝑡ℎ data point, and 𝑦−𝑖 denotes the rest data. After 𝑛 times, the

𝑒𝑙𝑝𝑝𝑑 can be estimated by:

𝑒𝑙𝑝𝑝�̂� = ∑ log [𝑝(𝑦𝑖|𝑦−𝑖)]

𝑛

𝑖=1

3.2.4.3 Posterior Predictive Checks

Posterior predictive check is a classical approach to compare the test statistics 𝑇(𝑦)

(arbitrary function of data) of the actual observed data and the data generated from the

model with parameters sampled from the posterior predictive distribution (Berkhof et al.,

2000).

The posterior predictive distribution can be written as:

𝑃𝑟(𝑦𝑟𝑒𝑝|𝑦𝑜𝑏𝑠) = ∫ 𝑃(𝑦𝑟𝑒𝑝| 𝜃)𝑃(𝜃|𝑦𝑜𝑏𝑠)𝑑𝜃

where 𝑦𝑟𝑒𝑝 denotes the replicated cyanobacteria data, and 𝑦𝑜𝑏𝑠denotes the

observed cyanobacteria data.

The principle behind posterior predictive checks (PPCs) is that if a model provides a

good fit to the data, the generated data would have a similar pattern (test statistics) with the

observed data. Bayesian p-value (Posterior p-value) is a quantitative measurement of the

goodness of the fit. The p-value-like measure represents the probability that the test statistic

(such as mean, maximum, minimum, zero proportion) in the replicated (or predicted new

observations) data set exceeds that in the original data (or new observations).

Pr (𝑇(𝑦𝑟𝑒𝑝) ≥ 𝑇(𝑦𝑜𝑏𝑠)) = ∫ 𝐼(𝑇(𝑦𝑟𝑒𝑝) ≥ 𝑇(𝑦𝑜𝑏𝑠)|𝑦) ∙ 𝑝(𝑦𝑟𝑒𝑝|𝑦𝑜𝑏𝑠)𝑑𝑦𝑟𝑒𝑝

If the model provides a good fit, the Bayesian p-value should be around 0.5. A value close

to 0 or 1 indicates that the model is a poor fit (Meng, 1994).

44

For each simulation (𝑠 = 1, … , 2000) of parameters from the posterior distribution, a

185 dimensional vector of 185 predicted outcomes of cyanobacteria abundance is obtained.

Thus, the result is an 2000 × 185 sized matrix of predicted outcomes from all simulations.

In doing posterior predictive checks, the training data and test data were used when

building the model to obtain the replicated cyanobacteria abundance data and predicted

cyanobacteria abundance data. The test statistics of replicated cyanobacteria abundance

data 𝑦𝑟𝑒𝑝 using training data, test statistics of predicted cyanobacteria abundance data

𝑦𝑝𝑟𝑒using test data ,and test statistics of the actual observed values 𝑦𝑜𝑏𝑠 and 𝑦𝑛𝑒𝑤 are

compared. The bayesplot package in R was used for plotting posterior predictive checks.

3.3 Results and discussion

3.3.1 Variable selection

Initially, all nine variables were used to develop a generalized linear model to serve as a

reference model. In each step, one additional variable is included (starting with no variables

or only an intercept), and the elpd and RMSE of each model were calculated. The order of

variables added is based on the maximizing fit and is therefore indicative of variable

importance. The selection order decided by the algorithm was Chl a, temperature, turbidity,

total phosphorus, solar radiation, wind, pH, total nitrogen, and precipitation (Figure 3-5).

45

Figure 3-5 Model elpd and RMSE from LOO-CV plotted as a function of stepwise addition of

variables.

Figure 3-5 shows that the first five variables were sufficient predictors as they result

in a similar elpd and RMSE to the reference model (the final point includes all variables).

The selected variables are consistent with prior knowledge of factors that can be used to

determine cyanobacteria counts. Chl a is produced by cyanobacteria and, therefore, a

strong indicator of abundance. Temperature promotes cyanobacterial growth (Thomas &

Litchman, 2016) and is expected to be a significant driver of blooms. Nutrients (nitrogen and

46

phosphorous) stimulate the growth of cyanobacteria (O’Neil et al., 2012). However, it is

worth noting that only total phosphorous was identified as a variable of importance and total

nitrogen had no impact on model fit. Previous studies indicate that the optimal mass-based

ratio of total nitrogen to total phosphorus is 16:1 (Davidson et al., 2012). The variable

selection indicates the reservoir was phosphorus-dominated, and the nitrogen concentration

was either sufficient or very stable. However, the average mass-based ratio of total nitrogen

to total phosphorus was 11:1 in the reservoir, suggesting that nitrogen should be limiting.

Precipitation and wind were not considered significant, and it is hypothesized that changes

in turbidity better represent the effects of precipitation and wind.

3.3.2 Model selection

Weakly informative priors were used for parameters, and four Markov chains were run for

each model for 1,000 iterations, discarding the first 500 iterations as a burn-in process.

Figure S1-3 in supplementary materials presents trace plots for parameters in NB, ZINB and

hurdle NB models. The overlapping of different chains indicates convergence. Furthermore,

parameters from all three models have �̂� < 1.003, further suggesting convergence of each

chain (Brooks & Gelman, 1998).

After confirming the convergence of all MCMC chains, LOO-CV was applied to

assess the strength of each modelling approach. Assessment of model strength was based

on both elpd and standard error (SE) (Table 3-3). The difference in 𝑒𝑙𝑝�̂� relative to the

model with the largest 𝑒𝑙𝑝�̂� (i.e. the ZINB model) can be used to consider the magnitude of

difference between models. The significance of observed differences in elpd was

determined by calculating z-scores and corresponding p-values of paired comparisons

(Lambert, 2018). Results indicate zero-truncated models (zero-inflated and hurdle models)

were better than a negative binomial model (p = 0.002); however, the performance of ZINB

and hurdle NB were comparable (p = 0.14).

47

Table 3-3 LOO-CV results to compare strength of model fits. Differences in elpd and

standard error (SE) were calculated using the highest performing model (ZINB).

Model elpd

difference SE

difference

ZINB 0.0 0.0

Hurdle NB -1.2 2.7

NB -25.3 8.9

While the fit between ZINB and hurdle NB were comparable, it should be considered

that the mechanism of zero generation is different between them. In a zero-inflated model,

zero counts may come from two sources: (a) the cell number is too low to be by the

enumeration method used, (b) the cell number was truly zero. Zero counts are assumed

only to be caused by cell numbers below the detection limit in a hurdle model.

Cyanobacteria are likely not present in the reservoir at some times, and therefore, the ZINB

was chosen as the best model based on goodness of fit and the ability to consider true zero

counts.

3.3.3 Model checking

Posterior predictive checks (PPC) are used to evaluate if the model fit is reasonable and

identify potential differences between observed data and the fitted model. PPCs were

initially run using the training set of data. Zero proportion was chosen as a test statistic for 𝑦

and 𝑦𝑟𝑒𝑝, which represents the proportion of zero values in the real observed data and the

replicated data (predicted data for the same data set) and calculated the Bayesian p-value.

Bayesian p-values in this context indicate the probability that replicated data are not more

extreme than the observed distribution (Gelman, 2005). A Bayesian p-value close to 0.5

indicates a good fit, values approaching 0 indicate lack of fit, and values close to 1 indicate

overfitting (Korner-Nievergelt et al. 2015).

48

The top left and right panels of Figure 3-6 show the density plot of the original

training data (dark blue) and the density plot of the replicated data (light blue). The

overlapping of observations distribution and replications distributions showing the model

represents a good model fit. However, Figure 3-6 (top right) shows that the model tends to

underestimate the zero proportion. The computed Bayesian p-value is 0.2, indicating that

the model tends to underestimate the zero proportions. It is possible that the zero-inflated

generalized linear models still cannot account for all zeros in the data due to not capturing

non-linear relationships between cyanobacteria and predictor variables.

A 5-fold PPC cross-validation was applied to evaluate the model using out-of-bag

samples. Posterior predictive checks were repeated for each validation set and compared

the test statistics for 𝑦𝑛𝑒𝑤 and 𝑦𝑝𝑟𝑒 . The Bayeisan p-value of the five validation sets were

0.43, 0.56, 0.32, 0.42, 0.53 with an average of 0.45. One validation set is shown as an

example in Figure 3-6 (Bayesian p-value = 0.32). The predicted 𝑦𝑝𝑟𝑒 and the actual new

observations 𝑦𝑛𝑒𝑤 overlap (Figure 3-6, bottom left), although there is a slight

underestimation of zero proportion (Figure 3-6, bottom right). The difference in estimated

zero proportions and p-values is likely due to the varying proportion of zero counts in each

of the five validation sets. The Bayesian p-values of both replicated data and new data close

to zero suggest that the linear model may be inadequate for the cyanobacteria growth

model. Considering non-linear models, such as the dynamic phytoplankton model proposed

by Malve et al. (2007) would add complexity but may also increase model fit.

49

Figure 3-6 Top left: Kernel density estimate of observations in the training set 𝑦 (dark line)

and replications 𝑦𝑟𝑒𝑝 (light line). Top right: Zero proportion as test statistics 𝑇(𝑦). Dark line is

the zero proportion of observations in the training set. Light lines are the distribution of zero

proportions of replicated data. Bottom left: Kernel density estimate of new observations 𝑦𝑛𝑒𝑤

(dark line) and predictions 𝑦𝑝𝑟𝑒 (light lines). Bottom right: Zero proportion as test statistics

𝑇(𝑦). Dark line is the zero proportion of new observations. Light lines are the distribution of

zero proportions of predicted data.

3.3.4 Cyanobacteria alert level prediction

Predictions are produced by first sampling regression parameters from their respective

distributions, followed by calculating cyanobacteria counts. Since 4 MCMC chains of 1,000

iterations were generated, and the first 500 iterations of each chain (burn-in) were

discarded, the number of replicates for each prediction was 2,000. The advantage of

Bayesian models is that instead of predicting a single value, the model presents a predictive

Training data density plot

Validation data density plot Validation data zero proportion

Training data zero proportion

50

probability distribution based on MCMC iterations. For example, the predictive distribution

based on MCMC results of two data points in the test set are shown in Figure 3-7.

Figure 3-7 Predictive density plot of two new observations. The red line indicates the true

observed values, and the density is determined based on 2,000 MCMC replicates.

From Figure 3-7, it can be observed that even if the peaks of the predictive density

do not fall precisely on the true observed value, the maximum predictive density may be

approximately adjacent to the true value and the overall predictive density shifts. It is also of

note that despite density being highest immediately adjacent to true predictions, there is

non-zero probability of elevated cyanobacteria abundance. The Bayesian modelling

approach allows for direct interpretation of this uncertainty and the uncertain nature of

factors influencing cyanobacterial population dynamics to carry through to predictions.

Predictive density was used to categorize predictions according to WQRA alert

levels. By taking probability density in bins rather than point estimates, the high levels of

uncertainty were accounted for in both the impacts of influencing factors and how to

interpret risk from cyanobacteria abundance. Not all species will release toxins (Lee et al.,

2015) and environmental conditions such as temperature impacts toxin release (Walls et al.,

51

2018). As such, management of surface waters often is in response to categorized levels of

cell counts or other water quality parameters (Ibelings et al., 2014).

The predicted class was determined by the mode or most common predicted class

based on probability density. The accuracies in each fold were 0.50, 0.32, 0.36, 0.32, 0.45.

The overall confusion matrix for multiclass prediction (all WQRA alert levels) is shown in

Table 3-4a. The average accuracy was found to be 0.40, generally indicating poor

performance. In particular, it was noted that the model performed poorly in predicting low or

medium alert levels and predictions of safe levels dominated. The dominant safe level

probabilities are evident from the figure inset on Table 3-4a.

Based on poor performance with narrow alert level bands, and generally better

separation of ‘high’ vs ‘safe’ levels, it was considered to reduce classification to a binary

decision of potential toxin presence or not. The threshold was set to 1,000 cells/mL,

corresponding to the middle of the low alert level in WQRA and associated with a level

where toxin release may be possible. For this binary decision, the precision and recall were

found to be 0.62 and 0.99, respectively. Cohen’s kappa, the statistics which measures

interrater reliability of binary classiers (McHugh, 2012), is 0.8, suggesting an almost perfect

agreement. As such, on a more coarse level the model performance improved and has

potential for distinguishing conditions that could result in toxin presence (Table 3-4b). In

particular, the binary decision approach did not under-predict alert (false negatives), and

performance was high for correctly predicting counts greater than 1,000 cells/mL.

52

Table 3-4 a) Confusion matrix for all WQRA levels along with figure depicting probability of

each class for a given prediction, and b) reduced confusion matrix for binary decisions > or

< 1,000 cells/mL.

a) All WQRA levels

Predicted

Safe Low Medium High Very high

Ac

tua

l

Safe 21 1 24 14 3

Low 9 0 9 14 0

Medium 5 1 12 20 2

High 1 0 4 40 1

Very high

0 0 0 4 0

b) Binary decision Predicted

Safe (< 1,000 cells/mL)

Potential toxin presence (>= 1,000 cells/mL)

Ac

tua

l

Safe (< 1,000 cells/mL)

14 65

Potential toxin presence

(>= 1,000 cells/mL) 1 105

3.3.5 Influence of weather and water quality factors on cyanobacteria counts

The influence of various factors can also be observed from the kernel density estimates

posterior distributions of the variable-specific coefficients (Figure 3-8). Chl a was found to

have the largest positive coefficient, indicating a strong positive relationship with

cyanobacteria counts. This was expected since cyanobacteria will produce Chl a, and this

measure is often used as a surrogate for cell counts (Chaffin et al., 2018). The temperature

coefficient is distributed above zero, implying a positive impact on the probability of a bloom.

A positive relationship between temperature was anticipated based on a significant amount

of literature highlighting increased growth with increasing temperature (Thomas & Litchman,

2016; Rousso et al., 2020).

53

The coefficient of solar radiation is mainly distributed below zero, indicating a

negative correlation with cyanobacteria levels. A negative relationship between radiation

and cell counts could be explained by photobleaching of pigments in cyanobacteria, such as

phycobiliproteins (Sinha et a., 2005) or by relative competitive advantages of cyanobacteria

compared to other algal taxa under limited light conditions (LeBlanc Renaud et al., 2011).

Long-term exposure to increasing light intensity and UV-B light in particular has resulted in

decreased Chl a content and decreased cyanobacteria population (Cirés et al., 2011; Xue et

al., 2005). At high radiation levels (340 μE m−2 s−1), the cyanobacteria growth rate was

previously found to be 30% lower than at moderate radiation (60 μE m−2 s−1) or low radiation

levels (Cirés et al., 2011). However, it should be noted that radiation intensity and

temperature are strongly correlated, and increasing solar radiation was expected to result in

increased cyanobacteria levels due to a corresponding increase in temperature (Jöhnk et

al., 2008).

The turbidity coefficient was distributed on both sides of zero, indicating the

possibility of either positive or negative correlations with cyanobacteria abundance. Turbidity

is a general measure and does not distinguish types of matter, including no distinction

between cyanobacteria and non-algal matter that would contribute to turbidity. Previously,

cyanobacteria abundance of Kansas reservoirs was reported to be negatively correlated to

non-algal turbidity (Dzialowski et al., 2011). As the non-algal turbidity increases, light

penetration is reduced, and less cyanobacteria biomass is expected. Alternatively,

cyanobacteria presence would lead to a measured increase in turbidity (Klemer and

Konopka 1989). As such, the role of turbidity cannot be easily identified, and the parameter

distribution appears to represent the uncertain relationship between turbidity and

cyanobacteria counts accurately.

54

Figure 3-8 Kernel density estimate of posterior distributions for parameters based on MCMC

sampling with median and 80% intervals.

The coefficient distribution for total phosphorus was primarily distributed below zero,

implying a negative correlation with cyanobacteria abundance. This result is counter to the

expectation of phosphorous levels being positively associated with cell counts, given the

substantial amount of evidence that nutrient reduction strategies reduce blooms (Hamilton

et al., 2016). It should be considered that there were relatively elevated levels of

phosphorous in the reservoir (mean value of 0.1 mg/L), and nutrients may generally not

have been a limiting factor for growth in this system. The recommended limit of total

phosphorus in lakes is 0.05 mg/L (Litke, 1999), and 92% of the recorded phosphorous

levels in this dataset would imply the reservoir being studied is eutrophic or hypertrophic

(Carlson and Simpson, 1996). Relatively flat biomass responses with increasing

phosphorous above a limiting threshold have also been previously reported (Dolman et al.,

2012).

55

3.4 Summary

Bayesian mixture models were applied to model cyanobacteria abundance in a reservoir,

with particular consideration for the tendency for cyanobacteria abundance to be highly

imbalanced with a high proportion of zero values. Two models that can account for the high

proportionality of zero measurements, including a ZINB and hurdle NB, were compared. An

NB model was also applied to act as a baseline approach that does not account for excess

zero counts.

Based on fit determined from leave-one-out cross-validation, it was found that the

ZINB and hurdle NB models performed significantly better than the NB model. The observed

improvement of fit when using models that account for excessive zero counts supports the

hypothesis that inflated zero counts are important to consider when modelling cyanobacteria

abundance. Furthermore, a slight increase of fit was observed when using ZINB compared

to the hurdle NB approach. ZINB models can account for zero measurements being present

either from the cell number being below detection limits, or from the true absence of

cyanobacteria. As such, the improvement of fit using ZINB illustrates that both mechanisms

of zero generation should be considered when modelling cyanobacteria.

The ZINB model was then applied to predict cyanobacteria levels using a separated

test set. Although the performance was poor when predicting narrow alert level bands,

precision and recall were high (0.62 and 0.99, respectively) for binary prediction of elevated

vs. low risk levels of cyanobacteria. The established model utilizes a limited number of easy-

to-measure parameters including Chl. a, total phosphorous, pH, temperature, and solar

radiation to generate these predictions. Furthermore, the predictions produced from the

Bayesian approach utilized in this paper are probabilistic. The uncertainty from the data and

interactions in the system are carried through the modelling process to produce an

estimated cell count with associated level of uncertainty. The high uncertainty levels in

parameter estimates demonstrate that cyanobacteria count prediction is difficult, and the

56

impact of influencing factors is complex. As such, the presented modelling process is

believed well suited to inform the management of complex systems with high uncertainty.

57

Chapter 4: A probabilistic approach to evaluating Cryptosporidium health risk in

drinking water

4.1 Introduction

The protozoa Cryptosporidium is an important chlorine resistant pathogen that commonly

drives public health risk associated with drinking water treatment and delivery (Efstratiou et

al., 2017). Outbreaks of Cryptosporidium can impact a large proportion of the population in a

short time frame due to its persistence in aquatic environments and high probability of

infection at low doses (Swaffer et al., 2018; Desai et al., 2012). The reported incidence of

Cryptosporidium has increased since 2004 in the United States, with most cases occurring

during the summer and among children (Yoder & Beach, 2010). Cryptosporidiosis has been

increasingly identified as an important cause of morbidity and mortality in the world

(Checkley et al., 2015), particularly for the immunocompromised. In two of the documented

waterborne outbreaks, Milwaukee and Las Vegas, mortality rates in the

immunocompromised ranged from 52% to 68% (Rose, 1997).

The reservoirs of Cryptosporidium including humans, cattle and other mammalian

species (Thomson et al., 2017). Cryptosporidium can be found in soil, water and food or

surface that have been contaminated with the feces from the hosts. Cryptosporidium can

entered the source water such as lakes and rivers through sewage overflow, storm water

runoff, agricultural runoff and wildlife.

The routes of exposure include ingestion of contaminated recreational or drinking

water, ingestion of contaminated food, exposure to infected animals, and close contacts with

other with cryptosporidiosis (Yoder & Beach, 2010). Humans can be infected with

Cryptosporidium through various of tranismission routes, including faecal-oral route (person-

to-person transmission and zoonotic transmission) (Ng et al., 2012), and ingestion of

contaminated foods (foodborne transmission) (Ryan et al., 2018) and water (waterborne

transmission) (Xiao & Feng, 2017). After the individual ingests the protozoan oocysts, the

58

infection begins by releasing sporozoites that invade the mucosa to establish endogenous

autoinfection (Gerace et al., 2019).

Continued monitoring and improvement of drinking water treatment and untreated

water contact control have a pivotal role in cryptosporidiosis prevention and control. As

Cryptosporidium oocysts occur in low numbers in water, and in vitro culture techniques that

augment parasites numbers for identification are not available, it is necessary to concentrate

oocysts to identify them effectively and accurately (Smith et al., 2010). Besides

morphological identification of oocysts by microscopy, most common methods for detection

and enumeration include concentrating and staining of fecal smears, immunological-based

methods, and molecular techniques (Ahmed & Karanis, 2018). In drinking water,

concentration through methods such as continuous flow centrifugation and membrane

filtration is most commonly practiced. Molecular methods, PCR tests can detect both clinical

and environmental specimens. Although PCR tests are rapid, highly sensitive and accurate,

false positives rate can be high due to detection of non-viable microorganisms and

laboratory contamination (Checkly et al., 2015)

Due to the expense and labour intensity of detection methods, information on source

water concentrations is severely limited and routine monitoring of Cryptosporidium is not

practiced (Efstratiou et al., 2017). Evident from reduction of sporadic cases of

cryptosporidiosis when water treatment is improved, levels of Cryptosporidium in source

waters and infection rates are likely underestimated due to monitoring issues. As such,

there is a need for cheaper and easier to measure method to predict the presence of

Cryptosporidium on a day-to-day basis. In a recent study (Ligda et al., 2020), a machine

learning type risk assessment model have been developed to predict Cryptosporidium with

meteorological and physicochemical predictors. The model achieved overall accuracy of

75% in four-level classification of Cryptosporidium concentrations.

59

Although there are limited studies regarding Cryptosporidium prediction, previous

studies have attempted to identify factors that drive occurrence of Cryptosporidium in water

bodies. A common factor in many historical outbreaks is preceding extreme rainfall event

(Hrudey & Hrudey, 2004; Sylvestre et al., 2021). Liu et al. (2019) investigated the fate and

transport dynamics of Cryptosporidium in the Daning River, China using the soil and water

assessment tool (SWAT) and reported a combined impact of rainfall and regional

fertilization on the level of Cryptosporidium. Furthermore, Coffey et al. (2010) found that

fertilization usage has significantly impacts on the Cryptosporidium existence in a watershed

in Ireland. Xiao et al. (2013) developed a quantitative microbial risk assessment model for

Cryptosporidium and reported a strong relationship between positive samples for

Cryptosporidium and flooding frequency.

Quantitative microbial risk assessment (QMRA) is a mathematical modeling

approach to estimating the health risk related to environmental exposure of microorganisms

(Haas et al., 2014) and has increasingly become a standard for assessing pathogen risk.

QMRA provides a detailed and flexible method for estimating risk and disease burden that

can support risk-based management decisions (Hunter et al., 2011). Previous work has

utilized QMRA to quantify Cryptosporidium risk based on estimated concentrations rather

than direct observation of oocyst concentrations. Hunter et al. (2011) applied QMRA using

Cryptosporidium concentrations estimated from a regression model based on E. coli

concentrations. The analysis indicated that a major risk of Cryptosporidium infection among

English and French populations that consume tap water from very small drinking water

supplies.

Probabilistic models are useful tools that take into account the impact of random

events or actions in predicting the potential outcomes. While deterministic models give point

estimates, probabilistic models give probability distribution as estimation. They are highly

applicable to modelling environmental systems since outputs are easily generated with

60

incomplete data, and predictions are probabilistic therefore provide a measure of uncertainty

(Aguilera et al., 2011; Bertone et al., 2016). Probabilistic QMRA is emerging as a valuable

technique in microbial risk assessment. Amha et al. (2015) have developed a probabilistic

QMRA to determine the risk of Salmonella infections resulting from consumption of crops

irregated with treated wastewater. Mok et al. (2014) have constructed a probabilistic QMRA

model to determine the health risks of norovirus infection from consumption of vegetables

irrigated with wastewater. Barker et al.(2014) have proposed a probabilistic QMRA using

Monte Carlo method to assess the risk of gastroenteritis illness caused by rotavirus,

norovirus, and Ascaris lumbricoides associated with the consumption of street food salads.

Probabilistic QMRA using Monte Carlo simulation can also be used to estimate the health

risk of systems and management strategies. Bivins et al. (2017) have proposed a

probabilistic QMRA to estimate the risks of infection of waterborne illness when the

population exposed to Intermittent water supply using reference pathogens including

Campylobacter, Cryptosporidium, and rotavirus. Bayesian networks are probabilistic

graphical model representing complex relationships of multiple variables and has also been

widely used in probabilistic QMRA. Beaudequin et al. (2015) evaluated the capabilities and

challenges of adopting BN models to QMRA, highlighting the opportunity to use BNs for

scenario assessment and identifying nodes or factors with the most influence on the risk

outcomes. Beaudequin et al. (2016) later presented a QMRA expressed as a wastewater

reuse context and examined the risk of norovirus infection associated with wastewater-

irrigated lettuce. Zhiteneva et al. (2021), constructed a probabilistic QMRA utilizing a BN to

examine a non-membrane based indirect potable reuse (IPR) system. The critical control

points of norovirus, Campylobacter and Cryptosporidium were determined through

sensitivity analysis, scenario analysis, and backwards inferencing. Although BNs have great

potential for broad use in QMRA, the drawbacks of BNs include the difficulties with eliciting

conditional probabilities and information loss caused when categorizing the input variables

61

and risk outcomes. The uncertainty from parameter estimations can be incorporated with

the risk assessment with Markov Chain Monte Carlo method. Donald et al. (2011) have

conducted a study of incorporating the parameter uncertainties into the probabilistic QMRA

model to estimate Salmonella. MCMC method can also be used to estimate pathogen

concentrations (Bouwknegt et al., 2014; Masciopinto et al., 2020) and unknown quantities

such as the parameters of distributions and the model (Rigaux Ancelet et al., 2013; Donald

et al., 2011).

In this research, two connected models of dose (estimating Cryptosporidium

presence/absence) and response (public health outcomes) were developed. The dose

model was intended to address challenges associated with limited knowledge of day-to-day

Cryptosporidium levels in source waters. A Gaussian process classifier (GPC) was used to

predict presence/absence based on known factors correlated with the presence of protozoa

such as turbidity, fecal coliforms, and weather data (precipitation, temperature). Predictions

were then connected to a Bayesian linear regression model to estimate public health risk

(response model). Factors affecting public health risk such as water treatment efficiency,

drinking water consumption, herd immunity rates, and sewer overflow rates were considered

in the response model. Parameterization for the response model was based on previous

literature reporting distributions such as drinking water consumption, herd immunity and

previous recording of annual sewer overflows.

The modelling approach developed has significant value to environmental and water

system management. Established probabilistic QMRA model can be utilized for predictive

inferences, suggesting the resulting health risk with uncertainty under emergencies, and

policy controls. Site-specific risk assessments with uncertainty at drinking water treatment

facilities, under climate change, emergencies and policy controls. By incorporating weather

and other environmental variables into a dose model, estimation of impacts of climate

change on risk can be explored. Furthermore, the control points for sewer overflow and

62

target treatment efficiency can be determined with backwards reasoning. As the input

variables remain continuous, the model avoids the weakness of BNs which are commonly

discrete or linear Gaussian distributed. The simulation, parameter estimation and prediction

rely on Monte Carlo method and Markov chain Monte Carlo, demonstrating a novel use of

BN and a probabilistic version of QMRA. The model with its highly flexible nature, is a

powerful tool that can be continually extended with more variables and can increase

understanding of the public health impacts of diverse risk factors.

4.2 Methodology

4.2.1 Data sources

Source water quality of Kensico reservoir were obtained from the City of New York (NYC)

Open Data, provided by the Department of Environmental Protection (DEP). Monitored

parameters at DEL18DT station (representing Kensico water) included Cryptosporidium

concentration, turbidity and fecal coliforms. Weather data were observed at the nearest

weather station to the Kensico reservoir: Westchester County Airport at White Plains, NY

(station ID: US1NYWC0003), including temperature and precipitation on the

Cryptosporidium sampling day. Two data sets are merged according to the monitoring date.

The merged data set include 368 samples from May 2015 up to September 2021. The

sampling times in each year was presented in Figure 4-1(a). In the dataset, the most

reported Cryptosporidium oocysts concentration was ‘zero’ or below the detection limit (92%

of all data) and there is a significant imbalance between the presence and absence, as

shown in Figure 4-1 (b).

The merged datatset include five variables: Cryptosporidium oocysts concentration

(number of Cryptosporidium oocysts observed in a 50-liter sample), turbidity (NTU, average

turbidity of the six 4-hour grab samples), fecal coliforms (total number of colonies counted

per 100mL sample volume filtered), temperature (°C, average monitored temperatures

through the day) and precipitation on the sampling day (mm). Correlation heatmap in Figure

63

4-2 depicted the relations between water quality, weather and Cryptospordium oocysts in

source water. A weak negative correlation (r = -0.01) has been observed between turbidity

and Cryptosporidium presence/absence in source water. This outcome is contrary to that of

Gómez-Couso et al. (2009) who found that the infectivity and survival of Cryptosporidium

oocysts decreased significantly when exposed to intensive radiation, and higher turbidity (>

200 NTU) could lead to less ultraviolet light (UV) penetrated further from the surface and

therefore beneficial to oocysts survival. This result may be explained by the fact that

turbidity in the reservoir ranges from 0.53 to 1.43 NTU, which does not have a significant

effect on light penetration. Previous findings have shown positive correlations with

Cryptosporidium oocysts and E.coli (Reinoso et al., 2008), Cryptosporidium oocysts and

rainfall (Schets et al., 2008). The positive correlations in Figure 4-2 confirms that

precipitation and E.coli concentration are postively associated with Cryptosporidium oocysts

concentration. Cryptosporidium oocysts are adaptive to a wide range of temperatures. The

oocysts will only be inactivated when exposed to temperatures above 50-60 °C or below

−20 °C (Hassan et al., 2021), although a slight positive linear relation has been observed

between temperature and Cryptosporidium presence/absence.

(a) (b)

Figure 4-1 (a) Bar plots of sampling times in each year from 2015 to 2021; (b) Historgrams

of Cryptosporidium presence and absence.

64

Figure 4-2: Linear Correlation (r) heatmap of water quality, weather parameters, and

Cryptosporidium presence/absence(Class).

4.2.2 Predicting Cryptosporidium presence in source water

4.2.2.1 Gaussian process classification

Gaussian processes (GPs) are fully probabilistic methods for regression and classification

problems. It allows a Bayesian use of kernels, that can be interpreted as a Bayesian

probabilistic analogue to kernel SVM classifier. GP provides fully probabilistic predictive

distributions with uncertainty estimation (Quinonero-Candela et al., 2007).

Given a training set 𝒟 = (𝐱𝐢, 𝑦𝐶𝑟𝑦𝑝𝑡𝑜), 𝑖 = 1, … , 𝑛 of 𝑛 pairs of input vector environmental

variables 𝐱𝐢 including precipitation,turbidity, temperature and E.coli and Cryptosporidium

binary classification(presence/absence) 𝑦𝐶𝑟𝑦𝑝𝑡𝑜. Gaussian process regression assumes a

Gaussian Process prior over functions 𝑓, where 𝑓 = [𝑓1, 𝑓2, … , 𝑓𝑛]𝑇 is a vector of latent

65

function values, and here is the prediction label. The process is fully specified by the mean

and covariance functions.

𝑓(𝐱)~𝐺𝑃(𝑚(𝐱), 𝐾)

Where mean function:

𝑚(𝐱) = 𝐸[𝑓(𝐱)]

Usually, the prior means are assumed to be constant and zero and covariance function be

the common covariance function, squared exponential:

𝐾𝑖,𝑗 = σ2exp (−(𝑥𝑖 − 𝑥𝑗)

2

2λ2)

Here the output variance σ2 controls the prior variance, and the length scale λ controls the

rate of decay of the covariance.

Logistic Gaussian Process regression is generalization of linear regression for binary

classification problem. In Logistic GP regression, the observed Cryptosporidium

presence/absence 𝑦𝐶𝑟𝑦𝑝𝑡𝑜,𝑖 ∈ (0,1), 𝑖 = 1, … , 𝑛, which are modeled using a Gaussian process

with the latent function values through the logistic link:

𝑦𝐶𝑟𝑦𝑝𝑡𝑜,𝑖~𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(logit−1(𝑓𝑖))

Integrating the distribution over the latent function 𝑓∗ of future environmental data, 𝐱∗, a

probabilistic predictive distribution for future Cryptosporidium presence/absence data 𝑦∗ can

be described as:

𝑝(𝑦𝐶𝑟𝑦𝑝𝑡𝑜,∗ = +1|𝑦𝐶𝑟𝑦𝑝𝑡𝑜 , 𝐱, 𝐱∗) = ∫ σ(𝑓∗)𝑝(𝑓∗|𝑦𝐶𝑟𝑦𝑝𝑡𝑜)𝑑𝑓∗

where 𝜎(⋅) is the logistic function and +1 denotes the presence of Cryptosporidium.

Compared to parametric methods, nonparametric methods do not assume a linear or non-

linear relationship between input variables and output and can be useful for dealing with

unexpected, outlying observations that might be problematic with parametric methods

(Whitley & Ball, 2002).

66

4.2.2.2 Model performance evaluation and threshold determination

Model performance was evaluated using precision and recall, in addition to accuracy.

Precision describes performance of the model on predicting the positive class, while recall

describes the model’s sensitivity in detecting positive class.

Class imbalance is a major problem in classification, and use of a default threshold

of 0.5 for binary decisions based on severely imbalance data will usually result in poor

performance. A straightforward approach to improving the performance of a binary classifier

is to tune the threshold used to map probabilities to class labels (Collell et al., 2018). The

optimal threshold for Cryptosporidium presence/absence classification was chosen based

on the precision-recall (PR) curves that results in the best balance between precision and

recall.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑃

𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑁

Fitting the Gaussian process classifier described in section 4.2.2.1, we computed the

probability of Cryptosporidium presence, π(xi) = logit−1(𝑓∗,𝑖), for the 𝑖th input environmental

data, and adjusted the threshold of π(xi) to inspect changes in precision and recall when

setting different threshold. We used the F-score to find the threshold that resulted in the best

balance of precision and recall, which is the same as optimizing the F-score that

summarizes the harmonic mean of both measures.

𝐹 − 𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

4.2.3 Modelling the Cryptosporidium exposure

4.2.3.1 Removal through water treatment

In this study, it was assumed that the exposure route of Cryptosporidium is direct ingestion

through drinking water consumption. A range of water treatment techniques can improve the

67

safety of portable water with regard to pathogenic contamination. Log removal value (LRV)

is widely used to measure the treatment efficacy:

𝐿𝑅𝑉 = log10(𝐶in/𝐶out) (𝑒𝑞. 5.1)

where 𝐶in is the pathogen concentration in influent and 𝐶𝑜𝑢𝑡 is the pathogen concentration in

effluent.

The achieved LRV for a given system is site-specific and dependent on the unit

processes used as well as operational parameters. Generally, regulations require and

overall LRV of 3 (USEPA, 2006), although advanced treatment methods can achieve much

higher LRVs. For instance, membrane filtration technologies are capable of LRVs >7 (Hirata

& Hashimoto, 1998). However, usually after the conventional drinking water treatment

process, including coagulation, flocculation, sedimentation, and filtration, a general LRV of 2

± 0.5 is achieved for Cryptosporidium oocysts (Chaudhry et al., 2017). Other reviews have

also indicated that most conventional water systems in the US achieved 2-2.5 log removal

and do not monitor the filtered water for Cryptosporidium (LeChevallier et al., 1991). The

regulation standard efficiency of filtration (𝜂) is assumed to follow uniform distribution:

𝜂 ∼ 𝑈(1.5,2.5) (𝑒𝑞. 5.2)

4.2.3.2 Drinking water consumption

Daily water intake varies between countries, age groups, and gender. About 20% of daily

water intake comes from food, whereas the rest from beverages and drinking water. The

recommended daily water intake for the vast majority of persons is 3.7 L for adult men, and

2.7 L for adult women (Sawka, 2005). Säve-Söderbergh et al. (2018) found the drinking

water consumption patterns among adults through collected self-reported estimates. The

daily drinking water consumption (glasses; 𝐷) was best fitted to a gamma distribution (shape

= 3.938; rate = 0.791, in glasses equaling 200 ml):

𝐷 ∼ 𝛤(3.938,0.791) (𝑒𝑞. 5.3)

68

4.2.3.3 Sewer overflow rate

Under normal circumstances, wastewater is transported to the wastewater treatment plant

through sewers and is treated prior to discharge into drinking water sources. However,

during extreme weather events or other emergencies, such as pipes blocked or cracked and

heavy rainfall/snowmelt, excessive untreated sewage or wastewater can be discharged

directly to water bodies and pose a substantial health and environmental challenge.

Cryptosporidium oocysts concentrations are considerably higher in sewage than in

surface water. Lalancette et al. (2012) reported an average of 18 oocysts/L in urban sewage

received by two wastewater treatment plants in the Greater Montreal Area in Canada.

Concentration as high as 103 oocysts/L has been recorded associated with a spring runoff

(Gammie et al., 2000). The values are high compared to oocysts concentration in surface

water. Typically, Cryptosporidium concentrations in Canadian surface waters range from

0.01 to 1 oocysts/L (Health Canada, 2019). Data collected in the United States showed a

median of Cryptosporidium ranged from 0.005/L to 0.5/L in natural surface waters (Ongerth,

2013).

The probability distribution of annual sewer overflow discharge volumes is

determined using the combined sewer overflow discharge volumes data (Statistics Canada,

Table 38-10-0100-01), and potable water use by sector and average daily use data

(Statistics Canada. Table 38-10-0271-01): the overflow discharge volumes of each

Canadian provinces of year 2013, 2015, 2017 and the overall potable water use by all

sectors of the according provinces and years. Thus, the estimated overflow sewage rate (θ)

is best fitted to a zero-truncated normal distribution (Figure 4-3):

𝜃 ∼ 𝒩(0.022,0.12), when 𝜃 > 0 (𝑒𝑞. 5.4)

69

Figure 4-3 Density plot of overflow rate

4.2.4 Modelling the risk of illness

The probability of ingesting an exact discrete dose of organisms (𝑗) per L given as average

concentration of pathogen consumed per day from drinking water (Dose Ingested/day) is

modelled as a Poisson distribution:

𝑃(𝑗/𝐷𝑜𝑠𝑒 𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦) =(𝐷𝑜𝑠𝑒 𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦)

𝑗

𝑗!𝑒−𝐷𝑜𝑠𝑒𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦 (𝑒𝑞. 5.5)

The probability of infection is given as an exponential model:

𝑃(𝐼𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛) = 1 − (1 − 𝑟)𝑗 (𝑒𝑞. 5.6)

For Cryptosporidium, the 𝑟 value is 0.018 (Messner et al., 2001).

Herd immunity is the medical term describes the population’s resistance to a

pathogen, obtained from the immunity developed from previous infection of a portion of the

population (Okhuysen et al., 1998). Typically, the dose-response function is for infection

rather disease. To calculate the disease burden for the pathogen, it is necessary to

calculate the probability of illness. The dose independent morbidity ratio for Cryptosporidium

is approximately 0.5 (Haas, 2014). A normal distribution is assumed with a mean of 0.5 and

a standard deviation of 0.07 (Casmen et al., 2000).

70

α ∼ 𝒩(0.5 , 0.07) (𝑒𝑞. 5.7)

The daily risk of illness is calculated based on the herd immunity (morbidity ratio) and the

probability of infection:

𝑃(𝑖𝑙𝑙𝑛𝑒𝑠𝑠, 𝑑𝑎𝑦) = 𝑃(𝑖𝑛𝑓𝑒𝑐𝑖𝑡𝑜𝑛, 𝑑𝑎𝑦) × α (eq. 5.8)

The burden of disease in QMRA model is quantified by disability-adjusted life years

(DALYs), which are used in the risk assessment model as a metric to compare illnesses

with different health endpoints (Murray, 1997).

4.2.5 Probabilistic QMRA

A Bayesian multiple linear regression model was constructed to illustrate the inference of

daily risk of illness and DALYs (Figure 4-4). Turbidity and fecal indicator (E. coli) reflect the

water condition and are utilized as indicators of Cryptosporidium presence/absence in the

source water. The four variables: precipitation, temperature, turbidity and E. coli are

predictors in the logistic Gaussian process regression, the outcomes are Cryptosporidium

level. The predicted class was input to the lower part of the Bayesian multiple linear

regression model that estimates daily exposure.

Figure 4-4 Schematic model for the process of DALYs estimation. The arrows represent the

relationship between two variables.

71

In the response model, the relationships between two nodes are mathematical

expression (refer to section 2.4). However, since the parameters: daily water intake (𝐷),

sewer overflow rate (𝜃), treatment efficiency (𝜂), and morbidity (α) follow known

distributions, daily illness and DALYs will be presented as distributions instead of point

estimations.

Ingested dose per day can be calculated as:

𝐷𝑜𝑠𝑒 𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦 = (𝐶𝑠𝑜𝑢𝑟𝑐𝑒 × (1 − 𝜃) + 𝐶𝑠𝑒𝑤𝑒𝑟 × 𝜃) × 𝐷 × 𝜂 (𝑒𝑞. 5.9)

The annual risk of illness was determined through randomly repeated sampling 365 times

from the calculated daily risks.

𝑃(𝑖𝑙𝑙𝑛𝑒𝑠𝑠, 𝑦𝑟) = 1 − ∏(1 − 𝑃(𝑖𝑙𝑙𝑛𝑒𝑠𝑠, 𝑖))

365

𝑖=1

(𝑒𝑞. 5.10)

4.3 Results and discussion

4.3.1 Cryptosporidium prediction

In this study, the efficacy of Gaussian process classification combined with threshold

moving techniques were investigated. After fitting a logistic Gaussian process regression

model to the Cryptosporidium data, a range of thresholds for classifying were applied to the

calculated parameter of the Bernoulli distribution and the according precision and recall

were examined. A grid search was used to tune the threshold and locate the optimal

threshold value and the precision/recall and the according F-scores with varying thresholds

are shown in Figure 4-5. The plots demonstrated the advantage and importance of choosing

an appropriate threshold and using F-score that balances precision and recall. The

threshold (0.12) achieving the highest F-score (0.70) was chosen and was heighted by red

points in both figures.

For predictions of Cryptosporidium presence vs absence, overall accuracy of

93.77% was observed when reapplying the selected threshold to the data. The precision

and recall were found to be 0.58 and 0.83, respectively. As such, the model has potential for

72

distinguishing conditions that could result in Cryptosporidium presence in source water. In

particular, the approach did not under-predict risk (false negative rate/significance level =

0.02), which is comparably more dangerous than over-prediction. Cohen’s kappa, which is

used to measure the agreement of two raters rating on categorical scales (McHugh, 2012),

is 0.65 (p-value < 0.05), suggesting a fair to good strength of agreement.

The role of threshold moving technique was determined by comparing the outcomes

with logistic Gaussian process classifier without threshold moving. By setting the threshold

at 0.5, the overall accuracy is still high of 0.92, but cannot distinguish the minority class

“presence” from the majority “absence”, and therefore the outlying minority class among the

majority samples are ignored, resulting zero precision and recall. The threshold moving

method uses the original training data set to train the model and then moves the decision

threshold so that the minority can be easier predicted. Compared with other methods such

as data augmenting and sampling, threshold moving method does not introduce external

biases (He & Ma et al., 2013), and are simple and straightforward to implement.

(a) (b)

Figure 4-5 (a) The precision-recall curve when varying the threshold of predicting

“presence”. (b) The F-score to threshold curve.

73

Table 4-1 Confusion matrix for binary classification of presence and absence of

Cryptosporidium

Predicted

Absence Presence A

ctu

al Absence

320 18

Presence

5 25

4.3.2 Scenario assessments with probabilistic QMRA

4.3.2.1 Climate change

According to the findings of U.S. Global Climate Change Science Special Report (Wuebbles

et al., 2017), the global annual averaged surface air temperature has increased by 1.0 ℃

over the past 115 years (1901-2016). If annual carbon dioxide emissions continue to

increase rapidly, as they have since the beginning of 21st century, it is predicted that by the

end of this century, global temperature will increase to 2.78 - 5.56 ℃ above baseline. If

emissions increase more slowly yearly or begin to decline significantly by the mid-21st

century, the predicted temperature would still be warmer to the range of 1.33 - 3.28℃

(Wuebbles et al., 2017). In addition to overall warming, extreme weather and climate events

such as extreme precipitation, heatwaves, floods, droughts and major hurricanes are

becoming more frequent in many regions (Myhre et al., 2019; Shukla et al., 2019). Extreme

weather events such as increased precipitation and temperature have been revealed to be

associated with water quality impacts and an increase in waterborne diseases (Khan et al.,

2015). Compared to normal conditions, the odds of identifying Cryptosporidium oocysts and

Giardia cysts in surface water have increased between 2 and 3 times after extreme weather

events (Young et al., 2015).

According to the U.S. Global Change Research Program, extreme precipitation is

defined as days with precipitation in the top 1 percent of all days with precipitation. Recent

74

analyses from observed data suggest that in New York and New England, the intensity of

extreme rainfall events has increased since the 1950s (DeGaetano et al., 2011). The

extreme precipitation in 24 h over the past 10 years in Westchester County was obtained

from the interactive web tool: Extreme Precipitation in New York & New England

(DeGaetano et al., 2011) for extreme precipitation analysis. The estimate is 13.03 cm/day,

with lower confidence limit of 11.84 cm, and the upper confidence limit of 14.30 cm. Myhre

et al. (2019) concluded the observed intensity in daily heavy precipitation events increases

with the surface temperature at a rate of 6-7% K-1.

Temperature and precipitation are important meteorological factors that can affect

water quality and human health. With the previous recordings of extreme rainfall and

prediction of future global warming, the temperature increase under emissions control

Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (unit: ℃ )and temperature increase without emission control Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (unit:

℃) are assumed to follow Gaussian distributions:

Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑~𝒩(2.30,0.49) (𝑒𝑞. 5.11)

Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑~𝒩(4.17,0.70) (𝑒𝑞. 5.12)

So that the 95% confidence interval for Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 is (1.33, 3.28), and the 95% confidence

interval for Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 is (2.78, 5.56), consistent with the estimated range of temperature

increase.

The extreme precipitation event 𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒 (unit: cm/day) follow the below Gaussian

distribution:

𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒~𝒩(13.03, 0.62) (𝑒𝑞. 5.13)

So that the mean is consistent with the estimated 13.03 cm, and 95% confidence interval

(11.84, 14.30).

Δ𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒,𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 = 0.065 × Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (𝑒𝑞. 5.14)

Δ𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒,𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 = 0.065 × Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (𝑒𝑞. 5.15)

75

The increased intensity Δ𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒 after climate change is dependent on the temperature

changes. The datasets with increased temperature and increased extreme precipitation

were generated by Monte Carlo simulation with random sampling values from each above

distribution.

It is of interest to determine whether the climate change-induced temperature

increases and extreme precipitation intensity increase have influences on DALYs. The

DALYs of Cryptosporidium under climate change are repeatedly calculated 2,000 times. In

each time, 365 samples from the predicted Cryptosporidium oocysts under climate change

both controlled and uncontrolled emission were randomly sampled and the results are

compared with the DALYs of Cryptosporidium before climate calculated by the same

method. Box plots of DALYs after climate change and before climate change are shown in

Figure 4-6 (b).

(a) (b)

Figure 4-6 Quantile-quantile plots (Q-Q plots) (a) and box plots (b) of disability-adjusted life

years (DALYs) before climate change, after climate change under controlled emission, and

after climate change under uncontrolled emission.

Furthermore, a potential functional limitation is its sensitivity to higher temperature

above optimum. In order to examine the model’s response to elevated temperatures, the

76

temperature has been increased from 15℃ to 65℃. The DALYs plot under different

temperature was presented in Figure 4-7. Although 55℃ is reported as a temperature that

leads to rapid Cryptosporidium oocysts inactivation (Hassan et al., 2021; King et al., 2005),

the DALYs increases along with temperature.

Figure 4-7 DALYs for Cryptopsoridium under temperature from15 to 65°C.

The mean of DALYs before climate change, after climate change under controlled

emission and after climate change under uncontrolled emission are 1.364 × 10-4, 1.374 ×

10-4 and 1.377× 10-4. Before checking the statistically significant differences between the

means of the four groups using one-way ANOVA, a Shapiro-Wilk’s test was conducted to

determine whether data drawn from a normally distributed DALYs population. Shapiro-Wilk’s

test is widely used for normality test (Shapiro & Wilk, 1965). It is based on the correlation

between the data and corresponding normal scores. From the normality test results, the p-

values are 0.53, 0.11, 0.08. The p-values >0.05 imply that the data distribution is not

significantly different from a normal distribution. The DALYs were found to be significantly

different based on ANOVA result (p-value < 10-6) and differences between DALYs before

climate change and after climate change under controlled emission (p-value <10-6), before

77

climate change and after climate change under uncontrolled emission (p-value <10-6) and

under controlled and uncontrolled emissions (p-value = 0.0005) were siginifcantly different

according to Tukey HSD test. Results suggest that extreme precipitations and temperature

increases have significant effects on illness caused by Cryptosporidium and are consistent

with previous studies that report strong associations between extreme precipitation,

temperature and increased concentrations of protozoa and other microorganisms (Atherholt

et al., 1998; Curriero et al., 2001). Rainfall can increase particulate matter in water following

surface runoff and cause resuspension of sediments in river and lake bottom (Atherholt et

al., 1998). Davies et al. (2004) simulated artificial rainfall events and concluded that the

oocyst load was significantly affected by rainfall intensity and duration. Temperature is

another important factor that affects the survival of Cryptosporidium oocysts and their

inactivation and infectivity during transport. A 4 log reduction in infectivity has been

observed above 20°C (King et al., 2005) and an increase in surface temperature could

reduce the persistence of pathogens deposited on the land surface (Sterk et al., 2016).

However, the oocysts can remain infective at least 3 months when stored between 4°C and

15°C (King et al., 2005; Fayer et al., 1998). The annual average temperature at the

sampling location is 7.93°C after temperature increased under controlled emission is

10.23°C, and under uncontrolled emission is 12.1°C, which are all below the survival habitat

temperature of 15°C. The simulations suggest temperature and intensity precipitation along

with climate change will significantly increase source concentrations of Cryptosporidium.

However, DALYs of Cryptospordium oocysts under elevated temperature from 15 to 65°C

were observed to increase as shown in Figure 4-7. This finding is contrary to previous

studies which have suggested that the oocysts will be rapidly inactivated following exposure

to temperatures above approximately 50–60 °C. A possible explanation for the results may

be the lack of adequate temperature data to train the model to classify Cryptosporidium

78

oocysts presence/absence under temperature above 25 °C, which is the maximum recorded

temperature.

4.3.2.2 Treatment technique improvement

Given the possible increase of Cryptosporidium concentrations, it is of interest to

understand the level of treatment needed to ensure the overall health burden is minimized.

For the filtration technique improvement scenario, three treatment levels: 2.5 log, 3 log, 3.5

log were assumed, which could be represented by three continuous uniform distributions

with means of 4, 5, 6:

𝜂1 ∼ 𝑈(2.0 ,3.0)

𝜂2 ∼ 𝑈(2.5, 3.5)

𝜂3 ∼ 𝑈(3.0, 4.0)

The mean DALYs for before treatment improved, treatment improved to 2.5 ± 0.5log,

treatment improved to 3.0 ± 0.5log, treatment improved to 3.5 ± 0.5log are 1.4×10-4, 4.7×10-

5, 1.5×10-5, 4.98×10-6. The box plots of DALYs are shown in Figure 4-8. The p-values of the

normality test for the four groups are 0.16, 0.97, 0.77 and 0.87, implying that the data

distributions are not significantly different from normal distribution. Normal distributions of

the four groups can be assumed. As expected, improving the removal efficiency resulted in

a significant decrease in disease burden based on ANOVA (p-value <10-6) and Tukey HSD

test (all p-values <10-6).

79

(a) (b)

Figure 4-8 (a) Q-Q plots of DALYs before treatment improvement, and for improved

treatment at means of 2.5 ± 0.5log, 3.0 ± 0.5log, 3.5 ± 0.5log. (b)Box plots of DALYs in the

four groups.

Backward reasoning refers to the process of working backward from the goal. It

differs from the forward inference which starts from the known evidence (Fung et al., 1994).

The focus of backward reasoning is to investigate the evidence (i.e., removal efficacy, sewer

overflow rate) that leads to the goal of certain outcomes. In addition to understanding the

influence of improving treatment techniques on the final health burdens, the developed

probabilistic QMRA model can also be used to determine required log removal for a given

health burden by simulation through MCMC. In order to conservatively achieve the health

burden goal of 10-6 DALYs per person per year, the upper level of 95% confidence interval

was set to 10-6, the target DALYs distribution with the same standard deviation of current

DALYs would be:

𝐷𝐴𝐿𝑌𝑠𝑡𝑎𝑟𝑔𝑒𝑡 ∼ 𝒩(6.29 × 10−7, 2.26 × 10−7) (eq. 5.19)

A total of 2,000 samples (the first 500 samples drawn from each chain were

discarded as “warm-up”s) drawn from the posterior distribution of parameter treatment is

presented at Figure 4-9. The sampling results suggest that to achieve the goal of the 95%

UI upper level is below 10-6, the treatment techniques have to be improved to approximately

80

4 log removal before and after climate change (before climate change: mean = 4.07; after

climate change: mean=4.10). Conventional treatment (e.g. coagulation, sedimentation,

filtration) has been observed to result in generally lower LRVs, and is largely dependent on

the effectiveness of coagulation (Dugan et al., 2001). Although the 3-log removal is believed

to be achieved if the treatment process has met the filtration requirements (USEPA, 2006),

considering the possible occurrence of transient elevated concentrations of Cryptosporidium

in source waters (Assavasilavasukul et al., 2008), the results suggest more stringent and

efficient techniques should be considered, such as membrane processes including

microfiltration (4.0-7.0 log removal), ultrafiltration (4.4-6.0 log removal), and UV disinfection

with advanced oxidation (~6 log removal) (Hamilton et al., 2018).

(a) (b)

Figure 4-9 Density plot with dark color represents estimated probability density of the

samples drawn target distribution, while the density plot with light color represents estimated

probability density of samples drawn from treatment before improvement (uniform

distribution, lower = 1.5, upper = 2.5). (a) The density plot of samples from target distribution

before climate change (b) The density plot of samples drawn from target distribution after

climate change under emission control.

81

4.3.2.3 Sewer overflow control

By controlling the sewerage overflow rate, the overall DALYs will decrease due to less

influence of raw sewage or partially treated wastewater from treatment plants overflowing

into water bodies. Three different sewer overflow rates were modelled, which could be

represented by three normal distributions with means of 0.01, 0.005, 0.001, and the same

standard deviation with the current sewer overflow rate (0.12). The box plots of DALYs are

shown in Figure 4-10.

(a) (b)

Figure 4-10 (a) quantile-quantile plots (Q-Q plots) of DALYs at current sewer overflow rate

(0.022), sewer overflow rate at three different levels of 0.01, 0.005, 0.001 (b)Box plots of

DALYs in the four groups.

The p-values of the normality tests for the four groups are 0.9, 0.93, 0.73, 0.41,

suggesting the normality of the DALYs distributions in four groups. From the ANOVA result,

controlling the sewerage overflow rates significantly decrease DALYs (p-value < 10-6 ).

Tukey HSD test result also suggest that there is significant difference in DALYs in all

pairwise comparisons (all p-values are below 10-6).

82

(a) (b)

Figure 4-11 Density plot with dark color represents estimated probability density of the

samples drawn target sewer overflow rate distribution, while the density plot with light color

represents estimated probability density of samples drawn from sewer overflow rate before

control (normal distribution, mean = 0.022, sd = 0.12). (a) The density plot of samples from

target distribution before climate change (b) The density plot of samples drawn from target

distribution after climate change under emission control.

As the sampling results shown in Figure 4-11, in order to achieve the goal of the

95% UI upper level is below 1×10-6 DALYs, a sewer overflow rate of around 0.0005 are

required before and after climate change (before climate chage: mean = 0.00045; after

climate change: mean=0.0005). According to UN, the 80% of the world’s sewage is currently

discharged without treatment (WWAP, 2017). In 2016 and 2017, over 4% of all wastewater

collected and discharged are untreated in Canada (Statistics Canada, 2017). Although the

magnitude of sewage in drinking water has not been estimated in previous studies, our

model suggests the value should be controlled at 0.05% if the treatment efficiency remains

2 ± 0.5 log without improvement. Besides technology-based controls for communities to

address sewer overflow problems, in the New England and Great Lake regions, a

screening-level assessment of the impact of future climate change on sewer overflow

(USEPA, 2008) has been provided. Furthermore, smart data infrastructure for wet weather

83

control and decision-making in real time has been considered in recent years (USEPA,

2018). With the help of advances in weather and climate prediction, sewer overflow is

expected to be minimized to the target value.

4.4 Summary

A probabilistic quantitative microbial risk assessment model was developed to estimate the

health risk under different scenarios and determine the control points. The innate nature of

Cryptosporidium monitoring data is highly imbalanced with a high proportion of zero values.

Gaussian process regression and threshold moving method based

on precision-recall curve were applied to address the imbalance data problem. The model

performed well in binary classification with high precision and recall (0.58 and 0.83,

respectively).

A stochastic model was utilized to probabilistically estimate the health risk by

incorporating factors that impact public health, such as water treatment efficiency, drinking

water consumption, morbidity and sewage overflow. Prediction of health risk under different

scenarios was based on Monte Carlo method, while the backward reasoning regarding the

target goal of treatment improvement and sewage overflow control is conducted with

Markov chain Monte Carlo method. The model provides reasonable estimation of disease

burden (in DALYs) under different levels of treatment efficiency improvement and sewage

overflow controls. Based on the dataset used, in order to conservatively achieve the goal of

10-6 DALYs, the model suggests that treatment efficiency should be at least 4 log, and

sewage overflow rate should be controlled below 0.05%.

Due to the expensiveness of Cryptosporidium monitoring, fecal indicator bacteria are

usually considered as the predictor. However, it did not always predict high concentrations

of oocysts, potentially underestimating Cryptosporidium risk. The developed model can be

more sensitive to potential Cryptosporidium risk and is well suited to assess risks under

84

different scenarios, such as forecasting Cryptosporidium alert to inform management and

develop strategies including level of treatment required and sewage overflow control.

85

Chapter 5: Conclusion

5.1 Summary of Contributions

In this thesis, Bayesian methods were applied to model the cyanobacteria abundance and

Cryptosporidium oocysts in source waters. Imbalanced data methods including zero-inflated

modelling and threshold moving were explored. Several variable selection, model selection

and checking methods under Bayesian framework were applied to extract relevant features

and identify the model that best fits the data. Markov chain Monte Carlo (MCMC) methods

were applied to sample from posterior distributions to estimate parameters. In probabilistic

quantitative microbial risk assessment (QMRA) of Cryptosporidium, the Monte Carlo method

was used to simulate and estimate disease the probability distribution of disease burden

under different scenarios. The contribution of thesis can be summarized as follows:

1. Bayesian zero-inflated models were successful in improving the fit of the model to

imbalanced cyanobacteria and data. Chl a, temperature, turbidity, solar radiation, and total

phosphorus were identified as the key factors for cyanobacterial growth.

Based on the comparison results, zero-inflated models were significantly better than the

negative binomial model (p=0.002). While the fit between zero-inflated negative binomial

and hurdle negative binomial was comparable, considering the mechanisms of zero

generation, the was chosen as the model. The average accuracy for cyanobacteria

classification is 0.4, indicating poor performance. The model performed poorly in predicting

low or medium alert levels, and predictions of safe levels dominated. However, the model

has a better separation of high alert vs safe levels, and for binary decisions, the precision

and recall were found to be 0.62 and 0.99.

According to the projection predictive inference results, Chl a, temperature, turbidity,

solar radiation and total phosphorus were identified as the key factors for cyanobacterial

growth. As the kernel density estimation of coefficients of Chl a and temperature were

distributed above zero, they are assumed to have positive correlations with cyanobacteria

86

abundance. Solar radiation and total phosphorus, with coefficients distributed primarily

below zero, are believed to have negative correlations with cyanobacteria abundance.

Turbidity effects on cyanobacteria abundance are either positive or negative, depending on

the non-algal matter concentration. As such, the models have great potential for identifying

potential cyanobacterial blooms and key environmental factors necessary for cyanobacterial

growth in natural waters.

The Bayesian approach is also contrary to previous models that generate point

estimations. It is expected that the approach utilized in this thesis would enable the

modelling process to produce estimated cell count of microorganisms with associated level

of uncertainty. With only a few environmental factors required, Bayesian zero-inflated

models can be applied to model other microorganism abundance and understand the

relative importance of the factors.

2. Probabilistic QMRA model developed based on Monte Carlo methods and MCMC

can be used to quantify the day-to-day Cryptosporidium health risk in DALYs and determine

the control points in different scenarios. Gaussain process classification achieve high

accuracy in Cryptosporidium classification.

The integrated probabilistic model in this thesis include two parts: Gaussian process

regression model to predict presence/absence of Cryptosporidium in source water, and

QMRA model incorporating information such as drinking water treatment efficiency, daily

water intake, sewer overflow rate to generated health risk estimation.

Gaussian processes are probabilistic methods for regression and classification, and

provide probabilistic predictive distributions with uncertainty estimation. For binary

classification of Cryptosporidium, Gaussian process classification achieved high overall

accuracy of 93.7%, with the precision of 0.58 and recall of 0.83. QMRA models are widely

used to quantify the health risks from microorganisms for source waters and can support

water management decisions. Compared with previous studies that focus on the utility of

87

deterministic QMRA, the probabilistic QMRA in this thesis investigated the use of Monte

Carlo methods as well as MCMC simulation, which allow each variable to be continuous

instead of discrete and minimize the potential information loss.

The model suggests increased temperature and precipitation under climate change

would significantly increase the disease burden of Cryptosporidium. To conservatively

achieve the goal of the 95% UI upper level below 10-6, the treatment techniques need to be

improved to 4 log removal if the other intervention and information remain the unchanged.

Thus, more stringent and efficient treatment techniques should be considered, such as

microfiltration, reverse osmosis and ultrafiltration. If the other conditions unchanged, to

conservatively achieve the goal of 95% UI upper level below 10-6 DALYs, a sewer overflow

rate of 5×10-4 is necessary. It is expected that when new predictor data comes, this model

would enable predicting other microorganisms presence in source water, quantifying the

health risks and determining the necessary control points in different scenarios such as

climate change, treatment technique improvement and sewer overflow control.

5.2 Limitations and suggestions for future research

The predictors used to predict Cryptosporidium include turbidity, temperature, precipitation

and fecal indicator concentration (E.coli) are limited and may have decreased the model’s

accuracy in Cryptosporidium classification. More water quality variables and the interaction

between variables could be considered to improve the predictability. Besides, in predicting

Cryptosporidium, the temperature data are the observations at the nearest weather station

to the reservoir instead of water temperature. Although Bayesian models combined with

imbalanced data method have good performance in predicting presence or absence of

microorganisms in water, the prediction of non-binary count data is a remaining challenge

and should be further investigated.

Another limitation was treating each observation data point as independent and

ignoring the temporal autocorrelation, i.e. the similarity between observations as a function

88

of the time lag. The presence/absence of cyanobacteria and Cryptosporidium closely

together in time are likely to occur together. Time series in days and seasons are suggested

to be considered. The seasonal dynamics of bacteria and pathogens in natural waters

should be investigated.

Further improvement in the QMRA model could focus on other scenarios, such as

exposure through recreational waters, herd immunity increased due to vaccination. Data

collection, such as precise sewer overflow data should be obtained for better prior

distributions of the parameters. Research efforts on accurate prediction of Cryptosporidium

oocysts concentration are significant to improve the disease burden estimation. Other

methods in machine learning, such as ensemble learning, AdaBoost, SMOTEBoost and

AUCBoost could be considered and compared to find the model that fit best to data.

89

Bibliography

Acero, J. L., Rodriguez, E., & Meriluoto, J. (2005). Kinetics of reactions between chlorine

and the cyanobacterial toxins microcystins. Water research, 39(8), 1628-1638.

Aguilera, P. A., Fernández, A., Fernández, R., Rumí, R., & Salmerón, A. (2011). Bayesian

networks in environmental modelling. Environmental Modelling & Software, 26(12),

1376-1388.

Agulló-Barceló, M., Oliva, F., & Lucena, F. (2013). Alternative indicators for monitoring

Cryptosporidium oocysts in reclaimed water. Environmental Science and Pollution

Research, 20(7), 4448-4454.

Ahmed, S. A., & Karanis, P. (2018). Comparison of current methods used to detect

Cryptosporidium oocysts in stools. International journal of hygiene and

environmental health, 221(5), 743-763.

Amha, Y. M., Kumaraswamy, R., & Ahmad, F. (2015). A probabilistic QMRA of Salmonella

in direct agricultural reuse of treated municipal wastewater. Water Science and

Technology, 71(8), 1203-1211.

Antoniou, M. G., De La Cruz, A. A., & Dionysiou, D. D. (2005). Cyanotoxins: New generation

of water contaminants. Journal of environmental engineering, 131(9), 1239-1243.

Assavasilavasukul, P., Lau, B. L., Harrington, G. W., Hoffman, R. M., & Borchardt, M. A.

(2008). Effect of pathogen concentrations on removal of Cryptosporidium and

Giardia by conventional drinking water treatment. Water Research, 42(10-11), 2678-

2690.

Atherholt, T. B., LeChevallier, M. W., Norton, W. D., & Rosen, J. S. (1998). Effect of rainfall

on Giardia and Crypto. Journal‐American Water Works Association, 90(9), 66-80.

Baek, S. S., Pyo, J., Pachepsky, Y., Park, Y., Ligaray, M., Ahn, C. Y., ... & Cho, K. H.

(2020). Identification and enumeration of cyanobacteria species using a deep neural

network. Ecological Indicators, 115, 106395.

90

Barker, S. F., Amoah, P., & Drechsel, P. (2014). A probabilistic model of gastroenteritis risks

associated with consumption of street food salads in Kumasi, Ghana: Evaluation of

methods to estimate pathogen dose from water, produce or food quality. Science of

the Total Environment, 487, 130-142.

Beaudequin, D., Harden, F., Roiko, A., & Mengersen, K. (2016). Utility of Bayesian networks

in QMRA-based evaluation of risk reduction options for recycled water. Science of

the Total Environment, 541, 1393-1409.

Beaudequin, D., Harden, F., Roiko, A., Stratton, H., Lemckert, C., & Mengersen, K. (2015).

Beyond QMRA: Modelling microbial health risk as a complex system using Bayesian

networks. Environment international, 80, 8-18.

Berkhof, J., Van Mechelen, I., & Hoijtink, H. (2000). Posterior predictive checks: Principles

and discussion. Computational Statistics, 15(3), 337-354.

Bertone, E., Sahin, O., Richards, R., & Roiko, A. (2016). Extreme events, water quality and

health: A participatory Bayesian risk assessment tool for managers of reservoirs.

Journal of Cleaner Production, 135, 657-667.

Betancourt, W. Q., & Rose, J. B. (2004). Drinking water treatment processes for removal of

Cryptosporidium and Giardia. Veterinary parasitology, 126(1-2), 219-234.

Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 4,

No. 4, p. 738). New York: springer.

Bivins, A. W., Sumner, T., Kumpel, E., Howard, G., Cumming, O., Ross, I., ... & Brown, J.

(2017). Estimating infection risks and the global burden of diarrheal disease

attributable to intermittent water supply using QMRA. Environmental science &

technology, 51(13), 7542-7551.

Bouwknegt, M., Knol, A. B., van der Sluijs, J. P., & Evers, E. G. (2014). Uncertainty of

population risk estimates for pathogens based on QMRA or epidemiology: a case

study of Campylobacter in the Netherlands. Risk analysis, 34(5), 847-864.

91

Bownik, A. (2016). Harmful algae: Effects of cyanobacterial cyclic peptides on aquatic

invertebrates-a short review. Toxicon, 124, 26-35.

Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis:

forecasting and control. John Wiley & Sons.

Boyer, G. L. (2008). Cyanobacterial toxins in New York and the lower Great Lakes

ecosystems. In Cyanobacterial harmful algal blooms: state of the science and

research needs (pp. 153-165). Springer, New York, NY.

Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of

iterative simulations. Journal of computational and graphical statistics, 7(4), 434-455.

Carlson, R. E., & Simpson, J. (1996). A coordinator’s guide to volunteer lake monitoring

methods. North American Lake Management Society, 96, 305.

Carmichael, W. W. (1997). The cyanotoxins. In Advances in botanical research (Vol. 27, pp.

211-256). Academic Press.

Carmichael, W. W. (2001). Assessment of blue-green algal toxins in raw and finished

drinking water. American Water Works Association.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., ... &

Riddell, A. (2017). Stan: A probabilistic programming language. Journal of statistical

software, 76(1), 1-32.

Catalina, A., Bürkner, P. C., & Vehtari, A. (2020). Projection Predictive Inference for

Generalized Linear and Additive Multilevel Models. arXiv preprint arXiv:2010.06994.

Catherine, Q., Susanna, W., Isidora, E. S., Mark, H., Aurelie, V., & Jean-François, H.

(2013). A review of current knowledge on toxic benthic freshwater cyanobacteria–

ecology, toxin production and risk management. Water research, 47(15), 5464-5479.

Cha, Y., Park, S. S., Kim, K., Byeon, M., & Stow, C. A. (2014). Probabilistic prediction of

cyanobacteria abundance in a Korean reservoir using a Bayesian Poisson model.

Water Resources Research, 50(3), 2518-2532.

92

Cha, Y., Cho, K. H., Lee, H., Kang, T., & Kim, J. H. (2017). The relative importance of water

temperature and residence time in predicting cyanobacteria abundance in regulated

rivers. Water research, 124, 11-19.

Chaffin, J. D., Kane, D. D., Stanislawczyk, K., & Parker, E. M. (2018). Accuracy of data

buoys for measurement of cyanobacteria, chlorophyll, and turbidity in a large lake

(Lake Erie, North America): implications for estimation of cyanobacterial bloom

parameters from water quality sonde measurements. Environmental science and

pollution research, 25(25), 25175-25189.

Chapra, S. C. et al. Climate Change Impacts on Harmful Algal Blooms in U.S. Freshwaters:

A Screening-Level Assessment. Environ. Sci. Technol. 51, 8933–8943 (2017).

Chaudhry, R. M., Hamilton, K. A., Haas, C. N., & Nelson, K. L. (2017). Drivers of microbial

risk for direct potable reuse and de facto reuse treatment schemes: The impacts of

source water quality and blending. International journal of environmental research

and public health, 14(6), 635.

Checkley, W., White Jr, A. C., Jaganath, D., Arrowood, M. J., Chalmers, R. M., Chen, X. M.,

... & Houpt, E. R. (2015). A review of the global burden, novel diagnostics,

therapeutics, and vaccine targets for Cryptosporidium. The Lancet Infectious

Diseases, 15(1), 85-94.

Chen, X. M., Keithly, J. S., Paya, C. V., & LaRusso, N. F. (2002). Cryptosporidiosis. New

England Journal of Medicine, 346(22), 1723-1731.

Chorus, I., & Welker, M. (2021). Toxic cyanobacteria in water: a guide to their public health

consequences, monitoring and management (p. 858). Taylor & Francis.

Chow, C. W., Drikas, M., House, J., Burch, M. D., & Velzeboer, R. M. (1999). The impact of

conventional water treatment processes on cells of the cyanobacterium Microcystis

aeruginosa. Water Research, 33(15), 3253-3262.

93

Christensen, V. G., Graham, J. L., Milligan, C. R., Pope, L. M.,& Ziegler, A. C. (2006). Water

quality and relation to taste-and-odor compounds in North Fork Ninnescah River and

Cheney Reservoir, south-central Kansas, 1997-2003 (No. 2006-5095). US

Geological Survey.

Cirés, S., Wörmer, L., Timón, J., Wiedner, C., & Quesada, A. (2011). Cylindrospermopsin

production and release by the potentially invasive cyanobacterium Aphanizomenon

ovalisporum under temperature and light gradients. Harmful Algae, 10(6), 668-675.

Clancy, J. L., Bukhari, Z., McCuin, R. M., Matheson, Z., & Fricker, C. R. (1999). USEPA

method 1622. Journal‐American Water Works Association, 91(9), 60-68.

Coffey, R., Cummins, E., Cormican, M., Flaherty, V. O., & Kelly, S. (2007). Microbial

exposure assessment of waterborne pathogens. Human and Ecological Risk

Assessment, 13(6), 1313-1351.

Coffey, R., Cummins, E., Bhreathnach, N., Flaherty, V. O., & Cormican, M. (2010).

Development of a pathogen transport model for Irish catchments using SWAT.

Agricultural Water Management, 97(1), 101-111.

Coffey, R., Cummins, E., O’Flaherty, V., & Cormican, M. (2010). Analysis of the soil and

water assessment tool (SWAT) to model Cryptosporidium in surface water sources.

Biosystems Engineering, 106(3), 303-314.

Collell, G., Prelec, D., & Patil, K. R. (2018). A simple plug-in bagging ensemble based on

threshold-moving for classifying binary and multiclass imbalanced data.

Neurocomputing, 275, 330-340.

Corso, P. S., Kramer, M. H., Blair, K. A., Addiss, D. G., Davis, J. P., & Haddix, A. C. (2003).

Costs of illness in the 1993 waterborne Cryptosporidium outbreak, Milwaukee,

Wisconsin. Emerging infectious diseases, 9(4), 426.

Costán-Longares, A., Montemayor, M., Payan, A., Mendez, J., Jofre, J., Mujeriego, R., &

Lucena, F. (2008). Microbial indicators and pathogens: removal, relationships and

94

predictive capabilities in water reclamation facilities. Water research, 42(17), 4439-

4448.

Curriero, F. C., Patz, J. A., Rose, J. B., & Lele, S. (2001). The association between extreme

precipitation and waterborne disease outbreaks in the United States, 1948–1994.

American journal of public health, 91(8), 1194-1199.

Davidson, K., Gowen, R. J., Tett, P., Bresnan, E., Harrison, P. J., McKinney, A., ... &

Crooks, A. M. (2012). Harmful algal blooms: how strong is the evidence that nutrient

ratios and forms influence their occurrence?. Estuarine, Coastal and Shelf Science,

115, 399-413.

Davies, C. M., Ferguson, C. M., Kaucner, C., Krogh, M., Altavilla, N., Deere, D. A., &

Ashbolt, N. J. (2004). Dispersion and transport of Cryptosporidium oocysts from fecal

pats under simulated rainfall events. Applied and environmental microbiology, 70(2),

1151-1159.

DeGaetano, A., Zarrow, D., & Center, N. R. C. (2011). Extreme Precipitation in New York &

New England. Technical Documentation and User Manual, Northeast Regional

Climate Center, Cornell University, Ithaca, NY.

Desai, N. T., Sarkar, R., & Kang, G. (2012). Cryptosporidiosis: an under-recognized public

health problem. Tropical parasitology, 2(2), 91.

Dilks, D. W., Canale, R. P., & Meier, P. G. (1992). Development of Bayesian Monte Carlo

techniques for water quality model uncertainty. Ecological Modelling, 62(1-3), 149-

162.

Dolman, A. M., Rücker, J., Pick, F. R., Fastner, J., Rohrlack, T., Mischke, U., & Wiedner, C.

(2012). Cyanobacteria and cyanotoxins: the influence of nitrogen versus

phosphorus. PloS one, 7(6), e38757.

95

Donald, M., Cook, A., & Mengersen, K. (2009). Bayesian network for risk of diarrhea

associated with the use of recycled water. Risk Analysis: An International Journal,

29(12), 1672-1685.

Donald, M., Mengersen, K., Toze, S., Sidhu, J. P., & Cook, A. (2011). Incorporating

parameter uncertainty into quantitative microbial risk assessment (QMRA). Journal

of water and health, 9(1), 10-26.

Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., ... & Lautenbach, S.

(2013). Collinearity: a review of methods to deal with it and a simulation study

evaluating their performance. Ecography, 36(1), 27-46.

Dugan, N. R., Fox, K. R., Owens, J. H., & Miltner, R. J. (2001). Controlling Cryptosporidium

oocysts using conventional treatment. Journal‐American Water Works Association,

93(12), 64-76.

Dzialowski, A. R., Smith, V. H., Huggins, D. G., Denoyelles, F., Lim, N. C., Baker, D. S. &

Beury, J. H. (2009). Development of predictive models for geosmin-related taste and

odor in Kansas, USA, drinking water reservoirs. water research, 43(11), 2829-2840.

Dzialowski, A. R., Smith, V. H., Wang, S. H., Martin, M. C., & Jr, F. D. (2011). Effects of non-

algal turbidity on cyanobacterial biomass in seven turbid Kansas reservoirs. Lake

and Reservoir Management, 27(1), 6-14.

Edzwald, J. K., Tobiason, J. E., Parento, L. M., Kelley, M. B., Kaminski, G. S., Dunn, H. J., &

Galant, P. B. (2000). Giardia and Cryptosporidium removals by clarification and

filtration under challenge conditions. Journal‐American Water Works Association,

92(12), 70-84.

Efstratiou, A., Ongerth, J. E., & Karanis, P. (2017). Waterborne transmission of protozoan

parasites: review of worldwide outbreaks-an update 2011–2016. Water research,

114, 14-22.

96

Efstratiou, A., Ongerth, J., & Karanis, P. (2017). Evolution of monitoring for Giardia and

Cryptosporidium in water. Water Research, 123, 96-112.

EPA, U. (1998). National Primary Drinking Water Regulations: Interim Enhanced Surface

Water Treatment. Federal Register: Rules and Regulations, 63, 69478-69521.

Falconer, I. R., Runnegar, M. T., Buckley, T., Huyn, V. L., & Bradshaw, P. (1989). Using

activated carbon to remove toxicity from drinking water containing cyanobacterial

blooms. Journal‐American Water Works Association, 81(2), 102-105.

Fang, F., Gao, Y., Gan, L., He, X., & Yang, L. (2018). Effects of different initial pH and

irradiance levels on cyanobacterial colonies from Lake Taihu, China. Journal of

Applied Phycology, 30(3), 1777-1793.

Fayer, R. J. M. T., Trout, J. M., & Jenkins, M. C. (1998). Infectivity of Cryptosporidium

parvum oocysts stored in water at environmental temperatures. The Journal of

parasitology,

Fayer, R., Graczyk, T. K., Lewis, E. J., Trout, J. M., & Farley, C. A. (1998). Survival of

infectious Cryptosporidium parvum oocysts in seawater and eastern oysters

(Crassostrea virginica) in the Chesapeake Bay. Applied and Environmental

Microbiology, 64(3), 1070-1074.

Fayer, R., Morgan, U., & Upton, S. J. (2000). Epidemiology of Cryptosporidium:

transmission, detection and identification. International journal for parasitology,

30(12-13), 1305-1322.

Ferrão-Filho, A. D. S., & Kozlowsky-Suzuki, B. (2011). Cyanotoxins: bioaccumulation and

effects on aquatic animals. Marine drugs, 9(12), 2729-2772.

Francy, D. S., Stelzer, E. A., Duris, J. W., Brady, A. M., Harrison, J. H., Johnson, H. E., &

Ware, M. W. (2013). Predictive models for Escherichia coli concentrations at inland

lake beaches and relationship of model variables to pathogen detection. Applied and

environmental microbiology, 79(5), 1676-1688.

97

Freni, G., & Mannina, G. (2010). Bayesian approach for uncertainty quantification in water

quality modelling: The influence of prior distribution. Journal of Hydrology, 392(1-2),

31-39.

Frey, S. K., Topp, E., Edge, T., Fall, C., Gannon, V., Jokinen, C., ... & Lapen, D. R. (2013).

Using SWAT, Bacteroidales microbial source tracking markers, and fecal indicator

bacteria to predict waterborne pathogen occurrence in an agricultural watershed.

Water research, 47(16), 6326-6337.

Fung, R., & Del Favero, B. (1994). Backward simulation in Bayesian networks. In

Uncertainty Proceedings 1994 (pp. 227-234). Morgan Kaufmann.

Gammie, L., Goatcher, L. and Fok, N. (2000). A Giardia/Cryptosporidium near miss? In:

Proceedings of the 8th National Conference on Drinking Water, Quebec City,

Quebec, October 28–30, 1998. Canadian Water and Wastewater Association,

Ottawa, Ontario.

Gelman, A. Comment: Fuzzy and Bayesian p-Values and u-Values. Statist. Sci. 20, (2005).

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).

Bayesian data analysis. CRC press.

Gerace, E., Presti, V. D. M. L., & Biondo, C. (2019). Cryptosporidium infection:

epidemiology, pathogenesis, and differential diagnosis. European Journal of

Microbiology and Immunology, 9(4), 119-123.

Ghernaout, B., Ghernaout, D., & Saiba, A. (2010). Algae and cyanotoxins removal by

coagulation/flocculation: A review. Desalination and Water Treatment, 20(1-3), 133-

143.

Gijsbertsen-Abrahamse, A. J., Schmidt, W., Chorus, I., & Heijman, S. G. J. (2006). Removal

of cyanotoxins by ultrafiltration and nanofiltration. Journal of Membrane Science,

276(1-2), 252-259.

98

Gómez-Couso, H., Fontán-Sainz, M., McGuigan, K. G., & Ares-Mazás, E. (2009). Effect of

the radiation intensity, water turbidity and exposure time on the survival of

Cryptosporidium during simulated solar disinfection of drinking water. Acta tropica,

112(1), 43-48.

Gronewold, A. D., Qian, S. S., Wolpert, R. L., & Reckhow, K. H. (2009). Calibrating and

validating bacterial water quality models: A Bayesian approach. Water Research,

43(10), 2688-2698.

Haas, C. N., Rose, J. B., & Gerba, C. P. (2014). Quantitative microbial risk assessment.

John Wiley & Sons.

Hall, D. B. (2000). Zero‐inflated Poisson and binomial regression with random effects: a

case study. Biometrics, 56(4), 1030-1039.

Hamilton, D. P., Salmaso, N., & Paerl, H. W. (2016). Mitigating harmful cyanobacterial

blooms: strategies for control of nitrogen and phosphorus loads. Aquatic Ecology,

50(3), 351-366.

Hamilton, K. A., Waso, M., Reyneke, B., Saeidi, N., Levine, A., Lalancette, C., ... & Ahmed,

W. (2018). Cryptosporidium and Giardia in wastewater and surface water

environments. Journal of environmental quality, 47(5), 1006-1023.

Harke, M. J., Steffen, M. M., Gobler, C. J., Otten, T. G., Wilhelm, S. W., Wood, S. A., &

Paerl, H. W. (2016). A review of the global ecology, genomics, and biogeography of

the toxic cyanobacterium, Microcystis spp. Harmful algae, 54, 4-20.

Harris, T. D., & Graham, J. L. (2017). Predicting cyanobacterial abundance, microcystin,

and geosmin in a eutrophic drinking-water reservoir using a 14-year dataset. Lake

and reservoir management, 33(1), 32-48.

Hassan, E. M., Örmeci, B., DeRosa, M. C., Dixon, B. R., Sattar, S. A., & Iqbal, A. (2021). A

review of Cryptosporidium spp. and their detection in water. Water Science and

Technology, 83(1), 1-25.

99

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their

applications.

He, X., Liu, Y. L., Conklin, A., Westrick, J., Weavers, L. K., Dionysiou, D. D., ... & Walker, H.

W. (2016). Toxic cyanobacteria and drinking water: Impacts, detection, and

treatment. Harmful algae, 54, 174-193.

Health Canada, 2002. Guidelines for Canadian Drinking Water Quality: Supporting

Documentation, Cyanobacterial Toxins –Microcystin-LR. Available online at

http://www.hcsc.gc.ca/ewhsemt/alt_formats/hecssesc/pdf/pubs/watereau/cyanobacte

rial_toxins/cyanobacterial_toxins-eng.pdf.

Health Canada. (2018). Guidance on the Use of Quantitative Microbial Risk Assessment in

Drinking Water.

Health Canada. (2019). Guidelines for Canadian drinking water quality: Enteric Protozoa:

Giardia and Cryptosporidium.

Hegg, A., Radersma, R., & Uller, T. (2022). Seasonal variation in the response to a toxin‐

producing cyanobacteria in Daphnia. Freshwater Biology.

Hilborn, E. D., Roberts, V. A., Backer, L., DeConno, E., Egan, J. S., Hyde, J. B., ... &

Hlavsa, M. C. (2014). Algal bloom–associated disease outbreaks among users of

freshwater lakes—United States, 2009–2010. MMWR. Morbidity and mortality

weekly report, 63(1), 11.

Himberg, K., Keijola, A. M., Hiisvirta, L., Pyysalo, H., & Sivonen, K. (1989). The effect of

water treatment processes on the removal of hepatotoxins fromMicrocystis

andOscillatoria cyanobacteria: A laboratory study. Water Research, 23(8), 979-984.

Hirata, T., & Hashimoto, A. (1998). Experimental assessment of the efficacy of

microfiltration and ultrafiltration for Cryptosporidium removal. Water Science and

Technology, 38(12), 103-107.

Hrudey, M. B., Drikas, M., & Gregory, R. (1999). REMEDIAL MEASURES.

100

Hrudey, S. E., & Hrudey, E. J. (2004). Safe drinking water. IWA publishing.

Huang, W. J., Cheng, B. L., & Cheng, Y. L. (2007). Adsorption of microcystin-LR by three

types of activated carbon. Journal of Hazardous Materials, 141(1), 115-122.

Huisman, J., Codd, G. A., Paerl, H. W., Ibelings, B. W., Verspagen, J. M., & Visser, P. M.

(2018). Cyanobacterial blooms. Nature Reviews Microbiology, 16(8), 471-483.

Hunter, P. R., De Sylor, M. A., Risebro, H. L., Nichols, G. L., Kay, D., & Hartemann, P.

(2011). Quantitative microbial risk assessment of cryptosporidiosis and giardiasis

from very small private water supplies. Risk Analysis: An International Journal, 31(2),

228-236.

Ibelings, B. W., Backer, L. C., Kardinaal, W. E. A., & Chorus, I. (2014). Current approaches

to cyanotoxin risk assessment and risk management around the globe. Harmful

algae, 40, 63-74.

Ishaq, S., Sadiq, R., Chhipi-Shrestha, G., Farooq, S., & Hewage, K. (2022). Developing an

Integrated “Regression-QMRA method” to Predict Public Health Risks of Low Impact

Developments (LIDs) for Improved Planning. Environmental Management, 1-17.

Jiang, P., Liu, X., Zhang, J., Te, S. H., Gin, K. Y. H., Van Fan, Y., ... & Shoemaker, C. A.

(2021). Cyanobacterial risk prevention under global warming using an extended

Bayesian network. Journal of Cleaner Production, 312, 127729.

Jin, C., Mesquita, M. M., Deglint, J. L., Emelko, M. B., & Wong, A. (2018). Quantification of

cyanobacterial cells via a novel imaging-driven technique with an integrated

fluorescence signature. Scientific reports, 8(1), 1-12.

Jöhnk , K. D., Huisman, J. E. F., Sharples, J., Sommeijer, B. E. N., Visser, P. M., & Stroom,

J. M. (2008). Summer heatwaves promote blooms of harmful cyanobacteria. Global

change biology, 14(3), 495-512.

101

Khan, S. J., Deere, D., Leusch, F. D., Humpage, A., Jenkins, M., & Cunliffe, D. (2015).

Extreme weather events: Should drinking water quality management systems adapt

to changing risk profiles?. Water research, 85, 124-136.

Kim, J., Jonoski, A., & Solomatine, D. P. (2022). A Classification-Based Machine Learning

Approach to the Prediction of Cyanobacterial Blooms in Chilgok Weir, South Korea.

Water, 14(4), 542.

Kim, S., Kim, S., Mehrotra, R., & Sharma, A. (2020). Predicting cyanobacteria occurrence

using climatological and environmental controls. Water Research, 175, 115639.

King, B. J., Keegan, A. R., Monis, P. T., & Saint, C. P. (2005). Environmental temperature

controls Cryptosporidium oocyst metabolic rate and associated retention of

infectivity. Applied and environmental microbiology, 71(7), 3848-3857.

Klemer, A. R., & Konopka, A. E. (1989). Causes and consequences of blue-green algal

(cyanobacterial) blooms. Lake and Reservoir Management, 5(1), 9-19.

Korich, D. G., Mead, J. R., Madore, M. S., Sinclair, N. A., & Sterling, C. (1990). Effects of

ozone, chlorine dioxide, chlorine, and monochloramine on Cryptosporidium parvum

oocyst viability. Applied and environmental microbiology, 56(5), 1423-1428.

Korner-Nievergelt, F. et al. Posterior Predictive Model Checking and Proportion of Explained

Variance. in Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS,

and STAN 161–174 (Elsevier, 2015). doi:10.1016/B978-0-12-801370-0.00010-1.

Kuk, A. Y., Li, J., & John Rush, A. (2014). Variable and threshold selection to control

predictive accuracy in logistic regression. Journal of the Royal Statistical Society:

Series C (Applied Statistics), 63(4), 657-672.

Lahti, K., Rapala, J., Kivimäki, A. L., Kukkonen, J., Niemelä, M., & Sivonen, K. (2001).

Occurrence of microcystins in raw water sources and treated drinking water of

Finnish waterworks. Water science and technology, 43(12), 225-228.

102

Lal, A., Fearnley, E., & Kirk, M. (2015). The risk of reported cryptosporidiosis in children

aged< 5 years in Australia is highest in very remote regions. International journal of

environmental research and public health, 12(9), 11815-11828.

Lalancette, C., Papineau, I., Payment, P., Dorner, S., Servais, P., Barbeau, B., ... & Prévost,

M. (2014). Changes in Escherichia coli to Cryptosporidium ratios for various fecal

pollution sources and drinking water intakes. water research, 55, 150-161.

Lambert, B. (2018). A student’s guide to Bayesian statistics. Sage.

Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in

manufacturing. Technometrics, 34(1), 1-14.

Lambert, T. W., Holmes, C. F., & Hrudey, S. E. (1996). Adsorption of microcystin-LR by

activated carbon and removal in full scale water treatment. Water Research, 30(6),

1411-1422.

LeBlanc Renaud, S., Pick, F. R., & Fortin, N. (2011). Effect of light intensity on the relative

dominance of toxigenic and nontoxigenic strains of Microcystis aeruginosa. Applied

and environmental microbiology, 77(19), 7016-7022.

LeChevallier, M. W., Norton, W. D., & Lee, R. G. (1991). Occurrence of Giardia and

Cryptosporidium spp. in surface water supplies. Applied and Environmental

Microbiology, 57(9), 2610-2616.

Lee, J., & Walker, H. W. (2006). Effect of process variables and natural organic matter on

removal of microcystin-LR by PAC− UF. Environmental science & technology,

40(23), 7336-7342.

Lee, J., Lee, S., & Jiang, X. (2017). Cyanobacterial toxins in freshwater and food: important

sources of exposure to humans. Annual review of food science and technology, 8,

281-304.

Lee, T. A., Rollwagen-Bollens, G., Bollens, S. M., & Faber-Hammond, J. J. (2015).

Environmental influence on cyanobacteria abundance and microcystin toxin

103

production in a shallow temperate lake. Ecotoxicology and environmental safety,

114, 318-325.

Leitch, G. J., & He, Q. (2011). Cryptosporidiosis-an overview. Journal of biomedical

research, 25(1), 1-16.

Levich, A. P. (1996). The role of nitrogen-phosphorus ratio in selecting for dominance of

phytoplankton by cyanobacteria or green algae and its application to reservoir

management. Journal of Aquatic Ecosystem Health, 5(1), 55-61.

Li, Z., Guo, J. S., Fang, F., Gao, X., Sheng, J. P., Zhou, H., & Long, M. (2010). Seasonal

variation of cyanobacteria and its potential relationship with key environmental

factors in Xiaojiang backwater area, Three Gorges Reservoir. Huan Jing ke Xue=

Huanjing Kexue, 31(2), 301-309.

Ligda, P., Claerebout, E., Kostopoulou, D., Zdragas, A., Casaert, S., Robertson, L. J., &

Sotiraki, S. (2020). Cryptosporidium and Giardia in surface water and drinking water:

animal sources and towards the use of a machine-learning approach as a tool for

predicting contamination. Environmental Pollution, 264, 114766.

Linden, K. G., Shin, G. A., Faubert, G., Cairns, W., & Sobsey, M. D. (2002). UV disinfection

of Giardia lamblia cysts in water. Environmental science & technology, 36(11), 2519-

2522.

Lindquist, H.D A., J W. Bennett, K. Broomall, G Glover, AND F W. Schaefer III. COUNTING

CRYPTOSPORIDIUM, AN ANALYSIS OF THE UTILITY OF VARIOUS

CYTOMETRIC TECHNIQUES. Presented at Annual Meeting of American Society of

Parasitologists, Monterey, CA, July 5-10, 1999.

Lisle, J. T., & Rose, J. B. (1995). Gene exchange in drinking water and biofilms by natural

transformation. Water Science and Technology, 31(5-6), 41-46.

104

Litke, D. W. (1999). Review of phosphorus control measures in the United States and their

effects on water quality (Vol. 99, No. 4007). US Department of the Interior, US

Geological Survey.

Liu, W., An, W., Jeppesen, E., Ma, J., Yang, M., & Trolle, D. (2019). Modelling the fate and

transport of Cryptosporidium, a zoonotic and waterborne pathogen, in the Daning

River watershed of the Three Gorges Reservoir Region, China. Journal of

environmental management, 232, 462-474.

Lürling, M., Eshetu, F., Faassen, E. J., Kosten, S., & Huszar, V. L. (2013). Comparison of

cyanobacterial and green algal growth rates at different temperatures. Freshwater

Biology, 58(3), 552-559.

Maatouk, I., Bouaı̈cha, N., Fontan, D., & Levi, Y. (2002). Seasonal variation of microcystin

concentrations in the Saint-Caprais reservoir (France) and their removal in a small

full-scale treatment plant. Water Research, 36(11), 2891-2897.

Mac Kenzie, W. R., Hoxie, N. J., Proctor, M. E., Gradus, M. S., Blair, K. A., Peterson, D. E.,

... & Davis, J. P. (1994). A massive outbreak in Milwaukee of Cryptosporidium

infection transmitted through the public water supply. New England journal of

medicine, 331(3), 161-167.

Malve, O., Laine, M., Haario, H., Kirkkala, T., & Sarvala, J. (2007). Bayesian modelling of

algal mass occurrences—using adaptive MCMC methods with a lake water quality

model. Environmental Modelling & Software, 22(7), 966-977.

Mangan, N. M., Flamholz, A., Hood, R. D., Milo, R., & Savage, D. F. (2016). pH determines

the energetic efficiency of the cyanobacterial CO2 concentrating mechanism.

Proceedings of the National Academy of Sciences, 113(36), E5354-E5362.

Marion, J. W., Zhang, F., Cutting, D., & Lee, J. (2017). Associations between county-level

land cover classes and cyanobacteria blooms in the United States. Ecological

Engineering, 108, 556-563.

105

Masciopinto, C., Vurro, M., Lorusso, N., Santoro, D., & Haas, C. N. (2020). Application of

QMRA to MAR operations for safe agricultural water reuses in coastal areas. Water

Research X, 8, 100062.

McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3),

276-282.

Meng, X. L. (1994). Posterior predictive p -values. The annals of statistics, 22(3), 1142-

1160.

Messner, M. J., Chappell, C. L., & Okhuysen, P. C. (2001). Risk assessment for

Cryptosporidium: a hierarchical Bayesian analysis of human dose response data.

Water Research, 35(16), 3934-3940.

Metropolis, N., & Ulam, S. (1949). The monte carlo method. Journal of the American

statistical association, 44(247), 335-341.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).

Equation of state calculations by fast computing machines. The Journal of Chemical

Physics, 21(6), 1087–1092.

Min, Y., & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated

count data. Statistical modelling, 5(1), 1-19.

Mok, H. F., Barker, S. F., & Hamilton, A. J. (2014). A probabilistic quantitative microbial risk

assessment model of norovirus disease burden from wastewater irrigation of

vegetables in Shepparton, Australia. Water research, 54, 347-362.

Mur, R., Skulberg, O. M., & Utkilen, H. (1999). CYANOBACTERIA IN THE ENVIRONMENT.

Murray, C. J., & Acharya, A. K. (1997). Understanding DALYs. Journal of health economics,

16(6), 703-730.

Myhre, G., Alterskjær, K., Stjern, C. W., Hodnebrog, Ø., Marelle, L., Samset, B. H., ... &

Stohl, A. (2019). Frequency of extreme precipitation increases extensively with event

rareness under global warming. Scientific reports, 9(1), 1-10.

106

Newcombe, G., House, J., Ho, L., Baker, P., & Burch, M. (2009). Management strategies for

cyanobacteria (blue-green algae): A guide for water utilities. Water Quality Research

Australia (WQRA), Reserach Report, 74, 60-76.

Ng, J. S. Y., Eastwood, K., Walker, B., Durrheim, D. N., Massey, P. D., Porigneaux, P., ... &

Ryan, U. (2012). Evidence of Cryptosporidium transmission between cattle and

humans in northern New South Wales. Experimental parasitology, 130(4), 437-441.

Nieminski, E. C., & Ongerth, J. E. (1995). Removing Giardia and Cryptosporidium by

conventional treatment and direct filtration. Journal‐American Water Works

Association, 87(9), 96-106.

O’Neil, J. M., Davis, T. W., Burford, M. A., & Gobler, C. J. (2012). The rise of harmful

cyanobacteria blooms: the potential roles of eutrophication and climate change.

Harmful algae, 14, 313-334.

Oberemm, A., Becker, J., Codd, G. A., & Steinberg, C. (1999). Effects of cyanobacterial

toxins and aqueous crude extracts of cyanobacteria on the development of fish and

amphibians. Environmental Toxicology: An International Journal, 14(1), 77-88.

Okhuysen, P. C., Chappell, C. L., Sterling, C. R., Jakubowski, W., & DuPont, H. L. (1998).

Susceptibility and serologic response of healthy adults to reinfection with

Cryptosporidium parvum. Infection and Immunity, 66(2), 441-443.

Oneby, M., Deremiah, R., & Bollyky, L. J. (2006). Pipeline Contactor for the City of Wichita,

Kansas High Pressure Ozone Facility. In Pipelines 2006: Service to the Owner (pp.

1-10).

Ong, C. S., Eisler, D. L., Goh, S. H., Tomblin, J., Awad-El-Kariem, F. M., Beard, C. B., ... &

Isaac-Renton, J. L. (1999). Molecular epidemiology of cryptosporidiosis outbreaks

and transmission in British Columbia, Canada. The American journal of tropical

medicine and hygiene, 61(1), 63-69.

107

Ongerth, J. E. (2013). ICR SS protozoan data site-by-site: a picture of Cryptosporidium and

Giardia in US surface water. Environmental science & technology, 47(18), 10145-

10154.

Ongerth, J. E. (2013). LT2 Cryptosporidium data: What do they tell us about

Cryptosporidium in surface water in the United States?. Environmental science &

technology, 47(9), 4029-4038.

Ongerth, J. E., & Hutton, A. P. (1997). DE filtration to remove Cryptosporidium. Journal‐

American Water Works Association, 89(12), 39-46.

Ongerth, J. E., & Hutton, P. E. (2001). Testing of diatomaceous earth filtration for removal of

Cryptosporidium oocysts. Journal‐American Water Works Association, 93(12), 54-

63.

Orak, N. H. (2020). A Hybrid Bayesian network framework for risk assessment of arsenic

exposure and adverse reproductive outcomes. Ecotoxicology and Environmental

Safety, 192, 110270.

Pachepsky, Y., Shelton, D., Dorner, S., & Whelan, G. (2016). Can E. coli or thermotolerant

coliform concentrations predict pathogen presence or prevalence in irrigation

waters?. Critical reviews in microbiology, 42(3), 384-393.

Paerl, H. W., & Huisman, J. (2009). Climate change: a catalyst for global expansion of

harmful cyanobacterial blooms. Environmental microbiology reports, 1(1), 27-37.

Paerl, H. W., & Paul, V. J. (2012). Climate change: links to global expansion of harmful

cyanobacteria. Water research, 46(5), 1349-1363.

Papadimitriou, T., Kagalou, I., Stalikas, C., Pilidis, G., & Leonardos, I. D. (2012).

Assessment of microcystin distribution and biomagnification in tissues of aquatic

food web compartments from a shallow lake and evaluation of potential risks to

public health. Ecotoxicology, 21(4), 1155-1166.

108

Parrott, L., & Kok, R. (2000). Incorporating complexity in ecosystem modelling. Complexity

international, 7, 1-19.

Parsons, D. J., Orton, T. G., D'Souza, J., Moore, A., Jones, R., & Dodd, C. E. R. (2005). A

comparison of three modelling approaches for quantitative risk assessment using the

case study of Salmonella spp. in poultry meat. International journal of food

microbiology, 98(1), 35-51.

Piironen, J., Paasiniemi, M., & Vehtari, A. (2020). Projective inference in high-dimensional

problems: Prediction and feature selection. Electronic Journal of Statistics, 14(1),

2155-2197.

Pyo, J., Cho, K. H., Kim, K., Baek, S. S., Nam, G., & Park, S. (2021). Cyanobacteria cell

prediction using interpretable deep learning model with observed, numerical, and

sensing data assemblage. Water Research, 203, 117483.

Pyo, J., Park, L. J., Pachepsky, Y., Baek, S. S., Kim, K., & Cho, K. H. (2020). Using

convolutional neural network for predicting cyanobacteria concentrations in river

water. Water Research, 186, 116349.

Quinonero-Candela, J., Rasmussen, C. E., & Williams, C. K. (2007). Approximation

methods for Gaussian process regression. In Large-scale kernel machines (pp. 203-

223). MIT Press.

Rabalais, N. N., Diaz, R. J., Levin, L. A., Turner, R. E., Gilbert, D., & Zhang, J. (2010).

Dynamics and distribution of natural and human-caused hypoxia. Biogeosciences,

7(2), 585-619.

Rastogi, R. P., Sinha, R. P., & Incharoensakdi, A. (2014). The cyanotoxin-microcystins:

current overview. Reviews in Environmental Science and Bio/Technology, 13(2),

215-249.

109

Razzolini, M. T. P., Breternitz, B. S., Kuchkarian, B., & Bastos, V. K. (2020).

Cryptosporidium and Giardia in urban wastewater: A challenge to overcome.

Environmental Pollution, 257, 113545.

Recknagel, F., Orr, P. T., Bartkow, M., Swanepoel, A., & Cao, H. (2017). Early warning of

limit-exceeding concentrations of cyanobacteria and cyanotoxins in drinking water

reservoirs by inferential modelling. Harmful algae, 69, 18-27.

Reichwaldt, E. S., & Ghadouani, A. (2012). Effects of rainfall patterns on toxic

cyanobacterial blooms in a changing climate: between simplistic scenarios and

complex dynamics. Water research, 46(5), 1372-1393.

Reinoso, R., Torres, L. A., & Bécares, E. (2008). Efficiency of natural systems for removal of

bacteria and pathogenic parasites from wastewater. Science of the total

environment, 395(2-3), 80-86.

Richardson, J., Feuchtmayr, H., Miller, C., Hunter, P. D., Maberly, S. C., & Carvalho, L.

(2019). Response of cyanobacteria and phytoplankton abundance to warming,

extreme rainfall events and nutrient enrichment. Global Change Biology, 25(10),

3365-3380.

Rigaux Ancelet, C. S., Carlin, F., Nguyen‐thé, C., & Albert, I. (2013). Inferring an augmented

Bayesian network to confront a complex quantitative microbial risk assessment

model with durability studies: application to Bacillus cereus on a courgette purée

production chain. Risk analysis, 33(5), 877-892.

Robarts, R. D., & Zohary, T. (1987). Temperature effects on photosynthetic capacity,

respiration, and growth rates of bloom‐forming cyanobacteria. New Zealand Journal

of Marine and Freshwater Research, 21(3), 391-399.

Robertson, L. J., Campbell, A. T., & Smith, H. V. (1992). Survival of Cryptosporidium

parvum oocysts under various environmental pressures. Applied and environmental

microbiology, 58(11), 3494-3500.

110

Rose, J. B. (1997). Environmental ecology of Cryptosporidium and public health

implications. Annual review of public health, 18(1), 135-161.

Rousso, B. Z., Bertone, E., Stewart, R., & Hamilton, D. P. (2020). A systematic literature

review of forecasting and predictive models for cyanobacteria blooms in freshwater

lakes. Water Research, 182, 115959.

Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent

Gaussian models by using integrated nested Laplace approximations. Journal of the

royal statistical society: Series b (statistical methodology), 71(2), 319-392.

Ryan, U., Hijjawi, N., & Xiao, L. (2018). Foodborne cryptosporidiosis. International journal

for parasitology, 48(1), 1-12.

Salmaso, N., Capelli, C., Shams, S., & Cerasino, L. (2015). Expansion of bloom-forming

Dolichospermum lemmermannii (Nostocales, Cyanobacteria) to the deep lakes

south of the Alps: colonization patterns, driving forces and implications for water use.

Harmful Algae, 50, 76-87.

Sarma, T. A. (2012). Handbook of cyanobacteria. CRC Press.

Säve-Söderbergh, M., Toljander, J., Mattisson, I., Åkesson, A., & Simonsson, M. (2018).

Drinking water consumption patterns among adults—SMS as a novel tool for

collection of repeated self-reported water consumption. Journal of exposure science

& environmental epidemiology, 28(2), 131-139.

Sawka, M. N., Cheuvront, S. N., & Carter, R. (2005). Human water needs. Nutrition reviews,

63(suppl_1), S30-S39.

Schets, F. M., Van Wijnen, J. H., Schijven, J. F., Schoon, H., & de Roda Husman, A. M.

(2008). Monitoring of waterborne pathogens in surface waters in Amsterdam, The

Netherlands, and the potential health risk associated with exposure to

Cryptosporidium and Giardia in these waters. Applied and environmental

microbiology, 74(7), 2069-2078.

111

Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete

samples). Biometrika, 52(3/4), 591-611.

Shirley, D. A. T., Moonah, S. N., & Kotloff, K. L. (2012). Burden of disease from

cryptosporidiosis. Current opinion in infectious diseases, 25(5), 555.

Shukla, P. R., Skeg, J., Buendia, E. C., Masson-Delmotte, V., Pörtner, H. O., Roberts, D. C.,

... & Malley, J. (2019). Climate Change and Land: an IPCC special report on climate

change, desertification, land degradation, sustainable land management, food

security, and greenhouse gas fluxes in terrestrial ecosystems.

Sinha, R. P., Kumar, A., Tyagi, M. B., & Hader, D. (2005). Ultraviolet-B-induced destruction

of phycobiliproteins in cyanobacteria. Physiology and Molecular Biology of Plants,

11(2), 313.

Sivonen, K., Kononen, K., Carmichael, W. W., Dahlem, A. M., Rinehart, K. L., Kiviranta, J.,

& Niemela, S. I. (1989). Occurrence of the hepatotoxic cyanobacterium Nodularia

spumigena in the Baltic Sea and structure of the toxin. Applied and Environmental

microbiology, 55(8), 1990-1995.

Smith, H. V., & Grimason, A. M. (2003). Giardia and Cryptosporidium in water and

wastewater. In Handbook of water and wastewater microbiology (pp. 695-756).

Academic Press.

Smith, H. V., & Nichols, R. A. (2010). Cryptosporidium: detection in water and food.

Experimental parasitology, 124(1), 61-79.

Sparks, H., Nair, G., Castellanos-Gonzalez, A., & White, A. C. (2015). Treatment of

Cryptosporidium: what we know, gaps, and the way forward. Current tropical

medicine reports, 2(3), 181-187.

States, S., Stadterman, K., Ammon, L., Vogel, P., Baldizar, J., Wright, D., ... & Sykora, J.

(1997). Protozoa in river water: sources, occurrence, and treatment. Journal‐

American Water Works Association, 89(9), 74-83.

112

Statistics Canada. Table 38-10-0100-01 Combined sewer overflow discharge volumes (x

1,000,000)

Statistics Canada. Table 38-10-0271-01 Potable water use by sector and average daily use.

DOI: https://doi.org/10.25318/3810027101-eng

Sterk, A., Schijven, J., de Roda Husman, A. M., & de Nijs, T. (2016). Effect of climate

change on runoff of Campylobacter and Cryptosporidium from land to surface water.

Water research, 95, 90-102.

Swaffer, B., Abbott, H., King, B., van der Linden, L., & Monis, P. (2018). Understanding

human infectious Cryptosporidium risk in drinking water supply catchments. Water

research, 138, 282-292.

Sylvestre, É., Burnet, J. B., Dorner, S., Smeets, P., Medema, G., Villion, M., ... & Prévost,

M. (2021). Impact of Hydrometeorological events for the selection of parametric

models for Protozoan Pathogens in drinking‐water sources. Risk Analysis, 41(8),

1413-1426.

Taranu, Z. E., Gregory‐Eaves, I., Leavitt, P. R., Bunting, L., Buchaca, T., Catalan, J., ... &

Vinebrooke, R. D. (2015). Acceleration of cyanobacterial dominance in north

temperate‐subarctic lakes during the Anthropocene. Ecology letters, 18(4), 375-384.

Teixeira, M. R., & Rosa, M. J. (2005). Microcystins removal by nanofiltration membranes.

Separation and Purification Technology, 46(3), 192-201.

Templeton, T. J., Lancto, C. A., Vigdorovich, V., Liu, C., London, N. R., Hadsall, K. Z., &

Abrahamsen, M. S. (2004). The Cryptosporidium oocyst wall protein is a member of

a multigene family and has a homolog in Toxoplasma. Infection and Immunity, 72(2),

980-987.

Thomas, M. K., & Litchman, E. (2016). Effects of temperature and nitrogen availability on

the growth of invasive and native cyanobacteria. Hydrobiologia, 763(1), 357-369.

113

Thomson, S., Hamilton, C. A., Hope, J. C., Katzer, F., Mabbott, N. A., Morrison, L. J., &

Innes, E. A. (2017). Bovine cryptosporidiosis: impact, host-parasite interaction and

control strategies. Veterinary Research, 48(1), 1-16.

US Environmental Protection Agency (USEPA). (1998). Interim enhanced surface water

treatment rule. Fed Reg, 63, 69478. LeChevallier, M. W., Norton, W. D., & Lee, R. G.

(1991). Giardia and Cryptosporidium spp. in filtered drinking water supplies. Applied

and environmental Microbiology, 57(9), 2617-2621.

USEPA (US Environmental Protection Agency). (2006). Long Term 2 Enhanced Surface

Water Treatment Rule. EPA 815-R-0e16, US EPA.

USEPA. (2018). Smart Data Infrastructure for wet Weather Control and Decision Support.

US Geological Survey, 2015, USGS National Water Information System, accessed ,

http://dx.doi.org/10.5066/F7P55KJN

van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., ... & Yau,

C. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1),

1-26.

Vermeulen, L. C., van Hengel, M., Kroeze, C., Medema, G., Spanier, J. E., van Vliet, M. T.,

& Hofstra, N. (2019). Cryptosporidium concentrations in rivers worldwide. Water

Research, 149, 202-214.

Verspagen, J. M., Van de Waal, D. B., Finke, J. F., Visser, P. M., Van Donk, E., & Huisman,

J. (2014). Rising CO2 levels will intensify phytoplankton blooms in eutrophic and

hypertrophic lakes. PloS one, 9(8), e104325.

Villa, A., Fölster, J., & Kyllmar, K. (2019). Determining suspended solids and total

phosphorus from turbidity: comparison of high-frequency sampling with conventional

monitoring methods. Environmental monitoring and assessment, 191(10), 1-16.

114

Walls, J. T., Wyatt, K. H., Doll, J. C., Rubenstein, E. M., & Rober, A. R. (2018). Hot and

toxic: Temperature regulates microcystin release from cyanobacteria. Science of the

Total Environment, 610, 786-795.

Wang, H., Zhang, Z., Liang, D., Pang, Y., Hu, K., & Wang, J. (2016). Separation of wind's

influence on harmful cyanobacterial blooms. Water Research, 98, 280-292.

Wang, Z., Huang, K., Zhou, P., & Guo, H. (2010). A hybrid neural network model for

cyanobacteria bloom in Dianchi Lake. Procedia Environmental Sciences, 2, 67-75.

Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmayer, D. B. (1996). Modelling

the abundance of rare species: statistical models for counts with extra zeros.

Ecological Modelling, 88(1-3), 297-308.

Wenger, S. J., & Freeman, M. C. (2008). Estimating species occurrence, abundance, and

detection probability using zero‐inflated distributions. Ecology, 89(10), 2953-2959.

Westrick, J. A., Szlag, D. C., Southwell, B. J., & Sinclair, J. (2010). A review of

cyanobacteria and cyanotoxins removal/inactivation in drinking water treatment.

Analytical and bioanalytical chemistry, 397(5), 1705-1714.

Whitley, E., & Ball, J. (2002). Statistics review 6: Nonparametric methods. Critical care, 6(6),

1-5.

Whitton, B. A., & Potts, M. (2012). Introduction to the cyanobacteria. In Ecology of

Cyanobacteria II (pp. 1-13). Springer, Dordrecht.

Woods, S. A., Borges, H., Puddick, J., Biessy, L., Atalah, J., Hawes, I., ... & Hamilton, D. P.

(2017). Contrasting cyanobacterial communities and microcystin concentrations in

summers with extreme weather events: insights into potential effects of climate

change. Hydrobiologia, 785(1), 71-89.

World Health Organization. (2005). Guidelines for laboratory and field testing of mosquito

larvicides (No. WHO/CDS/WHOPES/GCDPP/2005.13). World Health Organization.

115

World Health Organization. (2016). Quantitative microbial risk assessment: application for

water safety management.

World Health Organization. (2021). WHO human health risk assessment toolkit: chemical

hazards. World Health Organization.

Wuebbles, D.J., D.W. Fahey, K.A. Hibbard, B. DeAngelo, S. Doherty, K. Hayhoe, R. Horton,

J.P. Kossin, P.C. Taylor, A.M. Waple, and C.P. Weaver, 2017: Executive summary.

In: Climate Science Special Report: Fourth National Climate Assessment, Volume I

[Wuebbles, D.J., D.W. Fahey, K.A. Hibbard, D.J. Dokken, B.C. Stewart, and T.K.

Maycock (eds.)]. U.S. Global Change Research Program, Washington, DC, USA, pp.

12-34, doi: 10.7930/J0DJ5CTG.

WWAP, U. (2017). WWAP (United Nations World Water Assessment Programme).

Xiao, G., Qiu, Z., Qi, J., Chen, J. A., Liu, F., Liu, W., ... & Shu, W. (2013). Occurrence and

potential health risk of Cryptosporidium and Giardia in the Three Gorges Reservoir,

China. Water research, 47(7), 2431-2445.

Xiao, L., & Feng, Y. (2017). Molecular epidemiologic tools for waterborne pathogens

Cryptosporidium spp. and Giardia duodenalis. Food and Waterborne Parasitology, 8,

14-32.

Xue, L., Zhang, Y., Zhang, T., An, L., & Wang, X. (2005). Effects of enhanced ultraviolet-B

radiation on algae and cyanobacteria. Critical reviews in microbiology, 31(2), 79-89.

Yang, L., Zhao, X., Peng, S., & Li, X. (2016). Water quality assessment analysis by using

combination of Bayesian and genetic algorithm approach in an urban lake, China.

Ecological modelling, 339, 77-88.

Yang, X. S. (2019). Introduction to algorithms for data mining and machine learning.

Academic press.

Yoder, J. S., & Beach, M. J. (2010). Cryptosporidium surveillance and risk factors in the

United States. Experimental parasitology, 124(1), 31-39.

116

Young, I., Smith, B. A., & Fazil, A. (2015). A systematic review and meta-analysis of the

effects of extreme weather events and other weather-related variables on

Cryptosporidium and Giardia in fresh surface waters. Journal of water and health,

13(1), 1-17.

Zhang, F., Lee, J., Liang, S., & Shum, C. K. (2015). Cyanobacteria blooms and non-

alcoholic liver disease: evidence from a county level ecological study in the United

States. Environmental Health, 14(1), 1-11.

Zhao, C. S., Shao, N. F., Yang, S. T., Ren, H., Ge, Y. R., Feng, P., ... & Zhao, Y. (2019).

Predicting cyanobacteria bloom occurrence in lakes and reservoirs before blooms

occur. Science of the total environment, 670, 837-848.

Zhao, Y., Sharma, A., Sivakumar, B., Marshall, L., Wang, P., & Jiang, J. (2014). A Bayesian

method for multi-pollution source water quality model and seasonal water quality

management in river segments. Environmental Modelling & Software, 57, 216-226.

Zhao, Y., Yan, Y., Xie, L., Wang, L., He, Y., Wan, X., & Xue, Q. (2020). Long-term

environmental exposure to microcystins increases the risk of nonalcoholic fatty liver

disease in humans: A combined fisher-based investigation and murine model study.

Environment International, 138, 105648.

Zhiteneva, V., Carvajal, G., Shehata, O., Hübner, U., & Drewes, J. E. (2021). Quantitative

microbial risk assessment of a non-membrane based indirect potable water reuse

system using Bayesian networks. Science of the Total Environment, 780, 146462.

117

Appendix

Chapter 3 Supplementary Materials

Figure S1: MCMC tracplots of negative binomial model based on four chains at 1000

iterations, intercept, beta_mu[1-5] are the coefficients in the negative binomial model, phi is

the inverse overdispersion control

118

(a)

(b)

Figure S2: MCMC tracplots of zero-inflated negative binomial model based on four chains at

1000 iterations, intercept2, beta_mu[1-5] are the coefficients in the negative binomial model,

phi inverse overdispersion control. Intercept2, beta_theta[1-5] are the coefficients in the

binomial model

119

(a)

(b)

Figure S3: MCMC tracplots of hurdle negative binomial model based on four chains at 1000

iterations, intercept2, beta_mu[1-5] are the coefficients in the negative binomial model, phi

inverse overdispersion control. Intercept2, beta_theta[1-5] are the coefficients in the

binomial model

120

Figure S4: Bar plots of average cyanobacteria abundance in each season. The seasons are

defined as spring (March, April, May), summer (June, July, August), autumn (September,

October, November) and winter (December, January, February).