Upload
khangminh22
View
0
Download
0
Embed Size (px)
Citation preview
APPLICATION OF BAYESIAN METHODS FOR CYANOBACTERIA AND
CRYPTOSPORIDIUM PREDICTION AND HEALTH RISK ASSESSMENT
by
Yirao Zhang
B.Eng, Beijing Normal University, 2020
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF APPLIED SCIENCE
in
THE COLLEGE OF GRADUATE STUDIES
(Civil Engineering)
THE UNIVERSITY OF BRITISH COLUMBIA
(Okanagan)
July 2022
© Yirao Zhang, 2022
ii
The following individuals certify that they have read, and recommend to the College of
Graduate Studies for acceptance, a thesis/dissertation entitled:
APPLICATION OF BAYESIAN METHODS FOR CYANOBACTERIA AND
CRYPTOSPORIDIUM PREDICTION AND HEALTH RISK ASSESSMENT
submitted by Yirao Zhang in partial fulfillment of the requirements of
the degree of Master of Applied Science .
Dr. Nicolás Peleato, School of Engineering
Supervisor
Dr. Solomon Tesfamariam, School of Engineering
Supervisory Committee Member
Dr. Alyse Hawley, School of Engineering
Supervisory Committee Member
Dr. Sepideh Pakpour, School of Engineering
University Examiner
iii
Abstract
This research investigated the use of Bayesian methods in predictive modelling of
cyanobacteria concentration and Cryptosporidium presence/absence. Cyanobacteria
blooms and Cryptosporidium spp. in source waters are ubiquitous concerns for water
treatment and management. However, identifying and enumeration cyanobacteria and
Cryptosporidium has always been challenging. Previous research has shown Bayesian
methods to be a promising approach to predicting water quality variables with uncertainty.
However, the innate nature of highly imbalance in cyanobacteria abundance and
Cryptosporidium classification data reduces the performance of traditional predictive
models. Moreover, few studies have focused on the probabilistic disease burden estimation
of Cryptosporidium after prediction.
In this study, a Bayesian approach was proposed to fit the cyanobacteria abundance
data with mixture models that handle zero-inflated data. Predictor variables considered
included weather and water quality measures that can easily be obtained day-to-day.
Several models were compared based on fit to training data. Furthermore, the optimal
model (zero-inflated negative binomial) was used to predict cyanobacteria alert levels on a
separated test set of data. The ability to predict narrow alert levels was limited. However,
high accuracy was achieved in predicting cyanobacteria counts above or below 1,000
cells/mL.
For Cryptosporidium, a probabilistic quantitative microbial risk assessment approach
was proposed to predict Cryptosporidium presence/absence and estimate the disease
burden presented in disability-adjusted life years (DALYs). For severely imbalanced
Cryptosporidium data, the model achieved high precision and recall. The probabilistic
QMRA based on Monte Carlo and Markov chain Monte Carlo method was used to estimate
disease burden under different scenarios and backwards infer the necessary level of
iv
treatment and critical control points for sewer overflow. The modeling approach can be
applied to assess risk under different scenarios and advice for water management.
v
Lay Summary
Cyanobacterial blooms and Cryptosporidium spp. in source waters are persistent concerns
to water management and treatment. However, detection and counting of these
microorganisms are usually labour-intensive and time-consuming. A probabilistic disease
burden estimation of Cryptosporidium with uncertainty is not available. In this thesis, several
statistical models were applied to predict cyanobacteria abundance and Cryptosporidium
presence in source waters. Furthermore, a risk assessment model to estimate the disease
burden of Cryptosporidium was developed using simulation and sampling methods. The
model can also determine the removal efficiency and sewer overflow control required to
satisfy certain health outcome goals. This study proposed new approaches to predict
cyanobacteria abundance and Cryptosporidium presence in source waters in more accurate
and easy ways and highlighted the potential of using probabilistic risk assessment in
quantifying the disease burden of pathogens.
vi
Preface
The thesis is based on the research work completed in the School of Engineering at The
University of British Columbia, Okanagan, under the supervision and guidance of Dr.
Nicolás Peleato. All published works are included in this thesis. All published works are
included in this thesis.
Chapter 3 is based on the following paper submitted to Heliyon.
Zhang, Yirao, and Nicolas M Peleato. 2022. Predicting Cyanobacteria Abundance
with Bayesian Zero-Inflated Negative Binomial Models. Available at
SSRN: https://ssrn.com/abstract=3939421 or http://dx.doi.org/10.2139/ssrn.3939421
Chapter 4 is based on the following paper submitted to Water Research, and also accepted
to 2021 Thirty-sixth Conference on Neural Information Processing Systems (NIPS), Machine
Learning in Public Health (MLPH) Workshop
Zhang, Yirao, and Nicolas M Peleato. 2022. A probabilistic approach to evaluating
Cryptosporidium health risk in drinking water.
vii
Table of Contents
Abstract .............................................................................................................................. iii
Lay Summary ...................................................................................................................... v
Preface ................................................................................................................................ vi
Table of Contents .............................................................................................................. vii
List of Figures ..................................................................................................................... x
List of Tables ..................................................................................................................... xii
Acknowledgements ......................................................................................................... xiii
Chapter 1: Introduction ......................................................................................................1
1.1 Background ......................................................................................................... 1
1.2 Objectives ............................................................................................................ 3
Chapter 2: Literature Review .............................................................................................6
2.1 Source water quality and waterborne bacteria and pathogens of concern............ 6
2.1.1 Cyanobacteria and algae blooms ...............................................................6
2.1.2 Identification and treatment techniques for cyanobacteria ..........................8
2.1.3 Cryptosporidium parvum ..........................................................................11
2.1.4 Identification and treatment techniques for Cryptosporidium ....................13
2.2 Predictive modelling for cyanobacteria and Cryptosporidium ............................. 16
2.2.1 Cyanobacteria prediction..........................................................................16
2.2.2 Cryptosporidium prediction .......................................................................19
2.3 Bayesian modelling in source water quality........................................................ 21
2.4 Risk assessment for waterborne bacteria and pathogen .................................... 23
Chapter 3: Predicting Cyanobacteria Abundance with Bayesian Zero-inflated Negative
Binomial Models ..............................................................................................................28
3.1 Introduction ........................................................................................................ 28
viii
3.2 Methodology ...................................................................................................... 32
3.2.1 Study site and data source .......................................................................32
3.2.2 Mixture Models for Zero-inflated Count Data ............................................37
3.2.2.1 Zero-Inflated Negative Binomial Model .................................................37
3.2.2.2 Hurdle Negative Binomial Model ..........................................................37
3.2.3 Bayesian approach ..................................................................................38
3.2.4 Model development, selection and validation ...........................................39
3.2.4.1 Projection predictive inference .............................................................41
3.2.4.2 Leave-one-out cross-validation ............................................................42
3.2.4.3 Posterior Predictive Checks .................................................................43
3.3 Results and discussion ...................................................................................... 44
3.3.1 Variable selection .....................................................................................44
3.3.2 Model selection ........................................................................................46
3.3.3 Model checking ........................................................................................47
3.3.4 Cyanobacteria alert level prediction .........................................................49
3.3.5 Influence of weather and water quality factors on cyanobacteria counts ..52
3.4 Summary ........................................................................................................... 55
Chapter 4: A probabilistic approach to evaluating Cryptosporidium health risk in drinking
water ...............................................................................................................................57
4.1 Introduction ........................................................................................................ 57
4.2 Methodology ...................................................................................................... 62
4.2.1 Data sources ............................................................................................62
4.2.2 Predicting Cryptosporidium presence in source water ..............................64
4.2.2.1 Gaussian process classification............................................................64
4.2.2.2 Model performance evaluation and threshold determination .................66
ix
4.2.3 Modelling the Cryptosporidium exposure .................................................66
4.2.3.1 Removal through water treatment ........................................................66
4.2.3.2 Drinking water consumption .................................................................67
4.2.3.3 Sewer overflow rate..............................................................................68
4.2.4 Modelling the risk of illness ......................................................................69
4.2.5 Probabilistic QMRA ..................................................................................70
4.3 Results and discussion ...................................................................................... 71
4.3.1 Cryptosporidium prediction .......................................................................71
4.3.2 Scenario assessments with probabilistic QMRA .......................................73
4.3.2.1 Climate change ....................................................................................73
4.3.2.2 Treatment technique improvement .......................................................78
4.3.2.3 Sewer overflow control .........................................................................81
4.4 Summary ........................................................................................................... 83
Chapter 5: Conclusion .....................................................................................................85
5.1 Summary of Contributions ................................................................................. 85
5.2 Limitations and suggestions for future research ................................................. 87
Bibliography.....................................................................................................................89
Appendix ....................................................................................................................... 117
x
List of Figures
Figure 3-1 (a) Bar plots of sampling frequency in each year from 2002 to 2015; (b)
Histograms of cyanobacteria abundance. ...........................................................................33
Figure 3-2 (a) Visualization of the time series for cyanobacteria abundance in Cheney
reservoir; (b) Autocorrelation factor (ACF) of cyanobacteria abundance based on monthly
time lag. ..............................................................................................................................34
Figure 3-3 Correlation heatmap of water quality and weather parameters. ..........................36
Figure 4-1 (a) Bar plots of sampling times in each year from 2015 to 2021; (b) Historgrams
of Cryptosporidium presence and absence. ........................................................................63
Figure 4-2: Linear Correlation (r) heatmap of water quality, weather parameters, and
Cryptosporidium presence/absence(Class). ........................................................................64
Figure 4-3 Density plot of overflow rate ...............................................................................69
Figure 4-4 Schematic model for the process of DALYs estimation. The arrows represent the
relationship between two variables. .....................................................................................70
Figure 4-5 (a) The precision-recall curve when varying the threshold of predicting
“presence”. (b) The F-score to threshold curve. ..................................................................72
Figure 4-6 Quantile-quantile plots (Q-Q plots) (a) and box plots (b) of disability-adjusted life
years (DALYs) before climate change, after climate change under controlled emission, and
after climate change under uncontrolled emission. ..............................................................75
Figure 4-7 DALYs for Cryptopsoridium under temperature from15 to 65°C. ........................76
Figure 4-8 (a) Q-Q plots of DALYs before treatment improvement, and for improved
treatment at means of 2.5 ± 0.5log, 3.0 ± 0.5log, 3.5 ± 0.5log. (b)Box plots of DALYs in the
four groups. .........................................................................................................................79
Figure 4-9 Density plot with dark color represents estimated probability density of the
samples drawn target distribution, while the density plot with light color represents estimated
probability density of samples drawn from treatment before improvement (uniform
xi
distribution, lower = 1.5, upper = 2.5). (a) The density plot of samples from target distribution
before climate change (b) The density plot of samples drawn from target distribution after
climate change under emission control. ..............................................................................80
Figure 4-10 (a) quantile-quantile plots (Q-Q plots) of DALYs at current sewer overflow rate
(0.022), sewer overflow rate at three different levels of 0.01, 0.005, 0.001 (b)Box plots of
DALYs in the four groups. ...................................................................................................81
Figure 4-11 Density plot with dark color represents estimated probability density of the
samples drawn target sewer overflow rate distribution, while the density plot with light color
represents estimated probability density of samples drawn from sewer overflow rate before
control (normal distribution, mean = 0.022, sd = 0.12). (a) The density plot of samples from
target distribution before climate change (b) The density plot of samples drawn from target
distribution after climate change under emission control. ....................................................82
xii
List of Tables
Table 3-1 Selected variables used to build initial models.....................................................36
Table 3-2 Alert levels for management of toxic cyanobacteria (WQRA) ..............................41
Table 3-3 LOO-CV results to compare strength of model fits. Differences in elpd and
standard error (SE) were calculated using the highest performing model (ZINB). ...............47
Table 3-4 a) Confusion matrix for all WQRA levels along with figure depicting probability of
each class for a given prediction, and b) reduced confusion matrix for binary decisions > or
< 1,000 cells/mL. .................................................................................................................52
Table 4-1 Confusion matrix for binary classification of presence and absence of
Cryptosporidium ..................................................................................................................73
xiii
Acknowledgements
I would first like to thank my supervisor, Professor Nicolás Peleato, whose expertise has
helped me formulate the research questions and methodology. I could not have completed
the thesis without his insightful feedback and guidance. Thanks for his patient support along
the way.
Thanks also to committee members Dr. Solomon Tesfamariam and Dr. Alyse
Hawley, and university examiner Dr. Sepideh Pakpour for their crucial role in completing my
thesis and their meaningful comments that help me improve the thesis.
I would like to acknowledge my lab colleagues, Ziyu Li, Faezeh Ketabchi and Atefeh
Ashrafi for sharing their experiences and data with me. I would also like to thank Professor
Michael Noonan at the department of biology for his terrific course on biostatistics. Also, I
need to thank Dr. Andrew Gelman, Dr. Aki Vehtari and Dr. Ben Lambert for their wonderful
books on Bayesian statistics, which opened the door to the Bayesian world for me and gave
me opportunities to further my research.
Finally, sincere thanks to my family, friends and for their understanding,
encouragement and support over the two years.
1
Chapter 1: Introduction
1.1 Background
Source water refers to the natural sources of water, such as lakes, rivers, reservoirs and
groundwater, which provide water to public water supplies. Source water can be easily
contaminated by human activities, including sewage and agricultural pollution. Polluted
waters can contain a greater number of viruses, bacteria, parasites and other
microorganisms, such as cyanobacteria and Cryptosporidium. When certain conditions
exist, such as warm water temperature and abundance of nutrients, cyanobacteria that are
naturally found in aquatic environments can rapidly reproduce to form blooms. Cyanotoxins
released by some types of cyanobacteria can spread through contaminated or inadequately
treated drinking water to cause illness in humans and animals. For example, during 2009-
2010, 11 disease outbreaks associated with algal blooms were reported in the United
States, which represented 79% of the 14 freshwater algal blooms associated with outbreaks
that have been reported to CDC since 1978 (Hilborn et al., 2014).
Cryptosporidium that can be present in human or animal feces can cause
gastrointestinal illness when fecal-contaminated water is consumed. According to The World
Health Organization (WHO) guidelines (World Health Organization, 2005), about 1.8 million
people die from diarrheal diseases globally every year, many of which have been linked to
diseases acquired from the consumption of contaminated waters. The outbreak of
cryptosporidiosis in Milwaukee, Wisconsin, United States, was caused by an ineffective
filtration process, resulting in a total cost of $96.2 million for medical costs and productivity
losses (Corso et al., 2003). Since these waterborne microorganisms can result in high
maintenance costs and pose serious threats to public health, there is an urgent need to
develop technology for source water evaluation and risk assessment to reduce risks and
control treatment costs. However, accurate and rapid detection methods for the estimation
of cyanobacteria and Cryptosporidium remain a practical challenge. Current laboratory-
2
based methods related to microscopy are costly and time-consuming. With the need for
methods that can enumerate or estimate microorganisms’ levels rapidly and preferably, the
indirect data-driven approaches have been explored to make day-to-day predictions based
on easy-to-measure meteorological and water quality parameters.
Bayesian statistical method is a promising analytical method that has been utilized in
drinking water research. Bayesian methods propagate estimation uncertainty, allow
interpretation of probability of both the outcomes and parameters of interest, and have the
ability to incorporate prior expert knowledge. (Gelman et al., 2013). In the context of risk
assessment and decision making, Bayesian methods are well suited to inform the
management of complex systems with high uncertainty, such as cyanobacteria bloom
prediction and Cryptosporidium presence in source waters and the related public health
risks.
A common challenge with modelling cyanobacteria and Cryptosporidium is the
innate imbalance in monitoring datasets. A significant excess number of zero counts is
typical and may have resulted from either failure to detect or an actual absence of
microorganisms. Hence, in this study, the predictive models adaptive to imbalanced data
such as zero-inflated models, Gaussian process classification with threshold moving were
explored and applied to cyanobacteria and Cryptosporidium estimation and quantitative
microbial risk assessment (QMRA). Furthermore, the predictions produced from the
Bayesian approach utilized in this paper are probabilistic, providing interpretable results with
uncertainty for decision making.
In this thesis, the research focuses on the application of Bayesian methods in the
source water quality evaluation and risk assessment, i.e., imbalanced regression of
cyanobacteria abundance prediction (Chapter 4), imbalanced classification of
Cryptosporidium presence/absence prediction (Chapter 5), and probabilistic QMRA
3
(Chapter 5). In the following section, a detailed description of specific objectives and a brief
overview of major methods are provided.
1.2 Objectives
The primary goal of this work is to develop Bayesian predictive models for cyanobacteria
abundance and the presence of Cryptosporidium in source water, address the understudied
data imbalance problem in water quality modelling, and investigate the probabilistic
approach to evaluate Cryptosporidium health risk in drinking water. The specific hypotheses
and objectives of this thesis are listed below:
1) Various environmental factors directly related to cyanobacteria abundance.
Using zero-inflated models that account for data imbalance is hypothesized to
improve the fit. The aim of this study is to identify the key factors necessary for
cyanobacteria growth, and develop a predictive model for imbalanced
cyanobacteria data through Bayesian statistical methods and zero-inflated
models.
Cyanobacterial bloom is a persistent concern to water management and
treatment, with blooms potentially causing the release of toxins and degrading
water quality. Significant efforts have been made to model cyanobacteria growth
in surface waters to identify key factors driving growth and anticipate bloom
events. However, previous models have not considered the zero-inflation of
cyanobacteria count data, which refers to the excess zero-count in cyanobacteria
abundance. Commonly used Poisson and negative binomial models for count
data underestimate the probability of zeros, making these models less reliable.
As such, issues related to regression on imbalanced data must be addressed.
Several zero-inflated models including zero-inflated and hurdle Poisson models,
zero-inflated and hurdle negative binomial models will be investigated and
compared in this thesis to improve the overall accuracy.
4
Furthermore, there is limited discussion focused on the interpretability
and uncertainty of the predictive model. Point estimates from the frequentist
approach do not allow for informative, transparent prediction distributions. In this
work, A Bayesian approach to zero-inflated regression models will be presented.
The established model will also be used to assess the importance and impact of
meteorological and environmental variables on the probability of cyanobacteria
blooms. The optimal model with predictor variables of importance will be used to
predict cyanobacteria alert levels on a separate test set of data.
2) Day-to-day Cryptosporidium health risk can be evaluated through the QMRA
approach based on water quality data. Nonparametric methods are hypothesized
to be effective in addressing the imbalanced classification problem. The aim is to
apply Bayesian nonparametric methods and develop a probabilistic QMRA
approach to evaluate Cryptosporidium health risk in drinking water in different
scenarios and determine the critical control point for water management.
Cryptosporidium is an important pathogen that commonly drives public health
risks associated with drinking treatment, due to its persistence in aquatic
environments (Swaffer et al., 2018) and high probability of infection at low doses
(Lal et al., 2015). However, information on Cryptosporidium concentration is
inadequate due to significant sampling and measurement challenges.
Considering the commonly low amount of Cryptosporidium in samples, manual
concentration, filtration and detection are usually slow and labor-intensive. Data-
driven models to predict Cryptosporidium presence based on historical
meteorological and environmental data have not been intensively studied.
Furthermore, the consequential health risks in drinking water related to
Cryptosporidium concentration in source water should also be evaluated.
5
In this study, a Bayesian nonparametric method, logistic Gaussian
process regression will be applied to predict the presence in source waters using
easy-to-measure parameters and present a probabilistic QMRA connected to the
predictive model to evaluate Cryptosporidium health risks. With the probabilistic
QMRA, the effects of climate change (including temperature and precipitation
pattern change), treatment techniques improvements and sewage overflow
control will be investigated. The utility of the novel QMRA in backward inference
of the required removal efficiency and sewage overflow control to meet specific
disability-adjusted life years (DALYs) goals will also be demonstrated.
6
Chapter 2: Literature Review
2.1 Source water quality and waterborne bacteria and pathogens of concern
2.1.1 Cyanobacteria and algae blooms
Cyanobacteria, also called blue-green algae, are photosynthetic prokaryotes found naturally
in all types of illuminated environments (Whitton & Potts, 2012). They are single-celled and
synthesize various forms of chlorophyll to absorb energy in sunlight. Cyanobacteria flourish
in fresh, brackish and marine water, and can be found in environments where no other
microalgae can exist. Although most types of Cyanobacteria live in warm and nutrient-rich
waters, many species are capable of living in the soil and other terrestrial habitats (Mur et
al., 1999).
Even though cyanobacteria can live across a diverse range of environments, they
prefer warmer climates, and the temperature optimum for most cyanobacteria is at least
several degrees than for most eukaryotic algae (Whitton & Potts, 2012). Robarts & Zohary
(1987) found in literature and field data that the growth rate is temperature-dependent with
an optimum at 25°C or greater. However, direct temperature effects are secondary to
indirect temperature effects mixing with nutrients. Temperature is hypothesized to act
synergistically with other factors (Robarts& Zohary, 1987). Cyanobacteria can tolerate
desiccation, water stress and high levels of ultra-violet irradiation (Whitton & Potts, 2012),
and are diverse and abundant in higher pH values. The growth rate of most cyanobacteria is
at optima at high pH values between pH 7.5 and 10.0. Cyanobacteria have not been found
in acid lakes and are even not common in waters with pH between 5 and 6 (Fang et al.,
2018). Several cyanobacteria can fix atmospheric nitrogen and their growth in many
ecosystems is limited by the availability of nutrients including phosphorus and nitrogen.
When the freshwater bodies become enriched in nutrients, especially phosphorus, there is
often a shift in the phytoplankton community towards dominance by cyanobacteria (O’Neil et
al., 2012). The nitrogen to phosphorus (N:P) ratio is also the factor that regulates the
7
dominance of planktonic communities by blue-green or green microorganisms. A decrease
of the ratio through the addition of phosphorus usually leads to cyanobacterial blooming
(Levich, 1996). Other factors such as vertical stratification and increased atmospheric
carbon dioxide are also contributors to cyanobacteria’s increasing dominance in aquatic
ecosystems (Paerl & Paul, 2012). In water bodies with favorable conditions described
above, cyanobacteria can quickly multiply, forming harmful algae blooms that spread across
the surface.
Cyanobacteria blooms can form in warm, slow-moving waters that are rich in
nutrients such as fertilizer runoff or septic tank overflows. Blooms with the potential to harm
human health or aquatic ecosystems are referred to as harmful algal blooms or HABs. The
dense blooms are usually toxic and can degrade water quality, causing major problems for
water quality (Huisman et al., 2018). Cyanobacteria blooms block the sunlight that other
organisms need to live, and can deplete dissolved oxygen causing the death of fish and
benthic invertebrates (Rabalais et al., 2010). They can also form several compounds that
give unpleasant tastes and odours that interfere with the use for recreation and drinking.
Toxins produced by cyanobacteria, cyanotoxins, constitute the major source of natural
product toxins found in the surface supplies of freshwater (Carmichael, 1997). Cyanotoxins
have significant health effects, including human illness and mortality from direct
consumption of the toxins or indirect exposure to organisms that accumulate the toxins or
the toxins themselves (Sarma, 2012).
Cyanotoxins can be generally classified into three major classes based on their
primary toxicological effects: hepatotoxic peptides, neurotoxins and dermatotxin (Ferrão-
Filho & Kozlowsky-Suzuki, 2011). Most human and animal poisoning by cyanobacteria
involve acute hepatotoxicosis, caused by microcystins (MCs) and nodularins (NODs)
(Sarma, 2012). MCs are a class of toxins produced by a variety of cyanobacteria, including
Microcystis, Planktothrix, Anabaena, and Oscillatoria genera (Harke et al., 2016). MCs are
8
the most commonly found cyanobacterial toxins that cause a major risk to safe drinking
water and pose a serious threat to public health (Rastogi et al., 2014). Acute illnesses
caused by short-term exposure to MCs include headache, sore throat, vomiting and nausea,
diarrhea, and pneumonia, while prolonged
exposure to the reference level of MCs can cause severe liver injuries and might be at high
risk for developing nonalcoholic fatty liver disease (NAFLD) (Zhao et al., 2020). The
exposure to MCs is either through direct contact or by means of intake of untreated
contaminated water and food (Papadimitriou et al., 2012). Detectable levels of MCs have
been found in 80% of raw and treated water in 45 drinking water supplies in Canada and the
US in a 1996-1998 survey (Carmichael et al., 2001). However, only 4% of the samples
exceeded the WHO drinking water guideline of 1.0 μg/L. Between 2000 and 2012, an
extensive survey of 81 lakes in New York and lower Great Lakes found detectable levels of
MCs in nearly 60% of the 2500 samples collected during cyanobacteria bloom season, and
15% of which exceeded the WHO advisory limit (Boyer, 2008). NODs are potent produced
by the cyanobacteria Nodularia spumigena (Sivonen et al., 1989). NODs are often attributed
to gastroenteritis, allergic irritation reactions and liver diseases. Detrimental effects from
NODs have been frequently reported in many countries over the past 30 years (Chorus &
Welker, 2021). The WHO guideline for NODs concentration is 1.5 μg/L. Although there are
few records of human illness related to cyanobacterial blooms, Health Canada in 2002
classified MCs as possibly carcinogenic to humans and placed them in Group IIIB denoting
inadequate data in humans, limited evidence in experimental animals (Health Canada,
2002).
2.1.2 Identification and treatment techniques for cyanobacteria
Cyanobacterial cells are usually enumerated using optical microscopy (Marchall, 1982;
Tortora et al. 2007) which is usually labor-intensive and subjectively dependent on the
9
observers (Alversion et al., 2003; Correa). This method is further complicated by the
variable morphology of individual cells, high species diversity, and complexity of cell
aggregates or units (Jin et al., 2018). To replace the conventional enumeration methods,
recent studies have been focused on using image-driven techniques. Baek et al. (2020)
have used deep learning techniques, regional convolutional neural network (R-CNN) and
convolutional neural network (CNN) to quantify five cyanobacteria species. After reducing
the noises of the cell features, the model has achieved average precision values above 0.89
for all of the five species. Jin et al. (2018) have developed a novel imagining-driven
technique with an integrated fluorescence signature to enable automated enumeration of
cyanobacterial cells. The model was reported to achieve higher accuracy than using
standard manual microscopic enumeration techniques but in less time.
Both conventional and advanced treatment technologies have been used for
cyanobacterial cell removal. A number of studies have examined the effectiveness of
conventional treatment technologies on cyanobacterial cells and cyanotoxin removal.
Coagulation effectively removes cyanobacterial cells but cannot remove toxins, and the cells
remain intact during typical operating conditions, despite the high velocity gradients that are
produced during the rapid mixing stage (Ghernaout et al., 2010; Chow et al., 1999).
Filtration is usually followed by coagulation and sedimentation, which involves the passage
of the water by gravity through a filter of granular material (typically sand, gravel or
anthracite), with the purpose to remove the remaining particulates in the water.
Cyanobacterial cells and cell-bound cyanotoxins have been found effectively removed by
bank filtration, slow sand filtration and sedimentation (Grutzmacher et al., 2002; Hrudey et
al., 1999; Lahti et al., 2001). Although the large size of cyanobacterial cells achieves
effective removal during filtration, little or no removal of cyanotoxins occurs during filtration
(He et al., 2016).
10
Adsorption is also a widely applied effective treatment technology the removes trace
contaminants. Commonly used adsorbents include activated carbon and iron-based
adsorbents. Activated carbon, however, is more often utilized in practice than other
materials. Most of the studies relating to the activated carbon adsorption of cyanotoxins
have been conducted on the mycrocystins, in particular mycrocystins-LR and have
concluded that granular activated carbon is effective in removing cyanotoxins from drinking
water (Falconer et al., 1989; Himberg et al., 1989; Lambert et al., 1996). The capacity of
powdered activated carbon in adsorbing cyanotoxins has also been reported and is directly
related to the pore volume in the mesoporous region (Maatouk et al., 2002; Antoniou et al.,
2005). In addition to the pore size of the adsorbent, the surface chemistry of the adsorbent,
the pH of the solution, and the presence of competing compounds such as NOM, also
influence the adsorption process (Huang et al., 2007).
Disinfection is a highly effective method of bacteria removal and toxin inactivation.
The common oxidants include free chlorine, chlorine dioxide, chloramines and
permanganate. Free chlorine can effectively destroy microcystins and cylindrospermopsin
under optimized treatment conditions but is less effective in destroying saxitoxin and
anatoxin-a (Acero et al., 2005; Westrick et al., 2010). Chlorine dioxide (ClO2) is more
selective and in some cases comparable to or even more effective than chlorine in the
inactivation of microorganisms. However, studies
have found chlorine dioxide is not effective for the destruction of Microcystins,
cylindrospermospin and anatoxin (Westrick et al., 2010). Chloramine is the least effective
oxidant for inactivating certain cyanobacteria species, including Microsystis aeruginosa,
Oscillatoria sp. and Lyngbya sp. due to its relatively low reactivity with common water
constituents (Wert et al., 2013).
Membrane processes have been concluded to effectively remove toxic
cyanobacterial cells. A composite nanofiltration membrane have been found nearly remove
11
microcystins completely (Teixeira & Rosa, 2005). Gijsbertsen-Abrahamse et al. (2006) have
found 98% cell-bound microcystin was removed using an ultrafiltration membrane.
Ultrafiltration coupled to powdered activated carbon (PAC-UF) is an effective process for the
removal of cyanotoxins from drinking water. Lee & Walker (2006) have found PAC-UF
achieved more than 90% of microcystin-LR from water due to the adsorption of toxins
increasing the effective size of toxins and facilitating removal by ultrafiltration.
2.1.3 Cryptosporidium parvum
Cryptosporidium parvum is a protozoan parasite that can cause the diarrheal disease
cryptosporidiosis. Cryptosporidium is protected by an outer shell that makes it survive
outside the body for long periods of time and tolerant to chlorine disinfection during
treatment (Korich et al., 1990). Once the oocyst is consumed, the parasite can emerge from
the shell and infect the lining of the intestine, causing cryptosporidiosis (Templeton et al.,
2004).
The most common species in humans are Cryptosporidium parvum. However,
Cryptosporidium felis, Cryptosporidium muris and Cryptosporidium meleagridis have also
been identified in immunocompromised persons, especially those with the acquired
immunodeficiency syndrome (AIDS) (Chen et al., 2002). Symptoms of cryptosporidiosis
include watery diarrhea, stomach cramps or pain, dehydration, nausea and vomiting, which
begin 2 to 10 days after becoming infected with the parasite and usually last about 1 to 2
weeks in people with healthy immune systems (Leitch & He et al., 2011). People of all ages
can be infected, although infections are more common, and symptoms are more severe in
children. Up to now, Nitazoxanide, paromomycin, and azithromycin have activity against
Cryptosporidium (Sparks et al., 2015).
The major source of Cryptosporidium is from domestic and wild animals. Beef calves
are regarded as posing the greatest risk because of their large numbers, distribution and
high levels of oocysts excretion. Cryptosporidium completed their life cycle within the
12
epithelial cells in the intestine of a single host, underwent both sexual and asexual
development and produced oocysts excreted in the feces (Rose, 1997). Cryptosporidium is
widely distributed in the environment and the fecal to oral transmission of oocysts stage has
resulted in outbreaks through contamination of drinking water, food, and recreational water
(Fayer et al., 2000). The fecal contamination in water linked to discharge of untreated and
treated sewage and run-off of manure have been well-documented (Razzolini et al., 2020).
Lisle & Rose (1995) have reported that between 5.6% to 87.1% of source water including
surface, spring and groundwater samples that are not impacted by domestic and/or
agricultural waste contain 0.003 to 4.74 oocysts L-1. Monthly average oocysts in the river
network were predicted to range from 10-6 to 102 oocysts L-1 in most places. Densely
populated areas such as India, China, and Mexico are ‘hotspots’ regions with high oocysts
concentrations (Vermeulen et al., 2019). The infectivity of oocysts is high, and although the
infectious dose of oocysts excreted in feces is low, ingestion of 10-30 oocysts can cause
infection in healthy populations (Yoder & Beach, 2010).
Cryptosporidium is extremely resistant to chemical disinfection and has long survival
times in the aquatic environment. Cryptosporidium oocysts have demonstrated longevity in
all types of water investigated, including freezing (exposure to temperatures as low as -
22℃), desiccation and a series of the water treatment process (Robertson & Smith, 1992).
Oocysts survive well in waters at 20℃ with a salinity of 10 ppm over 20 weeks, but less than
4 weeks in seawater at the salinity of 30 ppm. Under natural conditions, the die-off rate of
oocysts in water is 0.005 – 0.037 log units per day (Fayer et al., 1998). The structure and
composition of the oocyst wall are primary factors determining the survival and hydrologic
transport of Cryptosporidium parvum oocysts outside the host. “Interim Enhanced Surface
Water Treatment Rule” (IESWTR) promulgated by the United States Environmental
Protection Agency (USEPA) has established a Maximum Contaminant Level Goal (MCLG)
of zero for Cryptosporidium (USEPA, 1997). However, most conventional water systems in
13
the US achieved 2-2.5 log10 removal and did not ensure the filtered water free of
Cryptosporidium (LeChevallier et al., 1991). High disinfection levels or more efficient
disinfection procedures are required, which will be discussed in subsection 2.2.
The largest Cryptosporidium outbreak reported was in Milwaukee in the US in 1993,
which resulted in an estimate of more than 400,000 people affected (Mac Kenzie et al.,
1994). In Canada, there have been two notable outbreaks: In the summer of 1996,
Cryptosporidium outbreak affected approximately 2,000 people in Cranbrook, British
Columbia and a separate incident occurred in Kelowna, British Columbia, causing illness in
10,000 to 15,000 people a few weeks later (Ong et al., 1999); In April 2001, an outbreak
occurred in the city of North Batteleford, Saskatchewan, causing 5800-7100 people
diarrheal illness and 1907 confirmed cases of cryptosporidiosis (Ong et al., 1999). In
developing countries, growing health burdens of faltering, malnutrition, and diarrheal
mortality related to Cryptosporidium remain underappreciated as diagnostic tools are not
available. The incidence of Cryptosporidium infection is also growing in developed countries
largely due to outbreaks in recreational water facilities. Without effective diagnostics,
treatment for immunocompromised patients and promising vaccines, the ability to reduce
the disease burden of Cryptosporidium infection remains limited (Shirley et al., 2012).
2.1.4 Identification and treatment techniques for Cryptosporidium
To evaluate the health risk related to Cryptosporidium oocysts in water and implement
appropriate treatment techniques, oocysts concentrations must be known. Quantification of
Cryptosporidium requires several steps including concentration or filtration, and manual
detection, which is challenging considering the commonly low numbers of Cryptosporidium
in samples. Other enumeration methods for Cryptosporidium include flow cytometry, solid
phase cytometry, electric resistance particle characterization, hemacytometry, chamber
slides and epifluorescent well slide (Lindquist et al., 1999).
14
As Cryptosporidium spp. oocyst occurs in low numbers in the aquatic environment
(Smith et al., 2003), in vitro methods that augment pathogen numbers prior to identification
are not available for Cryptosporidium in source water. Thus, large volumes of water samples
must be collected for the detection and concentrating small numbers of oocysts effectively is
important. Common methods for Cryptosporidium identification and enumeration include
concentrating and staining, microscopic enumeration, immune assay techniques and
molecular techniques (Ahmed & Karanis, 2018). In drinking water, concentration through
methods such as continuous flow centrifugation and membrane filtration is most commonly
practiced. Microscopic-based identification of Cryptosporidium is the most widely used due
to its relatively low cost (Ahmed & Karanis, 2018). Molecular methods, PCR tests can also
detect drinking water specimens. However, despite its high sensitivity and accuracy, the
false positive rate is usually high due to the detection of non-viable microorganisms and
laboratory contamination (Checkly et al., 2015). USEPA has developed a grab sample
method for Cryptosporidium in raw water samples. The method involves filtration,
immunomagnetic detection using an immunofluorescence assay 4′,6-diamidino-2-
phenylindole (DAPI) staining, detection by epifluorescence microscopy, and determination
of internal morphology using Nomarski differential interference contrast (DIC) microscopy
prior to determining oocyst concentration (Clancy et al., 1999).
The USEPA Interim Enhanced Surface Water Treatment Rule (IESWTR) which is
promulgated in 1998, regulates that the treatment technology to control Cryptosporidium in
water should achieve a Maximum Contaminant Level Goal (MCLG) of zero for
Cryptosporidium and a 2-log (99%) log removal when using filtration only (USEPA, 1998). In
the USEPA Long-term 2 Enhanced Surface Water Treatment Rule (LT2ESWTR), water
plants using conventional treatment will require monitoring for Cryptosporidium, E.coli and
turbidity for a period of 24 months (USEPA, 2006). However, LeChevallier et al. (1991)
examined 66 conventional water systems in the US and reported that most of the utilities
15
achieved 2-2.5 log 10 cyst and oocyst removal by clarification and filtration and compliance
with criteria outlined by the SWTR did not ensure that filtered water free of waterborne
parasites. High disinfection levels or more efficient disinfection procedures were ultimately
recommended.
Previous studies have investigated the treatment efficiency of various treatment
processes. Nieminski and Ongerth (1995) conducted a 2-year evaluation of Cryptosporidium
at a full-scale treatment plant and a pilot operating under conventional treatment reported an
average of 2.9 log removal for Cryptosporidium. States et al. (1997) have observed
Cryptosporidium removal of 1.49 log in a full-scale conventional treatment plant. Since
Cryptosporidium oocysts are resistant to removal and inactivation by conventional water
treatment, extensive research has been focused on the optimization of treatment processes
and new technologies application. Enhanced coagulation through the use of higher doses of
coagulants can significantly improve the removal efficiency to 5.8 log units (Betancourt &
Rose, 2004). Edzwald et al., (2000) evaluated removals of Cryptosporidium by clarification
combined with dual media filtration under challenging conditions of high cyst and oocyst
levels. The combined clarification and filtration together achieved an average greater than 5-
log removals, which were comparable to those achieved by sedimentation and filtration.
Diatomaceous earth filtration has been reported more effective than other conventional
filtration in removing Cryptosporidium oocysts with up to 6-log removal (Ongerth and Hutton,
1997; Ongerth and Hutton, 2001). Although no inactivation of Cryptosporidium was
observed after 18h of contact time with chlorine at high levels and with chloramines, chlorine
dioxide can inactivate about 90% oocysts (Betancourt & Rose, 2004). Low doses of UV (1-9
mJ/cm2) have been observed to inactivate 2-4 log 10 units of Cryptosporidium parvum
oocysts (Linden et al., 2002). Membranes such as microfiltration (MF) membrane and
ultrafiltration (UF) membrane can provide complete removal of all protozoan oocysts of
16
concern. Jacangelo et al. (1995) have demonstrated that various MF and UF provide log
removals of Cryptosporidium parvum oocysts ranging from >4 log to 6 log units.
2.2 Predictive modelling for cyanobacteria and Cryptosporidium
2.2.1 Cyanobacteria prediction
As the occurrence of algal bloom results in water quality degradation and possible public
health risks, previous studies have investigated the water quality and meteorological factors
including water temperature, precipitation, flow, and nutrient concentration that could affect
cyanobacterial proliferation and developed a few frameworks to predict and forecast future
cyanobacteria abundance/blooms with the aid of historical and existing data. However,
ecosystems are complex systems consisting of interlinked subsystems (Parrott & Kok,
2000). The complex processes involved in cyanobacterial blooms can be challenging to
model, such as nutrient loading, transport and diffusion, and compounding effects from
weather events. The mathematical modelling approaches for microbes can be divided into
two major classes: physical models that simulate the dynamics of underlying processes and
data-driven models that construct models from empirical data and employ the extensive
monitoring data to predict and make decisions for future scenarios. For water quality
modelling, data-driven or statistical models provide a fast and low-cost approach, since
recent growth of improved monitoring techniques such as wireless sensors have
significantly improved data availability.
Several data-driven models have been implemented to predict cyanobacteria occurrence or
abundance in source waters. Kim et al. (2020) presented a model to predict cyanobacteria
occurrence or absence in rivers using water temperature, velocity and phosphorus
concentration, which are readily available through direct measurements. Weighted function
models, including sigmoidal, linear, and exponential, were developed to predict
cyanobacteria occurrence conditional on the preceding state of cyanobacteria abundance.
This model was shown to achieve more than 75% accuracy through cross-validation. Zhao
17
et al. (2019) proposed a new cyanobacterial bloom occurrence prediction method to analyze
the probability and driving factors of the blooms. The dominant species were determined
through a dominant species identification model and the principal driving factors were
analyzed using canonical correspondence analysis (CCA). The probability of bloom was
calculated using the model and the critical control point of the probability of cyanobacterial
bloom was 0.75. Harris & Graham (2017) have compared 12 linear and nonlinear regression
modeling techniques to predict cyanobacterial abundance and cyanobacterial toxins using
14-year water quality data set. Support vector machine, random forest, boosted tree and
Cubist modeling techniques were reported as the most predictive approaches, and Cubist
modeling is the only approach that can predict maximal concentrations of cyanobacteria
abundance and geosmin.
Bayesian methods have also been used for cyanobacteria abundance prediction and
assessment of the relative importance of environmental factors on cyanobacteria growth.
Cha et al., (2017) developed a Bayesian hierarchical model to compare the relative
importance of predictors, including biological, environmental, meteorological and
hydrological factors, obtained from 16 sites in four major rivers in South Korea. Results
suggested that temperature and residence time instead of nutrient levels are important
variables to cyanobacteria growth in summer across the sites (Cha et al., 2017).
Considering the demand for forecasting the alert of cyanobacterial blooms 10- to 30-day-
ahead, Recknagel et al. (2017) developed a novel early warning scheme for cyanobacteria
abundance and cyanotoxins in drinking water reservoirs. The scheme ensembles inferential
models developed by the hybrid evolutionary algorithm (HEA) solely using in-situ data. The
resulting models for cyanobacteria have been reported to be capable of forecasts up to 30
days.
Furthermore, deep learning techniques and image processing methods have been
used by Pyo et al. (2021) using the remote sensing images of cyanobacteria. They
18
developed a convolutional neural network (CNN) model with a convolutional module
(CNNan) to predict cyanobacteria abundance using field monitoring data, hyperspectral
image sensing and simulated water quality from a hydrodynamic model. The prediction
performance of the CNNan model was better than the unmodified CNN model and
environmental fluid dynamics code (EFDC) simulation. The results demonstrated the deep
learning models are feasible for predicting the presence of harmful algae in the water. Wang
et al., (2010) developed a hybrid model consisting of a back-propagation neural network and
a rough decision to predict blooms in Dianchi Lake, China. Predictive accuracy of 0.8 has
been achieved in binary classification.
A common challenge with data-driven models is class imbalance, where the number
of instances in one or several groups (called the majority classes) severely exceeds the
number of instances in other groups (called the minority classes). Standard machine
learning classification algorithms are developed to enhance overall accuracy and will cause
misclassification of a minor class, which is often associated with serious consequences. In
cyanobacteria prediction, algorithms without consideration for imbalance will accurately
predict “absence” (majority class) by decreasing the predictability of “presence” (minority
class) in the presence of class imbalance. Shi et al. (2021) explored oversampling
algorithms and ensemble classifiers to predict cyanobacteria events. The model was
developed using a variety of physicochemical and hydrometeorological factors as
predictors. Cyanobacteria abundance data were collected from 2013 to 2019 in major rivers
in South Korea and classified into binary classes. They proved the imbalance ratio
adversely affected the model performance and the effectiveness of resampling applications
for addressing the class imbalance. AdaBoost ensemble classifier yielded the most stable
performance, and the temperature was the primary influencing factor of cyanobacteria
blooms. Kim et al. (2022) have also used classification algorithms and oversampling
methods to resolve the problem of having an imbalanced dataset of cyanobacteria. Mixture
19
models such as the hurdle model have also been developed to handle imbalanced data and
predict cyanobacteria abundance. Cha et al. (2014) have developed a hurdle Poisson model
to predict cyanobacteria abundance, allowing zero counts (absence) and nonzero counts to
be modelled using different models and environmental factors. The results suggest low
temperature and low suspended solids (SS) can promote low cyanobacteria abundance. As
the model is fitted under a Bayesian framework, the alert levels were predicted along with
probabilities, which can provide management implications. Apart from Poisson distribution
used in the study by Cha et al., (2014), the negative binomial distribution is another
commonly used model for count data. The negative binomial distribution uses an extra
parameter to accommodate overdispersion, which leads to broader applicability. To
understand the response of cyanobacteria to environmental changes such as climate
warming and nutrient enrichment, Richardson et al., (2019) used a zero-inflated model
along with linear mixed models to fit cyanobacteria biovolume data. The first process
(binomial distribution) in a zero-inflated model was used to model the effect of treatment
(temperature, nutrient treatments and extreme rainfall treatments) on the probability of
occurrence of cyanobacteria. The impact of treatment on the biovolume of different
cyanobacteria genera (non-zero data) was tested using linear mixed models. Commonly
used zero truncated models include zero-inflated model and hurdle model. Zero-inflated
models assume zeros are generated in both processes, while in hurdle models, all zeros
arise from the first process. Both models should be validated and compared to provide a
reasonable explanation of the zero generation mechanism.
2.2.2 Cryptosporidium prediction
As the identification of Cryptosporidium in source water is time- and labour- intensive, data-
driven models based on historical data have been used to predict Cryptosporidium oocyst
concentrations. Due to the prevalence and ease of enumeration, a suite of fecal indicator
bacteria or organisms has been used as indicators for the presence of Cryptosporidium
20
oocysts (Coffey et al., 2007). In many regions, E.coli is accepted as the best and most
affordable surrogate of contamination by Cryptosporidium (WHO, 2006). Under the USEAP
Long Term 2 Enhanced Surface Water Treatment Rule (LT2), if lakes or reservoirs have low
densities of E.coli (<10 CFU 100ml-1) or flowing streams (<50 CFU 100ml-1),
Cryptosporidium monitoring is not required. As such, majority of previous studies on
Cryptosporidium prediction have utilized indicator organisms. Agulló-Barceló et al. (2013)
have reported that when using E.coli and somatic coliphages data together, discriminant
analyses showed high accuracy in predicting infectious Cryptosporidium oocysts. However,
more studies have concluded that indicator bacteria alone cannot provide information (the
presence and/or concentrations) of most important pathogens in surface waters (Pachepsky
et al., 2016; Francy et al., 2013; Costán-Longares et al., 2008). Also, Lalancette et al.
(2014) have reported the use of E.coli concentrations as a surrogate for Cryptosporidium
concentrations can result in an inaccurate estimate of Cryptosporidium risk for agriculture
impacted drinking water intakes or for intakes with distant wastewater sources. More
recently, studies have been focused on machine learning applications in Cryptosporidium
risk assessment. Ligda et al. (2020) have investigated interactions between environmental
factors and Cryptosporidium oocysts concentrations, and applied machine learning models,
and linear discriminant function analysis to predict the contamination intensity of
Cryptosporidium. An overall accuracy of approximately 75% was achieved for the
classification of four levels of Cryptosporidium concentrations.
Although there is a lack of studies that focus on the direct prediction of
Cryptosporidium, previous studies have elucidated factors that drive the occurrence of
Cryptosporidium in water bodies. Most of these studies have used the Soil and Water
Assessment Tool (SWAT). Coffey et al. (2010) have reported the SWAT can be used to
predict Cryptosporidium oocysts concentrations in the source water. The result suggested
the mean monthly prediction ranged from 4.8 oocysts L-1 to 0.004 oocysts L-1 in the west of
21
Ireland. Further model development using observed oocysts levels is required to
quantitatively assess model accuracy. Liu et al. (2019) studied the fate and transport
dynamics of Cryptosporidium using SWAT and predicted the average annual concentration
of Cryptosporidium oocysts in Daning River in China was 0.95 oocysts L-1 but with high
spatial variability. A combined impact of rainfall and regional fertilization on the level of the
Cryptosporidium was emphasized. Frey et al., (2013) have used Classification and
Regression Tree Analysis (CART) to predict pathogen presence/absence for an agricultural
watershed using the simulated streamflow, total suspended solids (TSS), total N and total P,
and fecal indicator bacteria loads. The model identified air temperature, precipitation,
streamflow, and total P as the most important variables for classifying pathogen
presence/absence, and a close relationship between cattle pollution and pathogen
occurrence in the studied watershed.
2.3 Bayesian modelling in source water quality
Bayesian methods are becoming increasingly popular in recent years with the demand for
quantification of uncertainty. Bayesian inference has the advantage of combining prior
information on parameters with observations to provide an improved model parameter
estimation and output with uncertainty. Bayesian methods have been applied in a range of
water quality models (Freni & Mannina, 2010). Two objectives are usually associated with
Bayesian modelling: (1) Present a predictive model of water quality using environmental
variables and interpret the results with uncertainty analysis. (2) Investigate the relationship
between water quality and these variables and identify the key factors through sensitivity
analysis and/or comparison between prior and posterior distributions of parameters.
Dilks et al. (1992) applied a Bayesian Monte Carlo technique to predict Grand River
dissolved oxygen with nine uncertain model parameters. As little prior information was
available, uniform distribution was employed to initially describe each parameter. Results
indicate every parameter was significantly correlated to the ability of the model, and the
22
model predicted minimum dissolved oxygen concentration by 72% from 0.69 to 2.5 mg/L.
Malve et al., (2007) fitted 8 years of in situ observations of cyanobacteria with adaptive
Markov chain Monte Carlo (MCMC) methods to estimate model parameters. The model
discovered that to satisfy with 0.95 probability criteria of cyanobacteria (concentration does
not exceed 0.86 mg/L), the range for total phosphorus concentration should be between 45
μgC/L to 16 μg/L. Zooplankton grazing effect has a major effect on cyanobacteria.
Gronewold et al. (2009) have applied ordinary least squares (OLS) linear regression and
MCMC to calibrate first-order bacterial decay model and empirical bacterial die-off model.
Both models were validated by leave-one-out cross-validation and assessed by Bayesian
posterior predictive p-values. Results suggest that models without a bacterial kinetics
parameter related to the decay rate more appropriately reflected FIB rate and transport
processes. Zhao et al. (2014) developed a multi-pollution source water quality model
integrated with Bayesian statistics to support water quality management in Songhua River
system in northeastern China. The model estimated the distribution of the decay rate (k)
which was considered a key factor. The distribution curves enabled assessing the influence
of each loading and designing water quality management strategies seasonally. Bayesian
hierarchical modelling is also widely utilized in water quality modelling. Liu et al. (2021) used
a hierarchical Bayesian model averaging framework to explore the relationship between
event-based water quality and environmental variables, including sediments, nutrients and
salinity to predict the water quality at multiple sites and identify key environmental drivers.
The study found that rainfall and runoff affected in-stream particulate constituents, while
wetness and vegetation cover impacted dissolved nutrient concentration and salinity.
MCMC methods can also be combined with other simulation-based methods. Yang et al.
(2016) have integrated a genetic algorithm (GA) into a Bayesian approach to improve
sampling performance during parameter estimation. The eutrophication model based on the
MCMC coupled with GA was applied on data from an urban lake in north China. Water
23
quality assessment was conducted for eutrophication management. Results suggest that
the MCMC-GA method performed a better convergence efficiency during sampling and
narrower 95% credible intervals than classic MCM method. Rainfall runoff nutrient loading
was a key factor in eutrophication and should be controlled for lake restoration.
2.4 Risk assessment for waterborne bacteria and pathogen
In the field of health risk assessment, quantitative microbial risk assessment (QMRA) and
nonlinear intelligent models are common tools. With the advantage of allowing uncertainty
by running Monte Carlo simulations, QMRA has been widely used for drinking water safety
management and risk assessment. QMRA is a mathematical approach used to quantify the
health risks from microorganisms in source water and can be used to support water safety
management decisions. QMRA approach follows four steps: hazard identification, exposure
assessment, dose-response assessment, and risk characterization. Using the model, the
health risk can be quantified and compared with the risk that is agreed to be acceptable.
QMRA also allows the comparison of different scenarios and informs the required design of
the treatment to obtain a certain treatment level. QMRA usually assesses the most ‘risky’
exposure, which is assumed to be oral ingestion, since the exposure through inhalation or
skin is unlikely, and the data to estimate exposure through these routes is often unavailable.
Although the assessment was initially set up to evaluate the health risk of specific microbes,
it can also be used for chemical contaminants (WHO, 2021).
Probabilistic approaches are emerging as a practical complementary approach to
conducting QMRA. Compared to deterministic QMRA, which uses point estimates such as
arithmetic mean values for the input variables (Health Canada, 2018), probabilistic QMRA
models are superior since they can account for variability and uncertainty of the input
variables and parameters. Three basic approaches are currently being adopted in research
into probabilistic QMRA: Monte Carlo simulations, Bayesian networks, and Markov chain
Monte Carlo method.
24
The Monte Carlo method relies on repeated random sampling to generate
simulations. It uses randomness to solve deterministic problems (Metropolis & Ulam, 1949).
In Monte Carlo approach to QMRA, all input variables (such as the concentration of microbe
and drinking water intake) were described as appropriate probability distributions that
quantified uncertainty and were further introduced to the exposure assessment model. The
result, which is the exposure distribution, was passed to the dose response model to
quantify the probability of infection and illness using dose-response relationship and
morbidity rate, and the final output of risk characterization is a probability distribution of
DALYs. By repeatedly sampling from distributions of input variables through Monte Carlo
method, the risk of illness in DALYs can be depicted in distribution. Amha et al. (2015) have
developed a probabilistic QMRA to determine the risk of Salmonella infections resulting from
the consumption of crops irrigated with treated wastewater. The probabilistic exposure
models for raw consumption of three vegetables (lettuce, cabbage and cucumber) irrigated
with treated wastewater were constructed, and the disease burden of Salmonella was
estimated using the Monte Carlo method. The results suggested a raised median disease
burden compared with the acceptable disease burden set by the World Health Organization
of 10-6 DALYs per person per year. Consumption of lettuce irrigated with treated wastewater
have posed the highest risk of infection, while cucumber showed the lowest risk from the
study. Mok et al. (2014) have constructed a probabilistic QMRA model to determine the
health risks of norovirus infection from consumption of vegetables irrigated with wastewater
in Shepparton, Autstrilia. Annual norovirus disease burden was estimated for the
consumption of lettuce, broccoli, cabbage, cucumber and Asian vegetables through the
Monte Carlo simulation. The results indicate that wastewater treatment did not have
sufficient removal efficiency to meet the WHO threshold of 10-6, while extra disinfection
treatments provided acceptable results. Barker et al.(2014) proposed a probabilistic QMRA
using Monte Carlo method to assess the risk of gastroenteritis illness caused by rotavirus,
25
norovirus, and Ascaris lumbricoides associated with the consumption of street food salads.
The results indicate that both Rotavirus-dominated and norovirus-dominated annual disease
burdens have exceeded the 10-4 DALYs, and significant interventions are demanded to
maintain the health and safety of street food in Kumasi. Apart from microbial risk estimation,
probabilistic QMRA can also be used to estimate the health risk of systems and
management solutions using reference pathogens as proxies, Bivins et al. (2017) proposed
a probabilistic QMRA using Monte Carlo techniques to estimate the risks of infection of
waterborne illness when the population exposed to Intermittent water supply. Reference
pathogens including Campylobacter, Cryptosporidium, and rotavirus were used as
conservative risk proxies for infection via bacteria, protozoa, and viruses. Results suggested
that diarrheal disease burden associated with intermittent water supply likely exceeds the
WHO guideline for drinking water of 10-6 DALYs per person per year. Ishaq et al. (2022)
have estimated the disease burden of diarrhea from Campylobacter, Giardia,
Cryptosporidium and Norovirus with an integrated “Regression-QMRA method” by
examining the relationship between pathogens concentration and environmental variables.
The probability distribution of pathogen concentration was calculated by linear regression of
water source, LID type, pathogen type and season, and 1000 simulated data points for each
pathogen were generated for each pathogen. The results show that after applying the
methodology to a planned LID train, the predicted disease burden of diarrhea from
Campylobacter is highest, and followed by Giardia, Cryptosporidium, and Norovirus.
Bayesian networks (BNs) are probabilistic graphical models (directed acyclic graphs)
representing complex relationships of multiple variables. Each node corresponds to a
random variable and each edge represents the conditional probability for the corresponding
random variables. Bayesian networks are appropriate for nonlinear problems (Yang et al.,
2019). Since Bayesian networks can infer missing values, incorporate expert knowledge
and multi-source data, and address uncertainties with prediction, such models are widely
26
applied to quantitative risk assessment (Beaudequin et al., 2015; Jiang et al., 2021). Greiner
et al. (2013) have concluded that the entire QMRA model can be formulated as a Bayesian
network using the same equations through Monte Carlo methods but implemented in a
network that includes the joint distribution of all variables in the model. Previous studies
have utilized QMRA to quantify risk based on BNs. Jiang et al. (2021) investigated the
relationships between cyanobacterial blooms and multi-dimensional influencing variables.
An extended BN and an integrated framework were proposed to assess the risk of
cyanobacterial blooms. The model was used to evaluate the global warming effects on the
risk and has reported an increase of 38.5% under global warming. Beaudequin et al. (2016)
presented a QMRA expressed as BN in the wastewater reuse context and evaluated the
risk of norovirus infection associated with wastewater-irrigated lettuce in a range of
exposure and risk mitigation scenarios. Orak et al. (2020) have developed a hybrid BN
model for health risk assessment of arsenic contamination, and the results show that low
inorganic arsenic concentration increases the risk of low birth weight even for low
gestational age scenarios. Donald et al. (2009) presented a conceptual BN model to
illustrate the risks of gastroenteritis posed by the use of recycled water. The nodes were
quantified using an expert’s opinion, and the model allows users to make various predictions
as to the risks posed under various scenarios.
Although both Monte Carlo simulation and BNs present predictions as probability
distributions, these methods work from fixed estimates for means, variances and other
parameters (Donald et al., 2011). The uncertainty from parameter estimations can be
incorporated with the risk assessment by adopting a Bayesian approach with Markov Chain
Monte Carlo method. Donald et al. (2011) studied incorporating the parameter uncertainties
into the probabilistic QMRA model. The study illustrated that simultaneous parameters
estimates are a better methodology than the ‘plug-in’ of point estimates of parameters
approach through Monte Carlo simulation. Parsons et al. (2005) have compared BN,
27
Markov chain Monte Carlo (MCMC) approaches to QMRA modelling of Salmonella spp.
Although the BN model requires variables to be discrete, which may introduce error, it
responds immediately to changes under scenario analysis since it does not use simulation
and can propagate information from any point in the network to all others by Bayesian
inference. MCMC approach sacrifices the ability to propagate evidence but does not require
discrete variables and offers greater flexibility.
28
Chapter 3: Predicting Cyanobacteria Abundance with Bayesian Zero-inflated Negative
Binomial Models
3.1 Introduction
Cyanobacteria are photosynthetic microorganisms that can result in degraded freshwater
quality and threaten human health. Cyanobacterial blooms can significantly increase
turbidity, result in dissolved oxygen depletion due to biological degradation of cyanobacteria
biomass, and produce unpleasant taste and odour compounds (Huisman et al., 2018).
Furthermore, some species can release toxins such as microcystins, nodularins,
cylindrospermopsin, anatoxins, and saxitoxins (Catherine et al., 2013). It has previously
been observed that there is a significant positive relation between non-alcoholic liver
disease and large-scale blooms associated with toxin release (harmful algal blooms or
CyanoHABs) (Zhang et al., 2015). Furthermore, associations between drinking surface
water from cyanobacteria contaminated water bodies and a higher incidence of colorectal
cancer have been noted (Lee et al., 2017). CyanoHABs also pose severe problems for
ecological systems. Even low concentrations of microcystin-LR (5 g/L) and microcystins
(50 g/L) have been found to impact fish growth and survival rates. At high concentrations
of microcystins (> 10 mg/L), morphological effects on fish have been observed (Oberemm et
al., 1999). In addition, the accumulation of microcystins and cyanotoxins through the food
web is a threat to human health (Bownik, 2016). Based on analysis of lake sediments over
the last 200 years, data shows that cyanobacteria have increased significantly, with the
most rapid growth in blooms occurring from 1945 until the present (Taranu et al., 2015).
CyanoHABs are caused or promoted by a combination of environmental factors, with
strong associations with several anthropogenic and natural processes. Agricultural activities
can increase nitrogen and phosphorus input into the water system, promoting cyanobacteria
growth (O’Neil et al., 2012). Climate change impacts are also likely to increase the
29
occurrence of blooms in the future (Chapra et al., 2017). Higher water temperatures
stimulate the growth of cyanobacteria, since their optimal growth rate is often reached at
temperatures above 25°C (Thomas & Litchman, 2016). Cyanobacteria are carbon-fixing
bacteria that rely on a CO2 concentrating mechanism, and therefore rising concentrations of
CO2 in the atmosphere and water bodies may also promote blooms (Verspagen et al.,
2014). Elevated pH is also known to reduce the energy cost of the CO2 concentrating
process, with higher efficiencies observed in acidic environments (Mangan et al., 2016)
Cyanobacteria bloom density is usually counted with a mechanical or electronic
counter using an inverted microscope following sedimentation in a chamber or filtration
(Chorus & Welker, 2021). Cell counting is a labour intensive, time consuming, and
expensive method that limits the extent and frequency of monitoring campaigns. As such,
there is a need for methods that can enumerate or estimate cyanobacterial levels rapidly
and preferably without the need for sampling. Several studies have developed models to fit
count data and make predictions of day-to-day counts based on easy to measure
parameters. Dzialowski et al. (2009) attempted to build a linear regression model for
predicting the cyanobacteria abundance and toxins in five reservoirs in Kansas, USA.
However, their results suggest simple linear models could not accurately predict
cyanobacteria counts (Dzialowski et al., 2009). Pyo et al., (2020) utilized a convolutional
neural network applied to the output of a spatial fluid dynamics model of cyanobacteria
abundance, which achieved good short-term prediction of microcystis. Zhao et al., (2019)
put forward a species identification model and analyzed the dominant species using
canonical correspondence analysis (CCA). The model was used to identify major driving
factors, including water temperature, pH, total phosphorus, ammonia nitrogen, chemical
oxygen demand and dissolved oxygen, and predict the risk of algal blooms. Harris &
Graham, (2017) developed 12 linear and non-linear models to predict cyanobacteria
abundance, microcystin and geosmin in a reservoir. Support vector machines, random
30
forests, boosted trees, and cubist modelling approaches were observed to have the best
performance. However, all models underestimated cyanobacteria abundance, and none of
the models predicted peak bloom events or the highest counts.
A common challenge with modelling cyanobacteria abundance is the innate
imbalance in monitoring datasets. A significant excess number of zero counts is typical and
may have resulted from either failure to detect cyanobacteria or an actual absence of
cyanobacteria. Poisson and negative binomial distributions are commonly used for
modelling count data, but they cannot account for the information contained in the excess
proportion of zeros. Several mixture models have been proposed to consider better high
numbers of zero counts: zero-inflated models and hurdle models. Zero-inflated models
assume zeros are generated by a Bernoulli distribution with probability 𝑃 and negative
binomial (or Poisson) distribution with probability 1 − 𝑃 (Lambert, 1992). In hurdle models,
the zeros and non-zero values are generated separately by a Bernoulli distribution and
negative binomial (or Poisson) distribution (Min & Agresti, 2005). Both hurdle and zero-
inflated models have been used in environmental and ecological fields of study. Wenger &
Freeman (2008) showed improved fit of zero-inflated models to duck species abundance
and stream fish abundance. Cha et al. (2014) developed a Bayesian hurdle Poisson model
for predicting cyanobacteria abundance in Lake Paldang, Korea. Richardson et al., (2019)
used a zero-inflated model along with linear mixed models to fit cyanobacteria biovolume
data. Hegg et al., (2022) have also used a zero-inflated generalized linear mixed model to
model cyanobacteria abundance to investigate the toxin-producing cyanobacteria effects on
water fleas (Daphnia) fitness in eutrophic lakes. Marion et al. (2017) constructed a
multivariate zero-inflated beta regression models to assess the relationships between the
proportion of county area experiencing a cyanobacteria bloom, county land cover types, and
nutrient loading. Salmaso et al., (2015) have used zero-inflated negative binomial and to
analyze the relationships between environmental variables and cyanobacteria abundance
31
and tested against zero-inflated Poisson model based on AIC and likelihood ratio test. The
two zero-truncated models: hurdle model and zero-inflated model are considered to provide
two plausible explanations for the zero counts in cyanobacteria abundance data. However,
comparison between the goodness of the fit to cyanobacteria abundance data of the two
models have not been made, with studies adopting either the zero-inflated (Salmaso et al.,
2015; Marion et al., 2017; Richardson et al., 2019; Hegg et al., 2022) or hurdle model (Cha
et al., 2014). Furthermore, although previous studies have analyzed cyanobacteria
abundance data within the Bayesian framework (Cha et al., 2014), Bayesian variable
selection method within such as projection predictive inference has not been widely used.
Previous methods focused on the traditional variable selection methods, such as principal
component analysis and stepwise regression (Salmaso et al., 2015; Cha et al., 2014). While
the general factors that cause cyanobacteria blooms have been well investigated (Salmaso
et al., 2015), the method to select site-specific factors that influence idiosyncratic
cynobacteria abundance have not been developed, which can make accurate prediction
challenging.
This study presents a Bayesian approach to fit cyanobacteria data with a negative
binomial model, zero-inflated negative binomial model, and hurdle negative binomial model
to address challenges with inflated zero counts. It is hypothesized that through the novel
use of zero-inflated models for this application the elevated zero counts inherent in the
majority of cyanobacteria abundance data can be accounted for and model fit will be
improved. Additionally, a Bayesian framework was used to present abundance predictions
as distributions rather than point estimates, allowing for a more direct interpretation of
uncertainty. Through these two key aspects of the presented models, the aim is to improve
the integrability of models in water management by accounting for expected data
distributions and emphasizing the need for knowledge of uncertainty in predictions of
environmental systems. The fit of each model is compared to select an optimal model.
32
Predictions from the optimal model, are then classified according to Australian Management
Strategies for Cyanobacteria (Newcombe et al., 2009) to assess the capabilities of the
presented approach to identify cyanobacteria levels used in water management. The
application of the selected model integrated into a Bayesian framework and utilizing the
predictive distribution of each prediction obtained from MCMC sampling to assign the
prediction into predefined categories are novel and can achieve higher accuracy than the
regression method. The established model was also used to assess the importance and
impact of environmental variables on the probability of cyanobacteria blooms. The state-of-
art projection predictive variable selection for generalized linear models which has shown
superior performance to competing variable selection methods (Catalina et al., 2020) is
employed in this study, and the selected model is validated through the posterior predictive
checks, which are useful tools to inspect the discrepancies between real and
predicted/simulated data. The whole process has not been used to resolve water quality
issues, and the developed framework is appropriate to resolve a wide range of problems of
predicting the classification of imbalance data in environmental and ecological fields.
3.2 Methodology
3.2.1 Study site and data source
Data used in this study was collected from a eutrophic reservoir, Cheney Reservoir (37º45’
35’’N, 97 º 50’06’’W), the main water supply for Wichita, Kansas USA (Christensen et al.,
2006).
The data was obtained from the United States Geological Survey (USGS) (US Geological
Survey, 2015). The reservoir has experienced frequent cyanobacterial blooms, presence of
microcystin, and taste-and-odor problems. In part, this could be due to the shallow depth
(average depth=6.1 m) and persistent winds that cause maximal turbulence and a resulting
turbid environment. Among the 185 samples in the dataset, 34 samples indicate zero counts
of cyanobacteria (18.4% of the data). The site was sampled in 14 years from 2002 to 2015,
33
with different annual sampling frequency ranges from twice a year to 24 times a year. The
number of samples in each year is shown in Figure 3-1 (a). Although samples with cell
counts < 1,000 / mL make the majority class of the data, the highest value is 129,836 cells /
mL during a cyanobacterial bloom. The frequency of cyanobacteria abundance is shown in
Figure 3-1(b). Imbalanced datasets like this with a wide data range are challenging to model
and often resulting in poor model performance.
(a) (b)
Figure 3-1 (a) Bar plots of sampling frequency in each year from 2002 to 2015; (b)
Histograms of cyanobacteria abundance.
In order to identify trends or patterns of cyanobacteria abundance over time, the time
series of cyanobacteria from 2002 to 2015 was depicted and presented in Figure 3-2 (a).
Although the repetitive fluctuation over years can be leveraged in predicting future blooms,
the cyanobacteria abundance lack any meaningful pattern. It is worth noting that in the algal
blooms magnititudes for Cheney reservoir displayed a increasing trend from 2004 to 2006,
and a slightly decreasing trend from 2006 to 2013. This finding is consistent with harmful
algal bloom at Cheney Reservoir Dam in September, 2006 reported by the city of Wichita. In
2006, the city of Wichita upgraded to ozone treatment to control event effects (Oneby et al.,
2006). In order to gain more in-depth insight over historical fluctuations, autocorrelation
34
function (ACF) was used to explain the similarity between observations in the function of
lagged time. The autocorrelation function (ACF) can be used to explain the similarity
between observations in time series as a function of lagged time (Box et al., 2015). The
resulted ACF analysis is illustrated in Figure 3-2 (b). As it can be observed from the graph, a
strong ACF (=0.5) was oberserved for cyanobacteria abundance in one month, suggesting
the present values of abundance is related with values in last month. Although Li et al.,
(2010) has indicated an apparent seasonal variation of cyanobacteria, in this study no
considerable seasonal pattern was observed for cyanobacterian abundance (ACF <0.5).
However, the cyanobacteria abundance usually peaks at fall season as presented in Figure
S4 in appendix. A possible explanation might be that the in-lake intervention such as ozone
treatment to manage harmful cyanobacterial blooms after 2006 has dimished the seasonal
pattern of cyanobacteria abundance.
(a) (b)
Figure 3-2 (a) Visualization of the time series for cyanobacteria abundance in Cheney
reservoir; (b) Autocorrelation factor (ACF) of cyanobacteria abundance based on monthly
time lag.
The dataset that included water quality variables in Cheney reservoir, Kansas was
also obtained from the United States Geological Survey (USGS) (US Geological Survey,
2015). Precipitation, solar radiation and wind speed in Cheney reservoir were obtained from
35
NASA Power project. Meteorological data and environmental data along with cyanobacteria
abundance data were merged according to the same sampling dates. The original data set
included 9 variables: temperature, pH, total phosphorous, total nitrogen, Chlorophyll a (Chl
a), all sky insolation incident on a horizontal surface (i.e. solar radiation), wind, turbidity, and
precipitation (Table 3-1). All measurements were daily average values for the same
sampling day as the cyanobacteria counts. Among all of the potential drivers behind harmful
cyanobacterial blooms, eutrophication, which refers to the water body enriched with
minerals and nutrients, particularly nitrogen and phosphorus, can significantly stimulate the
occurrence of harmful algal blooms by causing a shift in the phytoplankton community
towards cyanobacteria dominance (O’Neil et al., 2012). Increased temperature can
stimulate cyanobacteria growth both directly and indirectly. Cyanobacteria favor high
temperatures with an optimum temperature than other groups of algae (Lürling et al., 2013).
The indirect effects of temperature include the intensified thermal stratification due to
increased temperature. Cyanobacteria can take advantage of the stratification by regulating
their buoyancy by forming gas vesicles and accumulating dense blooms at the surface
(Paerl & Huisman, 2009). Wind is also identified as a contributor to cyanobacteria growth.
Below a critical wind speed of 2-3m/s, the wind-generated turbulence is hypothesized not
capable of mixing floating cells away from surface into deeper layers, which leads to the
accumulation of cyanobacteria on the surface (Wang et al., 2016). Extreme precipitation
across surface may mobilize the sediments and nutrients into the reservoir (Woods et al.,
2017), and also leads to increased water column mixing and weakened vertical stratification,
which have been identified as contributors to cyanobacteria blooms (Reichwaldt &
Ghadouani, 2012). Although Chl a and turbidity are not the cause of cyanobacterial blooms,
they are parameters that are expected to be correlated with the presence of cyanobacteria,
and therefore can be used as predictors. In order to identify the underlying linear
36
relationships between variables, the correlations were determined and a correlation
heatmap is presented om Figure 3-3.
Table 3-1 Selected variables used to build initial models.
Variable Abbreviation Units
Total Phosphorous TP mg/L as P
pH pH NA
Temperature Temp C
Chlorophyll a Chl a g/L
All Sky Insolation Incident on a Horizontal Surface
Solar radiation 𝑊ℎ/𝑚2
Wind Wind m/s
Total Nitrogen TN mg/L as N
Precipitation Precipitation mm
Turbidity Turb FNU
Figure 3-3 Correlation heatmap of water quality and weather parameters.
As the heatmap indicates, linear dependency between all parameters and
cyanobacteria are weak (R< 0.3). Therefore, it seems that linearity cannot capture the
complicated relationship between cyanobacteria abundance, weather and water quality
37
parameters. Chl.a, temperature, precipitation and pH have been observed to have positive
correlations with cyanobacteria levels. The correlations between solar radiation and
temperature (0.67), total phosphorus and turbidity (0.66) are stronger than other pairs,
which are aligned with previous studies (Villa et al., 2019). However, as the correlation
coefficients among these predictors are below 0.7, multicollinearity is not a significant
problem (Dormann et al., 2013). Prior to modelling, variables were further selected by
projection predictive inference, a Bayesian approach for model selection and decision
making.
3.2.2 Mixture Models for Zero-inflated Count Data
3.2.2.1 Zero-Inflated Negative Binomial Model
The Zero-Inflated Negative Binomial (ZINB) model (Lambert, 1992) is a mixture model
consisting of a Bernoulli distribution and an untruncated negative binomial distribution. In a
ZINB model, zero count of cyanobacteria are generated in two processes: the first binomial
process accounts for the absence or presence of cyanobacteria, and the second negative
binomial distribution generates counts for cyanobacteria abundnace, in which zeros are
included. By combining these two processes, ZINB model accounts for both the real zero
count of cyanobacteria and measurement error. For cyanobacteria abudance 𝑌𝑐𝑦𝑎𝑛𝑜, the
ZINB model can be written as:
𝑝(𝑌𝑐𝑦𝑎𝑛𝑜 = 𝑛) {𝜋 + (1 − 𝜋)𝑓(0) 𝑖𝑓 𝑛 = 0
(1 − 𝜋)𝑓(𝑛) 𝑖𝑓 𝑛 > 0
Where is the parameter denoting the probability of zeros in a binomial distribution.
𝑓(𝑛) is the probability density function of the negative binomial distribution.
3.2.2.2 Hurdle Negative Binomial Model
In a hurdle model, there are two parts in control of cyanobacteria count generation (Welsh et
al., 1996). The first part decides the presence of cyanobacteria, which is typically
accomplished through logistic modelling. The second part, a truncated negative binomial
38
model, models the count of cyanobacteria abundance (non-zero value). The hurdle NB
model can be written as:
𝑝(𝑌𝑐𝑦𝑎𝑛𝑜 = 0) = 𝜋
𝑝(𝑌𝑐𝑦𝑎𝑛𝑜 = 𝑛) = (1 − 𝜋)𝑓(𝑛)
1 − 𝑓(0) 𝑦𝑖 ≠ 0
3.2.3 Bayesian approach
Bayesian framework is an approach to model data and estimate parameters based on
Bayes' theorem:
𝑃(𝜃|𝑋) = 𝑃(𝑋|𝜃)𝑃(𝜃)
𝑃(𝑋)
In a Bayesian approach, parameter estimation workflow consists of three
processes: first, a prior distribution 𝑃(𝜃) was set for parameters including the coefficients of
pH, Chl.a, TN. TP, turbidity, precipitation, solar radiation and temperature in the linear
regression. The prior distribution is determined based on available experience and
knowledge. Second, the likelihood 𝑃(𝑋|𝜃) of observed cyanobacteria abundance data is
calculated using the parameters 𝜃. Finally, the likelihood and prior are combined to
determine the posterior distribution 𝑃(𝜃|𝑋) of the parameters in the linear regression,
reflecting an updated representation of knowledge (van de Schoot et al., 2021).
Once the posterior distribution of parameters is determined, sample observations
cab be drawn. However, the parameter distribution is high-dimensional and usually not a
probability distribution we are familiar with, making exact inference intractable (Bishop,
2006). Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings
algorithm (Metropolis et al., 1953; Hastings, 1970), are used to generative random samples
from the target distribution.
Stan is a probabilistic programming language for Bayesian statistical inference
written in C++. It provides a No-U-Turn sampler (NUTS) to obtain simulations from the user-
39
specified posterior distribution (Carpenter et al., 2017). In this study, the R package rstan
have been used, which provides an interface to Stan using R. Through rstan, we
implemented mixture models such as zero-inflated and hurdle models for discrete
distributions.
Convergence of MCMC chains can be diagnosed with trace plots and Gelman–
Rubin diagnostic �̂� (Brooks & Gelman, 1998). Trace plots are helpful when identifying the
burn-in process and the convergence of Markov chains. Gelman-Rubin statistic compares
the total-within and between-chain variation to analyze the difference between multiple
Markov chains. �̂� = 1 indicates good convergence. Practically, a 0.975 quantile for �̂� ≤ 1.2
denotes convergence.
3.2.4 Model development, selection and validation
A summary of the fitting and testing process is presented in Figure 3-4. Initially, the
cyanobacteria abundance data was split into training and test sets, where the test set was
only used to assess predictive performance. A 5-fold cross-validation with stratified random
sampling was taken to prevent an imbalance between training and test data and reduces
the randomness in results. In our data, the common attribute is zero or non-zero
cyanobacteria count (18.4% of data was zero counts). As such, the data were stratified into
two subgroups: zero and non-zero. In each subgroup, the data was randomly split into five
equal folds and then one-fold from each group were combined to form training sets with an
equal proportion of zero and non-zero cyanobacteria count. The test set contained 44
samples, and train data contained 141 samples.
40
Figure 3-4 Flowchart of modelling and application on cyanobacteria abundance prediction.
ZINB: zero-inflated model; Hurdle NB: Hurdle negative binomial model; NB: negative
binomial model. LOO-CV: leave-one-out cross-validation; PPC: posterior predictive check.
After splitting the data into training/test sets, the most representative variables were
selected by projection predictive inference. The selected variables were then used to build a
Bayesian negative binomial model, Bayesian zero-inflated model and a Bayesian hurdle
negative binomial model. Model comparison and selection were achieved by leaving one out
cross validation (LOO-CV) and model validation (posterior predictive checks) for the best
model.
When using generalized linear models to solve regression problems (e.g., binary and
multinomial logistic regression), a threshold is commonly chosen as the decision rule. For
example, in binary logistic regression, it is a general practice to choose 0.5 as the threshold,
but in practice, different thresholds can be mannully selected for specific situations. If high
discriminative accuracy is required for positive cases, a larger threshold can be chosen (Kuk
et al., 2014). Traditional multinomial logistic regression is subject to large bias when dealing
with imbalanced data and do not take the distribution of the data into account. Thus, in order
to make the result more indicative, the probability distribution of the prediction points was
approximated by the density distribution obtained by MCMC sampling and assigned the
41
predictions to alert levels according to the management strategies for cyanobacteria by
Water Quality Research Australia (WQRA) (Table 3-2). This framework is based on the
standards that outline when potential toxin release may occur. At low alert level, health
authorities may decide to issue health warning or notice for water consumption. Higher alert
levels represent the situation where the potential risk of cyanobacteria may cause adverse
health effects if the treatment is unavailable or infective (Newcombe et al.,2009). The
categorization process is analog to assigning the predicted class according to the posterior
distribution and the probability threshold that was set in advance. Finally, the fitted model
was applied to our test set to generate predictions and classified the results.
Table 3-2 Alert levels for management of toxic cyanobacteria (WQRA)
Alert Level Definition Description
Safe < 500 𝑐𝑒𝑙𝑙/𝑚𝐿 Safe for drinking water
Low ≥ 500 𝑎𝑛𝑑 < 2,000 𝑐𝑒𝑙𝑙/𝑚𝐿 Detected at low levels
Medium ≥ 2,000 𝑎𝑛𝑑 < 6,500 𝑐𝑒𝑙𝑙/𝑚𝐿 Potential toxin to 1
3∼
1
2 to guideline
concentration
High ≥ 6,500 𝑎𝑛𝑑 < 65, 000 𝑐𝑒𝑙𝑙/𝑚𝐿 Potential toxin greater than guideline
concentration
Very High ≥ 65, 000 𝑐𝑒𝑙𝑙/𝑚𝐿 Potential toxin 10 × greater than
guideline concentration
3.2.4.1 Projection predictive inference
Projection predictive inference (Piironen et al., 2020) is a Bayesian variable selection
method. Variable selection was carried out using the projpred package in R. Initially, a
42
model with all nine environmental predictor variables for cyanobacteria abundance was
fitted and considered as the reference model. Sub-models are then fitted, initially with one
variable, and then sequentially more variables are added. A model with the smallest subset
of variables with an approximately similar fit to a full model was selected. In the forward
search process, where variables are sequentially added, each step determines the variable
that would result in the largest decrease in discrepancy between the reference model and
the sub-model. The submodels were compared with the reference model by cross-validation
prediction accuracy using leave-one-out cross-validation (LOO-CV) to prevent overfitting.
3.2.4.2 Leave-one-out cross-validation
Several measures have been developed to compare the fit of different models, such as
Akaike information criterion (AIC), Bayesian information criterion (BIC), deviance information
criterion (DIC), Watanabe–Akaike information criterion (WAIC), and LOO-CV. To measure
the wider applicability of a statistical model, out-of-sample data are commonly used to
evaluate its predictive power. However, usually we do not have access to extra. As such, we
estimated the predictive capability of the model using the expected log pointwise predictive
density (𝑒𝑙𝑝𝑝𝑑) with a penalty term (Gelman et al., 2013).
𝑒𝑙𝑝𝑝𝑑 = ∑ 𝐸𝑓(log[𝑝(𝑦𝑛𝑒𝑤|𝑦)])
𝑛
𝑖=1
Measures including AIC, BIC, DIC, and WAIC methods utilize all data to determine fit
and therefore can be biased in assessment. Therefore, a LOO-CV (loo package in R)
approach was used in order to determine model fit based on out-of-sample data. This can
be extremely computationally expensive, especially if the data set is large. However, with
less than 200 cyanobacteria abundance samples, computation time can be ignored. In
LOO-CV, a single sample from the data set is removed to test the model and the remaining
samples are used to train the model. The process is repeated 𝑛 times (where 𝑛 is the size of
the cyanobacteria data set) so that each sample is considered.
43
From each iteration, the log predictive density (𝑙𝑝𝑑) is evaluated by:
𝑙𝑝𝑑 = log [𝑝(𝑦𝑖|𝑦−𝑖)]
where 𝑦𝑖 denotes the 𝑖𝑡ℎ data point, and 𝑦−𝑖 denotes the rest data. After 𝑛 times, the
𝑒𝑙𝑝𝑝𝑑 can be estimated by:
𝑒𝑙𝑝𝑝�̂� = ∑ log [𝑝(𝑦𝑖|𝑦−𝑖)]
𝑛
𝑖=1
3.2.4.3 Posterior Predictive Checks
Posterior predictive check is a classical approach to compare the test statistics 𝑇(𝑦)
(arbitrary function of data) of the actual observed data and the data generated from the
model with parameters sampled from the posterior predictive distribution (Berkhof et al.,
2000).
The posterior predictive distribution can be written as:
𝑃𝑟(𝑦𝑟𝑒𝑝|𝑦𝑜𝑏𝑠) = ∫ 𝑃(𝑦𝑟𝑒𝑝| 𝜃)𝑃(𝜃|𝑦𝑜𝑏𝑠)𝑑𝜃
where 𝑦𝑟𝑒𝑝 denotes the replicated cyanobacteria data, and 𝑦𝑜𝑏𝑠denotes the
observed cyanobacteria data.
The principle behind posterior predictive checks (PPCs) is that if a model provides a
good fit to the data, the generated data would have a similar pattern (test statistics) with the
observed data. Bayesian p-value (Posterior p-value) is a quantitative measurement of the
goodness of the fit. The p-value-like measure represents the probability that the test statistic
(such as mean, maximum, minimum, zero proportion) in the replicated (or predicted new
observations) data set exceeds that in the original data (or new observations).
Pr (𝑇(𝑦𝑟𝑒𝑝) ≥ 𝑇(𝑦𝑜𝑏𝑠)) = ∫ 𝐼(𝑇(𝑦𝑟𝑒𝑝) ≥ 𝑇(𝑦𝑜𝑏𝑠)|𝑦) ∙ 𝑝(𝑦𝑟𝑒𝑝|𝑦𝑜𝑏𝑠)𝑑𝑦𝑟𝑒𝑝
If the model provides a good fit, the Bayesian p-value should be around 0.5. A value close
to 0 or 1 indicates that the model is a poor fit (Meng, 1994).
44
For each simulation (𝑠 = 1, … , 2000) of parameters from the posterior distribution, a
185 dimensional vector of 185 predicted outcomes of cyanobacteria abundance is obtained.
Thus, the result is an 2000 × 185 sized matrix of predicted outcomes from all simulations.
In doing posterior predictive checks, the training data and test data were used when
building the model to obtain the replicated cyanobacteria abundance data and predicted
cyanobacteria abundance data. The test statistics of replicated cyanobacteria abundance
data 𝑦𝑟𝑒𝑝 using training data, test statistics of predicted cyanobacteria abundance data
𝑦𝑝𝑟𝑒using test data ,and test statistics of the actual observed values 𝑦𝑜𝑏𝑠 and 𝑦𝑛𝑒𝑤 are
compared. The bayesplot package in R was used for plotting posterior predictive checks.
3.3 Results and discussion
3.3.1 Variable selection
Initially, all nine variables were used to develop a generalized linear model to serve as a
reference model. In each step, one additional variable is included (starting with no variables
or only an intercept), and the elpd and RMSE of each model were calculated. The order of
variables added is based on the maximizing fit and is therefore indicative of variable
importance. The selection order decided by the algorithm was Chl a, temperature, turbidity,
total phosphorus, solar radiation, wind, pH, total nitrogen, and precipitation (Figure 3-5).
45
Figure 3-5 Model elpd and RMSE from LOO-CV plotted as a function of stepwise addition of
variables.
Figure 3-5 shows that the first five variables were sufficient predictors as they result
in a similar elpd and RMSE to the reference model (the final point includes all variables).
The selected variables are consistent with prior knowledge of factors that can be used to
determine cyanobacteria counts. Chl a is produced by cyanobacteria and, therefore, a
strong indicator of abundance. Temperature promotes cyanobacterial growth (Thomas &
Litchman, 2016) and is expected to be a significant driver of blooms. Nutrients (nitrogen and
46
phosphorous) stimulate the growth of cyanobacteria (O’Neil et al., 2012). However, it is
worth noting that only total phosphorous was identified as a variable of importance and total
nitrogen had no impact on model fit. Previous studies indicate that the optimal mass-based
ratio of total nitrogen to total phosphorus is 16:1 (Davidson et al., 2012). The variable
selection indicates the reservoir was phosphorus-dominated, and the nitrogen concentration
was either sufficient or very stable. However, the average mass-based ratio of total nitrogen
to total phosphorus was 11:1 in the reservoir, suggesting that nitrogen should be limiting.
Precipitation and wind were not considered significant, and it is hypothesized that changes
in turbidity better represent the effects of precipitation and wind.
3.3.2 Model selection
Weakly informative priors were used for parameters, and four Markov chains were run for
each model for 1,000 iterations, discarding the first 500 iterations as a burn-in process.
Figure S1-3 in supplementary materials presents trace plots for parameters in NB, ZINB and
hurdle NB models. The overlapping of different chains indicates convergence. Furthermore,
parameters from all three models have �̂� < 1.003, further suggesting convergence of each
chain (Brooks & Gelman, 1998).
After confirming the convergence of all MCMC chains, LOO-CV was applied to
assess the strength of each modelling approach. Assessment of model strength was based
on both elpd and standard error (SE) (Table 3-3). The difference in 𝑒𝑙𝑝�̂� relative to the
model with the largest 𝑒𝑙𝑝�̂� (i.e. the ZINB model) can be used to consider the magnitude of
difference between models. The significance of observed differences in elpd was
determined by calculating z-scores and corresponding p-values of paired comparisons
(Lambert, 2018). Results indicate zero-truncated models (zero-inflated and hurdle models)
were better than a negative binomial model (p = 0.002); however, the performance of ZINB
and hurdle NB were comparable (p = 0.14).
47
Table 3-3 LOO-CV results to compare strength of model fits. Differences in elpd and
standard error (SE) were calculated using the highest performing model (ZINB).
Model elpd
difference SE
difference
ZINB 0.0 0.0
Hurdle NB -1.2 2.7
NB -25.3 8.9
While the fit between ZINB and hurdle NB were comparable, it should be considered
that the mechanism of zero generation is different between them. In a zero-inflated model,
zero counts may come from two sources: (a) the cell number is too low to be by the
enumeration method used, (b) the cell number was truly zero. Zero counts are assumed
only to be caused by cell numbers below the detection limit in a hurdle model.
Cyanobacteria are likely not present in the reservoir at some times, and therefore, the ZINB
was chosen as the best model based on goodness of fit and the ability to consider true zero
counts.
3.3.3 Model checking
Posterior predictive checks (PPC) are used to evaluate if the model fit is reasonable and
identify potential differences between observed data and the fitted model. PPCs were
initially run using the training set of data. Zero proportion was chosen as a test statistic for 𝑦
and 𝑦𝑟𝑒𝑝, which represents the proportion of zero values in the real observed data and the
replicated data (predicted data for the same data set) and calculated the Bayesian p-value.
Bayesian p-values in this context indicate the probability that replicated data are not more
extreme than the observed distribution (Gelman, 2005). A Bayesian p-value close to 0.5
indicates a good fit, values approaching 0 indicate lack of fit, and values close to 1 indicate
overfitting (Korner-Nievergelt et al. 2015).
48
The top left and right panels of Figure 3-6 show the density plot of the original
training data (dark blue) and the density plot of the replicated data (light blue). The
overlapping of observations distribution and replications distributions showing the model
represents a good model fit. However, Figure 3-6 (top right) shows that the model tends to
underestimate the zero proportion. The computed Bayesian p-value is 0.2, indicating that
the model tends to underestimate the zero proportions. It is possible that the zero-inflated
generalized linear models still cannot account for all zeros in the data due to not capturing
non-linear relationships between cyanobacteria and predictor variables.
A 5-fold PPC cross-validation was applied to evaluate the model using out-of-bag
samples. Posterior predictive checks were repeated for each validation set and compared
the test statistics for 𝑦𝑛𝑒𝑤 and 𝑦𝑝𝑟𝑒 . The Bayeisan p-value of the five validation sets were
0.43, 0.56, 0.32, 0.42, 0.53 with an average of 0.45. One validation set is shown as an
example in Figure 3-6 (Bayesian p-value = 0.32). The predicted 𝑦𝑝𝑟𝑒 and the actual new
observations 𝑦𝑛𝑒𝑤 overlap (Figure 3-6, bottom left), although there is a slight
underestimation of zero proportion (Figure 3-6, bottom right). The difference in estimated
zero proportions and p-values is likely due to the varying proportion of zero counts in each
of the five validation sets. The Bayesian p-values of both replicated data and new data close
to zero suggest that the linear model may be inadequate for the cyanobacteria growth
model. Considering non-linear models, such as the dynamic phytoplankton model proposed
by Malve et al. (2007) would add complexity but may also increase model fit.
49
Figure 3-6 Top left: Kernel density estimate of observations in the training set 𝑦 (dark line)
and replications 𝑦𝑟𝑒𝑝 (light line). Top right: Zero proportion as test statistics 𝑇(𝑦). Dark line is
the zero proportion of observations in the training set. Light lines are the distribution of zero
proportions of replicated data. Bottom left: Kernel density estimate of new observations 𝑦𝑛𝑒𝑤
(dark line) and predictions 𝑦𝑝𝑟𝑒 (light lines). Bottom right: Zero proportion as test statistics
𝑇(𝑦). Dark line is the zero proportion of new observations. Light lines are the distribution of
zero proportions of predicted data.
3.3.4 Cyanobacteria alert level prediction
Predictions are produced by first sampling regression parameters from their respective
distributions, followed by calculating cyanobacteria counts. Since 4 MCMC chains of 1,000
iterations were generated, and the first 500 iterations of each chain (burn-in) were
discarded, the number of replicates for each prediction was 2,000. The advantage of
Bayesian models is that instead of predicting a single value, the model presents a predictive
Training data density plot
Validation data density plot Validation data zero proportion
Training data zero proportion
50
probability distribution based on MCMC iterations. For example, the predictive distribution
based on MCMC results of two data points in the test set are shown in Figure 3-7.
Figure 3-7 Predictive density plot of two new observations. The red line indicates the true
observed values, and the density is determined based on 2,000 MCMC replicates.
From Figure 3-7, it can be observed that even if the peaks of the predictive density
do not fall precisely on the true observed value, the maximum predictive density may be
approximately adjacent to the true value and the overall predictive density shifts. It is also of
note that despite density being highest immediately adjacent to true predictions, there is
non-zero probability of elevated cyanobacteria abundance. The Bayesian modelling
approach allows for direct interpretation of this uncertainty and the uncertain nature of
factors influencing cyanobacterial population dynamics to carry through to predictions.
Predictive density was used to categorize predictions according to WQRA alert
levels. By taking probability density in bins rather than point estimates, the high levels of
uncertainty were accounted for in both the impacts of influencing factors and how to
interpret risk from cyanobacteria abundance. Not all species will release toxins (Lee et al.,
2015) and environmental conditions such as temperature impacts toxin release (Walls et al.,
51
2018). As such, management of surface waters often is in response to categorized levels of
cell counts or other water quality parameters (Ibelings et al., 2014).
The predicted class was determined by the mode or most common predicted class
based on probability density. The accuracies in each fold were 0.50, 0.32, 0.36, 0.32, 0.45.
The overall confusion matrix for multiclass prediction (all WQRA alert levels) is shown in
Table 3-4a. The average accuracy was found to be 0.40, generally indicating poor
performance. In particular, it was noted that the model performed poorly in predicting low or
medium alert levels and predictions of safe levels dominated. The dominant safe level
probabilities are evident from the figure inset on Table 3-4a.
Based on poor performance with narrow alert level bands, and generally better
separation of ‘high’ vs ‘safe’ levels, it was considered to reduce classification to a binary
decision of potential toxin presence or not. The threshold was set to 1,000 cells/mL,
corresponding to the middle of the low alert level in WQRA and associated with a level
where toxin release may be possible. For this binary decision, the precision and recall were
found to be 0.62 and 0.99, respectively. Cohen’s kappa, the statistics which measures
interrater reliability of binary classiers (McHugh, 2012), is 0.8, suggesting an almost perfect
agreement. As such, on a more coarse level the model performance improved and has
potential for distinguishing conditions that could result in toxin presence (Table 3-4b). In
particular, the binary decision approach did not under-predict alert (false negatives), and
performance was high for correctly predicting counts greater than 1,000 cells/mL.
52
Table 3-4 a) Confusion matrix for all WQRA levels along with figure depicting probability of
each class for a given prediction, and b) reduced confusion matrix for binary decisions > or
< 1,000 cells/mL.
a) All WQRA levels
Predicted
Safe Low Medium High Very high
Ac
tua
l
Safe 21 1 24 14 3
Low 9 0 9 14 0
Medium 5 1 12 20 2
High 1 0 4 40 1
Very high
0 0 0 4 0
b) Binary decision Predicted
Safe (< 1,000 cells/mL)
Potential toxin presence (>= 1,000 cells/mL)
Ac
tua
l
Safe (< 1,000 cells/mL)
14 65
Potential toxin presence
(>= 1,000 cells/mL) 1 105
3.3.5 Influence of weather and water quality factors on cyanobacteria counts
The influence of various factors can also be observed from the kernel density estimates
posterior distributions of the variable-specific coefficients (Figure 3-8). Chl a was found to
have the largest positive coefficient, indicating a strong positive relationship with
cyanobacteria counts. This was expected since cyanobacteria will produce Chl a, and this
measure is often used as a surrogate for cell counts (Chaffin et al., 2018). The temperature
coefficient is distributed above zero, implying a positive impact on the probability of a bloom.
A positive relationship between temperature was anticipated based on a significant amount
of literature highlighting increased growth with increasing temperature (Thomas & Litchman,
2016; Rousso et al., 2020).
53
The coefficient of solar radiation is mainly distributed below zero, indicating a
negative correlation with cyanobacteria levels. A negative relationship between radiation
and cell counts could be explained by photobleaching of pigments in cyanobacteria, such as
phycobiliproteins (Sinha et a., 2005) or by relative competitive advantages of cyanobacteria
compared to other algal taxa under limited light conditions (LeBlanc Renaud et al., 2011).
Long-term exposure to increasing light intensity and UV-B light in particular has resulted in
decreased Chl a content and decreased cyanobacteria population (Cirés et al., 2011; Xue et
al., 2005). At high radiation levels (340 μE m−2 s−1), the cyanobacteria growth rate was
previously found to be 30% lower than at moderate radiation (60 μE m−2 s−1) or low radiation
levels (Cirés et al., 2011). However, it should be noted that radiation intensity and
temperature are strongly correlated, and increasing solar radiation was expected to result in
increased cyanobacteria levels due to a corresponding increase in temperature (Jöhnk et
al., 2008).
The turbidity coefficient was distributed on both sides of zero, indicating the
possibility of either positive or negative correlations with cyanobacteria abundance. Turbidity
is a general measure and does not distinguish types of matter, including no distinction
between cyanobacteria and non-algal matter that would contribute to turbidity. Previously,
cyanobacteria abundance of Kansas reservoirs was reported to be negatively correlated to
non-algal turbidity (Dzialowski et al., 2011). As the non-algal turbidity increases, light
penetration is reduced, and less cyanobacteria biomass is expected. Alternatively,
cyanobacteria presence would lead to a measured increase in turbidity (Klemer and
Konopka 1989). As such, the role of turbidity cannot be easily identified, and the parameter
distribution appears to represent the uncertain relationship between turbidity and
cyanobacteria counts accurately.
54
Figure 3-8 Kernel density estimate of posterior distributions for parameters based on MCMC
sampling with median and 80% intervals.
The coefficient distribution for total phosphorus was primarily distributed below zero,
implying a negative correlation with cyanobacteria abundance. This result is counter to the
expectation of phosphorous levels being positively associated with cell counts, given the
substantial amount of evidence that nutrient reduction strategies reduce blooms (Hamilton
et al., 2016). It should be considered that there were relatively elevated levels of
phosphorous in the reservoir (mean value of 0.1 mg/L), and nutrients may generally not
have been a limiting factor for growth in this system. The recommended limit of total
phosphorus in lakes is 0.05 mg/L (Litke, 1999), and 92% of the recorded phosphorous
levels in this dataset would imply the reservoir being studied is eutrophic or hypertrophic
(Carlson and Simpson, 1996). Relatively flat biomass responses with increasing
phosphorous above a limiting threshold have also been previously reported (Dolman et al.,
2012).
55
3.4 Summary
Bayesian mixture models were applied to model cyanobacteria abundance in a reservoir,
with particular consideration for the tendency for cyanobacteria abundance to be highly
imbalanced with a high proportion of zero values. Two models that can account for the high
proportionality of zero measurements, including a ZINB and hurdle NB, were compared. An
NB model was also applied to act as a baseline approach that does not account for excess
zero counts.
Based on fit determined from leave-one-out cross-validation, it was found that the
ZINB and hurdle NB models performed significantly better than the NB model. The observed
improvement of fit when using models that account for excessive zero counts supports the
hypothesis that inflated zero counts are important to consider when modelling cyanobacteria
abundance. Furthermore, a slight increase of fit was observed when using ZINB compared
to the hurdle NB approach. ZINB models can account for zero measurements being present
either from the cell number being below detection limits, or from the true absence of
cyanobacteria. As such, the improvement of fit using ZINB illustrates that both mechanisms
of zero generation should be considered when modelling cyanobacteria.
The ZINB model was then applied to predict cyanobacteria levels using a separated
test set. Although the performance was poor when predicting narrow alert level bands,
precision and recall were high (0.62 and 0.99, respectively) for binary prediction of elevated
vs. low risk levels of cyanobacteria. The established model utilizes a limited number of easy-
to-measure parameters including Chl. a, total phosphorous, pH, temperature, and solar
radiation to generate these predictions. Furthermore, the predictions produced from the
Bayesian approach utilized in this paper are probabilistic. The uncertainty from the data and
interactions in the system are carried through the modelling process to produce an
estimated cell count with associated level of uncertainty. The high uncertainty levels in
parameter estimates demonstrate that cyanobacteria count prediction is difficult, and the
56
impact of influencing factors is complex. As such, the presented modelling process is
believed well suited to inform the management of complex systems with high uncertainty.
57
Chapter 4: A probabilistic approach to evaluating Cryptosporidium health risk in
drinking water
4.1 Introduction
The protozoa Cryptosporidium is an important chlorine resistant pathogen that commonly
drives public health risk associated with drinking water treatment and delivery (Efstratiou et
al., 2017). Outbreaks of Cryptosporidium can impact a large proportion of the population in a
short time frame due to its persistence in aquatic environments and high probability of
infection at low doses (Swaffer et al., 2018; Desai et al., 2012). The reported incidence of
Cryptosporidium has increased since 2004 in the United States, with most cases occurring
during the summer and among children (Yoder & Beach, 2010). Cryptosporidiosis has been
increasingly identified as an important cause of morbidity and mortality in the world
(Checkley et al., 2015), particularly for the immunocompromised. In two of the documented
waterborne outbreaks, Milwaukee and Las Vegas, mortality rates in the
immunocompromised ranged from 52% to 68% (Rose, 1997).
The reservoirs of Cryptosporidium including humans, cattle and other mammalian
species (Thomson et al., 2017). Cryptosporidium can be found in soil, water and food or
surface that have been contaminated with the feces from the hosts. Cryptosporidium can
entered the source water such as lakes and rivers through sewage overflow, storm water
runoff, agricultural runoff and wildlife.
The routes of exposure include ingestion of contaminated recreational or drinking
water, ingestion of contaminated food, exposure to infected animals, and close contacts with
other with cryptosporidiosis (Yoder & Beach, 2010). Humans can be infected with
Cryptosporidium through various of tranismission routes, including faecal-oral route (person-
to-person transmission and zoonotic transmission) (Ng et al., 2012), and ingestion of
contaminated foods (foodborne transmission) (Ryan et al., 2018) and water (waterborne
transmission) (Xiao & Feng, 2017). After the individual ingests the protozoan oocysts, the
58
infection begins by releasing sporozoites that invade the mucosa to establish endogenous
autoinfection (Gerace et al., 2019).
Continued monitoring and improvement of drinking water treatment and untreated
water contact control have a pivotal role in cryptosporidiosis prevention and control. As
Cryptosporidium oocysts occur in low numbers in water, and in vitro culture techniques that
augment parasites numbers for identification are not available, it is necessary to concentrate
oocysts to identify them effectively and accurately (Smith et al., 2010). Besides
morphological identification of oocysts by microscopy, most common methods for detection
and enumeration include concentrating and staining of fecal smears, immunological-based
methods, and molecular techniques (Ahmed & Karanis, 2018). In drinking water,
concentration through methods such as continuous flow centrifugation and membrane
filtration is most commonly practiced. Molecular methods, PCR tests can detect both clinical
and environmental specimens. Although PCR tests are rapid, highly sensitive and accurate,
false positives rate can be high due to detection of non-viable microorganisms and
laboratory contamination (Checkly et al., 2015)
Due to the expense and labour intensity of detection methods, information on source
water concentrations is severely limited and routine monitoring of Cryptosporidium is not
practiced (Efstratiou et al., 2017). Evident from reduction of sporadic cases of
cryptosporidiosis when water treatment is improved, levels of Cryptosporidium in source
waters and infection rates are likely underestimated due to monitoring issues. As such,
there is a need for cheaper and easier to measure method to predict the presence of
Cryptosporidium on a day-to-day basis. In a recent study (Ligda et al., 2020), a machine
learning type risk assessment model have been developed to predict Cryptosporidium with
meteorological and physicochemical predictors. The model achieved overall accuracy of
75% in four-level classification of Cryptosporidium concentrations.
59
Although there are limited studies regarding Cryptosporidium prediction, previous
studies have attempted to identify factors that drive occurrence of Cryptosporidium in water
bodies. A common factor in many historical outbreaks is preceding extreme rainfall event
(Hrudey & Hrudey, 2004; Sylvestre et al., 2021). Liu et al. (2019) investigated the fate and
transport dynamics of Cryptosporidium in the Daning River, China using the soil and water
assessment tool (SWAT) and reported a combined impact of rainfall and regional
fertilization on the level of Cryptosporidium. Furthermore, Coffey et al. (2010) found that
fertilization usage has significantly impacts on the Cryptosporidium existence in a watershed
in Ireland. Xiao et al. (2013) developed a quantitative microbial risk assessment model for
Cryptosporidium and reported a strong relationship between positive samples for
Cryptosporidium and flooding frequency.
Quantitative microbial risk assessment (QMRA) is a mathematical modeling
approach to estimating the health risk related to environmental exposure of microorganisms
(Haas et al., 2014) and has increasingly become a standard for assessing pathogen risk.
QMRA provides a detailed and flexible method for estimating risk and disease burden that
can support risk-based management decisions (Hunter et al., 2011). Previous work has
utilized QMRA to quantify Cryptosporidium risk based on estimated concentrations rather
than direct observation of oocyst concentrations. Hunter et al. (2011) applied QMRA using
Cryptosporidium concentrations estimated from a regression model based on E. coli
concentrations. The analysis indicated that a major risk of Cryptosporidium infection among
English and French populations that consume tap water from very small drinking water
supplies.
Probabilistic models are useful tools that take into account the impact of random
events or actions in predicting the potential outcomes. While deterministic models give point
estimates, probabilistic models give probability distribution as estimation. They are highly
applicable to modelling environmental systems since outputs are easily generated with
60
incomplete data, and predictions are probabilistic therefore provide a measure of uncertainty
(Aguilera et al., 2011; Bertone et al., 2016). Probabilistic QMRA is emerging as a valuable
technique in microbial risk assessment. Amha et al. (2015) have developed a probabilistic
QMRA to determine the risk of Salmonella infections resulting from consumption of crops
irregated with treated wastewater. Mok et al. (2014) have constructed a probabilistic QMRA
model to determine the health risks of norovirus infection from consumption of vegetables
irrigated with wastewater. Barker et al.(2014) have proposed a probabilistic QMRA using
Monte Carlo method to assess the risk of gastroenteritis illness caused by rotavirus,
norovirus, and Ascaris lumbricoides associated with the consumption of street food salads.
Probabilistic QMRA using Monte Carlo simulation can also be used to estimate the health
risk of systems and management strategies. Bivins et al. (2017) have proposed a
probabilistic QMRA to estimate the risks of infection of waterborne illness when the
population exposed to Intermittent water supply using reference pathogens including
Campylobacter, Cryptosporidium, and rotavirus. Bayesian networks are probabilistic
graphical model representing complex relationships of multiple variables and has also been
widely used in probabilistic QMRA. Beaudequin et al. (2015) evaluated the capabilities and
challenges of adopting BN models to QMRA, highlighting the opportunity to use BNs for
scenario assessment and identifying nodes or factors with the most influence on the risk
outcomes. Beaudequin et al. (2016) later presented a QMRA expressed as a wastewater
reuse context and examined the risk of norovirus infection associated with wastewater-
irrigated lettuce. Zhiteneva et al. (2021), constructed a probabilistic QMRA utilizing a BN to
examine a non-membrane based indirect potable reuse (IPR) system. The critical control
points of norovirus, Campylobacter and Cryptosporidium were determined through
sensitivity analysis, scenario analysis, and backwards inferencing. Although BNs have great
potential for broad use in QMRA, the drawbacks of BNs include the difficulties with eliciting
conditional probabilities and information loss caused when categorizing the input variables
61
and risk outcomes. The uncertainty from parameter estimations can be incorporated with
the risk assessment with Markov Chain Monte Carlo method. Donald et al. (2011) have
conducted a study of incorporating the parameter uncertainties into the probabilistic QMRA
model to estimate Salmonella. MCMC method can also be used to estimate pathogen
concentrations (Bouwknegt et al., 2014; Masciopinto et al., 2020) and unknown quantities
such as the parameters of distributions and the model (Rigaux Ancelet et al., 2013; Donald
et al., 2011).
In this research, two connected models of dose (estimating Cryptosporidium
presence/absence) and response (public health outcomes) were developed. The dose
model was intended to address challenges associated with limited knowledge of day-to-day
Cryptosporidium levels in source waters. A Gaussian process classifier (GPC) was used to
predict presence/absence based on known factors correlated with the presence of protozoa
such as turbidity, fecal coliforms, and weather data (precipitation, temperature). Predictions
were then connected to a Bayesian linear regression model to estimate public health risk
(response model). Factors affecting public health risk such as water treatment efficiency,
drinking water consumption, herd immunity rates, and sewer overflow rates were considered
in the response model. Parameterization for the response model was based on previous
literature reporting distributions such as drinking water consumption, herd immunity and
previous recording of annual sewer overflows.
The modelling approach developed has significant value to environmental and water
system management. Established probabilistic QMRA model can be utilized for predictive
inferences, suggesting the resulting health risk with uncertainty under emergencies, and
policy controls. Site-specific risk assessments with uncertainty at drinking water treatment
facilities, under climate change, emergencies and policy controls. By incorporating weather
and other environmental variables into a dose model, estimation of impacts of climate
change on risk can be explored. Furthermore, the control points for sewer overflow and
62
target treatment efficiency can be determined with backwards reasoning. As the input
variables remain continuous, the model avoids the weakness of BNs which are commonly
discrete or linear Gaussian distributed. The simulation, parameter estimation and prediction
rely on Monte Carlo method and Markov chain Monte Carlo, demonstrating a novel use of
BN and a probabilistic version of QMRA. The model with its highly flexible nature, is a
powerful tool that can be continually extended with more variables and can increase
understanding of the public health impacts of diverse risk factors.
4.2 Methodology
4.2.1 Data sources
Source water quality of Kensico reservoir were obtained from the City of New York (NYC)
Open Data, provided by the Department of Environmental Protection (DEP). Monitored
parameters at DEL18DT station (representing Kensico water) included Cryptosporidium
concentration, turbidity and fecal coliforms. Weather data were observed at the nearest
weather station to the Kensico reservoir: Westchester County Airport at White Plains, NY
(station ID: US1NYWC0003), including temperature and precipitation on the
Cryptosporidium sampling day. Two data sets are merged according to the monitoring date.
The merged data set include 368 samples from May 2015 up to September 2021. The
sampling times in each year was presented in Figure 4-1(a). In the dataset, the most
reported Cryptosporidium oocysts concentration was ‘zero’ or below the detection limit (92%
of all data) and there is a significant imbalance between the presence and absence, as
shown in Figure 4-1 (b).
The merged datatset include five variables: Cryptosporidium oocysts concentration
(number of Cryptosporidium oocysts observed in a 50-liter sample), turbidity (NTU, average
turbidity of the six 4-hour grab samples), fecal coliforms (total number of colonies counted
per 100mL sample volume filtered), temperature (°C, average monitored temperatures
through the day) and precipitation on the sampling day (mm). Correlation heatmap in Figure
63
4-2 depicted the relations between water quality, weather and Cryptospordium oocysts in
source water. A weak negative correlation (r = -0.01) has been observed between turbidity
and Cryptosporidium presence/absence in source water. This outcome is contrary to that of
Gómez-Couso et al. (2009) who found that the infectivity and survival of Cryptosporidium
oocysts decreased significantly when exposed to intensive radiation, and higher turbidity (>
200 NTU) could lead to less ultraviolet light (UV) penetrated further from the surface and
therefore beneficial to oocysts survival. This result may be explained by the fact that
turbidity in the reservoir ranges from 0.53 to 1.43 NTU, which does not have a significant
effect on light penetration. Previous findings have shown positive correlations with
Cryptosporidium oocysts and E.coli (Reinoso et al., 2008), Cryptosporidium oocysts and
rainfall (Schets et al., 2008). The positive correlations in Figure 4-2 confirms that
precipitation and E.coli concentration are postively associated with Cryptosporidium oocysts
concentration. Cryptosporidium oocysts are adaptive to a wide range of temperatures. The
oocysts will only be inactivated when exposed to temperatures above 50-60 °C or below
−20 °C (Hassan et al., 2021), although a slight positive linear relation has been observed
between temperature and Cryptosporidium presence/absence.
(a) (b)
Figure 4-1 (a) Bar plots of sampling times in each year from 2015 to 2021; (b) Historgrams
of Cryptosporidium presence and absence.
64
Figure 4-2: Linear Correlation (r) heatmap of water quality, weather parameters, and
Cryptosporidium presence/absence(Class).
4.2.2 Predicting Cryptosporidium presence in source water
4.2.2.1 Gaussian process classification
Gaussian processes (GPs) are fully probabilistic methods for regression and classification
problems. It allows a Bayesian use of kernels, that can be interpreted as a Bayesian
probabilistic analogue to kernel SVM classifier. GP provides fully probabilistic predictive
distributions with uncertainty estimation (Quinonero-Candela et al., 2007).
Given a training set 𝒟 = (𝐱𝐢, 𝑦𝐶𝑟𝑦𝑝𝑡𝑜), 𝑖 = 1, … , 𝑛 of 𝑛 pairs of input vector environmental
variables 𝐱𝐢 including precipitation,turbidity, temperature and E.coli and Cryptosporidium
binary classification(presence/absence) 𝑦𝐶𝑟𝑦𝑝𝑡𝑜. Gaussian process regression assumes a
Gaussian Process prior over functions 𝑓, where 𝑓 = [𝑓1, 𝑓2, … , 𝑓𝑛]𝑇 is a vector of latent
65
function values, and here is the prediction label. The process is fully specified by the mean
and covariance functions.
𝑓(𝐱)~𝐺𝑃(𝑚(𝐱), 𝐾)
Where mean function:
𝑚(𝐱) = 𝐸[𝑓(𝐱)]
Usually, the prior means are assumed to be constant and zero and covariance function be
the common covariance function, squared exponential:
𝐾𝑖,𝑗 = σ2exp (−(𝑥𝑖 − 𝑥𝑗)
2
2λ2)
Here the output variance σ2 controls the prior variance, and the length scale λ controls the
rate of decay of the covariance.
Logistic Gaussian Process regression is generalization of linear regression for binary
classification problem. In Logistic GP regression, the observed Cryptosporidium
presence/absence 𝑦𝐶𝑟𝑦𝑝𝑡𝑜,𝑖 ∈ (0,1), 𝑖 = 1, … , 𝑛, which are modeled using a Gaussian process
with the latent function values through the logistic link:
𝑦𝐶𝑟𝑦𝑝𝑡𝑜,𝑖~𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(logit−1(𝑓𝑖))
Integrating the distribution over the latent function 𝑓∗ of future environmental data, 𝐱∗, a
probabilistic predictive distribution for future Cryptosporidium presence/absence data 𝑦∗ can
be described as:
𝑝(𝑦𝐶𝑟𝑦𝑝𝑡𝑜,∗ = +1|𝑦𝐶𝑟𝑦𝑝𝑡𝑜 , 𝐱, 𝐱∗) = ∫ σ(𝑓∗)𝑝(𝑓∗|𝑦𝐶𝑟𝑦𝑝𝑡𝑜)𝑑𝑓∗
where 𝜎(⋅) is the logistic function and +1 denotes the presence of Cryptosporidium.
Compared to parametric methods, nonparametric methods do not assume a linear or non-
linear relationship between input variables and output and can be useful for dealing with
unexpected, outlying observations that might be problematic with parametric methods
(Whitley & Ball, 2002).
66
4.2.2.2 Model performance evaluation and threshold determination
Model performance was evaluated using precision and recall, in addition to accuracy.
Precision describes performance of the model on predicting the positive class, while recall
describes the model’s sensitivity in detecting positive class.
Class imbalance is a major problem in classification, and use of a default threshold
of 0.5 for binary decisions based on severely imbalance data will usually result in poor
performance. A straightforward approach to improving the performance of a binary classifier
is to tune the threshold used to map probabilities to class labels (Collell et al., 2018). The
optimal threshold for Cryptosporidium presence/absence classification was chosen based
on the precision-recall (PR) curves that results in the best balance between precision and
recall.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Fitting the Gaussian process classifier described in section 4.2.2.1, we computed the
probability of Cryptosporidium presence, π(xi) = logit−1(𝑓∗,𝑖), for the 𝑖th input environmental
data, and adjusted the threshold of π(xi) to inspect changes in precision and recall when
setting different threshold. We used the F-score to find the threshold that resulted in the best
balance of precision and recall, which is the same as optimizing the F-score that
summarizes the harmonic mean of both measures.
𝐹 − 𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
4.2.3 Modelling the Cryptosporidium exposure
4.2.3.1 Removal through water treatment
In this study, it was assumed that the exposure route of Cryptosporidium is direct ingestion
through drinking water consumption. A range of water treatment techniques can improve the
67
safety of portable water with regard to pathogenic contamination. Log removal value (LRV)
is widely used to measure the treatment efficacy:
𝐿𝑅𝑉 = log10(𝐶in/𝐶out) (𝑒𝑞. 5.1)
where 𝐶in is the pathogen concentration in influent and 𝐶𝑜𝑢𝑡 is the pathogen concentration in
effluent.
The achieved LRV for a given system is site-specific and dependent on the unit
processes used as well as operational parameters. Generally, regulations require and
overall LRV of 3 (USEPA, 2006), although advanced treatment methods can achieve much
higher LRVs. For instance, membrane filtration technologies are capable of LRVs >7 (Hirata
& Hashimoto, 1998). However, usually after the conventional drinking water treatment
process, including coagulation, flocculation, sedimentation, and filtration, a general LRV of 2
± 0.5 is achieved for Cryptosporidium oocysts (Chaudhry et al., 2017). Other reviews have
also indicated that most conventional water systems in the US achieved 2-2.5 log removal
and do not monitor the filtered water for Cryptosporidium (LeChevallier et al., 1991). The
regulation standard efficiency of filtration (𝜂) is assumed to follow uniform distribution:
𝜂 ∼ 𝑈(1.5,2.5) (𝑒𝑞. 5.2)
4.2.3.2 Drinking water consumption
Daily water intake varies between countries, age groups, and gender. About 20% of daily
water intake comes from food, whereas the rest from beverages and drinking water. The
recommended daily water intake for the vast majority of persons is 3.7 L for adult men, and
2.7 L for adult women (Sawka, 2005). Säve-Söderbergh et al. (2018) found the drinking
water consumption patterns among adults through collected self-reported estimates. The
daily drinking water consumption (glasses; 𝐷) was best fitted to a gamma distribution (shape
= 3.938; rate = 0.791, in glasses equaling 200 ml):
𝐷 ∼ 𝛤(3.938,0.791) (𝑒𝑞. 5.3)
68
4.2.3.3 Sewer overflow rate
Under normal circumstances, wastewater is transported to the wastewater treatment plant
through sewers and is treated prior to discharge into drinking water sources. However,
during extreme weather events or other emergencies, such as pipes blocked or cracked and
heavy rainfall/snowmelt, excessive untreated sewage or wastewater can be discharged
directly to water bodies and pose a substantial health and environmental challenge.
Cryptosporidium oocysts concentrations are considerably higher in sewage than in
surface water. Lalancette et al. (2012) reported an average of 18 oocysts/L in urban sewage
received by two wastewater treatment plants in the Greater Montreal Area in Canada.
Concentration as high as 103 oocysts/L has been recorded associated with a spring runoff
(Gammie et al., 2000). The values are high compared to oocysts concentration in surface
water. Typically, Cryptosporidium concentrations in Canadian surface waters range from
0.01 to 1 oocysts/L (Health Canada, 2019). Data collected in the United States showed a
median of Cryptosporidium ranged from 0.005/L to 0.5/L in natural surface waters (Ongerth,
2013).
The probability distribution of annual sewer overflow discharge volumes is
determined using the combined sewer overflow discharge volumes data (Statistics Canada,
Table 38-10-0100-01), and potable water use by sector and average daily use data
(Statistics Canada. Table 38-10-0271-01): the overflow discharge volumes of each
Canadian provinces of year 2013, 2015, 2017 and the overall potable water use by all
sectors of the according provinces and years. Thus, the estimated overflow sewage rate (θ)
is best fitted to a zero-truncated normal distribution (Figure 4-3):
𝜃 ∼ 𝒩(0.022,0.12), when 𝜃 > 0 (𝑒𝑞. 5.4)
69
Figure 4-3 Density plot of overflow rate
4.2.4 Modelling the risk of illness
The probability of ingesting an exact discrete dose of organisms (𝑗) per L given as average
concentration of pathogen consumed per day from drinking water (Dose Ingested/day) is
modelled as a Poisson distribution:
𝑃(𝑗/𝐷𝑜𝑠𝑒 𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦) =(𝐷𝑜𝑠𝑒 𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦)
𝑗
𝑗!𝑒−𝐷𝑜𝑠𝑒𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦 (𝑒𝑞. 5.5)
The probability of infection is given as an exponential model:
𝑃(𝐼𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛) = 1 − (1 − 𝑟)𝑗 (𝑒𝑞. 5.6)
For Cryptosporidium, the 𝑟 value is 0.018 (Messner et al., 2001).
Herd immunity is the medical term describes the population’s resistance to a
pathogen, obtained from the immunity developed from previous infection of a portion of the
population (Okhuysen et al., 1998). Typically, the dose-response function is for infection
rather disease. To calculate the disease burden for the pathogen, it is necessary to
calculate the probability of illness. The dose independent morbidity ratio for Cryptosporidium
is approximately 0.5 (Haas, 2014). A normal distribution is assumed with a mean of 0.5 and
a standard deviation of 0.07 (Casmen et al., 2000).
70
α ∼ 𝒩(0.5 , 0.07) (𝑒𝑞. 5.7)
The daily risk of illness is calculated based on the herd immunity (morbidity ratio) and the
probability of infection:
𝑃(𝑖𝑙𝑙𝑛𝑒𝑠𝑠, 𝑑𝑎𝑦) = 𝑃(𝑖𝑛𝑓𝑒𝑐𝑖𝑡𝑜𝑛, 𝑑𝑎𝑦) × α (eq. 5.8)
The burden of disease in QMRA model is quantified by disability-adjusted life years
(DALYs), which are used in the risk assessment model as a metric to compare illnesses
with different health endpoints (Murray, 1997).
4.2.5 Probabilistic QMRA
A Bayesian multiple linear regression model was constructed to illustrate the inference of
daily risk of illness and DALYs (Figure 4-4). Turbidity and fecal indicator (E. coli) reflect the
water condition and are utilized as indicators of Cryptosporidium presence/absence in the
source water. The four variables: precipitation, temperature, turbidity and E. coli are
predictors in the logistic Gaussian process regression, the outcomes are Cryptosporidium
level. The predicted class was input to the lower part of the Bayesian multiple linear
regression model that estimates daily exposure.
Figure 4-4 Schematic model for the process of DALYs estimation. The arrows represent the
relationship between two variables.
71
In the response model, the relationships between two nodes are mathematical
expression (refer to section 2.4). However, since the parameters: daily water intake (𝐷),
sewer overflow rate (𝜃), treatment efficiency (𝜂), and morbidity (α) follow known
distributions, daily illness and DALYs will be presented as distributions instead of point
estimations.
Ingested dose per day can be calculated as:
𝐷𝑜𝑠𝑒 𝐼𝑛𝑔𝑒𝑠𝑡𝑒𝑑𝑑𝑎𝑦 = (𝐶𝑠𝑜𝑢𝑟𝑐𝑒 × (1 − 𝜃) + 𝐶𝑠𝑒𝑤𝑒𝑟 × 𝜃) × 𝐷 × 𝜂 (𝑒𝑞. 5.9)
The annual risk of illness was determined through randomly repeated sampling 365 times
from the calculated daily risks.
𝑃(𝑖𝑙𝑙𝑛𝑒𝑠𝑠, 𝑦𝑟) = 1 − ∏(1 − 𝑃(𝑖𝑙𝑙𝑛𝑒𝑠𝑠, 𝑖))
365
𝑖=1
(𝑒𝑞. 5.10)
4.3 Results and discussion
4.3.1 Cryptosporidium prediction
In this study, the efficacy of Gaussian process classification combined with threshold
moving techniques were investigated. After fitting a logistic Gaussian process regression
model to the Cryptosporidium data, a range of thresholds for classifying were applied to the
calculated parameter of the Bernoulli distribution and the according precision and recall
were examined. A grid search was used to tune the threshold and locate the optimal
threshold value and the precision/recall and the according F-scores with varying thresholds
are shown in Figure 4-5. The plots demonstrated the advantage and importance of choosing
an appropriate threshold and using F-score that balances precision and recall. The
threshold (0.12) achieving the highest F-score (0.70) was chosen and was heighted by red
points in both figures.
For predictions of Cryptosporidium presence vs absence, overall accuracy of
93.77% was observed when reapplying the selected threshold to the data. The precision
and recall were found to be 0.58 and 0.83, respectively. As such, the model has potential for
72
distinguishing conditions that could result in Cryptosporidium presence in source water. In
particular, the approach did not under-predict risk (false negative rate/significance level =
0.02), which is comparably more dangerous than over-prediction. Cohen’s kappa, which is
used to measure the agreement of two raters rating on categorical scales (McHugh, 2012),
is 0.65 (p-value < 0.05), suggesting a fair to good strength of agreement.
The role of threshold moving technique was determined by comparing the outcomes
with logistic Gaussian process classifier without threshold moving. By setting the threshold
at 0.5, the overall accuracy is still high of 0.92, but cannot distinguish the minority class
“presence” from the majority “absence”, and therefore the outlying minority class among the
majority samples are ignored, resulting zero precision and recall. The threshold moving
method uses the original training data set to train the model and then moves the decision
threshold so that the minority can be easier predicted. Compared with other methods such
as data augmenting and sampling, threshold moving method does not introduce external
biases (He & Ma et al., 2013), and are simple and straightforward to implement.
(a) (b)
Figure 4-5 (a) The precision-recall curve when varying the threshold of predicting
“presence”. (b) The F-score to threshold curve.
73
Table 4-1 Confusion matrix for binary classification of presence and absence of
Cryptosporidium
Predicted
Absence Presence A
ctu
al Absence
320 18
Presence
5 25
4.3.2 Scenario assessments with probabilistic QMRA
4.3.2.1 Climate change
According to the findings of U.S. Global Climate Change Science Special Report (Wuebbles
et al., 2017), the global annual averaged surface air temperature has increased by 1.0 ℃
over the past 115 years (1901-2016). If annual carbon dioxide emissions continue to
increase rapidly, as they have since the beginning of 21st century, it is predicted that by the
end of this century, global temperature will increase to 2.78 - 5.56 ℃ above baseline. If
emissions increase more slowly yearly or begin to decline significantly by the mid-21st
century, the predicted temperature would still be warmer to the range of 1.33 - 3.28℃
(Wuebbles et al., 2017). In addition to overall warming, extreme weather and climate events
such as extreme precipitation, heatwaves, floods, droughts and major hurricanes are
becoming more frequent in many regions (Myhre et al., 2019; Shukla et al., 2019). Extreme
weather events such as increased precipitation and temperature have been revealed to be
associated with water quality impacts and an increase in waterborne diseases (Khan et al.,
2015). Compared to normal conditions, the odds of identifying Cryptosporidium oocysts and
Giardia cysts in surface water have increased between 2 and 3 times after extreme weather
events (Young et al., 2015).
According to the U.S. Global Change Research Program, extreme precipitation is
defined as days with precipitation in the top 1 percent of all days with precipitation. Recent
74
analyses from observed data suggest that in New York and New England, the intensity of
extreme rainfall events has increased since the 1950s (DeGaetano et al., 2011). The
extreme precipitation in 24 h over the past 10 years in Westchester County was obtained
from the interactive web tool: Extreme Precipitation in New York & New England
(DeGaetano et al., 2011) for extreme precipitation analysis. The estimate is 13.03 cm/day,
with lower confidence limit of 11.84 cm, and the upper confidence limit of 14.30 cm. Myhre
et al. (2019) concluded the observed intensity in daily heavy precipitation events increases
with the surface temperature at a rate of 6-7% K-1.
Temperature and precipitation are important meteorological factors that can affect
water quality and human health. With the previous recordings of extreme rainfall and
prediction of future global warming, the temperature increase under emissions control
Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (unit: ℃ )and temperature increase without emission control Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (unit:
℃) are assumed to follow Gaussian distributions:
Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑~𝒩(2.30,0.49) (𝑒𝑞. 5.11)
Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑~𝒩(4.17,0.70) (𝑒𝑞. 5.12)
So that the 95% confidence interval for Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 is (1.33, 3.28), and the 95% confidence
interval for Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 is (2.78, 5.56), consistent with the estimated range of temperature
increase.
The extreme precipitation event 𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒 (unit: cm/day) follow the below Gaussian
distribution:
𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒~𝒩(13.03, 0.62) (𝑒𝑞. 5.13)
So that the mean is consistent with the estimated 13.03 cm, and 95% confidence interval
(11.84, 14.30).
Δ𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒,𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 = 0.065 × Δ𝑇𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (𝑒𝑞. 5.14)
Δ𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒,𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 = 0.065 × Δ𝑇𝑢𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑙𝑒𝑑 (𝑒𝑞. 5.15)
75
The increased intensity Δ𝑃𝑟𝑒𝑐𝑒𝑥𝑡𝑟𝑒𝑚𝑒 after climate change is dependent on the temperature
changes. The datasets with increased temperature and increased extreme precipitation
were generated by Monte Carlo simulation with random sampling values from each above
distribution.
It is of interest to determine whether the climate change-induced temperature
increases and extreme precipitation intensity increase have influences on DALYs. The
DALYs of Cryptosporidium under climate change are repeatedly calculated 2,000 times. In
each time, 365 samples from the predicted Cryptosporidium oocysts under climate change
both controlled and uncontrolled emission were randomly sampled and the results are
compared with the DALYs of Cryptosporidium before climate calculated by the same
method. Box plots of DALYs after climate change and before climate change are shown in
Figure 4-6 (b).
(a) (b)
Figure 4-6 Quantile-quantile plots (Q-Q plots) (a) and box plots (b) of disability-adjusted life
years (DALYs) before climate change, after climate change under controlled emission, and
after climate change under uncontrolled emission.
Furthermore, a potential functional limitation is its sensitivity to higher temperature
above optimum. In order to examine the model’s response to elevated temperatures, the
76
temperature has been increased from 15℃ to 65℃. The DALYs plot under different
temperature was presented in Figure 4-7. Although 55℃ is reported as a temperature that
leads to rapid Cryptosporidium oocysts inactivation (Hassan et al., 2021; King et al., 2005),
the DALYs increases along with temperature.
Figure 4-7 DALYs for Cryptopsoridium under temperature from15 to 65°C.
The mean of DALYs before climate change, after climate change under controlled
emission and after climate change under uncontrolled emission are 1.364 × 10-4, 1.374 ×
10-4 and 1.377× 10-4. Before checking the statistically significant differences between the
means of the four groups using one-way ANOVA, a Shapiro-Wilk’s test was conducted to
determine whether data drawn from a normally distributed DALYs population. Shapiro-Wilk’s
test is widely used for normality test (Shapiro & Wilk, 1965). It is based on the correlation
between the data and corresponding normal scores. From the normality test results, the p-
values are 0.53, 0.11, 0.08. The p-values >0.05 imply that the data distribution is not
significantly different from a normal distribution. The DALYs were found to be significantly
different based on ANOVA result (p-value < 10-6) and differences between DALYs before
climate change and after climate change under controlled emission (p-value <10-6), before
77
climate change and after climate change under uncontrolled emission (p-value <10-6) and
under controlled and uncontrolled emissions (p-value = 0.0005) were siginifcantly different
according to Tukey HSD test. Results suggest that extreme precipitations and temperature
increases have significant effects on illness caused by Cryptosporidium and are consistent
with previous studies that report strong associations between extreme precipitation,
temperature and increased concentrations of protozoa and other microorganisms (Atherholt
et al., 1998; Curriero et al., 2001). Rainfall can increase particulate matter in water following
surface runoff and cause resuspension of sediments in river and lake bottom (Atherholt et
al., 1998). Davies et al. (2004) simulated artificial rainfall events and concluded that the
oocyst load was significantly affected by rainfall intensity and duration. Temperature is
another important factor that affects the survival of Cryptosporidium oocysts and their
inactivation and infectivity during transport. A 4 log reduction in infectivity has been
observed above 20°C (King et al., 2005) and an increase in surface temperature could
reduce the persistence of pathogens deposited on the land surface (Sterk et al., 2016).
However, the oocysts can remain infective at least 3 months when stored between 4°C and
15°C (King et al., 2005; Fayer et al., 1998). The annual average temperature at the
sampling location is 7.93°C after temperature increased under controlled emission is
10.23°C, and under uncontrolled emission is 12.1°C, which are all below the survival habitat
temperature of 15°C. The simulations suggest temperature and intensity precipitation along
with climate change will significantly increase source concentrations of Cryptosporidium.
However, DALYs of Cryptospordium oocysts under elevated temperature from 15 to 65°C
were observed to increase as shown in Figure 4-7. This finding is contrary to previous
studies which have suggested that the oocysts will be rapidly inactivated following exposure
to temperatures above approximately 50–60 °C. A possible explanation for the results may
be the lack of adequate temperature data to train the model to classify Cryptosporidium
78
oocysts presence/absence under temperature above 25 °C, which is the maximum recorded
temperature.
4.3.2.2 Treatment technique improvement
Given the possible increase of Cryptosporidium concentrations, it is of interest to
understand the level of treatment needed to ensure the overall health burden is minimized.
For the filtration technique improvement scenario, three treatment levels: 2.5 log, 3 log, 3.5
log were assumed, which could be represented by three continuous uniform distributions
with means of 4, 5, 6:
𝜂1 ∼ 𝑈(2.0 ,3.0)
𝜂2 ∼ 𝑈(2.5, 3.5)
𝜂3 ∼ 𝑈(3.0, 4.0)
The mean DALYs for before treatment improved, treatment improved to 2.5 ± 0.5log,
treatment improved to 3.0 ± 0.5log, treatment improved to 3.5 ± 0.5log are 1.4×10-4, 4.7×10-
5, 1.5×10-5, 4.98×10-6. The box plots of DALYs are shown in Figure 4-8. The p-values of the
normality test for the four groups are 0.16, 0.97, 0.77 and 0.87, implying that the data
distributions are not significantly different from normal distribution. Normal distributions of
the four groups can be assumed. As expected, improving the removal efficiency resulted in
a significant decrease in disease burden based on ANOVA (p-value <10-6) and Tukey HSD
test (all p-values <10-6).
79
(a) (b)
Figure 4-8 (a) Q-Q plots of DALYs before treatment improvement, and for improved
treatment at means of 2.5 ± 0.5log, 3.0 ± 0.5log, 3.5 ± 0.5log. (b)Box plots of DALYs in the
four groups.
Backward reasoning refers to the process of working backward from the goal. It
differs from the forward inference which starts from the known evidence (Fung et al., 1994).
The focus of backward reasoning is to investigate the evidence (i.e., removal efficacy, sewer
overflow rate) that leads to the goal of certain outcomes. In addition to understanding the
influence of improving treatment techniques on the final health burdens, the developed
probabilistic QMRA model can also be used to determine required log removal for a given
health burden by simulation through MCMC. In order to conservatively achieve the health
burden goal of 10-6 DALYs per person per year, the upper level of 95% confidence interval
was set to 10-6, the target DALYs distribution with the same standard deviation of current
DALYs would be:
𝐷𝐴𝐿𝑌𝑠𝑡𝑎𝑟𝑔𝑒𝑡 ∼ 𝒩(6.29 × 10−7, 2.26 × 10−7) (eq. 5.19)
A total of 2,000 samples (the first 500 samples drawn from each chain were
discarded as “warm-up”s) drawn from the posterior distribution of parameter treatment is
presented at Figure 4-9. The sampling results suggest that to achieve the goal of the 95%
UI upper level is below 10-6, the treatment techniques have to be improved to approximately
80
4 log removal before and after climate change (before climate change: mean = 4.07; after
climate change: mean=4.10). Conventional treatment (e.g. coagulation, sedimentation,
filtration) has been observed to result in generally lower LRVs, and is largely dependent on
the effectiveness of coagulation (Dugan et al., 2001). Although the 3-log removal is believed
to be achieved if the treatment process has met the filtration requirements (USEPA, 2006),
considering the possible occurrence of transient elevated concentrations of Cryptosporidium
in source waters (Assavasilavasukul et al., 2008), the results suggest more stringent and
efficient techniques should be considered, such as membrane processes including
microfiltration (4.0-7.0 log removal), ultrafiltration (4.4-6.0 log removal), and UV disinfection
with advanced oxidation (~6 log removal) (Hamilton et al., 2018).
(a) (b)
Figure 4-9 Density plot with dark color represents estimated probability density of the
samples drawn target distribution, while the density plot with light color represents estimated
probability density of samples drawn from treatment before improvement (uniform
distribution, lower = 1.5, upper = 2.5). (a) The density plot of samples from target distribution
before climate change (b) The density plot of samples drawn from target distribution after
climate change under emission control.
81
4.3.2.3 Sewer overflow control
By controlling the sewerage overflow rate, the overall DALYs will decrease due to less
influence of raw sewage or partially treated wastewater from treatment plants overflowing
into water bodies. Three different sewer overflow rates were modelled, which could be
represented by three normal distributions with means of 0.01, 0.005, 0.001, and the same
standard deviation with the current sewer overflow rate (0.12). The box plots of DALYs are
shown in Figure 4-10.
(a) (b)
Figure 4-10 (a) quantile-quantile plots (Q-Q plots) of DALYs at current sewer overflow rate
(0.022), sewer overflow rate at three different levels of 0.01, 0.005, 0.001 (b)Box plots of
DALYs in the four groups.
The p-values of the normality tests for the four groups are 0.9, 0.93, 0.73, 0.41,
suggesting the normality of the DALYs distributions in four groups. From the ANOVA result,
controlling the sewerage overflow rates significantly decrease DALYs (p-value < 10-6 ).
Tukey HSD test result also suggest that there is significant difference in DALYs in all
pairwise comparisons (all p-values are below 10-6).
82
(a) (b)
Figure 4-11 Density plot with dark color represents estimated probability density of the
samples drawn target sewer overflow rate distribution, while the density plot with light color
represents estimated probability density of samples drawn from sewer overflow rate before
control (normal distribution, mean = 0.022, sd = 0.12). (a) The density plot of samples from
target distribution before climate change (b) The density plot of samples drawn from target
distribution after climate change under emission control.
As the sampling results shown in Figure 4-11, in order to achieve the goal of the
95% UI upper level is below 1×10-6 DALYs, a sewer overflow rate of around 0.0005 are
required before and after climate change (before climate chage: mean = 0.00045; after
climate change: mean=0.0005). According to UN, the 80% of the world’s sewage is currently
discharged without treatment (WWAP, 2017). In 2016 and 2017, over 4% of all wastewater
collected and discharged are untreated in Canada (Statistics Canada, 2017). Although the
magnitude of sewage in drinking water has not been estimated in previous studies, our
model suggests the value should be controlled at 0.05% if the treatment efficiency remains
2 ± 0.5 log without improvement. Besides technology-based controls for communities to
address sewer overflow problems, in the New England and Great Lake regions, a
screening-level assessment of the impact of future climate change on sewer overflow
(USEPA, 2008) has been provided. Furthermore, smart data infrastructure for wet weather
83
control and decision-making in real time has been considered in recent years (USEPA,
2018). With the help of advances in weather and climate prediction, sewer overflow is
expected to be minimized to the target value.
4.4 Summary
A probabilistic quantitative microbial risk assessment model was developed to estimate the
health risk under different scenarios and determine the control points. The innate nature of
Cryptosporidium monitoring data is highly imbalanced with a high proportion of zero values.
Gaussian process regression and threshold moving method based
on precision-recall curve were applied to address the imbalance data problem. The model
performed well in binary classification with high precision and recall (0.58 and 0.83,
respectively).
A stochastic model was utilized to probabilistically estimate the health risk by
incorporating factors that impact public health, such as water treatment efficiency, drinking
water consumption, morbidity and sewage overflow. Prediction of health risk under different
scenarios was based on Monte Carlo method, while the backward reasoning regarding the
target goal of treatment improvement and sewage overflow control is conducted with
Markov chain Monte Carlo method. The model provides reasonable estimation of disease
burden (in DALYs) under different levels of treatment efficiency improvement and sewage
overflow controls. Based on the dataset used, in order to conservatively achieve the goal of
10-6 DALYs, the model suggests that treatment efficiency should be at least 4 log, and
sewage overflow rate should be controlled below 0.05%.
Due to the expensiveness of Cryptosporidium monitoring, fecal indicator bacteria are
usually considered as the predictor. However, it did not always predict high concentrations
of oocysts, potentially underestimating Cryptosporidium risk. The developed model can be
more sensitive to potential Cryptosporidium risk and is well suited to assess risks under
84
different scenarios, such as forecasting Cryptosporidium alert to inform management and
develop strategies including level of treatment required and sewage overflow control.
85
Chapter 5: Conclusion
5.1 Summary of Contributions
In this thesis, Bayesian methods were applied to model the cyanobacteria abundance and
Cryptosporidium oocysts in source waters. Imbalanced data methods including zero-inflated
modelling and threshold moving were explored. Several variable selection, model selection
and checking methods under Bayesian framework were applied to extract relevant features
and identify the model that best fits the data. Markov chain Monte Carlo (MCMC) methods
were applied to sample from posterior distributions to estimate parameters. In probabilistic
quantitative microbial risk assessment (QMRA) of Cryptosporidium, the Monte Carlo method
was used to simulate and estimate disease the probability distribution of disease burden
under different scenarios. The contribution of thesis can be summarized as follows:
1. Bayesian zero-inflated models were successful in improving the fit of the model to
imbalanced cyanobacteria and data. Chl a, temperature, turbidity, solar radiation, and total
phosphorus were identified as the key factors for cyanobacterial growth.
Based on the comparison results, zero-inflated models were significantly better than the
negative binomial model (p=0.002). While the fit between zero-inflated negative binomial
and hurdle negative binomial was comparable, considering the mechanisms of zero
generation, the was chosen as the model. The average accuracy for cyanobacteria
classification is 0.4, indicating poor performance. The model performed poorly in predicting
low or medium alert levels, and predictions of safe levels dominated. However, the model
has a better separation of high alert vs safe levels, and for binary decisions, the precision
and recall were found to be 0.62 and 0.99.
According to the projection predictive inference results, Chl a, temperature, turbidity,
solar radiation and total phosphorus were identified as the key factors for cyanobacterial
growth. As the kernel density estimation of coefficients of Chl a and temperature were
distributed above zero, they are assumed to have positive correlations with cyanobacteria
86
abundance. Solar radiation and total phosphorus, with coefficients distributed primarily
below zero, are believed to have negative correlations with cyanobacteria abundance.
Turbidity effects on cyanobacteria abundance are either positive or negative, depending on
the non-algal matter concentration. As such, the models have great potential for identifying
potential cyanobacterial blooms and key environmental factors necessary for cyanobacterial
growth in natural waters.
The Bayesian approach is also contrary to previous models that generate point
estimations. It is expected that the approach utilized in this thesis would enable the
modelling process to produce estimated cell count of microorganisms with associated level
of uncertainty. With only a few environmental factors required, Bayesian zero-inflated
models can be applied to model other microorganism abundance and understand the
relative importance of the factors.
2. Probabilistic QMRA model developed based on Monte Carlo methods and MCMC
can be used to quantify the day-to-day Cryptosporidium health risk in DALYs and determine
the control points in different scenarios. Gaussain process classification achieve high
accuracy in Cryptosporidium classification.
The integrated probabilistic model in this thesis include two parts: Gaussian process
regression model to predict presence/absence of Cryptosporidium in source water, and
QMRA model incorporating information such as drinking water treatment efficiency, daily
water intake, sewer overflow rate to generated health risk estimation.
Gaussian processes are probabilistic methods for regression and classification, and
provide probabilistic predictive distributions with uncertainty estimation. For binary
classification of Cryptosporidium, Gaussian process classification achieved high overall
accuracy of 93.7%, with the precision of 0.58 and recall of 0.83. QMRA models are widely
used to quantify the health risks from microorganisms for source waters and can support
water management decisions. Compared with previous studies that focus on the utility of
87
deterministic QMRA, the probabilistic QMRA in this thesis investigated the use of Monte
Carlo methods as well as MCMC simulation, which allow each variable to be continuous
instead of discrete and minimize the potential information loss.
The model suggests increased temperature and precipitation under climate change
would significantly increase the disease burden of Cryptosporidium. To conservatively
achieve the goal of the 95% UI upper level below 10-6, the treatment techniques need to be
improved to 4 log removal if the other intervention and information remain the unchanged.
Thus, more stringent and efficient treatment techniques should be considered, such as
microfiltration, reverse osmosis and ultrafiltration. If the other conditions unchanged, to
conservatively achieve the goal of 95% UI upper level below 10-6 DALYs, a sewer overflow
rate of 5×10-4 is necessary. It is expected that when new predictor data comes, this model
would enable predicting other microorganisms presence in source water, quantifying the
health risks and determining the necessary control points in different scenarios such as
climate change, treatment technique improvement and sewer overflow control.
5.2 Limitations and suggestions for future research
The predictors used to predict Cryptosporidium include turbidity, temperature, precipitation
and fecal indicator concentration (E.coli) are limited and may have decreased the model’s
accuracy in Cryptosporidium classification. More water quality variables and the interaction
between variables could be considered to improve the predictability. Besides, in predicting
Cryptosporidium, the temperature data are the observations at the nearest weather station
to the reservoir instead of water temperature. Although Bayesian models combined with
imbalanced data method have good performance in predicting presence or absence of
microorganisms in water, the prediction of non-binary count data is a remaining challenge
and should be further investigated.
Another limitation was treating each observation data point as independent and
ignoring the temporal autocorrelation, i.e. the similarity between observations as a function
88
of the time lag. The presence/absence of cyanobacteria and Cryptosporidium closely
together in time are likely to occur together. Time series in days and seasons are suggested
to be considered. The seasonal dynamics of bacteria and pathogens in natural waters
should be investigated.
Further improvement in the QMRA model could focus on other scenarios, such as
exposure through recreational waters, herd immunity increased due to vaccination. Data
collection, such as precise sewer overflow data should be obtained for better prior
distributions of the parameters. Research efforts on accurate prediction of Cryptosporidium
oocysts concentration are significant to improve the disease burden estimation. Other
methods in machine learning, such as ensemble learning, AdaBoost, SMOTEBoost and
AUCBoost could be considered and compared to find the model that fit best to data.
89
Bibliography
Acero, J. L., Rodriguez, E., & Meriluoto, J. (2005). Kinetics of reactions between chlorine
and the cyanobacterial toxins microcystins. Water research, 39(8), 1628-1638.
Aguilera, P. A., Fernández, A., Fernández, R., Rumí, R., & Salmerón, A. (2011). Bayesian
networks in environmental modelling. Environmental Modelling & Software, 26(12),
1376-1388.
Agulló-Barceló, M., Oliva, F., & Lucena, F. (2013). Alternative indicators for monitoring
Cryptosporidium oocysts in reclaimed water. Environmental Science and Pollution
Research, 20(7), 4448-4454.
Ahmed, S. A., & Karanis, P. (2018). Comparison of current methods used to detect
Cryptosporidium oocysts in stools. International journal of hygiene and
environmental health, 221(5), 743-763.
Amha, Y. M., Kumaraswamy, R., & Ahmad, F. (2015). A probabilistic QMRA of Salmonella
in direct agricultural reuse of treated municipal wastewater. Water Science and
Technology, 71(8), 1203-1211.
Antoniou, M. G., De La Cruz, A. A., & Dionysiou, D. D. (2005). Cyanotoxins: New generation
of water contaminants. Journal of environmental engineering, 131(9), 1239-1243.
Assavasilavasukul, P., Lau, B. L., Harrington, G. W., Hoffman, R. M., & Borchardt, M. A.
(2008). Effect of pathogen concentrations on removal of Cryptosporidium and
Giardia by conventional drinking water treatment. Water Research, 42(10-11), 2678-
2690.
Atherholt, T. B., LeChevallier, M. W., Norton, W. D., & Rosen, J. S. (1998). Effect of rainfall
on Giardia and Crypto. Journal‐American Water Works Association, 90(9), 66-80.
Baek, S. S., Pyo, J., Pachepsky, Y., Park, Y., Ligaray, M., Ahn, C. Y., ... & Cho, K. H.
(2020). Identification and enumeration of cyanobacteria species using a deep neural
network. Ecological Indicators, 115, 106395.
90
Barker, S. F., Amoah, P., & Drechsel, P. (2014). A probabilistic model of gastroenteritis risks
associated with consumption of street food salads in Kumasi, Ghana: Evaluation of
methods to estimate pathogen dose from water, produce or food quality. Science of
the Total Environment, 487, 130-142.
Beaudequin, D., Harden, F., Roiko, A., & Mengersen, K. (2016). Utility of Bayesian networks
in QMRA-based evaluation of risk reduction options for recycled water. Science of
the Total Environment, 541, 1393-1409.
Beaudequin, D., Harden, F., Roiko, A., Stratton, H., Lemckert, C., & Mengersen, K. (2015).
Beyond QMRA: Modelling microbial health risk as a complex system using Bayesian
networks. Environment international, 80, 8-18.
Berkhof, J., Van Mechelen, I., & Hoijtink, H. (2000). Posterior predictive checks: Principles
and discussion. Computational Statistics, 15(3), 337-354.
Bertone, E., Sahin, O., Richards, R., & Roiko, A. (2016). Extreme events, water quality and
health: A participatory Bayesian risk assessment tool for managers of reservoirs.
Journal of Cleaner Production, 135, 657-667.
Betancourt, W. Q., & Rose, J. B. (2004). Drinking water treatment processes for removal of
Cryptosporidium and Giardia. Veterinary parasitology, 126(1-2), 219-234.
Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 4,
No. 4, p. 738). New York: springer.
Bivins, A. W., Sumner, T., Kumpel, E., Howard, G., Cumming, O., Ross, I., ... & Brown, J.
(2017). Estimating infection risks and the global burden of diarrheal disease
attributable to intermittent water supply using QMRA. Environmental science &
technology, 51(13), 7542-7551.
Bouwknegt, M., Knol, A. B., van der Sluijs, J. P., & Evers, E. G. (2014). Uncertainty of
population risk estimates for pathogens based on QMRA or epidemiology: a case
study of Campylobacter in the Netherlands. Risk analysis, 34(5), 847-864.
91
Bownik, A. (2016). Harmful algae: Effects of cyanobacterial cyclic peptides on aquatic
invertebrates-a short review. Toxicon, 124, 26-35.
Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time series analysis:
forecasting and control. John Wiley & Sons.
Boyer, G. L. (2008). Cyanobacterial toxins in New York and the lower Great Lakes
ecosystems. In Cyanobacterial harmful algal blooms: state of the science and
research needs (pp. 153-165). Springer, New York, NY.
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of
iterative simulations. Journal of computational and graphical statistics, 7(4), 434-455.
Carlson, R. E., & Simpson, J. (1996). A coordinator’s guide to volunteer lake monitoring
methods. North American Lake Management Society, 96, 305.
Carmichael, W. W. (1997). The cyanotoxins. In Advances in botanical research (Vol. 27, pp.
211-256). Academic Press.
Carmichael, W. W. (2001). Assessment of blue-green algal toxins in raw and finished
drinking water. American Water Works Association.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., ... &
Riddell, A. (2017). Stan: A probabilistic programming language. Journal of statistical
software, 76(1), 1-32.
Catalina, A., Bürkner, P. C., & Vehtari, A. (2020). Projection Predictive Inference for
Generalized Linear and Additive Multilevel Models. arXiv preprint arXiv:2010.06994.
Catherine, Q., Susanna, W., Isidora, E. S., Mark, H., Aurelie, V., & Jean-François, H.
(2013). A review of current knowledge on toxic benthic freshwater cyanobacteria–
ecology, toxin production and risk management. Water research, 47(15), 5464-5479.
Cha, Y., Park, S. S., Kim, K., Byeon, M., & Stow, C. A. (2014). Probabilistic prediction of
cyanobacteria abundance in a Korean reservoir using a Bayesian Poisson model.
Water Resources Research, 50(3), 2518-2532.
92
Cha, Y., Cho, K. H., Lee, H., Kang, T., & Kim, J. H. (2017). The relative importance of water
temperature and residence time in predicting cyanobacteria abundance in regulated
rivers. Water research, 124, 11-19.
Chaffin, J. D., Kane, D. D., Stanislawczyk, K., & Parker, E. M. (2018). Accuracy of data
buoys for measurement of cyanobacteria, chlorophyll, and turbidity in a large lake
(Lake Erie, North America): implications for estimation of cyanobacterial bloom
parameters from water quality sonde measurements. Environmental science and
pollution research, 25(25), 25175-25189.
Chapra, S. C. et al. Climate Change Impacts on Harmful Algal Blooms in U.S. Freshwaters:
A Screening-Level Assessment. Environ. Sci. Technol. 51, 8933–8943 (2017).
Chaudhry, R. M., Hamilton, K. A., Haas, C. N., & Nelson, K. L. (2017). Drivers of microbial
risk for direct potable reuse and de facto reuse treatment schemes: The impacts of
source water quality and blending. International journal of environmental research
and public health, 14(6), 635.
Checkley, W., White Jr, A. C., Jaganath, D., Arrowood, M. J., Chalmers, R. M., Chen, X. M.,
... & Houpt, E. R. (2015). A review of the global burden, novel diagnostics,
therapeutics, and vaccine targets for Cryptosporidium. The Lancet Infectious
Diseases, 15(1), 85-94.
Chen, X. M., Keithly, J. S., Paya, C. V., & LaRusso, N. F. (2002). Cryptosporidiosis. New
England Journal of Medicine, 346(22), 1723-1731.
Chorus, I., & Welker, M. (2021). Toxic cyanobacteria in water: a guide to their public health
consequences, monitoring and management (p. 858). Taylor & Francis.
Chow, C. W., Drikas, M., House, J., Burch, M. D., & Velzeboer, R. M. (1999). The impact of
conventional water treatment processes on cells of the cyanobacterium Microcystis
aeruginosa. Water Research, 33(15), 3253-3262.
93
Christensen, V. G., Graham, J. L., Milligan, C. R., Pope, L. M.,& Ziegler, A. C. (2006). Water
quality and relation to taste-and-odor compounds in North Fork Ninnescah River and
Cheney Reservoir, south-central Kansas, 1997-2003 (No. 2006-5095). US
Geological Survey.
Cirés, S., Wörmer, L., Timón, J., Wiedner, C., & Quesada, A. (2011). Cylindrospermopsin
production and release by the potentially invasive cyanobacterium Aphanizomenon
ovalisporum under temperature and light gradients. Harmful Algae, 10(6), 668-675.
Clancy, J. L., Bukhari, Z., McCuin, R. M., Matheson, Z., & Fricker, C. R. (1999). USEPA
method 1622. Journal‐American Water Works Association, 91(9), 60-68.
Coffey, R., Cummins, E., Cormican, M., Flaherty, V. O., & Kelly, S. (2007). Microbial
exposure assessment of waterborne pathogens. Human and Ecological Risk
Assessment, 13(6), 1313-1351.
Coffey, R., Cummins, E., Bhreathnach, N., Flaherty, V. O., & Cormican, M. (2010).
Development of a pathogen transport model for Irish catchments using SWAT.
Agricultural Water Management, 97(1), 101-111.
Coffey, R., Cummins, E., O’Flaherty, V., & Cormican, M. (2010). Analysis of the soil and
water assessment tool (SWAT) to model Cryptosporidium in surface water sources.
Biosystems Engineering, 106(3), 303-314.
Collell, G., Prelec, D., & Patil, K. R. (2018). A simple plug-in bagging ensemble based on
threshold-moving for classifying binary and multiclass imbalanced data.
Neurocomputing, 275, 330-340.
Corso, P. S., Kramer, M. H., Blair, K. A., Addiss, D. G., Davis, J. P., & Haddix, A. C. (2003).
Costs of illness in the 1993 waterborne Cryptosporidium outbreak, Milwaukee,
Wisconsin. Emerging infectious diseases, 9(4), 426.
Costán-Longares, A., Montemayor, M., Payan, A., Mendez, J., Jofre, J., Mujeriego, R., &
Lucena, F. (2008). Microbial indicators and pathogens: removal, relationships and
94
predictive capabilities in water reclamation facilities. Water research, 42(17), 4439-
4448.
Curriero, F. C., Patz, J. A., Rose, J. B., & Lele, S. (2001). The association between extreme
precipitation and waterborne disease outbreaks in the United States, 1948–1994.
American journal of public health, 91(8), 1194-1199.
Davidson, K., Gowen, R. J., Tett, P., Bresnan, E., Harrison, P. J., McKinney, A., ... &
Crooks, A. M. (2012). Harmful algal blooms: how strong is the evidence that nutrient
ratios and forms influence their occurrence?. Estuarine, Coastal and Shelf Science,
115, 399-413.
Davies, C. M., Ferguson, C. M., Kaucner, C., Krogh, M., Altavilla, N., Deere, D. A., &
Ashbolt, N. J. (2004). Dispersion and transport of Cryptosporidium oocysts from fecal
pats under simulated rainfall events. Applied and environmental microbiology, 70(2),
1151-1159.
DeGaetano, A., Zarrow, D., & Center, N. R. C. (2011). Extreme Precipitation in New York &
New England. Technical Documentation and User Manual, Northeast Regional
Climate Center, Cornell University, Ithaca, NY.
Desai, N. T., Sarkar, R., & Kang, G. (2012). Cryptosporidiosis: an under-recognized public
health problem. Tropical parasitology, 2(2), 91.
Dilks, D. W., Canale, R. P., & Meier, P. G. (1992). Development of Bayesian Monte Carlo
techniques for water quality model uncertainty. Ecological Modelling, 62(1-3), 149-
162.
Dolman, A. M., Rücker, J., Pick, F. R., Fastner, J., Rohrlack, T., Mischke, U., & Wiedner, C.
(2012). Cyanobacteria and cyanotoxins: the influence of nitrogen versus
phosphorus. PloS one, 7(6), e38757.
95
Donald, M., Cook, A., & Mengersen, K. (2009). Bayesian network for risk of diarrhea
associated with the use of recycled water. Risk Analysis: An International Journal,
29(12), 1672-1685.
Donald, M., Mengersen, K., Toze, S., Sidhu, J. P., & Cook, A. (2011). Incorporating
parameter uncertainty into quantitative microbial risk assessment (QMRA). Journal
of water and health, 9(1), 10-26.
Dormann, C. F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., ... & Lautenbach, S.
(2013). Collinearity: a review of methods to deal with it and a simulation study
evaluating their performance. Ecography, 36(1), 27-46.
Dugan, N. R., Fox, K. R., Owens, J. H., & Miltner, R. J. (2001). Controlling Cryptosporidium
oocysts using conventional treatment. Journal‐American Water Works Association,
93(12), 64-76.
Dzialowski, A. R., Smith, V. H., Huggins, D. G., Denoyelles, F., Lim, N. C., Baker, D. S. &
Beury, J. H. (2009). Development of predictive models for geosmin-related taste and
odor in Kansas, USA, drinking water reservoirs. water research, 43(11), 2829-2840.
Dzialowski, A. R., Smith, V. H., Wang, S. H., Martin, M. C., & Jr, F. D. (2011). Effects of non-
algal turbidity on cyanobacterial biomass in seven turbid Kansas reservoirs. Lake
and Reservoir Management, 27(1), 6-14.
Edzwald, J. K., Tobiason, J. E., Parento, L. M., Kelley, M. B., Kaminski, G. S., Dunn, H. J., &
Galant, P. B. (2000). Giardia and Cryptosporidium removals by clarification and
filtration under challenge conditions. Journal‐American Water Works Association,
92(12), 70-84.
Efstratiou, A., Ongerth, J. E., & Karanis, P. (2017). Waterborne transmission of protozoan
parasites: review of worldwide outbreaks-an update 2011–2016. Water research,
114, 14-22.
96
Efstratiou, A., Ongerth, J., & Karanis, P. (2017). Evolution of monitoring for Giardia and
Cryptosporidium in water. Water Research, 123, 96-112.
EPA, U. (1998). National Primary Drinking Water Regulations: Interim Enhanced Surface
Water Treatment. Federal Register: Rules and Regulations, 63, 69478-69521.
Falconer, I. R., Runnegar, M. T., Buckley, T., Huyn, V. L., & Bradshaw, P. (1989). Using
activated carbon to remove toxicity from drinking water containing cyanobacterial
blooms. Journal‐American Water Works Association, 81(2), 102-105.
Fang, F., Gao, Y., Gan, L., He, X., & Yang, L. (2018). Effects of different initial pH and
irradiance levels on cyanobacterial colonies from Lake Taihu, China. Journal of
Applied Phycology, 30(3), 1777-1793.
Fayer, R. J. M. T., Trout, J. M., & Jenkins, M. C. (1998). Infectivity of Cryptosporidium
parvum oocysts stored in water at environmental temperatures. The Journal of
parasitology,
Fayer, R., Graczyk, T. K., Lewis, E. J., Trout, J. M., & Farley, C. A. (1998). Survival of
infectious Cryptosporidium parvum oocysts in seawater and eastern oysters
(Crassostrea virginica) in the Chesapeake Bay. Applied and Environmental
Microbiology, 64(3), 1070-1074.
Fayer, R., Morgan, U., & Upton, S. J. (2000). Epidemiology of Cryptosporidium:
transmission, detection and identification. International journal for parasitology,
30(12-13), 1305-1322.
Ferrão-Filho, A. D. S., & Kozlowsky-Suzuki, B. (2011). Cyanotoxins: bioaccumulation and
effects on aquatic animals. Marine drugs, 9(12), 2729-2772.
Francy, D. S., Stelzer, E. A., Duris, J. W., Brady, A. M., Harrison, J. H., Johnson, H. E., &
Ware, M. W. (2013). Predictive models for Escherichia coli concentrations at inland
lake beaches and relationship of model variables to pathogen detection. Applied and
environmental microbiology, 79(5), 1676-1688.
97
Freni, G., & Mannina, G. (2010). Bayesian approach for uncertainty quantification in water
quality modelling: The influence of prior distribution. Journal of Hydrology, 392(1-2),
31-39.
Frey, S. K., Topp, E., Edge, T., Fall, C., Gannon, V., Jokinen, C., ... & Lapen, D. R. (2013).
Using SWAT, Bacteroidales microbial source tracking markers, and fecal indicator
bacteria to predict waterborne pathogen occurrence in an agricultural watershed.
Water research, 47(16), 6326-6337.
Fung, R., & Del Favero, B. (1994). Backward simulation in Bayesian networks. In
Uncertainty Proceedings 1994 (pp. 227-234). Morgan Kaufmann.
Gammie, L., Goatcher, L. and Fok, N. (2000). A Giardia/Cryptosporidium near miss? In:
Proceedings of the 8th National Conference on Drinking Water, Quebec City,
Quebec, October 28–30, 1998. Canadian Water and Wastewater Association,
Ottawa, Ontario.
Gelman, A. Comment: Fuzzy and Bayesian p-Values and u-Values. Statist. Sci. 20, (2005).
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis. CRC press.
Gerace, E., Presti, V. D. M. L., & Biondo, C. (2019). Cryptosporidium infection:
epidemiology, pathogenesis, and differential diagnosis. European Journal of
Microbiology and Immunology, 9(4), 119-123.
Ghernaout, B., Ghernaout, D., & Saiba, A. (2010). Algae and cyanotoxins removal by
coagulation/flocculation: A review. Desalination and Water Treatment, 20(1-3), 133-
143.
Gijsbertsen-Abrahamse, A. J., Schmidt, W., Chorus, I., & Heijman, S. G. J. (2006). Removal
of cyanotoxins by ultrafiltration and nanofiltration. Journal of Membrane Science,
276(1-2), 252-259.
98
Gómez-Couso, H., Fontán-Sainz, M., McGuigan, K. G., & Ares-Mazás, E. (2009). Effect of
the radiation intensity, water turbidity and exposure time on the survival of
Cryptosporidium during simulated solar disinfection of drinking water. Acta tropica,
112(1), 43-48.
Gronewold, A. D., Qian, S. S., Wolpert, R. L., & Reckhow, K. H. (2009). Calibrating and
validating bacterial water quality models: A Bayesian approach. Water Research,
43(10), 2688-2698.
Haas, C. N., Rose, J. B., & Gerba, C. P. (2014). Quantitative microbial risk assessment.
John Wiley & Sons.
Hall, D. B. (2000). Zero‐inflated Poisson and binomial regression with random effects: a
case study. Biometrics, 56(4), 1030-1039.
Hamilton, D. P., Salmaso, N., & Paerl, H. W. (2016). Mitigating harmful cyanobacterial
blooms: strategies for control of nitrogen and phosphorus loads. Aquatic Ecology,
50(3), 351-366.
Hamilton, K. A., Waso, M., Reyneke, B., Saeidi, N., Levine, A., Lalancette, C., ... & Ahmed,
W. (2018). Cryptosporidium and Giardia in wastewater and surface water
environments. Journal of environmental quality, 47(5), 1006-1023.
Harke, M. J., Steffen, M. M., Gobler, C. J., Otten, T. G., Wilhelm, S. W., Wood, S. A., &
Paerl, H. W. (2016). A review of the global ecology, genomics, and biogeography of
the toxic cyanobacterium, Microcystis spp. Harmful algae, 54, 4-20.
Harris, T. D., & Graham, J. L. (2017). Predicting cyanobacterial abundance, microcystin,
and geosmin in a eutrophic drinking-water reservoir using a 14-year dataset. Lake
and reservoir management, 33(1), 32-48.
Hassan, E. M., Örmeci, B., DeRosa, M. C., Dixon, B. R., Sattar, S. A., & Iqbal, A. (2021). A
review of Cryptosporidium spp. and their detection in water. Water Science and
Technology, 83(1), 1-25.
99
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications.
He, X., Liu, Y. L., Conklin, A., Westrick, J., Weavers, L. K., Dionysiou, D. D., ... & Walker, H.
W. (2016). Toxic cyanobacteria and drinking water: Impacts, detection, and
treatment. Harmful algae, 54, 174-193.
Health Canada, 2002. Guidelines for Canadian Drinking Water Quality: Supporting
Documentation, Cyanobacterial Toxins –Microcystin-LR. Available online at
http://www.hcsc.gc.ca/ewhsemt/alt_formats/hecssesc/pdf/pubs/watereau/cyanobacte
rial_toxins/cyanobacterial_toxins-eng.pdf.
Health Canada. (2018). Guidance on the Use of Quantitative Microbial Risk Assessment in
Drinking Water.
Health Canada. (2019). Guidelines for Canadian drinking water quality: Enteric Protozoa:
Giardia and Cryptosporidium.
Hegg, A., Radersma, R., & Uller, T. (2022). Seasonal variation in the response to a toxin‐
producing cyanobacteria in Daphnia. Freshwater Biology.
Hilborn, E. D., Roberts, V. A., Backer, L., DeConno, E., Egan, J. S., Hyde, J. B., ... &
Hlavsa, M. C. (2014). Algal bloom–associated disease outbreaks among users of
freshwater lakes—United States, 2009–2010. MMWR. Morbidity and mortality
weekly report, 63(1), 11.
Himberg, K., Keijola, A. M., Hiisvirta, L., Pyysalo, H., & Sivonen, K. (1989). The effect of
water treatment processes on the removal of hepatotoxins fromMicrocystis
andOscillatoria cyanobacteria: A laboratory study. Water Research, 23(8), 979-984.
Hirata, T., & Hashimoto, A. (1998). Experimental assessment of the efficacy of
microfiltration and ultrafiltration for Cryptosporidium removal. Water Science and
Technology, 38(12), 103-107.
Hrudey, M. B., Drikas, M., & Gregory, R. (1999). REMEDIAL MEASURES.
100
Hrudey, S. E., & Hrudey, E. J. (2004). Safe drinking water. IWA publishing.
Huang, W. J., Cheng, B. L., & Cheng, Y. L. (2007). Adsorption of microcystin-LR by three
types of activated carbon. Journal of Hazardous Materials, 141(1), 115-122.
Huisman, J., Codd, G. A., Paerl, H. W., Ibelings, B. W., Verspagen, J. M., & Visser, P. M.
(2018). Cyanobacterial blooms. Nature Reviews Microbiology, 16(8), 471-483.
Hunter, P. R., De Sylor, M. A., Risebro, H. L., Nichols, G. L., Kay, D., & Hartemann, P.
(2011). Quantitative microbial risk assessment of cryptosporidiosis and giardiasis
from very small private water supplies. Risk Analysis: An International Journal, 31(2),
228-236.
Ibelings, B. W., Backer, L. C., Kardinaal, W. E. A., & Chorus, I. (2014). Current approaches
to cyanotoxin risk assessment and risk management around the globe. Harmful
algae, 40, 63-74.
Ishaq, S., Sadiq, R., Chhipi-Shrestha, G., Farooq, S., & Hewage, K. (2022). Developing an
Integrated “Regression-QMRA method” to Predict Public Health Risks of Low Impact
Developments (LIDs) for Improved Planning. Environmental Management, 1-17.
Jiang, P., Liu, X., Zhang, J., Te, S. H., Gin, K. Y. H., Van Fan, Y., ... & Shoemaker, C. A.
(2021). Cyanobacterial risk prevention under global warming using an extended
Bayesian network. Journal of Cleaner Production, 312, 127729.
Jin, C., Mesquita, M. M., Deglint, J. L., Emelko, M. B., & Wong, A. (2018). Quantification of
cyanobacterial cells via a novel imaging-driven technique with an integrated
fluorescence signature. Scientific reports, 8(1), 1-12.
Jöhnk , K. D., Huisman, J. E. F., Sharples, J., Sommeijer, B. E. N., Visser, P. M., & Stroom,
J. M. (2008). Summer heatwaves promote blooms of harmful cyanobacteria. Global
change biology, 14(3), 495-512.
101
Khan, S. J., Deere, D., Leusch, F. D., Humpage, A., Jenkins, M., & Cunliffe, D. (2015).
Extreme weather events: Should drinking water quality management systems adapt
to changing risk profiles?. Water research, 85, 124-136.
Kim, J., Jonoski, A., & Solomatine, D. P. (2022). A Classification-Based Machine Learning
Approach to the Prediction of Cyanobacterial Blooms in Chilgok Weir, South Korea.
Water, 14(4), 542.
Kim, S., Kim, S., Mehrotra, R., & Sharma, A. (2020). Predicting cyanobacteria occurrence
using climatological and environmental controls. Water Research, 175, 115639.
King, B. J., Keegan, A. R., Monis, P. T., & Saint, C. P. (2005). Environmental temperature
controls Cryptosporidium oocyst metabolic rate and associated retention of
infectivity. Applied and environmental microbiology, 71(7), 3848-3857.
Klemer, A. R., & Konopka, A. E. (1989). Causes and consequences of blue-green algal
(cyanobacterial) blooms. Lake and Reservoir Management, 5(1), 9-19.
Korich, D. G., Mead, J. R., Madore, M. S., Sinclair, N. A., & Sterling, C. (1990). Effects of
ozone, chlorine dioxide, chlorine, and monochloramine on Cryptosporidium parvum
oocyst viability. Applied and environmental microbiology, 56(5), 1423-1428.
Korner-Nievergelt, F. et al. Posterior Predictive Model Checking and Proportion of Explained
Variance. in Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS,
and STAN 161–174 (Elsevier, 2015). doi:10.1016/B978-0-12-801370-0.00010-1.
Kuk, A. Y., Li, J., & John Rush, A. (2014). Variable and threshold selection to control
predictive accuracy in logistic regression. Journal of the Royal Statistical Society:
Series C (Applied Statistics), 63(4), 657-672.
Lahti, K., Rapala, J., Kivimäki, A. L., Kukkonen, J., Niemelä, M., & Sivonen, K. (2001).
Occurrence of microcystins in raw water sources and treated drinking water of
Finnish waterworks. Water science and technology, 43(12), 225-228.
102
Lal, A., Fearnley, E., & Kirk, M. (2015). The risk of reported cryptosporidiosis in children
aged< 5 years in Australia is highest in very remote regions. International journal of
environmental research and public health, 12(9), 11815-11828.
Lalancette, C., Papineau, I., Payment, P., Dorner, S., Servais, P., Barbeau, B., ... & Prévost,
M. (2014). Changes in Escherichia coli to Cryptosporidium ratios for various fecal
pollution sources and drinking water intakes. water research, 55, 150-161.
Lambert, B. (2018). A student’s guide to Bayesian statistics. Sage.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics, 34(1), 1-14.
Lambert, T. W., Holmes, C. F., & Hrudey, S. E. (1996). Adsorption of microcystin-LR by
activated carbon and removal in full scale water treatment. Water Research, 30(6),
1411-1422.
LeBlanc Renaud, S., Pick, F. R., & Fortin, N. (2011). Effect of light intensity on the relative
dominance of toxigenic and nontoxigenic strains of Microcystis aeruginosa. Applied
and environmental microbiology, 77(19), 7016-7022.
LeChevallier, M. W., Norton, W. D., & Lee, R. G. (1991). Occurrence of Giardia and
Cryptosporidium spp. in surface water supplies. Applied and Environmental
Microbiology, 57(9), 2610-2616.
Lee, J., & Walker, H. W. (2006). Effect of process variables and natural organic matter on
removal of microcystin-LR by PAC− UF. Environmental science & technology,
40(23), 7336-7342.
Lee, J., Lee, S., & Jiang, X. (2017). Cyanobacterial toxins in freshwater and food: important
sources of exposure to humans. Annual review of food science and technology, 8,
281-304.
Lee, T. A., Rollwagen-Bollens, G., Bollens, S. M., & Faber-Hammond, J. J. (2015).
Environmental influence on cyanobacteria abundance and microcystin toxin
103
production in a shallow temperate lake. Ecotoxicology and environmental safety,
114, 318-325.
Leitch, G. J., & He, Q. (2011). Cryptosporidiosis-an overview. Journal of biomedical
research, 25(1), 1-16.
Levich, A. P. (1996). The role of nitrogen-phosphorus ratio in selecting for dominance of
phytoplankton by cyanobacteria or green algae and its application to reservoir
management. Journal of Aquatic Ecosystem Health, 5(1), 55-61.
Li, Z., Guo, J. S., Fang, F., Gao, X., Sheng, J. P., Zhou, H., & Long, M. (2010). Seasonal
variation of cyanobacteria and its potential relationship with key environmental
factors in Xiaojiang backwater area, Three Gorges Reservoir. Huan Jing ke Xue=
Huanjing Kexue, 31(2), 301-309.
Ligda, P., Claerebout, E., Kostopoulou, D., Zdragas, A., Casaert, S., Robertson, L. J., &
Sotiraki, S. (2020). Cryptosporidium and Giardia in surface water and drinking water:
animal sources and towards the use of a machine-learning approach as a tool for
predicting contamination. Environmental Pollution, 264, 114766.
Linden, K. G., Shin, G. A., Faubert, G., Cairns, W., & Sobsey, M. D. (2002). UV disinfection
of Giardia lamblia cysts in water. Environmental science & technology, 36(11), 2519-
2522.
Lindquist, H.D A., J W. Bennett, K. Broomall, G Glover, AND F W. Schaefer III. COUNTING
CRYPTOSPORIDIUM, AN ANALYSIS OF THE UTILITY OF VARIOUS
CYTOMETRIC TECHNIQUES. Presented at Annual Meeting of American Society of
Parasitologists, Monterey, CA, July 5-10, 1999.
Lisle, J. T., & Rose, J. B. (1995). Gene exchange in drinking water and biofilms by natural
transformation. Water Science and Technology, 31(5-6), 41-46.
104
Litke, D. W. (1999). Review of phosphorus control measures in the United States and their
effects on water quality (Vol. 99, No. 4007). US Department of the Interior, US
Geological Survey.
Liu, W., An, W., Jeppesen, E., Ma, J., Yang, M., & Trolle, D. (2019). Modelling the fate and
transport of Cryptosporidium, a zoonotic and waterborne pathogen, in the Daning
River watershed of the Three Gorges Reservoir Region, China. Journal of
environmental management, 232, 462-474.
Lürling, M., Eshetu, F., Faassen, E. J., Kosten, S., & Huszar, V. L. (2013). Comparison of
cyanobacterial and green algal growth rates at different temperatures. Freshwater
Biology, 58(3), 552-559.
Maatouk, I., Bouaı̈cha, N., Fontan, D., & Levi, Y. (2002). Seasonal variation of microcystin
concentrations in the Saint-Caprais reservoir (France) and their removal in a small
full-scale treatment plant. Water Research, 36(11), 2891-2897.
Mac Kenzie, W. R., Hoxie, N. J., Proctor, M. E., Gradus, M. S., Blair, K. A., Peterson, D. E.,
... & Davis, J. P. (1994). A massive outbreak in Milwaukee of Cryptosporidium
infection transmitted through the public water supply. New England journal of
medicine, 331(3), 161-167.
Malve, O., Laine, M., Haario, H., Kirkkala, T., & Sarvala, J. (2007). Bayesian modelling of
algal mass occurrences—using adaptive MCMC methods with a lake water quality
model. Environmental Modelling & Software, 22(7), 966-977.
Mangan, N. M., Flamholz, A., Hood, R. D., Milo, R., & Savage, D. F. (2016). pH determines
the energetic efficiency of the cyanobacterial CO2 concentrating mechanism.
Proceedings of the National Academy of Sciences, 113(36), E5354-E5362.
Marion, J. W., Zhang, F., Cutting, D., & Lee, J. (2017). Associations between county-level
land cover classes and cyanobacteria blooms in the United States. Ecological
Engineering, 108, 556-563.
105
Masciopinto, C., Vurro, M., Lorusso, N., Santoro, D., & Haas, C. N. (2020). Application of
QMRA to MAR operations for safe agricultural water reuses in coastal areas. Water
Research X, 8, 100062.
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3),
276-282.
Meng, X. L. (1994). Posterior predictive p -values. The annals of statistics, 22(3), 1142-
1160.
Messner, M. J., Chappell, C. L., & Okhuysen, P. C. (2001). Risk assessment for
Cryptosporidium: a hierarchical Bayesian analysis of human dose response data.
Water Research, 35(16), 3934-3940.
Metropolis, N., & Ulam, S. (1949). The monte carlo method. Journal of the American
statistical association, 44(247), 335-341.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).
Equation of state calculations by fast computing machines. The Journal of Chemical
Physics, 21(6), 1087–1092.
Min, Y., & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated
count data. Statistical modelling, 5(1), 1-19.
Mok, H. F., Barker, S. F., & Hamilton, A. J. (2014). A probabilistic quantitative microbial risk
assessment model of norovirus disease burden from wastewater irrigation of
vegetables in Shepparton, Australia. Water research, 54, 347-362.
Mur, R., Skulberg, O. M., & Utkilen, H. (1999). CYANOBACTERIA IN THE ENVIRONMENT.
Murray, C. J., & Acharya, A. K. (1997). Understanding DALYs. Journal of health economics,
16(6), 703-730.
Myhre, G., Alterskjær, K., Stjern, C. W., Hodnebrog, Ø., Marelle, L., Samset, B. H., ... &
Stohl, A. (2019). Frequency of extreme precipitation increases extensively with event
rareness under global warming. Scientific reports, 9(1), 1-10.
106
Newcombe, G., House, J., Ho, L., Baker, P., & Burch, M. (2009). Management strategies for
cyanobacteria (blue-green algae): A guide for water utilities. Water Quality Research
Australia (WQRA), Reserach Report, 74, 60-76.
Ng, J. S. Y., Eastwood, K., Walker, B., Durrheim, D. N., Massey, P. D., Porigneaux, P., ... &
Ryan, U. (2012). Evidence of Cryptosporidium transmission between cattle and
humans in northern New South Wales. Experimental parasitology, 130(4), 437-441.
Nieminski, E. C., & Ongerth, J. E. (1995). Removing Giardia and Cryptosporidium by
conventional treatment and direct filtration. Journal‐American Water Works
Association, 87(9), 96-106.
O’Neil, J. M., Davis, T. W., Burford, M. A., & Gobler, C. J. (2012). The rise of harmful
cyanobacteria blooms: the potential roles of eutrophication and climate change.
Harmful algae, 14, 313-334.
Oberemm, A., Becker, J., Codd, G. A., & Steinberg, C. (1999). Effects of cyanobacterial
toxins and aqueous crude extracts of cyanobacteria on the development of fish and
amphibians. Environmental Toxicology: An International Journal, 14(1), 77-88.
Okhuysen, P. C., Chappell, C. L., Sterling, C. R., Jakubowski, W., & DuPont, H. L. (1998).
Susceptibility and serologic response of healthy adults to reinfection with
Cryptosporidium parvum. Infection and Immunity, 66(2), 441-443.
Oneby, M., Deremiah, R., & Bollyky, L. J. (2006). Pipeline Contactor for the City of Wichita,
Kansas High Pressure Ozone Facility. In Pipelines 2006: Service to the Owner (pp.
1-10).
Ong, C. S., Eisler, D. L., Goh, S. H., Tomblin, J., Awad-El-Kariem, F. M., Beard, C. B., ... &
Isaac-Renton, J. L. (1999). Molecular epidemiology of cryptosporidiosis outbreaks
and transmission in British Columbia, Canada. The American journal of tropical
medicine and hygiene, 61(1), 63-69.
107
Ongerth, J. E. (2013). ICR SS protozoan data site-by-site: a picture of Cryptosporidium and
Giardia in US surface water. Environmental science & technology, 47(18), 10145-
10154.
Ongerth, J. E. (2013). LT2 Cryptosporidium data: What do they tell us about
Cryptosporidium in surface water in the United States?. Environmental science &
technology, 47(9), 4029-4038.
Ongerth, J. E., & Hutton, A. P. (1997). DE filtration to remove Cryptosporidium. Journal‐
American Water Works Association, 89(12), 39-46.
Ongerth, J. E., & Hutton, P. E. (2001). Testing of diatomaceous earth filtration for removal of
Cryptosporidium oocysts. Journal‐American Water Works Association, 93(12), 54-
63.
Orak, N. H. (2020). A Hybrid Bayesian network framework for risk assessment of arsenic
exposure and adverse reproductive outcomes. Ecotoxicology and Environmental
Safety, 192, 110270.
Pachepsky, Y., Shelton, D., Dorner, S., & Whelan, G. (2016). Can E. coli or thermotolerant
coliform concentrations predict pathogen presence or prevalence in irrigation
waters?. Critical reviews in microbiology, 42(3), 384-393.
Paerl, H. W., & Huisman, J. (2009). Climate change: a catalyst for global expansion of
harmful cyanobacterial blooms. Environmental microbiology reports, 1(1), 27-37.
Paerl, H. W., & Paul, V. J. (2012). Climate change: links to global expansion of harmful
cyanobacteria. Water research, 46(5), 1349-1363.
Papadimitriou, T., Kagalou, I., Stalikas, C., Pilidis, G., & Leonardos, I. D. (2012).
Assessment of microcystin distribution and biomagnification in tissues of aquatic
food web compartments from a shallow lake and evaluation of potential risks to
public health. Ecotoxicology, 21(4), 1155-1166.
108
Parrott, L., & Kok, R. (2000). Incorporating complexity in ecosystem modelling. Complexity
international, 7, 1-19.
Parsons, D. J., Orton, T. G., D'Souza, J., Moore, A., Jones, R., & Dodd, C. E. R. (2005). A
comparison of three modelling approaches for quantitative risk assessment using the
case study of Salmonella spp. in poultry meat. International journal of food
microbiology, 98(1), 35-51.
Piironen, J., Paasiniemi, M., & Vehtari, A. (2020). Projective inference in high-dimensional
problems: Prediction and feature selection. Electronic Journal of Statistics, 14(1),
2155-2197.
Pyo, J., Cho, K. H., Kim, K., Baek, S. S., Nam, G., & Park, S. (2021). Cyanobacteria cell
prediction using interpretable deep learning model with observed, numerical, and
sensing data assemblage. Water Research, 203, 117483.
Pyo, J., Park, L. J., Pachepsky, Y., Baek, S. S., Kim, K., & Cho, K. H. (2020). Using
convolutional neural network for predicting cyanobacteria concentrations in river
water. Water Research, 186, 116349.
Quinonero-Candela, J., Rasmussen, C. E., & Williams, C. K. (2007). Approximation
methods for Gaussian process regression. In Large-scale kernel machines (pp. 203-
223). MIT Press.
Rabalais, N. N., Diaz, R. J., Levin, L. A., Turner, R. E., Gilbert, D., & Zhang, J. (2010).
Dynamics and distribution of natural and human-caused hypoxia. Biogeosciences,
7(2), 585-619.
Rastogi, R. P., Sinha, R. P., & Incharoensakdi, A. (2014). The cyanotoxin-microcystins:
current overview. Reviews in Environmental Science and Bio/Technology, 13(2),
215-249.
109
Razzolini, M. T. P., Breternitz, B. S., Kuchkarian, B., & Bastos, V. K. (2020).
Cryptosporidium and Giardia in urban wastewater: A challenge to overcome.
Environmental Pollution, 257, 113545.
Recknagel, F., Orr, P. T., Bartkow, M., Swanepoel, A., & Cao, H. (2017). Early warning of
limit-exceeding concentrations of cyanobacteria and cyanotoxins in drinking water
reservoirs by inferential modelling. Harmful algae, 69, 18-27.
Reichwaldt, E. S., & Ghadouani, A. (2012). Effects of rainfall patterns on toxic
cyanobacterial blooms in a changing climate: between simplistic scenarios and
complex dynamics. Water research, 46(5), 1372-1393.
Reinoso, R., Torres, L. A., & Bécares, E. (2008). Efficiency of natural systems for removal of
bacteria and pathogenic parasites from wastewater. Science of the total
environment, 395(2-3), 80-86.
Richardson, J., Feuchtmayr, H., Miller, C., Hunter, P. D., Maberly, S. C., & Carvalho, L.
(2019). Response of cyanobacteria and phytoplankton abundance to warming,
extreme rainfall events and nutrient enrichment. Global Change Biology, 25(10),
3365-3380.
Rigaux Ancelet, C. S., Carlin, F., Nguyen‐thé, C., & Albert, I. (2013). Inferring an augmented
Bayesian network to confront a complex quantitative microbial risk assessment
model with durability studies: application to Bacillus cereus on a courgette purée
production chain. Risk analysis, 33(5), 877-892.
Robarts, R. D., & Zohary, T. (1987). Temperature effects on photosynthetic capacity,
respiration, and growth rates of bloom‐forming cyanobacteria. New Zealand Journal
of Marine and Freshwater Research, 21(3), 391-399.
Robertson, L. J., Campbell, A. T., & Smith, H. V. (1992). Survival of Cryptosporidium
parvum oocysts under various environmental pressures. Applied and environmental
microbiology, 58(11), 3494-3500.
110
Rose, J. B. (1997). Environmental ecology of Cryptosporidium and public health
implications. Annual review of public health, 18(1), 135-161.
Rousso, B. Z., Bertone, E., Stewart, R., & Hamilton, D. P. (2020). A systematic literature
review of forecasting and predictive models for cyanobacteria blooms in freshwater
lakes. Water Research, 182, 115959.
Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the
royal statistical society: Series b (statistical methodology), 71(2), 319-392.
Ryan, U., Hijjawi, N., & Xiao, L. (2018). Foodborne cryptosporidiosis. International journal
for parasitology, 48(1), 1-12.
Salmaso, N., Capelli, C., Shams, S., & Cerasino, L. (2015). Expansion of bloom-forming
Dolichospermum lemmermannii (Nostocales, Cyanobacteria) to the deep lakes
south of the Alps: colonization patterns, driving forces and implications for water use.
Harmful Algae, 50, 76-87.
Sarma, T. A. (2012). Handbook of cyanobacteria. CRC Press.
Säve-Söderbergh, M., Toljander, J., Mattisson, I., Åkesson, A., & Simonsson, M. (2018).
Drinking water consumption patterns among adults—SMS as a novel tool for
collection of repeated self-reported water consumption. Journal of exposure science
& environmental epidemiology, 28(2), 131-139.
Sawka, M. N., Cheuvront, S. N., & Carter, R. (2005). Human water needs. Nutrition reviews,
63(suppl_1), S30-S39.
Schets, F. M., Van Wijnen, J. H., Schijven, J. F., Schoon, H., & de Roda Husman, A. M.
(2008). Monitoring of waterborne pathogens in surface waters in Amsterdam, The
Netherlands, and the potential health risk associated with exposure to
Cryptosporidium and Giardia in these waters. Applied and environmental
microbiology, 74(7), 2069-2078.
111
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete
samples). Biometrika, 52(3/4), 591-611.
Shirley, D. A. T., Moonah, S. N., & Kotloff, K. L. (2012). Burden of disease from
cryptosporidiosis. Current opinion in infectious diseases, 25(5), 555.
Shukla, P. R., Skeg, J., Buendia, E. C., Masson-Delmotte, V., Pörtner, H. O., Roberts, D. C.,
... & Malley, J. (2019). Climate Change and Land: an IPCC special report on climate
change, desertification, land degradation, sustainable land management, food
security, and greenhouse gas fluxes in terrestrial ecosystems.
Sinha, R. P., Kumar, A., Tyagi, M. B., & Hader, D. (2005). Ultraviolet-B-induced destruction
of phycobiliproteins in cyanobacteria. Physiology and Molecular Biology of Plants,
11(2), 313.
Sivonen, K., Kononen, K., Carmichael, W. W., Dahlem, A. M., Rinehart, K. L., Kiviranta, J.,
& Niemela, S. I. (1989). Occurrence of the hepatotoxic cyanobacterium Nodularia
spumigena in the Baltic Sea and structure of the toxin. Applied and Environmental
microbiology, 55(8), 1990-1995.
Smith, H. V., & Grimason, A. M. (2003). Giardia and Cryptosporidium in water and
wastewater. In Handbook of water and wastewater microbiology (pp. 695-756).
Academic Press.
Smith, H. V., & Nichols, R. A. (2010). Cryptosporidium: detection in water and food.
Experimental parasitology, 124(1), 61-79.
Sparks, H., Nair, G., Castellanos-Gonzalez, A., & White, A. C. (2015). Treatment of
Cryptosporidium: what we know, gaps, and the way forward. Current tropical
medicine reports, 2(3), 181-187.
States, S., Stadterman, K., Ammon, L., Vogel, P., Baldizar, J., Wright, D., ... & Sykora, J.
(1997). Protozoa in river water: sources, occurrence, and treatment. Journal‐
American Water Works Association, 89(9), 74-83.
112
Statistics Canada. Table 38-10-0100-01 Combined sewer overflow discharge volumes (x
1,000,000)
Statistics Canada. Table 38-10-0271-01 Potable water use by sector and average daily use.
DOI: https://doi.org/10.25318/3810027101-eng
Sterk, A., Schijven, J., de Roda Husman, A. M., & de Nijs, T. (2016). Effect of climate
change on runoff of Campylobacter and Cryptosporidium from land to surface water.
Water research, 95, 90-102.
Swaffer, B., Abbott, H., King, B., van der Linden, L., & Monis, P. (2018). Understanding
human infectious Cryptosporidium risk in drinking water supply catchments. Water
research, 138, 282-292.
Sylvestre, É., Burnet, J. B., Dorner, S., Smeets, P., Medema, G., Villion, M., ... & Prévost,
M. (2021). Impact of Hydrometeorological events for the selection of parametric
models for Protozoan Pathogens in drinking‐water sources. Risk Analysis, 41(8),
1413-1426.
Taranu, Z. E., Gregory‐Eaves, I., Leavitt, P. R., Bunting, L., Buchaca, T., Catalan, J., ... &
Vinebrooke, R. D. (2015). Acceleration of cyanobacterial dominance in north
temperate‐subarctic lakes during the Anthropocene. Ecology letters, 18(4), 375-384.
Teixeira, M. R., & Rosa, M. J. (2005). Microcystins removal by nanofiltration membranes.
Separation and Purification Technology, 46(3), 192-201.
Templeton, T. J., Lancto, C. A., Vigdorovich, V., Liu, C., London, N. R., Hadsall, K. Z., &
Abrahamsen, M. S. (2004). The Cryptosporidium oocyst wall protein is a member of
a multigene family and has a homolog in Toxoplasma. Infection and Immunity, 72(2),
980-987.
Thomas, M. K., & Litchman, E. (2016). Effects of temperature and nitrogen availability on
the growth of invasive and native cyanobacteria. Hydrobiologia, 763(1), 357-369.
113
Thomson, S., Hamilton, C. A., Hope, J. C., Katzer, F., Mabbott, N. A., Morrison, L. J., &
Innes, E. A. (2017). Bovine cryptosporidiosis: impact, host-parasite interaction and
control strategies. Veterinary Research, 48(1), 1-16.
US Environmental Protection Agency (USEPA). (1998). Interim enhanced surface water
treatment rule. Fed Reg, 63, 69478. LeChevallier, M. W., Norton, W. D., & Lee, R. G.
(1991). Giardia and Cryptosporidium spp. in filtered drinking water supplies. Applied
and environmental Microbiology, 57(9), 2617-2621.
USEPA (US Environmental Protection Agency). (2006). Long Term 2 Enhanced Surface
Water Treatment Rule. EPA 815-R-0e16, US EPA.
USEPA. (2018). Smart Data Infrastructure for wet Weather Control and Decision Support.
US Geological Survey, 2015, USGS National Water Information System, accessed ,
http://dx.doi.org/10.5066/F7P55KJN
van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M. G., ... & Yau,
C. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1),
1-26.
Vermeulen, L. C., van Hengel, M., Kroeze, C., Medema, G., Spanier, J. E., van Vliet, M. T.,
& Hofstra, N. (2019). Cryptosporidium concentrations in rivers worldwide. Water
Research, 149, 202-214.
Verspagen, J. M., Van de Waal, D. B., Finke, J. F., Visser, P. M., Van Donk, E., & Huisman,
J. (2014). Rising CO2 levels will intensify phytoplankton blooms in eutrophic and
hypertrophic lakes. PloS one, 9(8), e104325.
Villa, A., Fölster, J., & Kyllmar, K. (2019). Determining suspended solids and total
phosphorus from turbidity: comparison of high-frequency sampling with conventional
monitoring methods. Environmental monitoring and assessment, 191(10), 1-16.
114
Walls, J. T., Wyatt, K. H., Doll, J. C., Rubenstein, E. M., & Rober, A. R. (2018). Hot and
toxic: Temperature regulates microcystin release from cyanobacteria. Science of the
Total Environment, 610, 786-795.
Wang, H., Zhang, Z., Liang, D., Pang, Y., Hu, K., & Wang, J. (2016). Separation of wind's
influence on harmful cyanobacterial blooms. Water Research, 98, 280-292.
Wang, Z., Huang, K., Zhou, P., & Guo, H. (2010). A hybrid neural network model for
cyanobacteria bloom in Dianchi Lake. Procedia Environmental Sciences, 2, 67-75.
Welsh, A. H., Cunningham, R. B., Donnelly, C. F., & Lindenmayer, D. B. (1996). Modelling
the abundance of rare species: statistical models for counts with extra zeros.
Ecological Modelling, 88(1-3), 297-308.
Wenger, S. J., & Freeman, M. C. (2008). Estimating species occurrence, abundance, and
detection probability using zero‐inflated distributions. Ecology, 89(10), 2953-2959.
Westrick, J. A., Szlag, D. C., Southwell, B. J., & Sinclair, J. (2010). A review of
cyanobacteria and cyanotoxins removal/inactivation in drinking water treatment.
Analytical and bioanalytical chemistry, 397(5), 1705-1714.
Whitley, E., & Ball, J. (2002). Statistics review 6: Nonparametric methods. Critical care, 6(6),
1-5.
Whitton, B. A., & Potts, M. (2012). Introduction to the cyanobacteria. In Ecology of
Cyanobacteria II (pp. 1-13). Springer, Dordrecht.
Woods, S. A., Borges, H., Puddick, J., Biessy, L., Atalah, J., Hawes, I., ... & Hamilton, D. P.
(2017). Contrasting cyanobacterial communities and microcystin concentrations in
summers with extreme weather events: insights into potential effects of climate
change. Hydrobiologia, 785(1), 71-89.
World Health Organization. (2005). Guidelines for laboratory and field testing of mosquito
larvicides (No. WHO/CDS/WHOPES/GCDPP/2005.13). World Health Organization.
115
World Health Organization. (2016). Quantitative microbial risk assessment: application for
water safety management.
World Health Organization. (2021). WHO human health risk assessment toolkit: chemical
hazards. World Health Organization.
Wuebbles, D.J., D.W. Fahey, K.A. Hibbard, B. DeAngelo, S. Doherty, K. Hayhoe, R. Horton,
J.P. Kossin, P.C. Taylor, A.M. Waple, and C.P. Weaver, 2017: Executive summary.
In: Climate Science Special Report: Fourth National Climate Assessment, Volume I
[Wuebbles, D.J., D.W. Fahey, K.A. Hibbard, D.J. Dokken, B.C. Stewart, and T.K.
Maycock (eds.)]. U.S. Global Change Research Program, Washington, DC, USA, pp.
12-34, doi: 10.7930/J0DJ5CTG.
WWAP, U. (2017). WWAP (United Nations World Water Assessment Programme).
Xiao, G., Qiu, Z., Qi, J., Chen, J. A., Liu, F., Liu, W., ... & Shu, W. (2013). Occurrence and
potential health risk of Cryptosporidium and Giardia in the Three Gorges Reservoir,
China. Water research, 47(7), 2431-2445.
Xiao, L., & Feng, Y. (2017). Molecular epidemiologic tools for waterborne pathogens
Cryptosporidium spp. and Giardia duodenalis. Food and Waterborne Parasitology, 8,
14-32.
Xue, L., Zhang, Y., Zhang, T., An, L., & Wang, X. (2005). Effects of enhanced ultraviolet-B
radiation on algae and cyanobacteria. Critical reviews in microbiology, 31(2), 79-89.
Yang, L., Zhao, X., Peng, S., & Li, X. (2016). Water quality assessment analysis by using
combination of Bayesian and genetic algorithm approach in an urban lake, China.
Ecological modelling, 339, 77-88.
Yang, X. S. (2019). Introduction to algorithms for data mining and machine learning.
Academic press.
Yoder, J. S., & Beach, M. J. (2010). Cryptosporidium surveillance and risk factors in the
United States. Experimental parasitology, 124(1), 31-39.
116
Young, I., Smith, B. A., & Fazil, A. (2015). A systematic review and meta-analysis of the
effects of extreme weather events and other weather-related variables on
Cryptosporidium and Giardia in fresh surface waters. Journal of water and health,
13(1), 1-17.
Zhang, F., Lee, J., Liang, S., & Shum, C. K. (2015). Cyanobacteria blooms and non-
alcoholic liver disease: evidence from a county level ecological study in the United
States. Environmental Health, 14(1), 1-11.
Zhao, C. S., Shao, N. F., Yang, S. T., Ren, H., Ge, Y. R., Feng, P., ... & Zhao, Y. (2019).
Predicting cyanobacteria bloom occurrence in lakes and reservoirs before blooms
occur. Science of the total environment, 670, 837-848.
Zhao, Y., Sharma, A., Sivakumar, B., Marshall, L., Wang, P., & Jiang, J. (2014). A Bayesian
method for multi-pollution source water quality model and seasonal water quality
management in river segments. Environmental Modelling & Software, 57, 216-226.
Zhao, Y., Yan, Y., Xie, L., Wang, L., He, Y., Wan, X., & Xue, Q. (2020). Long-term
environmental exposure to microcystins increases the risk of nonalcoholic fatty liver
disease in humans: A combined fisher-based investigation and murine model study.
Environment International, 138, 105648.
Zhiteneva, V., Carvajal, G., Shehata, O., Hübner, U., & Drewes, J. E. (2021). Quantitative
microbial risk assessment of a non-membrane based indirect potable water reuse
system using Bayesian networks. Science of the Total Environment, 780, 146462.
117
Appendix
Chapter 3 Supplementary Materials
Figure S1: MCMC tracplots of negative binomial model based on four chains at 1000
iterations, intercept, beta_mu[1-5] are the coefficients in the negative binomial model, phi is
the inverse overdispersion control
118
(a)
(b)
Figure S2: MCMC tracplots of zero-inflated negative binomial model based on four chains at
1000 iterations, intercept2, beta_mu[1-5] are the coefficients in the negative binomial model,
phi inverse overdispersion control. Intercept2, beta_theta[1-5] are the coefficients in the
binomial model
119
(a)
(b)
Figure S3: MCMC tracplots of hurdle negative binomial model based on four chains at 1000
iterations, intercept2, beta_mu[1-5] are the coefficients in the negative binomial model, phi
inverse overdispersion control. Intercept2, beta_theta[1-5] are the coefficients in the
binomial model