Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
GIS-BASED, DATA-DRIVEN TECHNIQUES FOR SPATIAL ANALYSIS OF INFECTIOUS DISEASES AT THE REGIONAL, STATE, AND NATIONAL LEVELS
By
ABOLFAZL MOLLALO
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2019
© 2019 Abolfazl Mollalo
To my family
4
ACKNOWLEDGMENTS
I want to thank my parents for their unconditional love and immeasurable
supports in my life. Words cannot express how grateful I am for all you have done for
me. I am extending my heartfelt thanks to my brother and lovely sister for their love,
encouragement, and help in my life. You are the most important people in my life, and I
dedicate this dissertation to you.
This work wouldn’t have been possible without the support and advice of a
number of individuals. First and foremost, I would like to express my deepest and
sincere gratitude to my mentor professor Gregory Glass for believing in me, giving me
the freedom to bridge my interests in data science and medical geography and
providing feedback over the years. I also would like to express my appreciation to my
other committee members, Dr. Liang Mao, Dr. Jason Blackburn, and Dr. Parisa Rashidi
for their insight, collaboration, feedback, and ideas throughout the entire PhD program.
I would like to give a special thank you to my friends especially those who live in
Gainesville for being there for me and are and will be like a family to me.
Finally, many thanks go to the UF Geography department including my teachers,
staff and all other people who helped me, directly or indirectly. Thanks for working hard
to create a positive learning environment. It was a great pleasure and honor for me to
be a part of this program.
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS .................................................................................................. 4
LIST OF TABLES ............................................................................................................ 7
LIST OF FIGURES .......................................................................................................... 8
LIST OF ABBREVIATIONS ........................................................................................... 10
ABSTRACT ................................................................................................................... 12
CHAPTER
1 BACKGROUND ...................................................................................................... 14
Application of New Technologies and Tools in Public Health ................................. 16 Disease Mapping .................................................................................................... 19 Global Clustering Techniques ................................................................................. 21 Local Cluster Detection Techniques ....................................................................... 23 Space-Time Clustering Techniques ........................................................................ 26 Environment and Infectious Diseases ..................................................................... 27 Disease Modeling ................................................................................................... 30
Knowledge-Driven Models ................................................................................ 30 Data-Driven Models .......................................................................................... 31
Research Questions and Hypotheses..................................................................... 34
2 A 24-YEAR EXPLORATORY SPATIAL DATA ANALYSIS OF LYME DISEASE INCIDENCE RATE IN CONNECTICUT .................................................................. 41
Materials and Methods............................................................................................ 43 Data Collection and Preparation ....................................................................... 43 Global Clustering .............................................................................................. 44 Local Clustering ................................................................................................ 45
Spatial smoothing ...................................................................................... 45 Local moran’s ............................................................................................. 45 Spatial scan statistics ................................................................................. 46
Accuracy Assessments .................................................................................... 47 Results .................................................................................................................... 48 Discussion .............................................................................................................. 50
3 MACHINE LEARNING APPROACHES IN GIS-BASED ECOLOGICAL MODELING OF THE SAND FLY PHLEBOTOMUS PAPATASI, A VECTOR OF ZOONOTIC CUTANEOUS LEISHMANIASIS ......................................................... 60
Material and Methods ............................................................................................. 62
6
Study Area ........................................................................................................ 62 Data Collection and Preparation ....................................................................... 63
Sand fly collection ...................................................................................... 63 Exploratory data ......................................................................................... 64
Classifiers Used in This Study .......................................................................... 66 Pre-processing of the Data ............................................................................... 67 Overview of Training Models ............................................................................ 68 Models Validation ............................................................................................. 69
Results .................................................................................................................... 70 Discussion .............................................................................................................. 72
4 A GIS-BASED ARTIFICIAL NEURAL NETWORK MODEL FOR SPATIAL DISTRIBUTION OF TUBERCULOSIS ACROSS THE CONTINENTAL UNITED STATE .................................................................................................................... 82
Material and Methods ............................................................................................. 86 Tuberculosis Data ............................................................................................ 86 Explanatory Data .............................................................................................. 86 Global and Local Clustering ............................................................................. 87 Artificial Neural Networks ................................................................................. 89 Model Pre-Processing ...................................................................................... 90 Model Evaluation .............................................................................................. 92
Results .................................................................................................................... 93 Discussion .............................................................................................................. 96
5 CONCLUSION ...................................................................................................... 107
Summary .............................................................................................................. 107 Limitations ............................................................................................................. 108 Outlook and Future Challenges ............................................................................ 110
APPENDIX
A STATISTICAL COMPARISON OF CLASSIFIERS ............................................... 112
B HABITAT SUITABILITY MAPS ............................................................................. 113
LIST OF REFERENCES ............................................................................................. 115
BIOGRAPHICAL SKETCH .......................................................................................... 138
7
LIST OF TABLES
Table page 1-1 Application of GIS in the study of infectious diseases ........................................ 37
1-2 Application of RS in the study of infectious diseases .......................................... 38
1-3 Application of GPS in the study of infectious diseases ....................................... 38
1-4 Application of GIS, RS, Spatial statistics and machine learning in the study of infectious diseases (Lyme disease, leishmaniasis, tuberculosis) ....................... 39
2-1 Comparison of LISA and SaTScan clusters by confusion matrix and its derivatives statistics ............................................................................................ 55
2-2 Results of the global Moran statistic of LD incidence rate, Connecticut, 1991-2014 ................................................................................................................... 55
2-3 Characteristics of LD spatial clusters detected by SSS with 5% of the population at risk throughout Connecticut, USA for the period 1991-2014 ......... 55
3-1 Bioclimate variables used in this study ............................................................... 77
3-2 Pearson correlation coefficients between selected variables ............................. 77
4-1 Top 10 states with the largest number of hotspot counties (p < 0.10) of smoothed tuberculosis (TB) incidence rate (STIR) in the continental US, 2006–2010........................................................................................................ 101
4-2 Pearson correlation analysis between selected variables for modelling STIR, continental US. ................................................................................................. 101
4-3 Results of linear regression (LR) model for modeling log (STIR), continental US. ................................................................................................................... 101
4-4 Effects of environment and socio-economic factors on the log (STIR) using LR model. ......................................................................................................... 102
4-5 Comparison of multi-layer perceptron (MLP; one and two hidden layers), and LR model’ performance for predicting log (STIR) in the continental US. .......... 102
A-1 Statistical test for comparing AUCs of SVM, LR and RF classifiers: ................. 112
8
LIST OF FIGURES
Figure page 1-1 Contingency table of Knox method ..................................................................... 40
1-2 Principle of linearly separable SVM using maximum margin .............................. 40
2-1 Geographic location of the Connecticut, its towns and approximate populations. The names on the map show the location of the towns which are mentioned in this paper. ..................................................................................... 56
2-2 Temporal trend of LD incidence throughout Connecticut from 1991 to 2014. Black dots show the incidence rate for each year and the blue line represents a scatter with smooth lines. ................................................................................ 57
2-3 Locations of spatial clusters of LD incidence in Connecticut, USA based on the true-cluster definition of with LISA and SSS methods targeting 5% of the population at risk. ............................................................................................... 58
2-4 Comparison of LISA and spatial scan statistics with 5% and 50% of the population at risk of LD in Connecticut, USA for each period from 1991 to 2014. Sensitivity, specificity and overall accuracy expressed as per cent (%).... 59
3-1 Geographic location of the study area and its counties, NE Iran. The map inset (B) illustrates the location of presence/absence sampling of Ph. papatasi. ............................................................................................................. 78
3-2 Methodology flowchart used in this study ........................................................... 79
3-3 The ROC curves and AUCs for SVM, RF and LR classifiers .............................. 79
3-4 Comparison of the SVM, LR and RF classifiers. Overall accuracy, AUC, Kappa index, sensitivity, specificity, PPV and NPV expressed as %. ................. 80
3-5 Habitat suitability map of Ph. papatasi in Golestan province, northeast Iran based on support vector machine (note: this map is based on 1km units). ........ 81
4-1 Topological architecture of multi-layer perceptron neural network (MLPNN) used in this study. ............................................................................................. 103
4-2 Spatial distribution of training, cross-validation, and test data used for modeling log (STIR). ......................................................................................... 103
4-3 The frequency of TB cases (left) and the cumulative TB incidence rate (right) across the continental US (2006–2010). .......................................................... 104
9
4-4 Hotspot map for the STIR in the continental US identified by hotspot analysis (Getis–Ord Gi*) technique, 2006–2010. ........................................................... 104
4-5 The Normal P-P Plot of LR model. ................................................................... 105
4-6 Scatter plot of observed and predicted log (STIR) (by single hidden layer MLP model) for test data in the continental US. ............................................... 106
4-7 The contribution of input features on predicting log (STIR) according to sensitivity analysis of single hidden layer MLP. RMSE: Root mean square error. ................................................................................................................. 106
B-1 Habitat suitability map of Ph. Papatasi based on logistic regression classifier . 113
B-2 Habitat suitability map of Ph. Papatasi based on random forest classifier ........ 114
10
LIST OF ABBREVIATIONS
AIDS Acquired Immune Deficiency Syndrome
AMOEBA A Multidirectional Optimal Ecotope- based Algorithm
ANN Artificial Neural Network
CDC Center for Disease Control and prevention
CSR Complete Spatial Randomness
CT Connecticut
CTDPH Connecticut Department of Public Health
DEM Digital Elevation Model
DF Dengue Fever
EBS Empirical Bayes Smoothing
ENFA Ecological Niche Factor Analysis
ENM Ecological Niche Models
ESDA Exploratory Spatial Data Analysis
EWS Early Warning Systems
GAM Geographic analysis machine
GARP Genetic Algorithm for Rule-set Production
GIS Geographic Information System
GLM General Linear Model
GPS Global Positioning System
HIV Human Immunodeficiency Virus
KDE Kernel Density Estimation
LASSO Least Absolute Shrinkage and Selection Operator
LD Lyme Disease
LISA Local Indicator Spatial Autocorrelation
11
LR Logistic Regression (in Chapter 3)
LR Linear Regression (in Chapter 4)
LST Land Surface Temperature
MAE Mean Absolute Error
MaxEnt Maximum Entropy
MCDA Multi Criteria Decision Analysis
MLP Multi-Layer Perceptron
MLT Machine Learning Technique
MODIS Moderate-resolution Imaging Spectroradiometer
NDVI Normalized Difference Vegetation Index
NPV Negative Predicative Value
PPV Positive Predictive Value
RF Random Forest
RMSE Root Mean Square Error
ROC Receiver Operating Characteristic
RS Remote Sensing
SA Spatial Autocorrelation
SDM Species Distribution Model
SSS Spatial Scan Statistics
STIR Smoothed Tuberculosis Incidence Rate
SVM Support Vector Machine
TB Tuberculosis
USGS United States Geological Survey
VIF Variance Inflation Factor
VL Visceral Leishmaniasis
12
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
GIS-BASED, DATA-DRIVEN TECHNIQUES FOR SPATIAL ANALYSIS OF
INFECTIOUS DISEASES AT THE REGIONAL, STATE, AND NATIONAL LEVELS
By
Abolfazl Mollalo
August 2019
Chair: Gregory E. Glass Major: Geography
The present research emphasized the spatial epidemiological aspects of three
different infectious diseases: Lyme disease, zoonotic cutaneous leishmaniasis, and
tuberculosis at the state, regional, and national levels, respectively. We combined
relatively innovative methods/tools, for example, remote sensing, exploratory spatial
data analyses, GIS and data science techniques. The findings of these studies can
provide useful insight to health authorities on prioritizing resource allocation to risk-
prone areas.
We examined changes in the spatial distribution of significant spatial clusters of
Lyme disease incidence rates at the town level from 1991 to 2014 as an approach for
targeted interventions. Local clustering was measured using a local indicator of spatial
autocorrelation (LISA). Elliptic spatial scan statistics (SSS) in different shapes and
directions were also performed in SaTScan. The accuracy of these two cluster detection
methods was assessed and compared for sensitivity, specificity, and overall accuracy.
In another case study, we compared several approaches to model the spatial
distribution of Phlebotomus papatasi, the primary vector of zoonotic cutaneous
leishmaniasis, in an endemic region of the disease in Golestan province, northeast of
13
Iran. We gathered and prepared data on related environmental factors
including topography, weather variables, distance to main rivers and remotely sensed
data such as normalized difference vegetation cover and land surface temperature
(LST) in a GIS framework. Applicability of three classifiers: (vanilla) logistic regression,
random forest and support vector machine (SVM) were compared for predicting
presence/absence of the vector. Predictive performances were compared using an
independent dataset to generate area under the ROC curve (AUC) and Kappa statistics.
Despite the usefulness of artificial neural networks (ANNs) in the study of various
complex problems, ANNs have not been applied for modeling the geographic
distribution of tuberculosis (TB) in the US. Likewise, ecological level researches on TB
incidence rate at the national level are inadequate for epidemiologic inferences. We
collected 278 exploratory variables including environmental and a broad range of socio-
economic features for modeling the disease across the continental US. We investigated
the applicability of multilayer perceptron (MLP) ANN for predicting the disease
incidence.
14
CHAPTER 1 BACKGROUND
"Person", "place", and "time" are known as three key components in every
descriptive epidemiologic research 1. Application of place (geography) in the study of
diseases backs to more than 2000 years ago 2. Hippocrates (460-377 BC) in his essay
entitled “On Airs, Water, and Places” proposed a theory suggesting that environmental
factors can have a significant influence on the occurrence of diseases 3. In the late
seventeenth century, Filippo Arrieta (1694) utilized maps to investigate the outbreak of
plague in Bari, Italy 4. At the end of the eighteenth century, Valentine Seaman (1795)
mapped fatal cases of yellow fever in New York city to detect a possible association
between the location of yellow fever cases and waste sites 5. Perhaps the most famous
application of place (map) in the study of disease distribution is the work of John Snow
in mapping cholera death cases in the 1850s in London. Using the distribution map of
(cholera) deaths, he could trace and detect the source of the outbreak to a
contaminated public water pump 6.
Before 1950, there were very few studies that incorporated spatial components in
health researches. During the 1950s, Jacques May (1958) proposed a new concept
titled “Disease-ecology” or “geographic pathology” 7 which surprisingly is still valid in
epidemiological studies 8. “Disease- ecology” investigates the interaction of people with
the environment. His viewpoint was "process-oriented" and applied natural scientific
explanations: "Once the person, disease, and place are known, we may be able to
understand why someone is afflicted and someone else is not” 9. The principal objective
of his approach was to better comprehend the dynamics of disease which varies per
weather condition, mineral particles in water, vegetation cover and other influencing
15
factors 10. May divided an environmental health problem into three main categories: 1)
inorganic environment such as weather/climate variables which can reflect many
features of traditional environmental philosophy, 2) organic environment which can
reflect disease pattern because of animals/plants activities, and 3) socio-economic
factors which can explain the disease associated with the behavior and culture of
human 11.
By the end of the 1960s, remarkable changes in the concepts of May’s approach
began to emerge within the literature. The variations were partially associated with the
raised attention of geographers to the disease causation through visual interpretation.
This was in contrasts with May’s approach that studies disease occurrence as
understanding the processes of disease-ecology 10. In this decade, the World Atlas of
Disease was issued under the supervision of May. The atlas represented a significant
milestone in medical geography in this century. It contained both small and relatively
large scale maps of disease morbidity which were shaded by colors or plain symbols 1.
Mapping continued as one of the essential fields where geographer contributed
substantially and is widely used in spatial epidemiological researches as the exploratory
tool for generating hypotheses useful in health care planning 12.
Before 2000, medical geography mainly focused on clustering and cluster
detection analysis. Along with the advances in technology, the field of medical
geography has progressively evolved 13. Several sophisticated spatial tools such as
geographic information system (GIS), remote sensing (RS), and novel spatial statistical
analysis such as machine learning algorithms played an important role.
16
Although historical researches in medical geography rarely considered
geographical aspects, recent (decade) studies increasingly include "spatial" component
in epidemiological inferences 14. This increment might be due to the broader availability
of spatial data and quality of spatial health data than in the past because of the
advances in technology 15. Moreover, increased availability and sharing more accurate
environmental, socio-economic, cultural, and biological data has fueled researches in
medical geography. The availability of user-friendly software packages designed to
expedite statistical analysis is another explanation, particularly for non-expert users.
Therefore, medical geographers now have more powerful tools and high-quality and
update data at hands, and thus, spatial epidemiological studies are expected to
continue to play a significant role in the epidemiology of diseases. In the later sections
of Chapter 1 we will briefly demonstrate the application of a wide range of new tools and
technologies applied in medical geography, such as GIS, RS, GPS, and spatial
statistics and models.
Application of New Technologies and Tools in Public Health
Over the last two decades, mapping techniques, spatial analyses of disease
patterns, and risk modeling of diseases have substantially improved with the advent of
GIS 16. GIS has provided unprecedented opportunities to collect, organize and integrate
geospatial data from various sources and in different formats. It has enabled conducting
very complex analysis in public health in conjunction with population and environment
that in earlier researches were very hard or impossible to be studied. This tool has
equipped medical geographers to find answers for overly complicated questions more
efficiently 17. The application of GIS in recent decades have been progressively
increased. In Table 1-1, we have summarized some of the widely-used applications of
17
GIS in the studies of infectious diseases. An overview of GIS and its application in the
studies of infectious diseases has been published by Nykiforuk et al. (2011) 18.
According to Campbell and Wynne (2011) remote sensing refers to obtain some
characteristics of an object without making physical contact with it (i.e., through an
aerial sensor) 19. It allows collecting a vast amount of data in a short time and high
accuracy which facilitates disease surveillance and control. An overview of RS and its
application in public health has been reviewed by Hay et al. (1997) 20. High-resolution
satellite images can help medical geographers to investigate geographic variations of
diseases at a small-area scale. Moreover, with the help of RS, some products such as
normalized difference vegetation cover (NDVI) which reflects vegetation cover on the
ground, land surface temperature (LST), soil property, fog, and land cover can be
derived and incorporated as candidate factors in modeling. Also, several biophysical
features can be quantified by this technology including, biomass, chlorophyll absorption,
surface texture, and moisture content 21. In public health, RS data can be used for
generating habitat suitability maps for species, predicting vector or host population or
presence/absence and identifying vector or host habitat. Table 1-2 summarizes some
applications of RS in the study of infectious diseases.
Another technology that has made a significant role in collecting geospatial data
is the global positioning system (GPS). This technology has made collecting massive
amounts of spatial and attributes data in surveillance of infectious disease faster, easier
and with an affordable cost 22. Table 1-3 summarizes some applications of GPS in the
study of diseases.
18
Numerous researches have integrated the above technologies in various types of
infectious diseases researches. These studies include mapping disease prevalence,
predicting habitat suitability map of vectors, identifying risk-prone areas of infections,
etc. As an early study that integrated the above techniques, Glass et al. (1990) used
LANDSAT TM satellite images to derive land cover/land use in developing an
environmental geodatabase. These remotely sensed products and GIS were combined
to identify risk factors and consequently high-risk areas of Lyme disease in Maryland 23.
Mollalo et al. (2018) captured the location of collected sandflies of leishmaniasis in an
endemic province in Iran using handheld GPS. They then used several environmental
factors including NDVI in a GIS framework. They could predict the presence/absence of
the sand fly with an accuracy of 90% for a hold-out dataset 24.
Spatial statistics in epidemiology are statistical tools that are used to mainly map,
explain, and predict the spatial distribution of health problems. In general, the four main
types of spatial statistics frequently used in the studies of infectious diseases are: 1)
mapping (visualization) disease count or rate 2) clustering and cluster detection analysis
(exploratory spatial data analysis) 3) Correlation analysis and 4) modeling (explain or
predict disease rates/counts) 25. The very first step in spatial statistics is linking disease
counts/rates data to their corresponding locations and visualize them as maps. The
mapping step, however, may reveal some useful information, can also conceal the
actual affected areas 26. Therefore, some statistical analyses are required to evaluate
the observed pattern of disease occurrences statistically. Spatial clustering and cluster
detection techniques can consider locations and attributes and can be used to address
this issue by identifying hotspot(s) and cold spot(s) of infectious diseases. Identifying
19
hotspots are very helpful in generating hypothesis and can provide useful information
for further analysis 27. Finally, using a proper model the possible relationship between
disease frequency/rate and several explanatory variables (such as environmental and
socio-economic factors) can be explained. A properly developed model can also be
utilized to predict (spatially/temporally) disease distribution. Detailed information about
each step of the application of spatial statistics in the study of infectious diseases is
provided as follows.
Disease Mapping
The topic of disease mapping has a very long history and is an ongoing topic
among health researchers and organizations to produce more accurate atlas/maps of
morbidity or mortality rates of various infectious diseases. An early example of this topic
is mapping the spatial distribution of cancer mortality in England and Whales by Stocks
(1936) 28. Maps of infectious diseases can illustrate a summary of the complicated
status of spatial distributions of infectious diseases. The disease map can disclose
spatial patterns in the data that are not readily detectable in a tabular format. Some
applications of disease mapping are basic descriptions of the spatial disease
distribution, hypothesis generation of possible relations between factors and health
outcome and risk-prone areas representations useful for budget and resource
allocations 29.
Choropleth maps also known as shaded or thematic maps are widely used to
illustrate mortality and morbidity rates of infectious diseases 26. These maps are
particularly useful to identify health disparities, generating hypothesis and representing
variations of infectious disease counts/rates over time. However, several issues
regarding choropleth maps exist. This representation can be non-informative or to some
20
extent misleading regarding the choice of coloring; classification techniques and the
choice of cut-off values 30. Moreover, shaded mapping for areas with a small population
can lead to high variations of estimated risks compared to the large populations 31.
In the situation of sparse or highly clustered data, Bayesian models can be a
suitable alternative 32. It provides a more robust solution as it can avoid unstable
estimates of rates/risks. These models are increasingly applied in small area
estimations and disease mapping by spatial epidemiologists. They are especially useful
for rare mapping diseases 33. For instance, empirical Bayesian estimation, a type of
Bayesian models, can either shrink unstable risks/rates toward the local mean by
obtaining information from neighboring areas or weights the unstable risks/rates toward
global average from all areas 34. It is evident that the areas with higher rates are less
smoothed than areas with lower rates. Thus, smoothing produces more stable
estimations of disease risks/rates. However, it should be noted that global mean
smoothing can produce over-smoothed rates/risks which can mask informative local
variations 35. An exciting feature of Bayesian methods is that they can help to remove
random components resulting from correlated and unmeasured factors from the maps
36. Another benefit of this method is that it accounts for uncertainty measures associated
with the relative risks with confidence intervals 37. It is evident that including uncertainty
as confidence interval are more reliable for decision makers. Sharmin et al. (2016) used
a Bayesian generalized linear model to adjust for underreporting cases of dengue by
incorporating climate factors in Dhaka, Bangladesh 38. Randremanana et al. (2010)
used the integration of a Bayesian approach and a generalized linear mixed model to
analyze the geospatial distribution of TB in an endemic area, Madagascar. They
21
generated (smoothed) heat maps of TB to compare the association of TB rate with the
nationwide TB indicators 39. The more detailed information regarding Bayesian
approach can be found in Lawson (2013) 40.
Global Clustering Techniques
As already mentioned, although disease mapping can help obtain some useful
information about the spatial pattern of infectious diseases, they can be misleading and
unable to evaluate the significance of the pattern statistically. Spatial dependence or
spatial autocorrelation (SA) is a fundamental concept that is applied to evaluate the
pattern from the statistical viewpoint. SA indicates the association of a feature with itself
located nearby 41. SA proposed by Tobler (1979) as the first law of geography:"
everything is related to everything else but near things are more alike than the distant
things." 42. SA can be positive (i.e., similar attributes are closer to each other), negative
(i.e., different variables are closer to each other), or none (i.e., random distribution). SA
is an essential concept in medical geography and should be investigated in analyzing
spatial data 43. There are several spatial statistic tools to evaluate the presence of SA
(overall pattern) known as global clustering techniques. The null hypothesis significance
(i.e random distribution) is tested against the alternative hypothesis (i.e clustered
distribution).
One of the most straightforward measures of spatial autocorrelation, when the
variable of interest is categorical, is join counts statistics. The null hypothesis of join
counts states that the distribution is random. In this technique, a binary attribute is
classified into black and white colors, and a join (or connections between the zones) is
named as either WW (0,0), BB (1,1), or BW (1,0). By counting the number of WW, BB
and BW, type of SA can be determined. The overall distribution is clustered (i.e.,
22
positive SA) if the number of BW joins is significantly lower than expected by chance 44.
The distribution is dispersed (i.e., negative SA) if the number of BW is significantly
higher than expected by chance. The distribution is random if the number of BW joins
approximately the same as what would be expected by chance. The test of significance
evaluating the BW statistic as a standard deviate is as follows 45:
Z(BW) =BW − E(BW)
�σBW2
(1-1)
In this formula, Z(BW) indicates the magnitude of SA. Join counts statistics have been
applied in the disease studies. Gilbert et al. (1994) used join counts statistics to
evaluate the spatial distribution of canker disease of trees in Panama. They counted the
number of cankered-cankered tree joins (BB), healthy-healthy tree joins (WW), and
cankered -healthy tree joins (BW) and compared them with the expected value 46.
Global Moran’s statistic is the most widely used measure of SA in continuous
variables which relies on both the variable's location and attribute 47. This index ranges
from -1 to +1. In general, the values close to the value of -1 indicate a dispersion, the
values close to the value of +1 shows a clustering, while the values close to the value of
0 indicate a random pattern 48. The null hypothesis states that the distribution is random.
This index is calculated as 47:
I =N∑ ∑ wij(xi − x�)(xj − x�)N
j=1Ni=1
∑ ∑ wijNj=1
Ni=1 ∑ (xi − x�)2N
i=1 (1-2)
Where N is the count of spatial units 𝑖𝑖, 𝑗𝑗; 𝑥𝑥 is the variable of interest (e.g., malaria
incidence rate); 𝑤𝑤𝑖𝑖𝑖𝑖 is the connection between units 𝑖𝑖 and 𝑗𝑗. The spatial weight matrix
can be constructed in two main ways 49: 1) contiguity-based neighbors: sharing a border
(rook) or sharing a border or point (queen) 2) distance-based neighbors: k-nearest
23
neighbors or buffer (threshold distance). Many studies utilized this index to investigate
the overall pattern of infectious diseases. For instance, Naish et al. (2014) used global
Moran's statistic to assess the presence of SA in the study of dengue incidence rates in
Queensland, Canada and found a positive SA 50.
Geary’s C is another common method for measuring SA which has similar
calculations to Moran's I 51. The value of this statistic ranges from 0 to 2 where the
values close to 0 shows a positive SA, the value close to +2 shows a negative SA, while
the index value close to +1 shows a random distribution 52. Therefore, this index is
inversely related to the global Moran’s statistic. The null hypothesis expresses that there
is no SA (i.e., C=1). Using the same notation as for global Moran’s I, the Geary’s C
statistic is computed as follow 51:
C =(N− 1)∑ ∑ wij(xi − xj)2N
j=1Ni=1
2(∑ ∑ wijNj=1
Ni=1 )∑ (xi − x�)2N
i=1 (1-3)
Compared to Moran’s I, which is a global indicator of SA, Geary’s 𝐶𝐶 is highly sensitive to
the difference in neighbors 53. Simões et al. (2004) applied Geary's 𝐶𝐶 to investigate the
distribution of paracoccidioidomycosis, a fungal infection, in southern Brazil. Their
finding rejected the null hypothesis of SA and showed a significant spatial
autocorrelation 54. Other global clustering techniques are widely used in assessing
overall pattern are average nearest neighbors 55, K-function (used in point pattern
analysis) 56; general 𝐺𝐺 57 and Cuzick-Edwards 58.
Local Cluster Detection Techniques
In global clustering techniques, only a single SA value is assigned to the entire
pattern. Thus, global techniques are unable to identify the location of clusters. While, in
local measures of clustering (i.e., cluster detection techniques), an SA is calculated for
24
each areal unit and its significance is evaluated with statistical tests. Also, the
techniques can identify hotspot(s), coldspot(s) or outlier(s) in a pattern.
Perhaps one of the simplest ways to visualize the location of clusters is by using
density function such as kernel density estimation (KDE). In the KDE method, a
weighting kernel function is fitted over each point or line. The weight decreases as we
move away from the point/line feature 59. This method is criticized for the subjective
choice of search radius (bandwidth) and not providing statistical evaluation of the
results. As an example, Aikembayev et al. (2010) used KDE in a GIS environment to
identify the areas of outbreak concentration of Bacillus anthracis by livestock species,
Kazakhstan 60.
The Geographic analysis machine (GAM) is an automated cluster detection
technique developed by Openshaw et al. (1987) for point data. The method developed
to identify spatial clusters of childhood leukemia incidence. In this method, a two-
dimension grid is superimposed over the study area. Then, a series of different circles
with various sizes is generated and scans the whole study area. The observed intensity
of events within each circle is compared with a threshold based on Monte Carlo
simulation. If the observed intensity exceeds the threshold, it draws a circle on the map.
The final output of GAM is the map of significant circles 61. A significant disadvantage of
this technique is that no conclusion can be drawn about the significant level of clusters
62.
Local indicator spatial autocorrelation (LISA) is perhaps the most widely used
cluster detection technique for identifying clusters of various infectious diseases. The
statistic is based on the decomposition of global Moran's I for each areal unit 63. LISA
25
coefficients for each unit 𝑖𝑖 can be calculated as the deviation of values from the mean of
neighbors 63:
Li =(yi − y�)∑ wij(yj − y�)n
j=1,j≠i
si2 (1-4)
si2 =∑ (yi − y�)2nj=1,j≠i
n − 1 (1-5)
One of the advantages of this statistic compared to some other cluster detection
techniques is its ability to identify outliers (i.e., areas with high values of an attribute are
surrounded with low values and vice versa). Using LISA, Hassarangsee et al. (2015)
detected hotspots of tuberculosis incidence in several districts in Thailand 64. Szonyi et
al. (2015) identified several outliers of Lyme disease in western Texas by applying this
technique 65.
Getis-Ord 𝐺𝐺𝑖𝑖∗ statistic 66 is one of the most popular hotspot detection methods. It
is a polygon-based method; however, point events can be aggregated by
superimposing a grid. The statistic is calculated as follows 66:
Gi∗ =
∑ wijxj − X�∑ wijnj=1
nj=1
S �[n∑ wij
2 − �∑ wijnj=1 �
2]n
j=1n − 1
(1-6)
S = �∑ (xj − x�)2nj=1,j≠i
n − 1− x�2 (1-7)
The statistic can identify both hotspots and cold spots. The 𝐺𝐺𝑖𝑖∗ is usually standardized to
z-scores (of normal distribution) so that a large (positive) Z-score is corresponded to a
significant hotspot (i.e. Z-score>1.96 standard deviation); a large negative value of z-
score is associated with a location as coolspot (Z-score < -1.96 standard deviation); and
26
value close to zero shows a location that is neither hotspots nor coolspot. Sun et al.
(2017) utilized the Getis-Ord Gi* statistic to identify hotspots of the dengue fever
epidemic in Sri Lanka 67.
One of the novel tools to determine statistically significant elevated disease rate
is spatial scan statistic (SSS). The tool developed by Kulldorff (1997) 68 for point and
polygon units in SaTScan software. The SSS can also find significant clusters of
individual health events (i.e., in case-control studies) using the Bernoulli model. In this
method, circular windows with varying radius sizes or ellipsoidal windows in different
shapes and directions are drawn around each point or centroid of polygon data. The
size of the window is changed from 0 to the specified cluster size. For each window, a
maximum likelihood test is applied to compare the risk within the window with the
outside. Monte Carlo simulation is used to test for a significance level of clusters. The
cluster with maximum likelihood is considered as primary clusters and the rest of the
cluster(s) as a secondary cluster(s) 69. A significant advantage of this method over the
most cluster detection techniques is that it can take confounding into account by
adjusting for variables 70.
Space-Time Clustering Techniques
Space-time analysis of disease clusters is another crucial topic in epidemiological
studies which is usually overlooked. Space-time analysis is an ongoing research topic
which provides useful insight for public health decision-makers into how an infectious
disease propagates over the landscape. It can help to find direction and periodic
patterns helpful for predictions. Some space-time clustering techniques have been
proposed. In the following section, we have provided a general description of two widely
used space-time cluster detection techniques.
27
Knox method (1964) 71 is a primary method that is used for examining space-time
interactions (clusters) than expected by chance. Space-time interactions exist when
many of cases that are close in time are also close in space. Therefore, in this method
for each pair of cases, the interval (i.e., time and space interval) is examined. A 2*2
contingency table is constructed with cross-comparison proximity in space and time
(Figure 1-1). The number of pairs for each situation is counted, and the actual number
of pairs that fall into each category is compared with the expected number (i.e., cross
products of rows and column totals). Then a final Chi-square test with (n-1)*(r-1) degree
of freedom measures a significant level of difference. The CrimeStat software
(https://www.icpsr.umich.edu/CrimeStat/) allows users to apply Knox test. This
approach has been criticized due to the subjective cut-off value for closeness in space
and time.
Similar to the pure spatial scan statistic, in space-time approach a window scans
the whole study region, and the risk within the window is compared with the risk outside
of the window 72. However, instead of a 2-dimension circle/ellipse window, a cylindrical
window with the base of a circle/ellipse is used. The base of the cylinder depicts space
while the height corresponds to time. Also, p-value for each cylinder is computed using
Monte Carlo simulation. Compared to the Knox test, this method accounts for possible
geographic/temporal population heterogeneity (i.e., different population growth in
different regions) 73.
Environment and Infectious Diseases
The occurrence of infectious diseases represents a serious public health
challenge. Infectious diseases annually kill over 15 million people (>25% of the total
death), worldwide 74. The number is progressively growing, and the geographic
28
distribution is expanding to non-endemic areas 75. One strategy to fight against the
infectious diseases is to monitor and control the causative agents such as a virus,
bacteria of the disease. In this regard, identifying geographic patterns of infection and
underlying risk factors such as environmental and socio-economic factors is helpful.
Layers of environmental data can be obtained from various sources such as landcover
data from remote sensing, weather data from weather stations or WorldClim, land use
data from government records, soil maps from the department of agriculture, field
investigation, etc. The data are provided in different formats like excel, grid, shapefile,
text, etc., then using spatial analysis tools like GIS they are integrated and combined for
further analysis. Another strategy for control and monitor infectious diseases is the
surveillance of risk factors for diseased patients or deaths. An explanation for this
approach is because of a strong correlation between underlying risk factors and the
distribution of human cases (rates). For instance, to control Lyme disease in human,
monitoring and controlling Ixodes scapularis population is applied as a proxy for disease
morbidity.
The diseases (or parasites/vectors) highly dependent on the local and global
environment and consequently can influence disease prevalence in a community 76.
Environmental determinants (such as weather/climate, vegetation) can provide
favorable conditions for breeding, feeding, resting sites of certain vector-borne
diseases. However, it should be noted that socio-economic factors such as race,
gender, poverty, lifestyle (such as drinking alcohols or smoking) can play an essential
role in the occurrence of infectious disease such as Tuberculosis 77. Climate change is
an essential determining factor in explaining variations of most infectious diseases
29
especially vector-borne diseases 78. It can lead to migration of pathogens (causative
agents) and animals to new areas and consequently expansion of the diseases to
uninfected areas 79. The relationship between climate condition and infectious diseases
have levels of complexity. For instance, weather and climate factors can have different
influences on the vector or pathogen abundance of vector-borne diseases. The
differences are mainly because of differences in the life cycle or life stage of vectors.
For instance, four stages of tick-borne diseases’ life cycle (i.e., eggs, larvae, nymph,
adult) last two years of being completed. While the life stage of mosquitos (i.e., eggs,
multiple larvae, pupae, adults) only take a few weeks to a few months to be completed.
This difference indicates that mosquito’s abundance responds to short-duration
variations in weather or climate, while tick’s population reacts to longer-term changes in
weather condition and with little inter-variations 80. Also, extreme high/low temperature
prevents both mosquitoes and ticks host for blood meal seeking 81. Under other climate
conditions, meteorological factors can have relatively complex effects on mortality rates.
In harsh weather conditions, all the four stages of tick's life remain secured (because
ticks spend most of their life time under layers of soil), while the only dipteran can seek
refuges. Thus, the direct consequences of temperature have fewer impacts on the
survival of the tick's population compared with mosquitoes' abundance. As almost all
development stages of mosquitoes are affected by the presence of stagnant water, the
reproduction rates depend on the amount of rainfall 82. However, heavy rainfall can flush
away larval and eggs. Therefore, reproduction rates of ticks cannot be influenced by
weather changes apart from long-lasting impacts on host densities 80.
30
Disease Modeling
Although correlation analysis can indicate associations between factors and
disease morbidity/mortality, the associations do not indicate causality. Disease
modeling is an advanced level of spatial analysis in medical geography 83. It includes
testing generated hypothesis obtained from cluster detection or observations. Disease
modeling can involve the integration of previously mentioned tools (i.e., GIS, RS, and
GPS) with statistical and epidemiological analysis. Modeling disease occurrence helps
medical geographers to describe morbidity/mortality rates and identify underlying
factors. It can also be used for predictions: classification (such as presence/absence of
vectors) or regression (such as predicting disease rate/count). The prediction can be
spatial (i.e., for locations with unknown mortality/morbidity) or spatiotemporal (i.e., future
(near/far) status of disease) or for other areas (projection). Space-time predictions can
be used for early detection of infection or outbreak 84. Forecasting disease occurrence
can provide valuable guidelines for public health decision makers for cost-effective
planning and targeted interventions of future outbreaks.
Knowledge-Driven Models
In general, disease modeling techniques can be classified into two categories:
data-driven and knowledge-driven models 85. According to Pfeiffer, data-driven models
are mainly based on statistical analysis to quantify the relationship between disease
morbidity/mortality and underlying factors. While knowledge-driven models use
knowledge of experts as evidence 86. These models depend heavily on environmental
layers and are widely used in spatial epidemiology for mapping suitable areas for
disease/vector transmission. Some techniques such as multi-criteria decision analysis
(MCDA) 87 can use experts' knowledge to provide a habitat suitability map by converting
31
knowledge to decision rules. Analytical hierarchy process (AHP) proposed by Saaty
(1990) is one of the most common ways to define the weights of factors 88. By overlying
(combining) factors based on their corresponding weights, suitability for each pixel in
the study area can be computed. Moreover, uncertainty associated with variables, the
relationship between independent factors and the dependent variable, and the degree
of risk can be modeled by incorporating fuzzy logic 89. One of the significant limitations
of the knowledge-driven methods is that incorporating risk factors, weights and type of
membership functions of fuzzy rules are subjective. Knowledge-driven models have
been applied to modeling several infectious diseases. Clements et al. (2006) used them
in modeling rift-valley fever in Africa 90, Mollalo and Khodabandehloo (2016) applied
them in the study of zoonotic cutaneous leishmaniasis in an endemic area in Iran 91; and
Rakotomanana et al. (2007) applied them in modeling malaria vector control in
Madagascar 92.
Data-Driven Models
Data-driven models can generally be classified into two main categories: 1)
presence-absence models 2) presence-only models. Here, we first describe the
characteristics of several presence-absence models.
Classification and regression tree (CART) is a decision-tree based method that
can be applied for solving both classification and regression problems in spatial
epidemiology. CART is particularly useful in working with noisy data (such as datasets
with missing values or highly skewed data) and extensive and complex dataset. CART
is flexible in modeling a variety of response variables such as categorical, survival and
continuous data. Another advantage of CART is that the results obtained from the
model are easily interpretable. However, it should be acknowledged that the model
32
does not have a statistical test for the significance level of variables and is subject to
overfitting 93. Other features of CART have been discussed in detail in Franklin (2009)
94.
One of the significant problems of CART is attributed to the low predictive ability
and high variations in results. To address this drawback, bagging, boosting and random
forest can be useful. These algorithms generate numerous decision trees by random
selection of variables and random selection of samples with replacement (bootstrap
sampling). In bagging, in each repetition of sampling, one-third of data is used for
testing the model performance (i.e., out-of-bag samples). Boosting is like bagging: in
bagging, samples have equal weights, while in boosting samples are weighted. Random
forest (RF) is a type of bagging 95-97.
Support vector machine (SVM) is another data-driven model that is used for
binary classification problems. In SVM, a dataset of high dimensional points viewed as
vectors {𝑥𝑥𝑖𝑖∈𝑅𝑅𝑑𝑑:𝑖𝑖=1,…𝑛𝑛}, 𝑑𝑑>1, where each point belongs to one of two classes defined
by {𝑦𝑦𝑖𝑖∈{0,1}:𝑖𝑖=1,…𝑛𝑛}. Here 𝑦𝑦𝑖𝑖 corresponds to the presence/absence of points. If we
assume these points to be linearly separable (i.e., can be separated via a linear
boundary), the goal of SVM is to find the d-dimensional hyperplane maximizing the
margin (i.e., the distance between the closest points or support vectors) as illustrated in
Figure 1-2. More detailed explanations can be found in Noble (2006) 98.
Artificial neural networks (ANNs) is another data-driven model that works with
presence-absence data. More detailed information of ANN is provided in Chapter 4 of
this dissertation.
33
Presence-only model in data-driven techniques is a relatively new concept.
Traditional statistical models such as the generalized linear model (GLM), however,
have a well-established algorithm, require presence and absence of the diseases (or
vector) data. However, collecting absence data can be very expensive and may not
cover most of the study region 85. Species distribution model (SDMs) can be used for
modeling presence-only data and a vast area. They are highly applied in conservation
and ecology for modeling geographic distribution of species (i.e., disease macro-
organism). Another advantage of the SDM model is that they can be used for projection
(i.e., extrapolation beyond the study region) with decent accuracy 99.
Ecological niche models (ENMs) use background or pseudo-absence data
instead of real absence data to predict niche (i.e., conditions that enables species to
maintain population without immigration) 100. These models often are developed at
coarse spatial scales. Most important ENMs are ecological niche factor analysis
(ENFA), genetic algorithm for rule-set production (GARP) and maximum entropy
(MaxEnt).
ENFA assumes that species occurrence is not random and is clustered in a small
section of the study region. ENFA extracts marginality and specialization. Marginality
would be close to 1 when variables contribute to the life of species in a specific region,
and the value would be close to 0 when species could be found everywhere.
Specialization shows species tolerance. Thus, it can summarize all the variables into a
few factors 101. As an example of the application of ENFA in infectious diseases, Ayala
et al. (2009) used ENFA to provide habitat suitability map of malaria species in
Cameroon 102.
34
GARP uses four rules to select or reject, evaluate, and test rules from the final
model. The set of rules with the highest predictive performances are selected and used
for habitat suitability. GARP can obtain high accuracy even with a small sample size. It
can also be used for projection or extrapolation. More information about GARP has
been provided in Stockwell and Noble (1992) 103 and Stockwell and Peterson (1999) 104.
Maximum Entropy (MaxEnt) is another presence-only ecological niche model
(ENM) that is widely used in ecology and geography. In this model, the contribution of
each explanatory variable to the final model is computed by removing each variable and
examining the AUC of the model. The MaxEnt can be used even with minimal sample
size (n<100) and is easily interpretable 105. MaxEnt has been successfully used in
predicting the distribution of several infectious diseases (vectors) such as leishmaniasis
in France 106 and the West Nile virus in Iowa 107. More detailed information presented in
Phillips et al. (2006) 108.
Although, many published pieces of research have examined the relationships
between environmental conditions and infectious diseases, novel techniques such as
machine learning techniques have been underutilized in spatial epidemiology. In this
dissertation, we examine three different infectious diseases (i.e., Lyme disease,
zoonotic cutaneous leishmaniasis, tuberculosis) from a spatial perspective and at three
different levels. Table 1-4 summarizes some applications of GIS, RS, and machine
learning techniques in the study of the mentioned infectious diseases.
Research Questions and Hypotheses
The main research questions and hypotheses explored in this dissertation are as
follows:
35
In Chapter 2, we examine changes in the spatial distribution of significant spatial
clusters of Lyme disease (LD) incidence rates in Connecticut at the town level from
1991 to 2014 as an approach for targeted interventions. Spatial distribution of LD in CT
has been rarely investigated. Thus, we intend to help policymakers for targeted
interventions by prioritizing towns with high incidence rate. The primary objective of the
first paper is to integrate GIS and ESDA to better describe changes in the spatial
pattern of LD, supposing the passive cases of LD serve as a spatially random sample of
the infection in the state. Our specific research questions in Chapter 2 include:
• What is the spatial distribution (random/ dispersed/ clustered) of LD incidence over the past 24 years?
• If not-random, where are the locations of the clusters/hotspots in CT?
• Can we avoid or minimize the effects of edges in the spatial scan statistic technique?
• Which cluster detection technique (LISA/ spatial scan statistic) is more sensitive, specific, and accurate?
In Chapter 3, we compare three popular machine learning classifiers (i.e. SVM,
LR, and RF) in predicting spatial distribution of the Phlebotomus Papatasi, one of the
major vectors of zoonotic cutaneous leishmaniasis (ZCL). Support vector machine and
random forest are two-widely used classifier in environmental studies but have been
rarely applied for studying the species geographic distribution. We suppose that
potential sampling and measurement errors during collecting sand flies do not have
much influence on the accuracy of the classifiers. Also, we assume that ecological
factors used in this study are sufficient predictors of Ph.paptasi. The main research
questions in Chapter 1 are:
• Which machine learning technique (i.e., SVM, LR, and RF) is more sensitive, specific, and accurate in predicting Ph.papatasi in the study area?
36
• What are the most ecological contributing factors in predicting Ph. Papatasi?
• Where are the high-risk areas of Ph. papatasi in the study area?
In Chapter 4, we integrated geographic information system (GIS), spatial
statistics, and artificial neural networks (ANNs) in analyzing TB distribution in the US to
assist national TB control programs. ANN is another data science technique which is a
favorite modeling tool in other scientific domain but has never been investigated for
geographic modeling of TB. We examine the spatial distribution of the disease and
applicability of machine learning techniques in TB modeling with the following
assumptions 1) all the reported county-level TB incidence rates represent the status of
TB in the US and 2) ecological and socio-economic factors influence the TB infection.
Our specific research questions in Chapter 4 include:
• What are the most important ecological and socio-economic contributing factors in predicting TB incidence rate in the US?
• Which modeling technique (i.e., multilayer perceptron (MLP) or linear regression) is more accurate in predicting TB incidence rate?
37
Tables Table 1-1. Application of GIS in the study of infectious diseases
Basic GIS Function Applications in Infectious Diseases
Organizing data from multiple sources and in different formats
Sofizadeh et al. (2016) 109 developed a geodatabase of sandflies and explanatory variables to study the distribution of Phlebotomus Paptasi. Sandflies were collected during the field operation. The explanatory data were climatic conditions and the elevation layer. These data were obtained from the WorldClim database. Normalized differentiated vegetation index (NDVI) was used to reflect vegetation cover and obtained from MODIS satellite products.
Computing slope, and aspect of study region as a candidate factor
Hasyim et al. (2016) 110 derived topographic features (slope and aspect) of the resampled digital elevation model (DEM) of the study area as two explanatory variables in modeling malaria cases with ordinary least square and geographically weighted regression models.
Calculating Euclidean distance to water bodies
Franke et al. (2015) 111 used Spatial Analyst Tool, to compute Euclidean distance to inland and wetland waters for malaria risk modeling.
Buffer analysis Nakahapakor and Tripathi (2005) 112 applied 500 m and 1000 m buffering operation to identify the geographic environment conditions surrounding village affected by Dengue fever.
Overlay of environmental layers
Palaniyandi et al. (2014) 113 used overlay analysis of various environmental factors to map potential breeding areas of vector-borne diseases in an endemic area in India.
Identifying spatial pattern and clusters
Mollalo et al. (2017) 114 used global and local Moran's I to identify geospatial pattern and location of clusters of Lyme Disease incidence rate in Connecticut in using Spatial Statistics tool in a GIS environment.
WebGIS Li et al. (2013) 115 established a decision support system, as an accessible and inexpensive approach, for the response to infectious disease surveillance based on WebGIS and intelligent mobile services.
38
Table 1-2. Application of RS in the study of infectious diseases RS Function Applications in Infectious Diseases
NDVI and LST Mollalo et al. (2018) 24 derived NDVI and LST from MODIS satellite images and used it for prediction habitat suitability of Phlebotomus Papatasi, in an endemic area in Iran.
Sea surface temperature and Sea surface height
Lobitz et al. (2000) 116 linked public remotely sensed data (Sea surface temperature and height) with cholera cases in Bangladesh.
Crop disease management
Franke and Menz (2007) 117 used high-resolution multi-spectral data to detect in-field heterogeneities of crop diseases over time.
Image Classification
Hugh-Jones et al. (1992) 118 used a land cover map derived from a Landsat TM image to distinguish grazing areas with several levels of animal’s tick infestation.
Tick habitat suitability map
Using Landsat TM satellite images, Glass et al. (1995) 23 indicated that many proper tick habitats coincide with residential properties in proximity to the forested areas in Baltimore County, Maryland.
Table 1-3. Application of GPS in the study of infectious diseases GPS function Applications in Infectious Diseases
Mapping Coburn and Blower (2013) 119 used handheld GPS to establish geographic coordinates at each sampling sites to map HIV epidemics in sub Saharan Africa.
Dynamic mobility network
Paz-Soldan et al. (2010) 120 used GPS device to quantify human mobility among tracked individuals to study dengue virus transmission in Peru.
Outbreak investigation
Masthi et al. (2015) 121 used GPS along with google earth to investigate outbreak of cholera in a village in India to accurately pinpoint the location of household of cases and follow up them.
39
Table 1-4. Application of GIS, RS, Spatial statistics and machine learning in the study of infectious diseases (Lyme disease, leishmaniasis, tuberculosis)
Reference Infectious disease
Aim Study Area Techniques/tools
Garcia-Marti et al. (2017) 122
Lyme disease Mapping and modeling tick dynamics using volunteered data
Netherlands
Random forest; Remote sensing (NDVI); Volunteer geographic information
Ostfeld et al. (2006) 123
Lyme disease
To investigate the effect of variations of temperature, humidity, and deer and mice on the risk of Lyme disease
New York
GPS (field data); GIS; Spatial statistics (density mapping)
Pepin et al. (2012) 124
Lyme disease To investigate relations between human Lyme disease incidence and density of nymphs
36 eastern states
GPS (field data); GIS; Spatial statistic (zonal statistics, density mapping of nymphs; spatial autocorrelation); Modeling (negative binomial model)
Ramezankhani et al. (2018) 125
Leishmaniasis Predicting cutaneous leishmaniasis incidence based on environmental factors
Isfahan, Iran Decision trees Remote sensing (NDVI)
Nieto et al. (2006) 126
Leishmaniasis To predict spatial distribution and potential risk of visceral leishmaniasis
Bahia, Brazil Ecological niche model (GARP); GIS
Paixao Seva et al. (2017) 127
Leishmaniasis To predict future status of visceral leishmaniasis and identify underlying risk factors
Sao Paulo, Brazil
Spatio-temporal Bayesian model; GIS
Yamamura et al. (2016) 128
Tuberculosis
To explain spatial pattern of hospitalization due to tuberculosis, and to identify spatial and space-time clusters of TB
Riberiaro, Brazil
GIS; Spatial statistics (spatial scan statistics)
Patterson et al. (2017) 129
Tuberculosis To investigate social behavior of individuals who developed TB
A town in South Africa
GIS, GPS (locations of houses)
40
Figures
Figure 1-1. Contingency table of Knox method
Figure 1-2. Principle of linearly separable SVM using maximum margin
41
CHAPTER 2 A 24-YEAR EXPLORATORY SPATIAL DATA ANALYSIS OF LYME DISEASE
INCIDENCE RATE IN CONNECTICUT
Lyme disease (LD), a tick-borne, bacterial, zoonotic infection, remains a serious
challenge for public health.∗ The disease is distributed globally, predominantly in
temperate portions of the Northern Hemisphere such as Europe, Canada and USA 130.
In the United States (US), the geographical distribution of LD is primarily confined to the
north-eastern and mid-western areas 124. Past studies have shown that in these areas,
LD is caused by Borrelia burgdorferi sensu stricto. The pathogen is mainly transmitted
to humans during blood meals by the bite of infected blacklegged ticks (Ixodes
scapularis) with white footed mice (Peromyscus leucopus) serving as the primary
reservoir for this bacterium 131. The disease is the most common vector-borne disease
in U.S. with an estimated average number of 30,000 new cases every year 132;
however, the genuine number is likely much higher. The mean incidence rate of LD in
the top 13 U.S. states with the highest incidence rate during 2005-2009 progressively
rose from 29.6±10.6 per 100,000 in 2005 to 49.6±15.5 per 100,000 in 2009 124. At the
same time, in 11 states with the lowest incidence rate, the mean incidence developed
from 1.3±0.7 to 2.3±1.7 per 100,000 individuals 124. Although this common zoonotic
disease rarely leads to death, it can cause severe symptoms related to skin, joints and
heart in addition to anxiety and depression if untreated 133. LD can also be a
socioeconomic burden to society.
∗ Reprinted with permission from Mollalo, A., Blackburn, J. K., Morris, L. R., & Glass, G. E. (2017). A 24-year exploratory spatial data analysis of Lyme disease incidence rate in Connecticut, USA. Geospatial health, 12(2).
42
In recent years, exploratory spatial data analyses (ESDA) to describe spatial
patterns of LD has increased significantly as a strategy to improve our understanding of
disease transmission and risk. Several recent studies from different parts of the U.S.
have examined the spatial pattern of LD using ESDA. For example, Kugeler and
colleagues (2015) applied circular scan statistics to detect high-risk counties of LD in
the U.S. from 1993 to 2012. They showed that the number of counties with high
incidence of LD successively increased from 69 (1993-1997) to 130 (1998-2002) and
further to 197 (2003-2007) and 260 (2008-2012) counties, respectively 134. In Texas,
which is a nonendemic LD area, Szonya et al. (2015) applied global and local Moran’s I-
tests to determine the distribution and location of possible clusters, respectively, with
respect to the spatial distribution of LD at the county level (2000-2011). They observed
a clustered distribution with a high incidence cluster in central parts of the state, mainly
in a cross-timbers eco-region 65. In Virginia, Li et al. (2014) utilized the Empirical Bayes
smoothing (EBS) method on census tract LD cases with the aim of lessening random
variations, especially in censuses with small populations. Then, they applied space-time
scan statistic and found a primary cluster in northern Virginia which had experienced
population growth and urban-sub-urban improvements between 2008 and 2011 135.
The town of Lyme in Connecticut was the first spot that LD was recognized in the
U.S. The initial cluster in 1976 was observed in children 136. Since then, in spite of all
endeavors conducted by the Connecticut Department of Public Health (CTDPH) to
control the disease, it remains endemic with substantial morbidity rates. Although LD is
a well-investigated epidemiological subject in Connecticut, historical changes in patterns
of disease have been minimally studied 137. Additionally, even though geographical
43
information systems (GIS) is a useful tool to study infectious diseases 1, powerful GIS-
based studies of LD from this region are insufficient for prioritizing counties for
intervention. Thus, the main objective of this study is to use the combination of GIS and
ESDA to better describe changes in the spatial pattern of LD, supposing that the
reported passive cases of LD represent a spatially random subsample of the disease in
the state. Our specific research questions included: 1) what is the spatial distribution
(random/ dispersed/ clustered) of LD incidence over the past 24 years?; 2) If clustered,
where have the clusters/hotspots occurred? 3) Can we avoid or minimize the effects of
edges in the spatial scan statistic technique?; and 4) Which cluster detection technique
(LISA/spatial scan statistic) is more sensitive, specific and accurate?
Materials and Methods
Data Collection and Preparation
We used passively reported indigenous LD cases over a period of 24 years from
1991 to 2014 throughout the state of Connecticut. We retrieved data from the CTDPH
containing yearly counts and rates of LD at the town level. The CTDPH has a well-
established LD surveillance system operating since 1987 138. Reports of cases were
based on the National Surveillance Case Definition from the Centers for Disease
Control and Prevention (CDC) for LD 139-141. Data were geocoded and grouped into four
equal intervals (each period included six years: 1991–1996, 1997– 2002, 2003–2008
and 2009–2014) to further explore clustering, possible clusters and how hotspots had
changed.
Administrative boundaries of towns were obtained from the Map and Geographic
Information Center of Connecticut GIS data using the shapefile format
(http://magic.lib.uconn.edu/). Similarly, annual population statistics were downloaded
44
from CTDPH (http://www.ct.gov/dph/site/default.asp). The study area and the names of
the towns mentioned in this paper are shown in Figure 2-1.
Global Clustering
We applied global clustering techniques to statistically evaluate whether the
existing pattern of LD incidence was random, clustered, or dispersed. We used the
global Moran’s I statistic 142. to measure spatial autocorrelation using GeoDa software
version 1.6.7 143. The null hypothesis assumes that there is no spatial pattern among
the incidence of LD in different towns (i.e. complete spatial randomness) 144. This
statistic employs a covariance term between each town and its neighbours as follows
145:
𝐈𝐈 =𝐍𝐍𝐒𝐒𝟎𝟎
∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢 (𝐱𝐱𝐢𝐢 − 𝐱𝐱�)(𝐱𝐱𝐢𝐢 − 𝐱𝐱�)𝐍𝐍𝐢𝐢=𝟏𝟏,𝐢𝐢≠𝐢𝐢
𝐍𝐍𝐢𝐢=𝟏𝟏
∑ (𝐱𝐱𝐢𝐢 − 𝐱𝐱�)𝟐𝟐𝐍𝐍𝐢𝐢=𝟏𝟏
(2-1)
𝐒𝐒𝟎𝟎 = ��𝐰𝐰𝐢𝐢𝐢𝐢
𝐍𝐍
𝐢𝐢=𝟏𝟏
𝐍𝐍
𝐢𝐢=𝟏𝟏
(2-2)
where xi and xj are incidences of LD in the ith and jth towns, respectively; N the
aggregate number of towns; and wij the spatial neighborhood weight for towns i and j
generated based on the first order Queen’s contiguity which shares all common points
including boundaries and vertices. The generated spatial weight is used as a criterion
for recognizing neighbors of each town. The weight is defined taking into account
adjacent neighbors and written as:
𝐰𝐰𝐢𝐢𝐢𝐢 = � 𝟏𝟏 𝐢𝐢𝐢𝐢 𝐢𝐢 𝐚𝐚𝐚𝐚𝐚𝐚 𝐢𝐢 𝐚𝐚𝐚𝐚𝐚𝐚 𝐚𝐚𝐚𝐚𝐢𝐢𝐚𝐚𝐚𝐚𝐚𝐚𝐚𝐚𝐚𝐚 𝐚𝐚𝐚𝐚𝐢𝐢𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐚𝐚𝐧𝐧 𝟎𝟎 𝐧𝐧𝐚𝐚𝐧𝐧𝐚𝐚𝐚𝐚𝐰𝐰𝐢𝐢𝐧𝐧𝐚𝐚 (2-3)
The Moran’s I index varies between -1 and +1, with 0 showing spatially random
distribution, while negative values indicate dispersed distributions and positive values
45
for clustered distributions. We assessed significance of the index using both the Z-score
and P-value.
Local Clustering
Spatial smoothing
The global clustering techniques provide information about the overall distribution
of LD (random, clustered or dispersed), but we were also interested in identifying local
clusters. First, we applied the EBS routine to account for variation in town sizes and
populations. Contrasts in population size among the spatial areal units (i.e. towns of
Connecticut) may lead to variance instability and spurious outliers 143. This is due to the
observed raw rate in spatial areal units with small population being profoundly affected
by small changes of adding or removing few cases. Thus, crude rates might not reflect
underlying risk compared with other areal units with large populations. EBS provides a
solution to avoid this type of possible bias as it adjusts the estimated risk toward the
global mean to reduce variance instability 63; areas with low population are adjusted
more than areas with larger populations. Since there was a considerable difference in
areas of some towns (e.g., Derby and New London are approximately 5 mi2 while
Woodstock and New Milford cover more than 60 mi2) and also population size of towns
(e.g., Union and Canaan have about 1,000 people, whereas New Haven and Bridgeport
have more than 120,000 individuals) applying the EBS is justifiable. We calculated
spatial weights for each time interval using the first-order Queen’s contiguity. EBS
smoothed rates were employed in local cluster detection analyses.
Local moran’s
To detect local clusters of the LD rates smoothed by EBS, we applied Anselin’s
local indicator of spatial autocorrelation (LISA) statistics 143. LISA identifies hotspots
46
(towns with a high incidence surrounded by high incidences); coldspots (towns with a
low incidence surrounded by low incidences) and outliers (towns with a high incidence
surrounded by low incidences, or towns with a low incidence surrounding by high
incidences). We used GeoDa (https://geodacenter.github.io/) for LISA analyses. We set
the number of permutation tests to 999 and 95% significance level (p<0.05). We
mapped significant clusters using ArcGIS 10.2 (ESRI, Redlands, CA, USA).
Spatial scan statistics
We were interested in comparing LD clusters identified by LISA and the Spatial
Scan Statistic (SSS), for which we used an ellipsoidal, moving window situated on the
centroid of each town so that at any point the window incorporated different sets of
neighbors. At each position, the radii of ellipse was set to vary continuously from 0 to a
maximum that never included more than half of the total population at risk. If the window
contained the centroid of the neighboring towns, then that whole town was included.
This procedure produces a very large number of ellipsoidal windows and each one can
be a possible cluster of LD with different set of neighbors. The ratio of the length of
longest to the shortest axis of the ellipse was 1.5, 2, 3, 4 or 5. For each shape, a
different number of angles (i.e. the angle between the horizontal east-west line and the
longest axis of the ellipse) of the ellipse were also tested. For each ellipse, a likelihood
ratio statistic was computed based on the number of observed and expected cases
within and outside the ellipse. The null hypothesis (i.e. LD incidence is equal inside and
outside of the window) was tested against the alternative hypothesis that the risk was
elevated within the ellipse. The likelihood ratio is reported by P values calculated based
on the Monte Carlo simulation approach which finds the maximum likelihood ratio over
the entire study region. The ellipse with the maximum likelihood was signified the most
47
likely (primary) cluster. This approach also detected secondary clusters which were
additional ranked clusters that had high likelihood ratios but did not overlap the primary
cluster 147.
Three sets of data were built for the analysis based on discrete Poisson
probability model in SaTScan software version 9.4.2 147. The datasets were: Case file
representing the annual number of cases of LD for each town (n=169) from 1991 to
2014; Coordinate file was the two-dimensional Cartesian coordinates of centroid of each
town; and Population file was the population size of each town. We used the mean
population and mean number of cases per time interval. To scan the study area, we
applied two criteria: no geographic overlap between clusters and 5% of the population
as the maximum to search for hotspots with the aim of comparing the results with
LISA’s high-high clusters as the LISA analysis only considered the neighboring towns.
Also in another run, we used no geographic overlap between clusters and 50%
population at risk to investigate whether the results depended on the primary settings.
To ensure statistical power, the number of Monte Carlo replications was set to 999 and
only clusters at 99% confidence interval were considered (p<0.01).
Accuracy Assessments
Table 2-1 shows a 2x2 confusion matrix used to compare the performance of
LISA and SSS (for 5% and 50% of populations at risk) to detect hotspots, sensitivity,
specificity and overall accuracy 148. In this case, sensitivity measures how well the
cluster detection techniques correctly detected the presence of LD hotspots, whereas
specificity provides a measure of how well the techniques correctly identified the
absence of LD hotspots. Overall accuracy shows the ability of cluster detection
techniques to identify true positive and true negative LD hotspots. Locations of the
48
detected clusters with both LISA and SSS techniques were compared with the location
of true clusters as defined by Birnbaum et al. (1996) that “true clusters explain fewer
than 5% of all reported clusters” 149. Therefore, the top 5% towns with regard to high LD
incidence were considered as True Hotspots. We calculated these statistics for each
technique in each period, separately.
Results
There were 54,478 reported human LD cases from 1991 to 2014, out of which
10,328 cases (19.0%) occurred in first period (1991-1996), 20,234 cases (37.1%) in
second period (1997-2002), 11,210 cases (20.6%) in third period (2003-2008) and
12,706 cases (23.3%) in the last period (2009-2014). The annual incidence of LD
ranged between 84.7 (in 1993) and 305.6 (in 2002) cases per 100,000 individuals with a
mean of 144.8 cases per 100,000 people (Figure 2-2). The P values for the global
Moran’s I statistic were close to zero which reject the null hypothesis of complete spatial
randomness (CSR) for all time periods (Table 2-2). The index values ranged from 0.55
to 0.71, which indicates significant clustering. Although, we detected clustering with both
raw and smoothed methods, and in all four periods, the variation in rates required EBS
before running LISA. Based on the results of EB smoothing technique, LISA found High-
High clusters (hotspots), which varied for each study period. Results of LISA showed
that in the first and last period, the hotspots were completely restricted to the towns in
the East. In other words, for the first period, 24 towns and for the last period 30 towns
were identified as the hotspot in eastern Connecticut. The towns in the West were more
influenced in the second and third periods. The number of towns in the West identified
as hotspots in the second period was 8, while 7 towns were hotspot in the East of
Connecticut. In addition, in the third period, 10 towns in the West and 6 towns in the
49
East were identified as hotspots. It should be noted that applying EBS before running
LISA, reduces the likelihood of detecting a false cluster in low-population areas where
the cases were detected. But comparison of the detected clusters by raw and EBS
methods in LISA showed small differences with regard to the location of the detected
clusters.
Spatial scan statistics with 5% and 50% of the populations at risk with no
geographic cluster overlaps identified clusters with high incidence rates for each period.
Except for the second period, The SSS results revealed that the most likely cluster
predominantly occurred in the eastern region of the state, while the secondary cluster
occurred in towns in the West (Figure 2-3). Primary and secondary hotspots were
observed in different locations when 5% of the population at risk was investigated using
SSS for comparison with LISA. For the first period, the primary cluster occurred in the
eastern parts of the state and included 22 towns and 792 cases. The risk of LD
incidence within the primary cluster was 7.67 times greater than outside the cluster. The
secondary cluster occurred in the western region and contained 5 towns and 152 cases.
During the second period, the primary cluster occurred only in the eastern parts of the
state and included 23 towns and 1,208 cases. The risk of LD incidence within this
cluster was 3.40 times higher than outside. During the 2003-2008 period, the disease
largely affected the eastern region with 22 towns and 738 cases as compared to the
western parts with 10 towns and 145 cases. The relative risks of primary and secondary
clusters were 3.33 and 9.18, respectively. In the last period, only eastern parts of the
state showed statically significant clustering (no secondary cluster). The primary cluster
50
contained 24 towns and 942 cases. The risk of LD incidence in this cluster was 3.69
times more than in other areas (Table 2-3).
Comparison of the results of accuracy assessments of the spatial scan statistics
at 5% and 50% of the populations at risk, shows that sensitivity of this method increases
with an increment of the population defined to be at risk. However, increasing the
population at risk also leads to a decrease in the specificity of the results. In addition,
LISA tended to have higher specificity and had the highest overall accuracy (Figure 2-
4).
Discussion
This retrospective study examined the spatial structure of LD incidence
distribution in Connecticut based on 24 years of reported data with the aim of describing
the spatial distribution and the changes that have occurred with regard to the disease. It
differs from most other studies in the region by not focusing on local studies of the
pathogen, reservoir or vector and their associations with the environment. Instead, it
focuses on the changing patterns of documented human disease occurrences. We
assumed that the number of affected human cases corresponded to the density of
Ixodes scapularis in Connecticut 150-152. In addition, it should be noted that ticks have
limited capabilities to move to new areas because of their small size 153. Therefore, one
of the means to fight against the risk of the disease in the study area would be targeted
intervention in the areas (towns in this case) that were constantly affected with a very
high morbidity rate. Targeted control, planning and management of the disease can
assist with resource allocation to the towns with persistent high incidence rates resulting
in time and costs savings.
51
Results of this study confirm and extend the findings of Xue et al. (2015) in
Connecticut 154. They analyzed yearly clusters of LD and showed that the distribution
was clustered and that this clustering occurred in western and eastern Connecticut with
few cases in the central region supposing the yearly incidence weighted-geographic
mean analysis represents the clusters of the case distributions. According to the paper,
the epidemic of LD reached equilibrium in 2007 in the western parts of Connecticut,
while this happened in 2009 in the East. Comparison of the results shows that periods
1-3 included epidemic conditions while period 4 contained equilibrium condition.
Therefore, actual/sudden shifts in the locations of clusters occurred roughly before
equilibrium (period 1-3); and it may be that epidemic conditions affected both western
and eastern parts of the state; while equilibrium conditions only occurred in the eastern
regions.
This study identified towns with high LD incidence rates using the LISA test and
spatial scan statistics approaches. By intersecting the locations of the clusters identified
by LISA some towns, including Chaplin, Windham and Scotland, were persistently
detected to have high-rate clusters. Likewise, neighbors of these towns including
Andover, Columbia and Lebanon, were recognized as having high-rate clusters in 3 out
of the 4 periods of study (Figure 2-1). This indicates that these areas were almost
persistently affected by a high incidence of LD during the 24 years of study and
therefore they deserve closer consideration.
In this study, we used an elliptical window in the scan rather than highly applied
circular shape used in other studies for several reasons. First, according to previously
published papers, the elliptic shape had better power and precision compared with
52
circular one and also follows more accurately certain geographic features with varying
shapes and directions 155. Another reason was to avoid edge effects at the borders of
the study area. When circles are used for scanning the study area, particularly when the
circle should center on the centroid of border towns, considerable parts of it inevitably
covers the neighboring state (i.e. Rhode Island in the East, Massachusetts in the North,
Long Island Sound to the South and New York in West, where states are also LD-
endemic areas where data were not accessible) and the true number of cases would be
underestimated. The use of ellipses with different shapes and directions helps to reduce
this problem to some degree.
There were similarities in the locations of the spatial hotspots identified by
SaTScan and LISA even though these methods apply different methodologies to detect
hotspots. It should be noted that we smoothed the incidence rate for the LISA cluster
detection method, while this is not available for spatial scan statistics. The results were
also in agreement with the findings of Barro et al. (2015), who compared different
cluster detection techniques including Getis-Ord Gi*statistic, a multidirectional optimal
ecotope- based algorithm (AMOEBA) and the spatial scan statistic for identifying
hotspots of human cutaneous anthrax in Georgia for point data 156. They also found that
SSS was more sensitive (due to augmentation in the quantity of true positives) but less
specific (by increment of the population at risk because of a declining number of true
negative clusters). Here, the results were highly dependent on defining the weight
matrix in LISA or the percentage of the population at risk in spatial scan statistics.
The most important limitation of this study is attributed to the data reported that
were used in this study. As indicated by the CDC, surveillance data are subject to
53
under-reporting and misclassification in highly endemic areas such as Connecticut;
however, this problem would not be very severe in light of well-designed LD surveillance
system in this state. Additionally, as long as there is simply a spatially random thinning
of reported cases, the overall spatial analysis should not be affected. However, if there
is persistent over- or under-reporting in selected regions, it may be difficult to recognize
the impact of such biases. Another factor influencing the results is that the definition of
LD has changed several times: from using a two-test approach for laboratory affirmation
139 to the addition of probable and suspect categories with less strict criteria 140 and
subtle changes in the specification of confirmed cases 141. However, empirically (Figure
2-3), these changes do not seem to substantially have altered the areas of high-risk
clusters. Therefore, although the analyses conducted were based on data reported by
CTDPH, the findings of this study should be interpreted with some caution.
The findings of this study can be regarded as a basis for generating hypotheses
about underlying risk components. For instance, visual comparison of the locations of
towns that were never identified as cluster areas (Figure 2-3) and the digital elevation
model (DEM) of the study area suggest that low-risk towns are located in low-lying
areas; thus it seems that people who live at moderate or higher altitudes in Connecticut
are at a higher risk. This would be consistent with the earliest epidemiologic
investigations 157 that identified increased risk in areas away from the seashore. Thus,
altitude as a proxy of other environmental factors can uncover the reasons of the spatial
distribution of the disease in the study area. Moreover, except for Windham, all the
other towns identified as persistent clusters had populations lower than 8,000 people
showing that persistent towns as clusters occurred in less populated areas. Therefore,
54
further studies should incorporate other environmental factors that have significant
influence on the spatial and temporal patterns of the disease.
55
Tables
Table 2-1. Comparison of LISA and SaTScan clusters by confusion matrix and its derivatives statistics Identified hotspots (LISA or SaTScan)
Actual hotspots (top 5%
incidence rates)
Presence Absence
Presence True Positive (TP)
False Negative (FN)
Absence False Positive (FP)
True Negative (TN)
Sensitivity = (TP)/(TP+FN); Specificity = (TN)/(TN+FP); Overall accuracy = (TP+TN)/(TP+FN+FP+TN)
Table 2-2. Results of the global Moran statistic of LD incidence rate, Connecticut, 1991-2014
Year Index Z-score P-value Type of distribution
1991-1996 0.71 16.05 ≈ 0 Clustered 1997-2002 0.59 13.54 ≈ 0 Clustered 2003-2008 0.55 12.73 ≈ 0 Clustered 2009-2014 0.59 13.28 ≈ 0 Clustered
Table 2-3. Characteristics of LD spatial clusters detected by SSS with 5% of the
population at risk throughout Connecticut, USA for the period 1991-2014
Cluster (C) Year Observed Cases (N)
Expected Cases
Ratio (Obs/Exp)
P-value Relative Risk (RR)
Log likelihood ratio
Most Likely (East) 1991-1996 792 171.26 4.62 <0.0001 7.67 736.02
Secondary (West) 1991-1996 152 32.26 4.71 <0.0001 5.07 120.17
Most Likely (East) 1997-2002 1,208 474.52 2.55 <0.0001 3.40 496.50
Most Likely (East) 2003-2008 738 305.46 2.42 <0.0001 3.33 284.31
Secondary (West) 2003-2008 145 16.96 8.55 <0.0001 9.18 187.61
Most Likely (East) 2009-2014 942 376.77 2.50 <0.0001 3.69 400.80
P-values were obtained from the Monte Carlo hypothesis test
56
Figures
Figure 2-1. Geographic location of the Connecticut, its towns and approximate populations. The names on the map show the location of the towns which are mentioned in this paper.
57
Figure 2-2. Temporal trend of LD incidence throughout Connecticut from 1991 to 2014. Black dots show the incidence rate for each year and the blue line represents a scatter with smooth lines.
0
50
100
150
200
250
300
350
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
Inci
denc
e ra
te (p
er 1
00,0
00)
Year
58
Figure 2-3. Locations of spatial clusters of LD incidence in Connecticut, USA based on
the true-cluster definition of with LISA and SSS methods targeting 5% of the population at risk (A), for the period 1991-1996; (B), for the period 1997-2002; (C), for the period 2003-2008; (D), for the period 2009-2014.
59
Figure 2-4. Comparison of LISA and spatial scan statistics with 5% and 50% of the
population at risk of LD in Connecticut, USA for each period from 1991 to 2014. Sensitivity, specificity and overall accuracy expressed as per cent (%)
40
50
60
70
80
90
100
Sens
itivi
ty
Spec
ifici
ty
Ove
rall
Accu
racy
Sens
itivi
ty
Spec
ifici
ty
Ove
rall
Accu
racy
Sens
itivi
ty
Spec
ifici
ty
Ove
rall
Accu
racy
Sens
itivi
ty
Spec
ifici
ty
Ove
rall
Accu
racy
Period1 Period2 Period3 Period4
LISA
SSS 5% pop. at risk
SSS 50% pop. at risk
60
CHAPTER 3 MACHINE LEARNING APPROACHES IN GIS-BASED ECOLOGICAL MODELING OF
THE SAND FLY PHLEBOTOMUS PAPATASI, A VECTOR OF ZOONOTIC CUTANEOUS LEISHMANIASIS
Zoonotic cutaneous leishmaniasis (ZCL) is an important neglected tropical
disease due to insufficient considerations for controlling and ignored remarkable public
health impacts in endemic countries ∗158. It poses unpleasant skin lesions to the patients
which may take a long time to heal and injects serious socio-economic burdens to the
society 159,160. This vector-borne disease is highly prevalent in developing countries
including Iran 161. Previous studies in Iran have demonstrated that Leishmania major is
the causative agent of ZCL and Ph. papatasi which is the target species in this study
serves as the major vector of the disease. It is reported that the disease is still endemic
in many rural areas of 17 of 31 provinces of Iran and almost 70% of
cutaneous leishmaniasis cases in Iran are ZCL 162,91. A large number of factors impact
the distribution of Ph. papatasi and/or its pathogen which have made it a
multidimensional and complex health problem particularly when the underlying
relationship is ambiguous 163.
In recent years, novel machine learning techniques such as random forest (RF)
and support vector machine (SVM) have become popular research tools for ecologists
and medical researchers due to improved predictive performances of the techniques,
particularly in solving binary classification tasks 164. The rapid application of these
techniques might be due to their abilities to approximate almost any complex functional
∗ Reprinted with permission from Mollalo, A., Sadeghian, A., Israel, G. D., Rashidi, P., Sofizadeh, A., & Glass, G. E. (2018). Machine learning approaches in GIS-based ecological modeling of the sand fly Phlebotomus papatasi, a vector of zoonotic cutaneous leishmaniasis in Golestan province, Iran. Acta tropica, 188, 187-194.
61
relationship 165. These techniques are mathematical models that can map the
relationship between input data and the target output by receiving several examples
(training data) 166. A trained model subsequently can predict a variable of interest for a
new independent (test) dataset.
To date, there are few studies using machine learning methods that model the
spatial distribution of leishmaniasis. In South America, Peterson and Shaw (2003) used
the genetic algorithm for rule-set prediction (GARP) to model ecological niche of
three sand fly vector species: Lutzomyia whitmani, Lutzomyia intermedia,
and Lutzomyia migonei. They projected the geographic distribution of vectors for the
year 2050 under global climate change scenarios and predicted that Lutzomyia
whitmani will dramatically increase in southern Brazil 167. Samy et al. (2016) used
Maximum Entropy (MaxEnt), a presence-only ecological niche model (ENM), to map the
potential distribution of ZCL across Libya. Their model successfully predicted the risk
areas and distribution of most species associated with ZCL including Ph. Papatasi. This
sand fly was present mostly in lowland areas and the risk areas of ZCL were almost
confined to the northern coast of the country 168. In a study in northwestern Iran, Rajabi
et al. (2014) used radial basis functional link nets (RBFLN), a neural network modeling
technique, to map potential areas of visceral leishmaniasis (VL), or kala-azar, a fatal
form of leishmaniasis. They produced a susceptibility map of VL, with the overall
accuracy of 92% in endemic villages and concluded that riverside and nomadic villages
without health centers were high-risk areas 169. RF and SVM have been applied in the
classification of other vector-borne diseases such as malaria and dengue fever. In an
endemic district in South Africa, Kapwata and Gebreslasie (2016) successfully applied a
62
random forest approach to select important features to create predictive maps of
malaria. They found that the most determining features were altitude, normalized
difference vegetation index (NDVI), rainfall and temperature 170. Gomes et al.
(2010) trained an SVM model for a small sample size (n = 28 patients) to classify
dengue fever (DF) and dengue hemorrhagic fever (DHF) patients based on gene
expression data. They achieved 85% accuracy for the validation dataset and identified 7
out of 12 genes that can differentiate DF patients from DHF patients 171.
To the best of our knowledge, no effort has been made to model Ph. papatasi or
other vectors of leishmaniasis distribution using SVM and RF, which are two effective
supervised techniques in ecological studies 172. This study analyses the applicability of
these techniques in GIS-based modeling of the species in Golestan province, northeast
of Iran. These techniques are used to classify this endemic area of ZCL into two
distinctive classes: ‘presence’ and ‘absence’ of Ph. papatasi. The accuracy of
these classifiers was compared with the frequently applied (vanilla) logistic regression.
This study is a starting point for further studies and application in control measures such
as “automatic early warning systems”. Findings of this study can help decision makers
target surveillance, and to control and mitigate species propagation into new foci.
Material and Methods
Study Area
Surveillance was conducted in Golestan province, a notorious endemic area
of leishmaniasis. The province has over 1,700,000 inhabitants and covers an area of
20,368 square kilometers in the northeast of Iran. This area lies within latitudes 36°30′
and 38°8′ N and longitudes 53°57′ and 56°22′ E (Figure 3-1). Geographically, this
province is bounded from north to Turkmenistan and from west to the Caspian Sea and
63
Mazandaran province. The altitude of the study area ranges from −32 to 3,945 m above
mean sea level. The climate of province varies from cold to moderate weather in the
south because of Alborz mountain ranges and arid and semi-arid with hot summers in
northern areas. As one moves from southern to northern areas, generally, the amount
of rainfall and percentage of humidity decreases and air temperature increases.
Data Collection and Preparation
Several categories of data in different formats including vector and raster data
models and excel spreadsheets were incorporated. The data were obtained from the
Worldclim, the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite
images, the United States Geological Survey (USGS) and GIScenter of Golestan
province. These data were used as candidate explanatory variables to model spatial
distribution of vector presence/absence in the study region. It was essential to
determine candidate factors to model species distribution. Thus, in the first step, a ZCL
related geo-database was developed in GIS and stored in a common projection system.
Sand fly collection
We conducted sampling during the primary activity period of sand flies from early
April 2014 to late October 2014 (adult sand flies are absent in other months). In each of
the 16 counties of the province, 5 to 10 villages were selected based on their
topographic conditions, spatial distribution of sampling sites (almost distributed
uniformly across the study area), and area of each county. For each sampling site,
sticky paper traps smeared with castor oil with the size of 15*20 cm used and placed
indoors (bedrooms, stables,…) and outdoors (animal shelters, rodents burrows,..). In
each site, 30 indoor and 30 outdoor sticky traps were monthly installed. We installed
traps at sunset and collected them the next day before sunrise. Also, no unprecedented
64
changes in climate condition or ZCL incidence was observed for the study period.
Collected sand flies were then washed to remove oil and other debris. We preserved
the sand flies in 70% ethanol and transferred them to an entomology laboratory for
species identification. Identification of collected material was based on
morphological identification keys of dissected sand flies. In 44 out of 142 sampling
sites, Ph. papatasi was present (Figure 3-1B). Although other species of sand flies were
also found in the study area, we did not consider them in habitat modelling, as Ph.
papatasi is the main vector of ZCL in the study area. Detailed information on collected
sand flies has been presented in Sofizadeh et al. (2016) 109.
Exploratory data
To investigate the effects of temperature and precipitation on habitat suitability
of Ph. papatasi in this area, we used the WorldClim database 173. Nineteen climate
variables (bio1 through bio19) were obtained from the database
(http://www.worldclim.org) and clipped with the boundaries of the study area (Table 3-
1). These variables are at 1 km (30 arcsec) spatial resolution. In addition, to investigate
the possible effects of proximity to water on presence/absence of the sand fly, we
calculated the Euclidean distance from each sampling site to the nearest river in the
GIS environment. The river coverage obtained from GIS and Environment Center of
Golestan province.
Two widely used remotely sensed products including land surface temperature
(LST) and NDVI were used to measure land temperature and vegetation cover,
respectively. These products were downloaded from the Moderate Resolution Imaging
Spectroradiometer (MODIS) satellite imagery from March to October 2014
(corresponding to the same time of the sand fly collection). Both LST and NDVI are at
65
1 km spatial resolution; the temporal resolution for the LST dataset is 8 days which
aggregated to a monthly resolution. LST downloaded for day and night, separately. The
NDVI dataset is at a monthly temporal resolution. They were pre-processed in ENVI
5.4 image analysis software. All remotely sensed satellite images are freely available at
the MODIS website: http://www.modis.gsfc.nasa.gov.lp.hscl.ufl.edu.
We used NDVI to reflect the density of vegetation cover on the ground. This
index, which has been highly correlated with the plant species presence, is calculated
via the red (R) and near-infrared (NIR) spectral bands from the following formula:
𝐍𝐍𝐍𝐍𝐍𝐍𝐈𝐈 =𝐍𝐍𝐈𝐈𝐍𝐍 − 𝐍𝐍𝐍𝐍𝐈𝐈𝐍𝐍 + 𝐍𝐍 (3-1)
The value of NDVI differs from −1 to +1 so that a greater value of NDVI represents a
higher amount of vegetation cover 158, 160, 174.
The 30-meter shuttle radar topography mission (SRTM) elevation data was
obtained from the United States Geological Survey (USGS) website (https://www-usgs-
gov.lp.hscl.ufl.edu/). Also, a slope map was generated from the digital elevation model
(DEM) of the study area. To calculate slope, the average maximum algorithm was used
in ArcGIS 10.2 software (ESRI, Redlands, CA, USA).
In summary, we developed a set of 31 candidate variables representing different
weather conditions, topographies, vegetation regimes, land temperature and proximity
to rivers that might plausibly impact sand fly occurrence (Supplementary Table). The
selection of these environmental variables was based on previously published literature
that found direct and/or indirect impacts on the distribution of the sand fly populations.
The values of each raster variable were extracted for each of the 142 sampling sites
66
through “Extract values to points” function in ArcGIS 10.2. The presence and absence
of Ph. papatasi are coded with 1 and 0, respectively.
Classifiers Used in This Study
Logistic regression (LR) is a generalized linear regression model that is
appropriate for situations where the dependent variable is dichotomous. The output of
LR is a likelihood of species occurrence as a function of several exploratory variables
175.
The random forest (RF) classifier, an ensemble tree-based algorithm, generates
a large number of binary decision trees (for instance 1000 trees) and then combines
them to provide a final classification for the presence/absence of the outcome 96. This
algorithm injects randomness through random selection of observations
using bootstrap sampling (i.e. sampling with replacement) and random selection of
exploratory variables. This process learns binary trees that are shallow (i.e. have a
small number of nodes) but highly specialized on a subset of the dataset/variables.
Ensembling the decisions of this diverse set of trees (forest) leads to an improvement in
the performance of the method 97. A final decision is achieved based on the majority of
labels (i.e. votes or outputs of trees) obtained from individual trees. In this study, the
number of trees was optimally selected from the set of {10, 30, 50, 100, 500} by cross-
validation. In addition, the optimal maximum depth (i.e., number of layers or number of
edges from the root to the node) of the trees was chosen from the set of {2, 3, 4}. The
maximum depth is selected via the cross-validation done on the training data.
SVM, originally proposed by Vapnik (1999), can extract patterns from complex
non-linear data 176. We used this SVM's ability to extract spatial patterns from the
collected data and features. SVM has several appealing characteristics compared to
67
most machine learning techniques. It guarantees global optima 177 and leads to a
unique solution when solved using iterative methods, rather than stochastic
randomness. We used a grid search to find the optimum values for the SVM
parameters. Detailed explanations of SVM can be found in Noble 98.
We trained all above-mentioned models (i.e. LR, RF, and SVM) using sklearn
(http://scikit-learn.org/stable/) in Python programming language to predict Ph.
papatasi occurrences.
Pre-processing of the Data
In the first step, to avoid/reduce the existence of the multicollinearity and
redundancy of the variables, we applied Pearson’s correlation analysis in SPSSver. 23
software. We investigated Pearson correlations among all pairs of selected variables.
We omitted highly correlated factors (i.e. correlation coefficient >0.7) before modeling.
All of the models are sensitive to the number of exploratory variables because
redundant variables add extra noise which reduces model accuracy and generalizability
178. A proper selection of the most effective input variables is crucial to improving
predictions. To address this, feature selection via L1 regularization was utilized before
developing the models to reduce the effects of possible noise. More detailed information
of L1 regularization method can be found in Park and Hastie (2007) 179.
After selecting the variables, we standardized (i.e., scaled) the input dataset
before incorporating the variables into the models because the data have different
ranges and units. This important pre-processing step can increase the performance of
models by converging faster 180, particularly in SVM and RF which are solved with
iteration-based optimization techniques. There are several standardization formulas
68
presented in literature and, here, we use the following equation which re-scales input
data to z-scores:
𝐙𝐙𝐚𝐚 =𝐙𝐙𝐢𝐢 − 𝛍𝛍𝛔𝛔
(3-2)
Where Zi is the initial (actual) value; μ and σ are the average and standard
deviation of the initial value, respectively and Zn is the standardized value. Moreover,
after developing the models, the output of the model is returned to the original form
through the following equation:
𝐙𝐙𝐢𝐢 = (𝐙𝐙𝐚𝐚 ∗ 𝛔𝛔) + 𝛍𝛍 (3-3)
Overview of Training Models
To develop models, we used 5-fold cross-validation where in each fold, the
dataset is randomly partitioned into two different categories including training set, 75%
of total data (ntrain = 106), and a remaining 25% (ntest = 36) as a hold-out test set used to
assess the accuracy and predictive power of the models after training process. We
saved and used the same training and test datasets for all models for further
comparisons. Also, the data are randomly selected, thus, on average 75% of the
training set is the same from one fold to the other.
To fine-tune each model’s hyper-parameters (e.g. parameters in SVM), using the
earlier mentioned grid search method, we performed a 3-fold cross validation within the
training set, breaking it into 66.6% and 33.3% subsets for each fold. As a final step, LR,
SVM and RF classifiers with optimal parameters were trained and then used to
distinguish presence from the absence of Ph. papatasi. All the reported results are the
average from the 5-fold cross-validation.
69
Models Validation
A confusion matrix or error matrix 148 is used to assess the generalization
capabilities of the classifiers. Sensitivity (or true positive rate) and specificity (true
negative rate) were used to evaluate the ability of the model to correctly identify
presence and absence of Ph. papatasi, respectively 181. Also, two important statistics
have been calculated to measure the performance of classifiers, namely positive
predictive value (PPV) and negative predictive value (NPV). PPV is the probability that
the sampling sites identified by the model with the presence of Ph. papatasi that actually
have that species, while NPV is the probability that the sampling sites identified by the
model as the absence of Ph. papatasi that actually don’t have that species 182. The
models’ accuracies were assessed using overall accuracy and area under receiver
operating characteristics (ROC) curve 183. In this study, we also used Cohen’s Kappa
index to exclude the probability of chance agreement between predictions and ground
truth 184.
Due to black-box nature of SVM, we are unaware of the weights/contribution of
each variable in predicting presence/absence of Ph. papatasi. We applied sensitivity
analysis to understand the contribution of each factor on the output variable by
excluding each variable from the model one by one and comparing the performance of
the model in terms of Cohen’s Kappa index. The absence of a variable that reduces the
Kappa index of the model the most is considered as the most important factor and so
forth. Finally, variables were imported into the most accurate model to generate habitat
suitability map for the entire study area. The obtained map was classified into 5 different
categories: very low, low, moderate, high and very high probabilities. Natural break
70
classification (a.k.a. jenks) was utilized to classify the obtained suitability map. The
overall flowchart of the procedures is shown in Figure 3-2.
Results
After conducting variable selection, 7 variables selected as inputs of the
models. Table 3-2 shows the list of selected variables and the correlation coefficients
among them.
As stated in the Methods section, the parameters C and γ in the SVM model
were determined experimentally by systematically increasing these values from 0.5 to
20 and 0.005 to 0.1, respectively. The best accuracy of SVM was obtained with C= 17.5
and γ= 0.105 and the radial basis function (RBF) kernel.
Results of this study using the test dataset (25% of the sampled areas not used
in developing the models) demonstrated that all three classifiers have good
generalization capabilities. The SVM model generated the highest overall accuracy
(91%) followed by the LR (88%) and the RF (85%). The ROC curves (prediction rates)
and AUCs for each method (Figure 3-3) showed all the classifiers generated high AUCs
(more than 90%); however, there were slight differences between AUCs of the
classifiers. The AUCs were 0.974, 0.965 and 0.935 for SVM, LR, and RF classifiers,
respectively.
Kappa indices were 0.79, 0.73 and 0.65 for the SVM, LR, and RF models,
respectively, suggesting substantial agreements between models outputs and ground
truth. Similarly, the lowest accuracy expected better than chance was obtained by the
RF method.
The highest positive predictive value (PPV) was achieved by the SVM model
(85%), indicating that the probability that the sampling sites identified by the SVM
71
classifier with the presence of Ph. papatasi that actually have that species was 85%. It
was followed by the LR model (82%) and the RF model (79.0%). Similarly, the SVM had
the highest negative predictive value (NPV = 93.2%) suggesting that the probability that
the sampling sites identified by the SVM as the absence of Ph. papatasi that actually
don’t have that species was 93.2% and followed by the LR (90.7%), and RF models
(87.0%).
The SVM model was the most sensitive model, with 84.9% of the correctly
classified sampling sites, followed by LR model with 80.0% of the correctly classified
sampling sites whereas the RF model was the least sensitive classifier with 73.4%. The
SVM model also had the most specific model (93.2%), implying that over 90% of
absence locations are correctly classified by this method. It was followed by the LR
model (91.9%) and RF models (90.4%). A graphical comparison of evaluation metrics of
classifiers (i.e. LR, SVM and RF) is shown in Figure 3-4.
Based on the above results, the best accuracy was obtained by SVM (with the
RBF kernel) compared to the other classifiers. Next, we assessed the contribution of
each variable for this best classifier using sensitivity analysis. Results of this analysis
indicated that the lowest Cohen’s Kappa index obtained when the slope was removed
from the SVM model. This suggested that this factor is the most influential factor in
predicting Ph. papatasi in the study region. The most influential factors in order of
contributions (from higher to lower) as predictors for Ph. papatasi were: slope, nighttime
LST in October, bio8 (mean temperature of wettest quarter), bio4 (temperature
seasonality), bio18 (precipitation of warmest quarter), bio5 (max temperature of
72
warmest month), bio15 (precipitation seasonality), distance to major rivers and bio13
(precipitation of wettest month), respectively.
According to the produced habitat suitability map (Figure 3-5) from the SVM
method, considerable parts of the study area are under very high probabilities of the
species occurrence. Results of the natural break classification of the map show that
28.8%, 5.0% and 7.7% of the total study area were predicted in the very low (0–0.18),
low (0.18–0.48) and moderate (0.48–0.73) probabilities of existence classes, while
areas covering high (0.73–0.90) and very high (>90) probabilities of Ph.
papatasi occurrences represent 16.5% and 42% of the total area, respectively. Very
high probabilities of the species occurrence happened mostly in central and north
(eastern) parts, while no and low occurrence of the species was restricted to the
southern half of the study area.
Discussion
This study focused on geospatial features that can predict the presence or
absence of Ph. papatasi in an endemic province of ZCL in Iran. We combined relatively
novel techniques/tools: remote sensing (to extract NDVI and LST), GIS (mainly for data
preparation and mapping) and machine learning (for model development). Our results
successfully demonstrated the utility of machine learning techniques in predicting the
spatial distribution of Ph. papatasi across Golestan province, Iran, from a surveillance
sampling strategy. Among the compared models, the SVM was an
efficient classifier with the highest prediction capability for producing habitat suitability
map of Ph. papatasi occurrence. Phlebotomine sand flies have limited flight range of
about 300 m 185 and a few have been known to travel 1.5–2 km 186. Regarding 1 km
73
spatial resolution of LST and bioclim variables, it seems that it is safe to assume that
this scale is operative for intervention or allocation of resources.
Regardless of the SVM benefits in Ph. papatasi habitat modeling, our results did
not indicate that the SVM classifier necessarily outperforms other binary classifiers for
species modeling or that RF is not a very accurate classifier. For instance, Ding et al.
(2018) found SVM inferior to the other models including random forest and gradient
boosting machine in modeling the global distribution of Aedes aegypti and Aedes
albopictus 187. Also, Williams et al. (2009) identified RF as the best model in predicting
appropriate habitats of six rare plant species in Creek Terrane, northern California. The
weaker results of RF compared to the other classifiers might be due to overfitting
(memorization). But generally, it is recommended to take SVM into consideration for
binary classification problems when having a dataset with a small sample size and a
relatively large number of exploratory variables 188.
The findings from this study can be compared with previous work conducted in
the same area by Sofizadeh et al. (2016). They utilized presence-only ecological niche
models (i.e. MaxEnt and GARP) for Ph. papatasi habitat modeling. The SVM model had
a 27% improvement in accuracy over the MaxEnt model 109. All three presence-absence
models developed in this study were more accurate than those developed using
ecological niche models. However, niche models generate pseudo-absence data from
the background, AUCs derived from pseudo-absence models are difficult to interpret
and thus a spurious pseudo-model might be selected. In this study, the lowest accuracy
for test data obtained by RF was 85%, while the overall accuracy of MaxEnt, which
outperformed GARP in the study of Sofizadeh et al. (2016), was no better than 64%.
74
This suggests that presence-absence models perform better for habitat suitability
models of the sand flies 109. Our results also confirm and extend the findings
of Carvalho et al. (2015) in predicting the vector of Leishmania amazonensis in South
America. In their study, random forest, as a presence-absence model, had the best
performance compared to several models including ENMs. They concluded that
incorporating species absence data significantly improved model performance 189.
However, it should be acknowledged that we incorporated some new variables such as
monthly NDVI (as opposed to yearly NDVI) and monthly LST (for day and night,
separately) and distance to main rivers together with considering absence points in
modeling. Among these new variables, nighttime LST and distance to main rivers were
found to be important determinants.
Sensitivity analysis indicated that slope was the most influential factor which is in
agreement with the results of previous ecological niche model. Topography and its
derivatives such as slope can have influence on biological and physical factors. For
instance, an increase in altitude leads to decreases in temperature and rainfall, and
these changes cause different plant types to grow 190. Also, Mollalo et al. (2014) found
altitude (and slope) a determining factor in the incidence of human ZCL. This suggests
that there is a strong correlation between human cases of ZCL and occurrence of Ph.
papatasi. Nighttime LST found the most important factor after slope which rarely is
incorporated in modeling Ph. papatasi 191. Land temperature can be associated with soil
moisture, evapotranspiration and vegetation water stress, which can be determining
factors in larval development and egg survival 192. Boussaa et al. (2016) also found LST
a significant factor in describing the distribution of Ph. papatasi. Similarly, no significant
75
association of daytime LST was observed which might be attributed to the nighttime
activities of sand flies. Temperature and precipitation variables (bio8, bio4, bio 13, bio
15, bio18) were found as the next important contributing predictors 193. Cross et al.
(1996) and Medlock et al. (2014) found that temperature and precipitation play
important role in development and survival of different life stages of Ph. papatasi 194-195.
These two factors have different effects on sand fly populations. For instance, lack of
rainfall sometimes doesn’t reduce larval frequency while moderate rainfall facilitates
transmission and intense rainfall reduces transmission because it flushes away suitable
resting sites for adult sand flies and destroys immature stages of the life cycle 196. In
regards to temperature, tropical climate provides favorable conditions to facilitate
transmission, while, the sand fly is not able to survive in temperature less than 15 °C
under laboratory conditions 197.
It should be noted that the accuracy of the classifiers used in this study depends
entirely on the data. Additionally, there are potential sources of errors (random errors or
systematic errors) in the dataset such as sampling bias or errors during vector collection
which can influence the accuracy of the classifiers. Therefore, the results of this study
should be used with some caution. We also recommend testing the performances of the
models on other datasets (e.g., regions) with comparable environmental characteristics
to select the best unbiased estimator for predictions. As a next step, a new sampling
validation in the same area of study is suggested to be performed in absence and
presence sites not sampled previously. Future studies should also consider actual
counts of the Ph. papatasifor each sampling site. Depending on the frequency
distribution of sand flies, proper modeling techniques such as Poisson regression,
76
negative binomial models, and artificial neural networks can be used. However, higher
counts of vector imply a greater risk of ZCL, the main purpose of this study was to
compare predictive capabilities of three different binary classifiers and modeling for
count data was not the purpose of this study.
In this paper, the use of novel machine learning techniques to predict the
presence/absence of an arthropod vector, Ph. papatasi, was investigated. The findings
showed that the SVM classifier was a reliable tool to solve specific problems in medical
geography and ecological modeling. The results of this study can be extended to
develop standalone early warning services (EWS) to actively monitor sand flies’
presence in endemic areas. Due to sophisticated predictive capabilities of SVM
compared to other classification methods, this classifier can be considered/
incorporated into the design of early warning systems (i.e. predictions of vector
presence greater than a threshold as a warning). This is applicable even for
other vector-borne diseases such as tick- and mosquito-borne diseases. However, the
implementation of EWS is more feasible in countries with well-established climate
stations that systematically record data.
77
Tables
Table 3-1. Bioclimate variables used in this study slope bio13 bio15 bio5 bio8 LST_OctN Dist_major slope 1 -.567** .376** -.610** .031 -.356** .457** bio13 -.567** 1 -.539** .467** -.087 .665** -.388** bio15 .376** -.539** 1 .013 -.144 -.479** .380** bio5 -.610** .467** .013 1 -.422** .640** -.331** bio8 .031 -.087 -.144 -.422** 1 -.221** .077 LST_OctN -.356** .665** -.479** .640** -.221** 1 -.375** Dist_major .457** -.388** .380** -.331** .077 -.375** 1
Table 3-2. Pearson correlation coefficients between selected variables
Variable Description Bio1 Annual Mean Temperature Bio2 Mean Diurnal Range (Mean of monthly (max temp – min temp)) Bio3 Isothermality (BIO2/BIO7) (*100) Bio4 Temperature Seasonality (standard deviation *100) Bio5 Max Temperature of Warmest Month Bio6 Min Temperature of Coldest Month Bio7 Temperature Annual Range (BIO5-BIO6) Bio8 Mean Temperature of Wettest Quarter Bio9 Mean Temperature of Driest Quarter Bio10 Mean Temperature of Warmest Quarter Bio11 Mean Temperature of Coldest Quarter Bio12 Annual Precipitation Bio13 Precipitation of Wettest Month Bio14 Precipitation of Driest Month Bio15 Precipitation Seasonality (Coefficient of Variation) Bio16 Precipitation of Wettest Quarter Bio17 Precipitation of Driest Quarter Bio18 Precipitation of Warmest Quarter Bio19 Precipitation of Coldest Quarter
Correlation is significant at the 0.01 level (2-tailed).
78
Figures
Figure 3-1. Geographic location of the study area and its counties, NE Iran. The map inset (B) illustrates the location of presence/absence sampling of Ph. papatasi.
79
Figure 3-2. Methodology flowchart used in this study
Figure 3-3. The ROC curves and AUCs for SVM, RF and LR classifiers
80
Figure 3-4. Comparison of the SVM, LR and RF classifiers. Overall accuracy, AUC, Kappa index, sensitivity, specificity, PPV and NPV expressed as %.
60
70
80
90
100
OverallAccuracy
AUC Kappa Sensitivity Specificity PPV NPV
Comparison of classifiers
SVM
LR
RF
81
Figure 3-5. Habitat suitability map of Ph. papatasi in Golestan province, northeast Iran based on support vector machine (note: this map is based on 1km units).
82
CHAPTER 4 A GIS-BASED ARTIFICIAL NEURAL NETWORK MODEL FOR SPATIAL
DISTRIBUTION OF TUBERCULOSIS ACROSS THE CONTINENTAL UNITED STATE
Tuberculosis (TB) is a contagious disease caused by Mycobacterium
tuberculosis ∗198. The disease is primarily transmitted through the respiratory route by
coughing or sneezing 199. The disease mostly attacks the lungs but can also affect other
organs such as kidney and brain 200. It can promote the course of human
immunodeficiency virus (HIV) infection into acquired immune deficiency syndrome
(AIDS) 201. According to the World Health Organization (WHO) global TB report, it is
estimated that 10.4 million incident cases in 2016 developed the disease, of which
almost 1.7 million patients died 202. This agency has ranked TB as the leading cause of
death among HIV patients, the most common killer from a single infectious agent, and
the 9th leading cause of death, worldwide.
According to the statistics by the WHO, most TB cases (>90%) are reported in
developing countries; however, it can also occur in developed countries 203,204. Despite
efforts to eradicate TB in the US, the disease remains a major public health challenge
204. In 2016, more than 9200 TB cases were reported in various parts of the US, which
placed the disease among the top notifiable infectious diseases in the country 205.
Although the frequency of TB has decreased in recent years, it is not expected that the
US will achieve the goal of TB elimination in this century 206.
There are many factors that influence the spatial distribution of TB, which has
made the disease a multidimensional and complex public health problem 207-210.
∗ Reprinted with permission from Mollalo, A., Mao, L., Rashidi, P., & Glass, G. E. (2019). A GIS-Based Artificial Neural Network Model for Spatial Distribution of Tuberculosis across the Continental United States. International journal of environmental research and public health, 16(1), 157.
83
Previous researches from different parts of the world have demonstrated that TB
transmission is related with various individual factors, for example, age, gender,
education level, race, migration, drinking alcohol, and presence of diseases (such as
HIV and diabetes) 211-213. Moreover, at the ecological level, factors such as climate,
altitude, air pollution, economic level, unemployment rate, and poverty have found
significant on TB occurrence 214,215. One of the major drawbacks of the highly applied
traditional statistical models in the study of TB is that these models are often based on
several hard-to-meet assumptions 191,91. This can bias the estimations of TB frequency/
incidence rate 216. For instance, some assumptions of the linear regression (LR) model
are normality of all variables, the linear relationship between inputs and output, constant
variance of errors, and little or no multicollinearity. They also often need a complete
and/or long-term recorded dataset to achieve unbiased estimations 217. On the other
hand, machine learning techniques (MLTs) may lead to appropriate estimations even
with noise-contaminated and incomplete data 218,219. As advanced tools, MLTs have
been successfully used in analyzing and modeling various complex environmental
disciplines, including in ecology, geography, biomedicine, and epidemiology
24,159,220,221,166. The growing popularity of MLTs can be attributed to their abilities to
approximate almost any complex non-linear functional relationship 222-224. Despite their
capabilities in working with noisy and incomplete data as in most epidemiological
studies, they have been underused in spatial epidemiology 169,225.
Inspired by human neural processing, artificial neural networks (ANNs) are
among the most popular MLTs used in recent years in environmental studies. Artificial
neural networks have a large number of highly interconnected processing elements
84
(neurons) working in unison to solve specific problems 226. Compared to the traditional
statistical models, ANNs are independent of the statistical distribution of data and do not
require a priori knowledge about the data for deriving patterns. ANNs are simplified
mathematical models that can map the relationship between input and output layers by
receiving several examples (training data). A properly trained network can further be
used to predict outcome(s) from new data (test data) 24.
There have been few published ANN architectures in spatial modeling of
infectious diseases, worldwide. Aburas et al. applied an ANN model with a back-
propagation algorithm to predict the frequency of confirmed dengue cases using
Singaporean National Environment Agency data 227. Results of their model showed a
correlation coefficient of 0.91 between actual and predicted values. They also identified
influential environmental factors predicting the number of dengue cases including mean
temperature, mean relative humidity, and total rainfall. Laureano-Rosario et al. used
ANN to predict dengue fever (DF) occurrence in a region in Puerto Rico, and in several
coastal municipalities in Mexico 228. The developed ANN models were trained with 19
years of DF data for Puerto Rico and six years’ data for Mexico. Sea surface
temperature, precipitation, air temperature, humidity, previous DF cases, and population
size were used as explanatory variables. Their results showed that the ANN
successfully modeled DF outbreak occurrences with the overall power of 70% in both
areas. The variables with the most influence on predicting DF outbreak were population
size, previous DF cases, air temperature, and date. In the US, Xue et al. developed the
least square (LS) regression analysis and a neural network trained by the genetic
algorithm to evaluate influenza activity in 10 geographic regions 229. They compared the
85
models using three evaluation metrics: mean square error, mean absolute percentage
error, and relative mean square error. All three evaluation indices used in their study
were lower than the corresponding metrics for the LS regression model showing the
superiority of the genetic algorithm-based neural network to the LS regression model.
To date, TB control efforts have relied on the empirically developed WHO DOTS
(directly observed treatment, short-course) control strategy which focuses on “case-
finding” rather than “place-finding” 230. Previous studies from different parts of the US
and at different levels have shown the association of TB with socio-economic status.
Mullins et al. used purely spatial scan statistics to identify spatial clusters of census
tracts with high TB prevalence rates in Connecticut. They found six clusters of TB
containing 126 census tracts 231. Persons in these clusters were more likely to be black
non-Hispanic and less likely to be Asian. Bennett et al. used multivariate logistic
regression to assess the association between demographic and clinical characteristics
and latent TB infection in refugees in San Diego, California 232. They found that the
highest prevalence rate was among refugees from sub-Saharan Africa and those with
less education. In a study in Harris County (in Texas), Feske et al. showed a positive
association between the percent of individuals using public transportation in census
tract and location of clusters detected by Getis–Ord’s Gi* hotspot analysis 233.
Ecological level researches on TB incidence rate at the national level are
inadequate for epidemiologic inferences especially in the US. Therefore, it is crucial to
perform an ecological study across the continental US to identify the location of
statistically significant hotspots and to determine the relationship between
environmental and socio-economic factors and TB incidence to provide useful insight to
86
policymakers in planning for TB control at a larger scale. To our knowledge, no study
has utilized ANNs in modeling the geographic distribution of TB incidence rate in the
US. Integration of the GIS and ANN can improve policymakers’ insight in identifying
potential TB high-risk areas and risk factors useful for future mitigation efforts. We
examined the spatial distribution of the disease and applicability of MLTs in TB
modeling with the following assumptions (1) all reported county-level TB incidence rates
represent the status of TB in the continental US and (2) the TB incidence rate is
influenced by environmental and socioeconomic factors.
Material and Methods
Tuberculosis Data
Data on all reported TB cases in the continental US between 2006 and 2010
were obtained from the paper of Scales et al. in the American Journal of Preventive
Medicine 234. All data are at the county level (n = 3109) and are publicly available 235.
Latent TB cases were not reported and included in this study. To alleviate variations of
TB incidences, particularly in counties with a small population size such as Loving and
King counties in Texas (n < 300 populations), the cumulative incidence was calculated
(2006–2010). For this purpose, five-year corresponding population estimates from the
American Community Survey (ASC) 236 were used. Tuberculosis incidence rates were
imported into ArcGIS 10.5 (ESRI, Redlands, CA, US) and geocoded at the county level.
Explanatory Data
In this study, 278 environmental and socio-economic factors were collected from
various sources and considered as explanatory variables based on previous studies
and domain knowledge. From the Center for Disease Control and Prevention (CDC)
wonder database 237, climate data were derived including daily maximum and minimum
87
air temperature (°F), and daily maximum heat index (°F). Also, the average number of
diabetes cases 238 during the study period were obtained from this database.
Topographic data including minimum, maximum and mean of altitude and slope were
obtained from the national map website 239. The county-level values of these data were
calculated using zonal statistics in ArcGIS Spatial Analyst extension. Socioeconomic
data were acquired from the US Census Bureau 240. A broad range of socio-economic
factors including age group, agriculture, immigration, education level, employment rate,
health, Hispanic or Latino population, income, poverty, and race were obtained across
the nation from this database. All population-based predictors were normalized to the
county’s population size. The full dataset description of exploratory variables is in
Supplementary Table. All data used in this study are downloadable from the above
sources.
Global and Local Clustering
After mapping TB incidence rates at the county level, the empirical Bayes
smoothing method was implemented in the GeoDa software 241 to adjust the crude
incidence rates toward the global mean. This helps to reduce the variance instability
associated with counties with a small population size 114. Thus, the response variable
changed to (logged) smoothed TB incidence rate (STIR) rather than the crude incidence
rate with more than 900 counties with 0 value which makes modeling difficult. All these
counties have non-zero STIR values. Next, the spatial pattern was statistically
evaluated using the global Moran’s I and Getis–Ord General G.
Global Moran’s I measures the similarities between the TB incidence rates of
neighboring counties as follows 142,158:
88
𝐈𝐈 =𝐚𝐚∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐳𝐳𝐢𝐢𝐳𝐳𝐢𝐢𝐚𝐚
𝐢𝐢=𝟏𝟏,𝐢𝐢≠𝐢𝐢𝐚𝐚𝐢𝐢=𝟏𝟏
𝐧𝐧𝟎𝟎 ∑ 𝐳𝐳𝐢𝐢𝟐𝟐𝐚𝐚𝐢𝐢=𝟏𝟏
(4-1)
𝐧𝐧𝟎𝟎 = ��𝐰𝐰𝐢𝐢𝐢𝐢
𝐚𝐚
𝐢𝐢=𝟏𝟏
𝐚𝐚
𝐢𝐢=𝟏𝟏
(4-2)
where 𝑧𝑧𝑖𝑖 and 𝑧𝑧𝑖𝑖 are the deviations of STIR for counties 𝑖𝑖 and 𝑗𝑗 from average incidences
(i.e., (𝑥𝑥𝑖𝑖 − �̅�𝑥), (𝑥𝑥𝑖𝑖 − �̅�𝑥)), respectively; 𝑤𝑤𝑖𝑖𝑖𝑖 is the spatial weight based on Rook’s contiguity
(i.e., common borders between counties 𝑖𝑖 and 𝑗𝑗); and 𝑠𝑠0 is the aggregation of all spatial
weights.
Moreover, the Getis–Ord General G statistics, developed by Getis and Ord was
used as a measure of clustering of the high or low value of STIR 57. A positive or
negative Z-score for G indicates spatial clustering of high (hotspot) or low (coldspot)
values, respectively. The formula for the general G of spatial association is:
𝐆𝐆 =∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐱𝐱𝐢𝐢𝐱𝐱𝐢𝐢𝐚𝐚
𝐢𝐢=𝟏𝟏𝐚𝐚𝐢𝐢=𝟏𝟏
∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐚𝐚𝐢𝐢=𝟏𝟏
𝐚𝐚𝐢𝐢=𝟏𝟏
∀𝐢𝐢 ≠ 𝐢𝐢 (4-3)
Getis–Ord Gi* statistics 66 was applied on the smoothed rates to identify the locations of
statistically significant hotspots of the STIR (p<0.05). Using the same notation as in
Equations (1) and (2), this statistic is computed as follows 233,114:
𝐆𝐆𝐢𝐢∗ =∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐱𝐱𝐢𝐢 − 𝐗𝐗�∑ 𝐰𝐰𝐢𝐢𝐢𝐢
𝐚𝐚𝐢𝐢=𝟏𝟏
𝐚𝐚𝐢𝐢=𝟏𝟏
𝐒𝐒 �[𝐚𝐚∑ 𝐰𝐰𝐢𝐢𝐢𝐢
𝟐𝟐 − �∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐚𝐚𝐢𝐢=𝟏𝟏 �
𝟐𝟐]𝐚𝐚
𝐢𝐢=𝟏𝟏𝐚𝐚 − 𝟏𝟏
(4-4)
𝐒𝐒 = �∑ (𝐱𝐱𝐢𝐢 − 𝐱𝐱�)𝟐𝟐𝐚𝐚𝐢𝐢=𝟏𝟏,𝐢𝐢≠𝐢𝐢
𝐚𝐚 − 𝟏𝟏− 𝐱𝐱�𝟐𝟐 (4-5)
89
Artificial Neural Networks
An ANN is a computational model, which consists of several simple processing
elements called neurons 242. The neurons are usually structured in layers: the input
layer, the hidden layer(s) and the output layer. In this study, we used ANNs with one
and two hidden layers, however, theoretical research has shown that almost any
complex and non-linear function can be estimated by an ANN with a single hidden layer
243. Additionally, having more hidden layers increases the number of parameters, which
may lead to over-fitting (i.e., memorizing the data while training). The aim of the hidden
layer is to find a multi-dimensional expansion of the input layer, which can be better
transformed to the pattern in the output layer 244. The neurons in the input layer are
connected to all neurons in the hidden layer. Similarly, all the neurons in the hidden
layer are connected to every neuron in the output layer (Figure 4-1). In this system,
selected explanatory variables are fed to the input layer and passed through the hidden
layer that processes them using simple mathematical operations. The relationship
between input and output layers of ANNs is trained by observing a series of known
examples from the training dataset and adjusting the weights accordingly. Once the
training phase is completed, the network is usually able to generalize what it has
learned to the test data with similar attributes of input. We developed a multi-layer
perceptron (MLP) neural network with one and two hidden layers to approximate the
dependency of log (STIR) in the continental US to environmental and socio-economic
factors.
Multi-layer perceptron is the most commonly used ANN structure in
environmental modeling 245,246. During the training phase of MLP, each input feature is
multiplied by its corresponding weight. The results are then summed and passed
90
through a smooth non-linear activation function to produce the output. The Logistic
function and hyperbolic tangent are among the widely used activation functions 247,248. In
a supervised learning setup, the difference between the (MLP) output and actual
output/target (i.e., the error or cost function) can be calculated as in Equation (5) 249:
𝐚𝐚𝐚𝐚𝐚𝐚𝐧𝐧𝐚𝐚 =𝟏𝟏𝟐𝟐�(𝐚𝐚𝐢𝐢 − 𝐧𝐧𝐢𝐢)𝟐𝟐𝐚𝐚
𝐢𝐢=𝟏𝟏
(4-6)
where 𝑜𝑜 and 𝑡𝑡 are model output and target respectively, and 𝑛𝑛 is the training sample
size. To minimize the cost function, a back-propagation algorithm based on stochastic
gradient descent is used to adjust each weight in the MLP model. The updated value for
each weight is calculated as in Equation (6):
𝐰𝐰(𝐤𝐤+𝟏𝟏): = 𝐰𝐰(𝐤𝐤) − 𝛂𝛂𝛛𝛛𝐚𝐚𝐚𝐚𝐚𝐚𝐧𝐧𝐚𝐚𝛛𝛛𝐰𝐰(𝐤𝐤) (4-7)
where 𝛼𝛼 or learning rate controls how much the coefficients can change on the 𝑘𝑘th
update. More detailed information about back-propagation MLP is presented in 250,251.
Model Pre-Processing
To develop the models, the entire dataset was randomly divided into three
different partitions: (1) training data: to learn and update the weights and biases in
network (60% of the total data) (2) cross-validation data: to avoid overfitting problem by
tuning models’ parameters during training phase (15% of the total data) (3) test data: to
evaluate accuracy and predictive power of the network after the training process (25%
of the total data). The spatial distribution of training, cross-validation, and test data as
shown in Figure 4-2. For the purpose of comparison, we used the same training, cross-
validation and test data for all developed models.
91
The next step was to standardize the input data before using them in the ANN
models. Input data have different ranges and units, this pre-processing step can
enhance the performance of models by faster convergence 24,252. There are several
standardization formulas presented in the literature. Here, we use the following equation
which transforms the input data to the range of 0 to 1 as in Equation (8):
𝐗𝐗𝐧𝐧 =𝐗𝐗𝐢𝐢 − 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚𝐗𝐗𝐦𝐦𝐚𝐚𝐱𝐱 − 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚
(4-8)
where 𝑋𝑋𝑖𝑖 is the initial (actual) value; 𝑋𝑋𝑚𝑚𝑖𝑖𝑚𝑚 and 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 are the minimum and maximum of
the initial values and 𝑋𝑋𝑠𝑠 is the respective standardized value. Moreover, after training
the network, the output of the model is returned to the original form through the
Equation (9):
𝐗𝐗𝐢𝐢 = 𝐗𝐗𝐧𝐧 ∗ (𝐗𝐗𝐦𝐦𝐚𝐚𝐱𝐱 − 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚) + 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚 (4-9)
Multi-layer perceptron and linear regression (LR) models are sensitive to
redundant explanatory variables because noise reduces model accuracy and
generalizability. Thus, a ‘proper’ selection of independent variables is crucial for better
performance. We applied L1-regularization or least absolute shrinkage and selection
operator (LASSO) on the training and cross-validation dataset before developing the
models. This process produces a sparse solution (i.e., few non-zero coefficients) which
reduces overfitting and enhances interpretability of the results 253. Detailed information
of L1 regularization technique can be found in Park and Hastie 179.
In the MLP model, 8 and 1 neurons were used in the input and output layers,
respectively. These numbers correspond to the 8 explanatory variables selected during
L1 regularization and one response variable (i.e. log (STIR)). There is no deterministic
92
rule to determine the number of neurons in the hidden layer. The grid search was used
to tune hyper-parameters in MLP with one and two hidden layer(s). This approach
systematically evaluates the developed model for each combination of model
parameters. For MLP, the tangent hyperbolic activation function �∅(𝑥𝑥) = 1−𝑒𝑒−𝑥𝑥
1+𝑒𝑒−𝑥𝑥�, a non-
linear and symmetric function which maps any real value to [–1,1], was used in the
hidden layer(s) 254. In addition, a linear identity function (𝑓𝑓(𝑥𝑥) = 𝑥𝑥) was applied as an
activation function in the output layer. All computer codes were developed in the Python
programming language.
Model Evaluation
We computed three types of evaluation metrics to assess and compare the
generalization capability of MLP with LR in predicting log (STIR) in the continental US.
These statistics include root mean square error (RMSE), mean absolute error (MAE)
and correlation coefficient (R) between model output and ground-truth.
Due to the fully connected architecture of MLP, it is difficult to define the explicit
relationship between input and output variables by coefficients 255. Sensitivity analysis is
a common way to address this problem. In this analysis, each factor was excluded from
the model, individually, and the RMSE of resulting models were compared. The most
influential factor, among the selected features, is the one that its absence increases the
RMSE of the model the most. Using sensitivity analysis, we identified the most
influential factors in predictions of the log (STIR). Finally, we ranked them according to
their decreasing importance.
93
Results
Between 2006 and 2010, 64,496 TB cases were reported across the continental
US. The number of TB cases showed a consistent declining trend from 14,119 to
11,284 annual cases (Figure 4-3). Among the states, the highest average TB incidence
rates were identified in Louisiana (5.30 cases per 100,000), Arizona (5.05 per 100,000)
and Georgia (4.7 per 100,000).
The global Moran’s I and general G indicated significant spatial clustering of
STIR in the continental US for the study period (Moran’s I = 0.13, Z-score = 32.13,
p<0.005; General G = 0.002, Z-score = 15.3, p<0.005). The hotspot analysis (Getis–Ord
Gi*) identified that about 7% of the continental US counties (n = 216) were part of
hotspots. The hotspots of STIR were distributed unequally, almost restricted to the
southern half of the country and particularly in the southern and southeastern counties
of the US (Figure 4-4). Table 4-1 summarizes the top 10 states with the largest number
of counties detected as part of STIR hotspots by the Getis–Ord Gi* technique.
Based on the results of L1 regularization, the environmental and several socio-
economic factors were selected. Out of 278 explanatory variables, only 8 factors were
incorporated as input variables for the models: (1) RHI820: resident population: not
Hispanic, white alone (July 1-estimate) (proportion of county population); (2) LFE330:
employed persons by industry (NAICS)-agriculture, forestry, fishing and hunting, and
mining (proportion of county population); (3) minimum temperature; (4) POP778: year of
entry by citizenship status in the United States entered 2000 or later-foreign-born
(proportion of county population) (5) IPE110: people of all ages in poverty (proportion of
county population); (6) SPR440: social security-benefit recipients (proportion of county
population); (7) HIS305: Hispanic or Latino persons, educational attainment, 25 years
94
and over, male (proportion of county population); (8) POP730: population one year and
over by residence-moved from different county within same state 2005–2009
(proportion of county population).
All Pearson correlation values among the selected variables were under 0.5, thus
we considered the selected factors as relatively uncorrelated (Table 4-2). The log
(STIR) was positively correlated with all variables except for household income
(p<0.05).
Preliminary results of the developed LR model showed that the selected
predictors generated R = 0.666, R2 = 0.443, and F = 184.246 (Sig. F Change < 0.001).
The R value which represents the simple correlation between predictions and reality
indicated an acceptable degree of correlation. The R2 value indicates that 44.3% of total
variations in the log (STIR) can be explained by the predictors. The F-test was
significant showing the developed LR model as a whole has statistically significant
predictive capability. Durbin–Watson test was close to 2 which verifies independency of
errors assumption (Durbin–Watson statistic = 2.04) (Table 4-3).
The t-test showed that all selected variables are statistically significant at a 99%
confidence interval. Based on the standardized coefficients, among the variables, in
order of strength, “RHI820”, “LFE330”, “HIS305”, “SPR440”, and “POP730” have
negative impacts on the log (STIR) while the variables “POP778”, “Min Temp”, and
“IPE110” have positive impacts on the dependent variable, respectively. Tolerance and
the variance inflation factor (VIF) were used as two collinearity diagnostic tests to
assess multicollinearity level for all variables. As the values of VIF didn’t exceed 3 and
95
tolerance statistic were above 0.1, it seems that there is no cause for concern about
collinearity (Table 4-4).
The normal probability-probability plot (P-P Plot) of LR residuals showed that
data points were closely aligned with the diagonal line suggesting the distribution of the
residuals was almost normal. This indicated that the assumption of normality of errors
was almost met. Nevertheless, there were a few samples which departed from the
diagonal line (Figure 4-5).
Table 4-5 presents the performance of MLP (1-hidden layer), MLP (2-hidden
layer), and LR models, in terms of MAE, RMSE, and R, for training, cross-validation and
test datasets, respectively. The correlation coefficient for a single hidden layer MLP,
with 20 nodes in the hidden layer, was larger than double hidden layers MLP, and LR
(Table 4-5), which showed a better agreement between the predicted and the ground-
truth. In addition, RMSE in MLP (single layer) was lower than the other models.
However, MAE in MLP with a two-hidden layer (20 nodes in first and 10 nodes in
second hidden layers) had the same test errors as single layer MLP. Results suggest
that the single layer MLP model outperformed the rest of models with higher
generalizability for predicting the log (STIR) in the continental US. Similarly, double
hidden layer MLP outperformed the LR model.
The scatter plot between the output of the MLP model and the corresponding
observed log (STIR) (i.e., ground-truth values) for test data (Figure 4-6) showed that the
model was able to predict the average variations of log (STIR), while it was unable to
predict some counties with exceptional rates.
96
The best model accuracy was achieved by the single layer MLP compared with
the other models. We examined the contribution/relative importance of each input
feature on the log (STIR) using sensitivity analysis. The results revealed that the highest
RMSE occurred when “resident population of American Indian and Alaska native” was
removed from the model, which implies that this factor has the maximum contribution,
among the selected factors, in predicting log (STIR) at the county level in the continental
US. The most influential factors in order of contributions were: “RHI820: resident
population: not Hispanic, White alone (July 1-estimate) (proportion)”, “LFE330:
employed persons by industry (NAICS)-agriculture, forestry, fishing and hunting, and
mining (proportion)”, “Minimum Temperature”, “POP778: year of entry by citizenship
status in the United States entered 2000 or later-foreign-born (proportion)”, “IPE110:
people of all ages in poverty (proportion)”, “SPR440: supplemental security income-
average monthly payments per recipient”, “HIS305: Hispanic or Latino persons,
educational attainment, 25 years and over, male (proportion)” and “POP730: population
one year and over by residence-moved from different county within same state 2005-
2009 (proportion)”. (Figure 4-7).
Discussion
According to the Institute of Medicine (2000), eliminating TB will require the
development of new tools to identify risk factors and high-risk areas 256. As expressed
by Feske et al., effective TB elimination in the US would require geographic elucidation
of high-risk areas and systematic surveillance of location-based risk factors 233. In this
study, we combined more advanced tools to examine the relationship between
environmental and socio-economic factors, and the log (STIR) across the continental
US. Integration of GIS, spatial statistics, and ANNs resulted in an efficient multi-
97
disciplinary approach, which can provide helpful guidelines for health decision makers.
The benefits obtained from this approach can enhance mitigation efforts such as budget
allocation, educating people who live in high-risk areas and drug distribution. Due to
limited research on the spatial modeling of TB at the national level, our study can be
regarded as a basis for future nationwide TB program researches.
In this study, we examined the spatial pattern and hotspots of the STIR in the
continental US. Moran’s I and General G statistic were used to investigate the presence
of spatial autocorrelation of local STIR. Our results showed that the distribution of STIR
at the county levels is clustered at the county level (p<0.05). We then conducted the
hotspot analysis to identify the counties with statistically significant STIR using the
hotspot analysis (Getis–Ord Gi*) approach. Our findings showed that STIR hotspots
were concentrated in the southern and western states (Figure 4-4); however, the South,
Southeast, and Southwest counties of the country were more severely affected. The
concentration of hotspots suggests that there were more cases observed in these areas
that would be expected if everyone were equally at risk.
Visual comparison of the location of identified hotspots in some states which had
a very high proportion of counties falling into hotspots (such as Georgia and Florida)
with recent surveillance reports of Georgia 257 and Florida 258 showed pieces of
evidence of similarities with some differences. This suggests the stability of location of
counties falling into hotspots which require close attention. Since after detecting the
statistically significant hotspots of STIR in the region, associated risk factors were not
known, we used ANN to model the relationship between environmental and socio-
economic factors and the log (STIR).
98
Based on sensitivity analysis, the environmental factors identified the contribution
of the selected variables to the log (STIR) at the county level. We found that the
average daily minimum temperature (with a positive effect) was an important climate
factor in the log (STIR). This is probably due to the adverse effects of minimum air
temperature on the respiratory system of patients, and more close lifestyle of people in
cold weather which increases the risk of exposure to infectious agents 259. Similar
results were reported in other studies. In a time-series analysis study in Fukuoka
(Japan), Onozuka and Hagihara found a positive significant relationship between
extreme cold temperature and TB incident cases 259. Our results are also consistent
with the findings of Mourtzoukou et al. 260 and Khalid et al. in Pakistan 261.
Sensitivity analysis showed that economic factors are important for log (STIR) in
the continental US. One of the most important economic factors was “proportion of
population county of all ages in poverty” which suggests that underserved segments of
the population are at higher-risk of STIR in the US. Conversely, counties with a higher
proportion of the population employed by industry or higher security income had lower
STIR. These factors potentially describe the impact of poverty/deprivation on lifestyle
choice. These findings agree with the individual-level studies of McKenna et al. 262, Ho
263, and Weis et al. 264. Unemployment rate has been found to be an important factor in
TB transmission in the studies of Munch et al. who indicated the strongest correlation
with TB caseload 265, and the studies of Dos Santos 266 and Jackubowiak 267; while Sun
et al. 207 did not find it significant in China. In a cohort study in Georgia, Djibuti et al. 268
showed that TB patients with lower-income households are at higher risk of poor TB
treatment. Bamrah et al. analyzed the genotyping of homeless persons in the US,
99
1994–2010. Their results showed that homeless TB patients had an approximately 10-
fold increase in TB incidence and had more than twice the odds of not completing
treatments 269.
Results of this study also showed the importance of race distribution and STIR as
there was a strong negative relationship between the proportion of county population
who are white and STIR. Visual comparison of the locations of the hotspots (Figure 4-4)
with the distribution map of race and ethnicity provided by US Census Bureau 270,
confirms that, in general, counties with more than 80% of the white population didn’t fall
into the hotspots. This agrees with the findings of Cantwell et al. 208 and CDC report 271.
Artificial neural networks and GIS were effective in modeling log (STIR), but
several limitations exist. First, the data used in this study were collected from multiple
online sources. It should be noted that TB reported data are subject to spatial
differences in case detection, thus, a standard passive case-finding approach needs to
be considered. In addition, methods of data collection and preparation are different
which may result in biased estimations. In this study, we only included active reported
TB cases. It should be noted that there are many people with latent TB infection who
have not developed TB disease, and therefore, are not detected or counted as cases.
This is more prevalent (exists in greater proportions) among those from other countries.
The last limitation stems from the study design. Our study at the county level should be
considered to characterize population rather than individual-level characteristics of risk
(ecological fallacy). Thus, the findings of this study should be applied to population-level
targeting rather than considering individual treatment.
100
This study showed machine learning techniques in spatial modeling can be
applied to TB incidence rate across the continental US. For future works, we
recommend conducting researches at multiple scales, particularly at finer scales such
as at the census tracts or block group level for more focal interventions. However, as
most policy decisions on TB control are performed at the state and federal government
levels, this represents a meaningful first step. To optimize the structure of ANNs, choice
of parameters (e.g., learning rate, activation functions) and training the weights in ANN,
and heuristic algorithms such as genetic algorithm may be useful leading to global
optima, because it can help to escape from local optima 272. Also, hyper-parameter
tuning coupled with more recent ANN approaches in terms of activation function or
optimization (e.g., Adam or other adaptive methods) is highly recommended. We also
recommend incorporating genetics data into the model or examine the genetic
characteristics of individual patients in counties with high modeling errors. In this regard,
national TB genotyping surveillance coverage which refers to the proportion of TB cases
with culture-positive with at least one genotyped isolate in the US can be useful. Also,
other statistical techniques such as Poisson model (through including the size of local
population at risk as an offset in the model) or the logistic regression model (through
binomial distribution) can be investigated in modeling TB incidence rate. Coupled with
GIS, the ANNs techniques successfully identified some determinant factors of log
(STIR), ranked them, and had better prediction ability than the traditional linear
regression analysis. The findings of this study can provide useful insight to health
authorities on prioritizing resource allocation to risk-prone areas.
101
Tables Table 4-1. Top 10 states with the largest number of hotspot counties (p < 0.10) of
smoothed tuberculosis (TB) incidence rate (STIR) in the continental US, 2006–2010.
Rank State No. Hotspot Counties
Percentage (#hotspots/#counties)
1 Georgia 57 35.8% 2 Texas 30 11.8% 3 North Carolina 23 23.0% 4 Louisiana 22 34.3% 5 Florida 20 29.9% 6 California 17 22.7% 7 South Carolina 17 37% 8 Arkansas 12 16.0% 9 Mississippi 12 14.6% 10 Alabama 10 14.9%
Table 4-2. Pearson correlation analysis between selected variables for modelling STIR,
continental US. POP730 LFE330 IPE110 POP778 Min
Temp SPR440 HIS305 RHI820
POP730 1.000 0.051 0.041 0.064 −0.124 −0.078 −0.138 −0.024 LFE330 0.051 1.000 0.018 0.057 0.136 −0.499 −0.186 −0.040 IPE110 0.041 0.018 1.000 0.266 −0.231 −0.108 0.066 0.384 POP778 0.064 0.057 0.266 1.000 −0.005 0.091 −0.390 0.248 Min Temp −0.124 0.136 −0.231 −0.005 1.000 0.066 −0.032 0.308
SPR440 −0.078 −0.499 −0.108 0.091 0.066 1.000 −0.015 0.003 HIS305 −0.138 −0.186 0.066 −0.390 −0.032 −0.015 1.000 0.403 RHI820 −0.024 −0.040 0.384 0.248 0.308 0.003 0.403 1.000
Table 4-3. Results of linear regression (LR) model for modeling log (STIR), continental US.
Model R R Square
Adj. R Square
Change Statistics Durbin–Watson
R Square Change
F df1 df2 Sig.
LR 0.666 a 0.443 0.440 0.443 184.246 8 1854 0.000 2.041 a. Predictors: (Constant), POP73, LFE330, IPE110, POP778, Min Temp, SPR440, HIS305, RHI820. Dependent Variable: log (STIR).
102
Table 4-4. Effects of environment and socio-economic factors on the log (STIR) using LR model. Unstandardized Coefficient
Standardized Coefficient
t Sig. 95.0% Confidence Interval for B
Collinearity Statistics
Variables B Beta
Lower Bound
Upper Bound
Tolerance VIF
(Constant) 0.001
0.009 0.993 −0.198 0.200
RHI820 −0.007 −0.294 −11.117
0.000 −0.009 −0.006 0.429 2.328
LFE330 −0.023 −0.166 −7.929 0.000 −0.029 −0.017 0.683 1.463 Min Temp 0.013 0.210 9.809 0.000 0.010 0.016 0.653 1.532 POP778 0.083 0.282 12.621 0.000 0.070 0.095 0.602 1.661 IPE110 0.012 0.140 6.662 0.000 0.008 0.015 0.677 1.477 SPR440 −0.009 −0.097 −4.703 0.000 −0.013 −0.005 0.701 1.426 HIS305 −0.019 −0.145 −5.976 0.000 −0.026 −0.013 0.508 1.968 POP730 −0.015 −0.080 −4.489 0.000 −0.021 −0.008 0.950 1.053
VIF: Variance inflation Factor. Table 4-5. Comparison of multi-layer perceptron (MLP; one and two hidden layers), and LR model’ performance for
predicting log (STIR) in the continental US. Model Training Cross-Validation Test
MAE RMSE R MAE RMSE R MAE RMSE R
LR 0.27 0.35 0.66 0.27 0.36 0.65 0.28 0.36 0.61 MLP (1 hidden layer) 0.25 0.33 0.70 0.26 0.35 0.67 0.27 0.35 0.63
MLP (2 hidden layers) 0.26 0.34 0.69 0.26 0.35 0.65 0.27 0.36 0.62
103
Figures
Figure 4-1. Topological architecture of multi-layer perceptron neural network (MLPNN) used in this study 63.
Figure 4-2. Spatial distribution of training, cross-validation, and test data used for modeling log (STIR).
104
Figure 4-3. The frequency of TB cases (left) and the cumulative TB incidence rate (right) across the continental US (2006–2010).
Figure 4-4. Hotspot map for the STIR in the continental US identified by hotspot analysis (Getis–Ord Gi*) technique, 2006–2010.
105
Figure 4-5. The Normal P-P Plot of LR model.
106
Figure 4-6. Scatter plot of observed and predicted log (STIR) (by single hidden layer
MLP model) for test data in the continental US.
Figure 4-7. The contribution of input features on predicting log (STIR) according to
sensitivity analysis of single hidden layer MLP. RMSE: Root mean square error.
0.28
0.190.17
0.33
0.18
0.12
0.15
0.12
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
RHI820 LFE330 Min Temp POP778 IPE110 SPR440 HIS305 POP730
Cha
nge
in R
MSE
Variable
107
CHAPTER 5 CONCLUSION
Summary
Despite a significant growth in spatial analysis of various types of infectious
diseases, most studies remain non-spatial. Our primary focus of this research was to
study spatial aspects of (three different) infectious diseases. Our studies confirmed that
“geography” is an essential part of the infectious diseases. In other words, we showed
that LD hotspots in CT occurred in specific geographic towns. Also, using geographic
factors we could predict presence/absence of one of main vector of ZCL and TB
incidence with a decent accuracy. Thus, it is necessary to incorporate it in future control
programs and predictions to fight against the diseases.
Due to interdisciplinary nature of the most infectious diseases, we underscored
innovative techniques such as RS, machine learning and GPS in a GIS platform and as
cost-effective and time saving approaches. Our main intent was to make contribution in
predicting presence/absence of the vector(s) that cause the disease and predicting
incidence of the diseases with higher accuracy than highly-applied traditional
techniques in literature. We examined a range of machine learning techniques that are
widely-used in many disciplines but underutilized in spatial epidemiology.
In Chapter 2, using ESDA in a GIS framework, we analyzed resulting pattern of
LD distribution in CT. We identified the towns where controlling LD was most required
and could provide the highest benefits in allocating resources. In Chapter 3, we
promoted machine learning classifiers such as random forest and support vector
machine as powerful algorithms in binary classifications tasks. However, there are many
other machine learning classifiers that have not been included in the research such as
108
k-means and Naïve Bayes classifier. Most of the binary classifiers are transferable to
medical geography and are likely to have better prediction abilities the than traditional
statistical techniques. We therefore recommend medical geographers to develop and
evaluate novel models in infectious disease research to examine the applicability of
techniques that better predict the presence/absence of other vectors or disease
morbidity/mortality rates.
In Chapter 4, the results of ANN showed an improvement of prediction accuracy
of continuous health outcome (TB rate) compared to the classic linear regression. ANN
coupled with GIS provided insight about the heterogeneity of TB in the US. We applied
the data science technique in the context of national level control strategy which are
rarely implemented in nationwide control programs of infectious diseases. However,
less insight is obtained in county level studies compared to individual level studies. Our
search of ANN in spatial epidemiology indicated that very few studies have applied the
technique. We recommend examining other robust and novel ANN techniques.
Depending on the nature and complexity of the infectious disease, size of the study
area, spatial distribution of the vectors, various types of the neural networks could be
applied, such as radial basis function, regulatory feedback, and recurrent neural
networks.
Limitations
The success of classifiers and models used in this research highly depend on
the quality of collected data and the variables used for predictions. Final spatial data
that represent a feature in the models undergo several transformations such as
measurement error or analysis errors (such as converting addresses to locations
(geocoding)). Also, there are several obstacles and sources of uncertainty that need to
109
be considered. First, neglecting the role of spatial autocorrelation especially in sparse
data may result in biased estimation of variables’ importance 273. SA is one of the most
important concepts that are usually ignored in spatial epidemiology. Misclassification
due to population migration or movements is another major source of uncertainty in
spatial and temporal pattern that can influence the validity of the results of our LD and
TB studies. Thus, misclassification can enhance or reduce the strength of associations.
One of the strengths of aggregated data compared to individual level studies is that they
are less influenced by misclassification. Selection bias can be another important
limitation. Inaccurate or incomplete distribution of collected health data (such as
collected sand flies) may lead to improper representation of the population of interest. In
our ZCL study, selection bias during sand fly collection can distort associations between
prediction of sand fly occurrence and descriptive variables.
Another problem is attributed to the administrative boundaries. All LD and TB
data and associated variables which were collected as a single value that represent
administrative boundaries (because of data restrictions, we used aggregated data)
rather than physical boundaries. This might be problematic for spatial and particularly
space-time analysis when comparing changes of pattern over time. The boundaries
may be subject to change (split or merge of some areas) because of governmental
policies, however in our study very few changes in the administrative boundaries
occurred. Spatial scale was another major concern in our LD and TB researches. In
these two papers, data were obtained at the town and county levels, respectively. The
values within each division is uniform but there might be sharp contrasts between
neighboring polygons. Thus, selection of geographic resolution for conducting analysis
110
is important. Because aggregation at various spatial scales can lead to different results.
Also, different aggregation techniques (median, mode, sum, etc.) can lead to different
results. In our studies, the choice of spatial unit was dictated by the available data.
These biases can cause spurious associations. By taking advantage of
technologic advances such as GIS as well as data collection techniques such as GPS
and RS, it is likely to improve reliability of results. Spatial data with higher spatial and
temporal resolutions can provide more accurate estimations of risk and predictions in
researches in medical geography. However, even with having accurate data spurious
associations are inevitable.
Outlook and Future Challenges
Future researches in GIS and spatial analysis need to find optimal solutions for
challenges such as scale and edge effect. Data collection techniques need to make
accurate and reliable spatial data available. Organizations that provide demographic
data need to make the same data available at multiple scales such as census tracts,
black group and county level. Analysis has to be (re)conducted at all the possible
scales, to address Modifiable areal unit problem (MAUP). Due to exponential growth of
availability and use of spatial data which is likely to continue, it is anticipated that GIS
software packages become more available. The software should include advanced and
robust techniques such as data science techniques and other approaches from other
disciplines. Also, spatial autocorrelation and uncertainty inherent in spatial data need to
be incorporated in the packages. These two areas deserve closer attention of medical
geographers. Analyzing big-data particularly in national and global health in near future
can be a new era of investigations in spatial epidemiology. In this regard, deep-learning
111
neural network is a machine learning technique that can be useful when working with
noisy and very large datasets 274.
112
APPENDIX A STATISTICAL COMPARISON OF CLASSIFIERS
Table A-1. Statistical test for comparing AUCs of SVM, LR and RF classifiers: Test
Statistic AUC 1 AUC 2 AUC 1-AUC 2 Lower 95% CL
Upper 95% CL
Test Statistic Value
Prob Level
Reject H0 at
α=0.05 Wald Z 0.974 (SVM) 0.965 (LR) 0.0090 -0.0704 0.0884 0.222 0.8243 No Wald Z 0.974 (SVM) 0.935 (RF) 0.0309 -0.0569 0.1349 0.794 0.4272 No Wald Z 0.965 (LR) 0.935 (RF) 0.0300 -0.0704 0.1304 0.584 0.5592 No
H0: AUC 1 = AUC 2 H1: AUC 1 ≠ AUC 2 Sample Size = n test = 36 Therefore, the P-values do not indicate a significant difference between the AUCs.
113
APPENDIX B HABITAT SUITABILITY MAPS
Figure B-1. Habitat suitability map of Ph. Papatasi based on logistic regression classifier
114
Figure B-2. Habitat suitability map of Ph. Papatasi based on random forest classifier
115
LIST OF REFERENCES
1. Moore, D. A., & Carpenter, T. E. (1999). Spatial analytical methods and geographic information systems: use in health research and epidemiology. Epidemiologic reviews, 21(2), 143-161.
2. Valenčius, C. B. (2000). Histories of medical geography. Medical History, 44(S20), 3-28.
3. Miller, G. (1962). " Airs, Waters, and Places" in History. Journal of the history of medicine and allied sciences, 129-140.
4. Cliff, A. D., Smallman-Raynor, M. R., & Stevens, P. M. (2009). Controlling the geogrpahical spread of infectious disease: Plague in Italy, 1347-1851. Acta medico-historica Adriatica, 7(2), 197-236.
5. Stevenson, L. G. (1965). Putting disease on the map: the early use of spot maps in the study of yellow fever. Journal of the History of Medicine and Allied Sciences, 226-261.
6. McLeod, K. S. (2000). Our sense of Snow: the myth of John Snow in medical geography. Social science & medicine, 50(7-8), 923-935.
7. May, J. M. (1959). The Ecology of Human Disease. The Ecology of Human Disease.
8. Meade, M. S., & Emch, M. (2010). Medical geography. Guilford Press.
9. May, J. M. (1950). Medical geography: its methods and objectives. Geographical Review, 9-41.
10. Gilbert, E. W. (1958). Pioneer maps of health and disease in England. Geographical Journal, 172-183.
11. Armstrong, R. W. (1965). Medical Geography-An Emerging Specialty?. International Pathology, 6(3), 61-63.
12. Musa, G. J., Chiang, P. H., Sylk, T., Bavley, R., Keating, W., Lakew, B., ... & Hoven, C. W. (2013). Use of GIS Mapping as a Public Health Tool–-From Cholera to Cancer. Health services insights, 6, HSI-S10471.
13. Meade, M. S. (2014). Medical geography. The Wiley Blackwell Encyclopedia of Health, Illness, Behavior, and Society, 1375-1381.
14. Kirby, R. S., Delmelle, E., & Eberth, J. M. (2017). Advances in spatial epidemiology and geographic information systems. Annals of epidemiology, 27(1), 1-9.
116
15. Morris, E., & Bone, C. (2016). Identifying spatial data availability and spatial data needs for Chagas disease mitigation in South America. Spatial and spatio-temporal epidemiology, 17, 45-58.
16. Mitchell, S. E. (2011). Using GIS to Explore the Relationship between Socioeconomic Status and Demographic Variables and Crime in Pittsburgh, Pennsylvania. Papers in Resource Analysis, 13(1), 1-11.
17. Cromley, E. K., & McLafferty, S. L. (2011). GIS and public health. Guilford Press.
18. Nykiforuk, C. I., & Flaman, L. M. (2011). Geographic information systems (GIS) for health promotion and public health: a review. Health promotion practice, 12(1), 63-73.
19. Campbell, J. B., & Wynne, R. H. (2011). Introduction to remote sensing. Guilford Press.
20. Hay, S. I. (2000). An overview of remote sensing and geodesy for epidemiology and public health application. Advances in parasitology, 47, 1-35.
21. Ali, I., Greifeneder, F., Stamenkovic, J., Neumann, M., & Notarnicola, C. (2015). Review of machine learning approaches for biomass and soil moisture retrievals from remote sensing data. Remote Sensing, 7(12), 16398-16421.
22. Im, S. B., Hurlebaus, S., & Kang, Y. J. (2011). Summary review of GPS technology for structural health monitoring. Journal of Structural Engineering, 139(10), 1653-1664.
23. Glass, G. E., Schwartz, B. S., Morgan III, J. M., Johnson, D. T., Noy, P. M., & Israel, E. (1995). Environmental risk factors for Lyme disease identified with geographic information systems. American journal of public health, 85(7), 944-948.
24. Mollalo, A., Sadeghian, A., Israel, G. D., Rashidi, P., Sofizadeh, A., & Glass, G. E. (2018). Machine learning approaches in GIS-based ecological modeling of the sand fly Phlebotomus papatasi, a vector of zoonotic cutaneous leishmaniasis in Golestan province, Iran. Acta tropica, 188, 187-194.
25. Elliot, P., Wakefield, J. C., Best, N. G., & Briggs, D. J. (2000). Spatial epidemiology: methods and applications. Oxford University Press.
26. Monmonier, M. (2005). Lying with maps. Statistical science, 20(3), 215-222.
27. Han, J., Kamber, M., & Tung, A. K. (2001). Spatial clustering methods in data mining. Geographic data mining and knowledge discovery, 188-217.
117
28. Stocks, P. (1953). Studies of cancer death rates at different ages in England and Wales in 1921 to 1950: uterus, breast and lung. British journal of cancer, 7(3), 283.
29. Wakefield, J. (2006). Disease mapping and spatial regression with count data. Biostatistics, 8(2), 158-183.
30. Indrayan, A., & Kumar, R. (1996). Statistical choropleth cartography in epidemiology. International journal of epidemiology, 25(1), 181-189.
31. Gebreslasie, M. T. (2015). A review of spatial technologies with applications for malaria transmission modelling and control in Africa. Geospatial health, 10(2).
32. Patil, A. P., Gething, P. W., Piel, F. B., & Hay, S. I. (2011). Bayesian geostatistics in health cartography: the perspective of malaria. Trends in Parasitology, 27(6), 246-253.
33. Best, N., Richardson, S., & Thomson, A. (2005). A comparison of Bayesian spatial models for disease mapping. Statistical methods in medical research, 14(1), 35-59.
34. Carlin, B. P., & Louis, T. A. (2010). Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC.
35. Caprarelli, G., & Fletcher, S. (2014). A brief review of spatial analysis concepts and tools used for mapping, containment and risk modelling of infectious diseases and other illnesses. Parasitology, 141(5), 581-601.
36. Elliott, P., & Wartenberg, D. (2004). Spatial epidemiology: current approaches and future challenges. Environmental health perspectives, 112(9), 998-1006.
37. Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. Journal of the American Statistical Association, 83(404), 1134-1143.
38. Sharmin, S., Glass, K., Viennet, E., & Harley, D. (2018). Geostatistical mapping of the seasonal spread of under-reported dengue cases in Bangladesh. PLoS neglected tropical diseases, 12(11), e0006947.
39. Randremanana, R. V., Richard, V., Rakotomanana, F., Sabatier, P., & Bicout, D. J. (2010). Bayesian mapping of pulmonary tuberculosis in Antananarivo, Madagascar. BMC infectious diseases, 10(1), 21.
40. Lawson, A. B. (2013). Bayesian disease mapping: hierarchical modeling in spatial epidemiology. Chapman and Hall/CRC.
41. Dubin, R. A. (1998). Spatial autocorrelation: a primer. Journal of housing economics, 7(4), 304-327.
118
42. Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region. Economic geography, 46(sup1), 234-240.
43. Miron, J. (1984). Spatial autocorrelation in regression analysis: A beginner’s guide. In Spatial statistics and models (pp. 201-222). Springer, Dordrecht.
44. De Jong, P. D., & De Bree, J. (1995). Analysis of the spatial distribution of rust-infected leek plants with the Black-White join-count statistic. European Journal of Plant Pathology, 101(2), 133-137.
45. Cliff, A. D., & Ord, J. K. (1981). Spatial processes: models & applications. Taylor & Francis.
46. Gilbert, G. S., Foster, R. B., & Hubbell, S. P. (1994). Density and distance-to-adult effects of a canker disease of trees in a moist tropical forest. Oecologia, 98(1), 100-108.
47. Moran, P. A. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society. Series B (Methodological), 10(2), 243-251.
48. Wilhelm, A., & Steck, R. (1998). Interactive graphics and local statistics. Journal of the Royal Statistical Society: Series D (The Statistician), 47(3), 423-430.
49. Chen, Y. (2012). On the four types of weight functions for spatial contiguity matrix. Letters in Spatial and Resource Sciences, 5(2), 65-72.
50. Naish, S., Dale, P., Mackenzie, J. S., McBride, J., Mengersen, K., & Tong, S. (2014). Spatial and temporal patterns of locally-acquired dengue transmission in northern Queensland, Australia, 1993–2012. PloS one, 9(4), e92524.
51. Geary, R. C. (1954). The contiguity ratio and statistical mapping. The incorporated statistician, 5(3), 115-146.
52. Chen, Y. (2013). New approaches for calculating Moran’s index of spatial autocorrelation. PloS one, 8(7), e68336.
53. Junior, G. B., de Paiva, A. C., Silva, A. C., & de Oliveira, A. C. M. (2009). Classification of breast tissues using Moran's index and Geary's coefficient as texture signatures and SVM. Computers in Biology and Medicine, 39(12), 1063-1072.
54. Souza, S. P. D., Jorge, V. M., & Orzechowski Xavier, M. (2014). Paracoccidioidomycosis in southern Rio Grande do Sul: a retrospective study of histopathologically diagnosed cases. Brazilian Journal of Microbiology, 45(1), 243-247.
55. Ripley, B. D. (1979). Tests of'randomness' for spatial point patterns. Journal of the Royal Statistical Society: Series B (Methodological), 41(3), 368-374.
119
56. Haase, P. (1995). Spatial pattern analysis in ecology based on Ripley's K‐function: Introduction and methods of edge correction. Journal of vegetation science, 6(4), 575-582.
57. Getis, A., & Ord, J. K. (2010). The analysis of spatial association by use of distance statistics. In Perspectives on Spatial Data Analysis (pp. 127-145). Springer, Berlin, Heidelberg.
58. Cuzick, J., & Edwards, R. (1996). Methods for investigating localized clustering of disease. Cuzick-Edwards one-sample and inverse two-sampling statistics. IARC scientific publications, (135), 200-202.
59. Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge.
60. Aikembayev, A. M., Lukhnova, L., Temiraliyeva, G., Meka-Mechenko, T., Pazylov, Y., Zakaryan, S., ... & Francesconi, S. C. (2010). Historical distribution and molecular diversity of Bacillus anthracis, Kazakhstan. Emerging Infectious Diseases, 16(5), 789.
61. Openshaw, S., Charlton, M., Wymer, C., & Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information System, 1(4), 335-358.
62. Clarke, K. C., McLafferty, S. L., & Tempalski, B. J. (1996). On epidemiology and geographic information systems: a review and discussion of future directions. Emerging infectious diseases, 2(2), 85.
63. Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical analysis, 27(2), 93-115.
64. Hassarangsee, S., Tripathi, N., & Souris, M. (2015). Spatial pattern detection of tuberculosis: a case study of Si Sa Ket Province, Thailand. International journal of environmental research and public health, 12(12), 16005-16018.
65. Szonyi, B., Srinath, I., Esteve-Gassent, M., Lupiani, B., & Ivanek, R. (2015). Exploratory spatial analysis of Lyme disease in Texas–what can we learn from the reported cases?. BMC public health, 15(1), 924.
66. Getis, A., & Ord, J. K. (1996). Local spatial statistics: an overview. Spatial analysis: modelling in a GIS environment, 374, 261-277.
67. Sun, W., Xue, L., & Xie, X. (2017). Spatial-temporal distribution of dengue and climate characteristics for two clusters in Sri Lanka from 2012 to 2016. Scientific reports, 7(1), 12884.
68. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics-Theory and methods, 26(6), 1481-1496.
120
69. Kulldorff, M., & Nagarwalla, N. (1995). Spatial disease clusters: detection and inference. Statistics in medicine, 14(8), 799-810.
70. Ward, M. P. (2001). Blowfly strike in sheep flocks as an example of the use of a time–space scan statistic to control confounding. Preventive Veterinary Medicine, 49(1-2), 61-69.
71. Knox, E. G., & Bartlett, M. S. (1964). The detection of space-time interactions. Journal of the Royal Statistical Society. Series C (Applied Statistics), 13(1), 25-30.
72. Kulldorff, M., Heffernan, R., Hartman, J., Assunçao, R., & Mostashari, F. (2005). A space–time permutation scan statistic for disease outbreak detection. PLoS medicine, 2(3), e59.
73. Robertson, C., & Nelson, T. A. (2010). Review of software for space-time disease surveillance. International journal of health geographics, 9(1), 16.
74. Fauci, A. S., Touchette, N. A., & Folkers, G. K. (2005). Emerging infectious diseases: a 10-year perspective from the National Institute of Allergy and Infectious Diseases. International Journal of Risk & Safety in Medicine, 17(3, 4), 157-167.
75. Fauci, A. S. (2001). Infectious diseases: considerations for the 21st century. Clinical Infectious Diseases, 32(5), 675-685.
76. Eisenberg, J. N., Cevallos, W., Ponce, K., Levy, K., Bates, S. J., Scott, J. C., ... & Trueba, G. (2006). Environmental change and infectious disease: how new roads affect the transmission of diarrheal pathogens in rural Ecuador. Proceedings of the National Academy of Sciences, 103(51), 19460-19465.
77. Mollalo, A., Mao, L., Rashidi, P., & Glass, G. E. (2019). A GIS-Based Artificial Neural Network Model for Spatial Distribution of Tuberculosis across the Continental United States. International journal of environmental research and public health, 16(1), 157.
78. Kovats, R. S., Campbell-Lendrum, D. H., McMichel, A. J., Woodward, A., & Cox, J. S. H. (2001). Early effects of climate change: do they include changes in vector-borne disease?. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 356(1411), 1057-1068.
79. Cudmore, T. J., Björklund, N., Carroll, A. L., & Lindgren, B. S. (2010). Climate change and range expansion of an aggressive bark beetle: evidence of higher beetle reproduction in naïve host tree populations. Journal of Applied Ecology, 47(5), 1036-1043.
121
80. Ogden, N. H., & Lindsay, L. R. (2016). Effects of climate and climate change on vectors and vector-borne diseases: ticks are different. Trends in parasitology, 32(8), 646-656.
81. Vail, S. G., & Smith, G. (1998). Air temperature and relative humidity effects on behavioral activity of blacklegged tick (Acari: Ixodidae) nymphs in New Jersey. Journal of medical entomology, 35(6), 1025-1028.
82. Lima, A., Lovin, D. D., Hickner, P. V., & Severson, D. W. (2016). Evidence for an overwintering population of Aedes aegypti in Capitol Hill neighborhood, Washington, DC. The American journal of tropical medicine and hygiene, 94(1), 231-235.
83. Jerrett, M., Gale, S., & Kontgis, C. (2010). Spatial modeling in environmental and public health research. International journal of environmental research and public health, 7(4), 1302-1329.
84. Takahashi, K., Kulldorff, M., Tango, T., & Yih, K. (2008). A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring. International Journal of Health Geographics, 7(1), 14.
85. Stevens, K. B., & Pfeiffer, D. U. (2011). Spatial modelling of disease using data-and knowledge-driven approaches. Spatial and spatio-temporal epidemiology, 2(3), 125-133.
86. Pfeiffer, D., Robinson, T. P., Stevenson, M., Stevens, K. B., Rogers, D. J., & Clements, A. C. (2008). Spatial analysis in epidemiology (Vol. 142, No. 10.1093). Oxford: Oxford University Press.
87. Ishizaka, A., & Nemery, P. (2013). Multi-criteria decision analysis: methods and software. John Wiley & Sons.
88. Saaty, T. L. (2008). Decision making with the analytic hierarchy process. International journal of services sciences, 1(1), 83-98.
89. Chang, D. Y. (1996). Applications of the extent analysis method on fuzzy AHP. European journal of operational research, 95(3), 649-655.
90. Clements, A. C., Pfeiffer, D. U., & Martin, V. (2006). Application of knowledge-driven spatial modelling approaches and uncertainty management to a study of Rift Valley fever in Africa. International Journal of Health Geographics, 5(1), 57.
91. Mollalo, A., & Khodabandehloo, E. (2016). Zoonotic cutaneous leishmaniasis in northeastern Iran: A GIS-based spatio-temporal multi-criteria decision-making approach. Epidemiology & Infection, 144(10), 2217-2229.
122
92. Rakotomanana, F., Randremanana, R. V., Rabarijaona, L. P., Duchemin, J. B., Ratovonjato, J., Ariey, F., ... & Jeanne, I. (2007). Determining areas that require indoor insecticide spraying using Multi Criteria Evaluation, a decision-support tool for malaria vector control programmes in the Central Highlands of Madagascar. International Journal of Health Geographics, 6(1), 2.
93. Loh, W. Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 14-23.
94. Franklin, J. (2003). Clustering versus regression trees for determining ecological land units in the southern California mountains and foothills. Forest Science, 49(3), 354-368.
95. Lewis, R. J. (2000). An introduction to classification and regression tree (CART) analysis. In Annual meeting of the society for academic emergency medicine in San Francisco, California (Vol. 14).
96. Rodriguez-Galiano, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M., & Rigol-Sanchez, J. P. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing, 67, 93-104.
97. Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences, 43(6), 1947-1958.
98. Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565.
99. Gomes, V. H., IJff, S. D., Raes, N., Amaral, I. L., Salomão, R. P., Coelho, L. S., ... & Guevara, J. E. (2018). Species Distribution Modelling: Contrasting presence-only models with plot abundance data. Scientific reports, 8(1), 1003.
100. Grinnell, J. (1917). The niche-relationships of the California Thrasher. Auk, 34(4), 427-433.
101. Hirzel, A. H., Hausser, J., Chessel, D., & Perrin, N. (2002). Ecological‐niche factor analysis: how to compute habitat‐suitability maps without absence data?. Ecology, 83(7), 2027-2036.
102. Ayala, D., Costantini, C., Ose, K., Kamdem, G. C., Antonio-Nkondjio, C., Agbor, J. P., ... & Simard, F. (2009). Habitat suitability and ecological niche profile of major malaria vectors in Cameroon. Malaria Journal, 8(1), 307.
103. Stockwell, D. R., & Noble, I. R. (1992). Induction of sets of rules from animal distribution data: a robust and informative method of data analysis. Mathematics and computers in simulation, 33(5-6), 385-390.
123
104. Stockwell, D. (1999). The GARP modelling system: problems and solutions to automated spatial prediction. International journal of geographical information science, 13(2), 143-158.
105. Phillips, S. J., & Dudík, M. (2008). Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, 31(2), 161-175.
106. Chamaillé, L., Tran, A., Meunier, A., Bourdoiseau, G., Ready, P., & Dedet, J. P. (2010). Environmental risk mapping of canine leishmaniasis in France. Parasites & vectors, 3(1), 31.
107. Larson, S. R., Degroot, J. P., Bartholomay, L. C., & Sugumaran, R. (2010). Ecological niche modeling of potential West Nile virus vector mosquito species in Iowa. Journal of Insect Science, 10(1), 110.
108. Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological modelling, 190(3-4), 231-259.
109. Sofizadeh, A., Rassi, Y., Vatandoost, H., Hanafi-Bojd, A. A., Mollalo, A., Rafizadeh, S., & Akhavan, A. A. (2016). Predicting the distribution of phlebotomus papatasi (diptera: psychodidae), the primary vector of zoonotic cutaneous leishmaniasis, in golestan province of Iran Using ecological niche modeling: comparison of MaxEnt and GARP models. Journal of medical entomology, 54(2), 312-320.
110. Hasyim, H., Nursafingi, A., Haque, U., Montag, D., Groneberg, D. A., Dhimal, M., ... & Müller, R. (2018). Spatial modelling of malaria cases associated with environmental factors in South Sumatra, Indonesia. Malaria journal, 17(1), 87.
111. Franke, J., Gebreslasie, M., Bauwens, I., Deleu, J., & Siegert, F. (2015). Earth observation in support of malaria control and epidemiology: MALAREO monitoring approaches. Geospatial health, (1), 117-131.
112. Nakhapakorn, K., & Tripathi, N. K. (2005). An information value based analysis of physical and climatic factors affecting dengue fever and dengue haemorrhagic fever incidence. International journal of health geographics, 4(1), 13.
113. Palaniyandi, M., Anand, P., & Maniyosai, R. (2014). Spatial cognition: a geospatial analysis of vector borne disease transmission and the environment, using remote sensing and GIS. Information Systems, 1, 52.
114. Mollalo, A., Blackburn, J. K., Morris, L. R., & Glass, G. E. (2017). A 24-year exploratory spatial data analysis of Lyme disease incidence rate in Connecticut, USA. Geospatial health, 12(2), 588.
124
115. Li, Y. P., Fang, L. Q., Gao, S. Q., Wang, Z., Gao, H. W., Liu, P., ... & Xu, B. (2013). Decision support system for the response to infectious disease emergencies based on WebGIS and mobile services in China. PLoS One, 8(1), e54842.
116. Lobitz, B., Beck, L., Huq, A., Wood, B., Fuchs, G., Faruque, A. S. G., & Colwell, R. (2000). Climate and infectious disease: use of remote sensing for detection of Vibrio cholerae by indirect measurement. Proceedings of the National Academy of Sciences, 97(4), 1438-1443.
117. Franke, J., & Menz, G. (2007). Multi-temporal wheat disease detection by multi-spectral remote sensing. Precision Agriculture, 8(3), 161-172.
118. Hugh-Jones, M., Barré, N., Nelson, G., Wehnes, K., Warner, J., Garvin, J., & Garris, G. (1992). Landsat-TM identification of Amblyomma variegatum (Acari: Ixodidae) habitats in Guadeloupe. Remote Sensing of Environment, 40(1), 43-55.
119. Coburn, B. J., & Blower, S. (2013). Mapping HIV epidemics in sub-Saharan Africa with use of GPS data. The lancet global health, 1(5), e251-e253.
120. Paz-Soldan, V., Stoddard, S., Vazquez-Prokopec, G., Morrison, A., Elder, J., Kitron, U., ... & Scott, T. (2010). Assessing and maximizing the acceptability of GPS device use for studying the role of human movement in dengue virus transmission in Iquitos, Peru. Am J Trop Med Hyg, 82(4), 723-730.
121. Masthi, N. R., Madhusudan, M., & Puthussery, Y. P. (2015). Global positioning system & Google Earth in the investigation of an outbreak of cholera in a village of Bengaluru Urban district, Karnataka. The Indian journal of medical research, 142(5), 533.
122. Garcia-Martí, I., Zurita-Milla, R., Van Vliet, A. J., & Takken, W. (2017). Modelling and mapping tick dynamics using volunteered observations. International journal of health geographics, 16(1), 41.
123. Ostfeld, R. S., Canham, C. D., Oggenfuss, K., Winchcombe, R. J., & Keesing, F. (2006). Climate, deer, rodents, and acorns as determinants of variation in Lyme-disease risk. PLoS biology, 4(6), e145.
124. Pepin, K. M., Eisen, R. J., Mead, P. S., Piesman, J., Fish, D., Hoen, A. G., ... & Diuk-Wasser, M. A. (2012). Geographic variation in the relationship between human Lyme disease incidence and density of infected host-seeking Ixodes scapularis nymphs in the Eastern United States. The American journal of tropical medicine and hygiene, 86(6), 1062-1071.
125. Ramezankhani, R., Sajjadi, N., Jozi, S. A., & Shirzadi, M. R. (2018). Climate and environmental factors affecting the incidence of cutaneous leishmaniasis in Isfahan, Iran. Environmental Science and Pollution Research, 25(12), 11516-11526.
125
126. Nieto, P., Malone, J. B., & Bavia, M. E. (2006). Ecological niche modeling forvisceral leishmaniasis in the state of Bahia, Brazil, using genetic algorithm for rule-set prediction and growing degree day-water budget analysis. Geospatialhealth, 1(1), 115-126.
127. Paixão Sevá, A., Mao, L., Galvis-Ovallos, F., Lima, J. M. T., & Valle, D. (2017).Risk analysis and prediction of visceral leishmaniasis dispersion in São PauloState, Brazil. PLoS neglected tropical diseases, 11(2), e0005353.
128. Yamamura, M., Freitas, I. M. D., Santo Neto, M., Chiaravalloti Neto, F., Popolin,M. A. P., Arroyo, L. H., ... & Arcêncio, R. A. (2016). Spatial analysis of avoidablehospitalizations due to tuberculosis in Ribeirao Preto, SP, Brazil (2006-2012). Revista de saude publica, 50, 20.
129. Patterson, B., Morrow, C. D., Kohls, D., Deignan, C., Ginsburg, S., & Wood, R.(2017). Mapping sites of high TB transmission risk: integrating the shared air andsocial behaviour of TB cases and adolescents in a South Africantownship. Science of the Total Environment, 583, 97-103.
130. Oliver, J. H., Lin, T., Gao, L., Clark, K. L., Banks, C. W., Durden, L. A., ... &Chandler, F. W. (2003). An enzootic transmission cycle of Lyme borreliosisspirochetes in the southeastern United States. Proceedings of the NationalAcademy of Sciences, 100(20), 11642-11645.
131. Anderson, J. F. (1989). Ecology of Lyme disease. Connecticut medicine, 53(6),343-346.
132. Radolf, J. D., Caimano, M. J., Stevenson, B., & Hu, L. T. (2012). Of ticks, miceand men: understanding the dual-host lifestyle of Lyme diseasespirochaetes. Nature reviews microbiology, 10(2), 87.
133. Centers for Disease Control and Prevention. Lyme Disease. Available at:http://www.cdc.gov/lyme/signs_symptoms. Accessed March 20, 2016.
134. Kugeler, K. J., Farley, G. M., Forrester, J. D., & Mead, P. S. (2015). Geographicdistribution and expansion of human Lyme Disease, United States. Emerginginfectious diseases, 21(8), 1455-1457.
135. Li, J., Kolivras, K. N., Hong, Y., Duan, Y., Seukep, S. E., Prisley, S. P., ... &Gaines, D. N. (2014). Spatial and temporal emergence pattern of Lyme disease inVirginia. The American journal of tropical medicine and hygiene, 91(6), 1166-1172.
136. Steere, A. C., Hardin, J. A., & Malawista, S. E. (1977). Erythema chronicummigrans and Lyme arthritis: cryoimmunoglobulins and clinical activity of skin andjoints. Science, 196(4294), 1121-1122.
126
137. Cromley, E. K., Cartter, M. L., Mrozinski, R. D., & Ertel, S. H. (1998). Residentialsetting as a risk factor for Lyme disease in a hyperendemic region. Americanjournal of epidemiology, 147(5), 472-477.
138. The Connecticut Department of Public Health (CTDPH). Available at:http://www.ct.gov/dph/cwp/view.asp?a=3136&q=399694. Accessed February 10,2016.
139. Centers for Disease Control and Prevention. Lyme Disease (Borreliaburgdorferi): 1996 Case Definition. Available at:https://wwwn.cdc.gov/nndss/conditions/lyme-disease/case-definition/1996/.Accessed March 20, 2016.
140. Centers for Disease Control and Prevention. Lyme Disease (Borreliaburgdorferi): 2008 Case Definition. Available at:https://wwwn.cdc.gov/nndss/conditions/lyme-disease/case-definition/2008/.Accessed March 20, 2016.
141. Centers for Disease Control and Prevention. Lyme Disease (Borreliaburgdorferi): 2011 Case Definition. Available at:https://wwwn.cdc.gov/nndss/conditions/lyme-disease/case-definition/2011/.Accessed March 20, 2016.
142. Moran, P. A. (1950). Notes on continuous stochasticphenomena. Biometrika, 37(1/2), 17-23.
143. Anselin, L. (2005). Exploring spatial data with GeoDaTM: a workbook. Center forspatially integrated social science.
144. O'sullivan, D., & Unwin, D. (2014). Geographic information analysis. John Wiley& Sons.
145. Mitchell, A., & Minami, M. (1999). The ESRI guide to GIS analysis: geographicpatterns & relationships (Vol. 1). ESRI, Inc..
146. Clayton, D., & Kaldor, J. (1987). Empirical Bayes estimates of age-standardizedrelative risks for use in disease mapping. Biometrics, 671-681.
147. Kulldorff, M. (2010). SaTScan user guide for version 9.0.
148. Fielding, A. H., & Bell, J. F. (1997). A review of methods for the assessment ofprediction errors in conservation presence/absence models. Environmentalconservation, 24(1), 38-49.
149. Birnbaum, D., Jacquez, G. M., Waller, L. A., Grimson, R., & Wartenberg, D.(1996). The analysis of disease clusters, part I: state of the art. Infection Control &Hospital Epidemiology, 17(5), 319-327.
127
150. Connally, N. P., Ginsberg, H. S., & Mather, T. N. (2006). Assessing peridomestic entomological factors as predictors for Lyme disease. Journal of Vector Ecology, 31(2), 364-371.
151. Mather, T. N., Nicholson, M. C., Donnelly, E. F., & Matyas, B. T. (1996). Entomologic index for human risk of Lyme disease. American Journal of Epidemiology, 144(11), 1066-1069.
152. Stafford, K. C., Cartter, M. L., Magnarelli, L. A., Ertel, S. H., & Mshar, P. A. (1998). Temporal correlations between tick abundance and prevalence of ticks infected with Borrelia burgdorferi and increasing incidence of Lyme disease. Journal of clinical microbiology, 36(5), 1240-1244.
153. Ogden, N. H., St-Onge, L., Barker, I. K., Brazeau, S., Bigras-Poulin, M., Charron, D. F., ... & Michel, P. (2008). Risk maps for range expansion of the Lyme disease vector, Ixodes scapularis, in Canada now and with climate change. International journal of health geographics, 7(1), 24.
154. Xue, L., Scoglio, C., McVey, D. S., Boone, R., & Cohnstaedt, L. W. (2015). Two introductions of Lyme disease into Connecticut: a geospatial analysis of human cases from 1984 to 2012. Vector-Borne and Zoonotic Diseases, 15(9), 523-528.
155. Root, E. D., Meyer, R. E., & Emch, M. E. (2009). Evidence of localized clustering of gastroschisis births in North Carolina, 1999–2004. Social science & medicine, 68(8), 1361-1367.
156. Barro, A. S., Kracalik, I. T., Malania, L., Tsertsvadze, N., Manvelyan, J., Imnadze, P., & Blackburn, J. K. (2015). Identifying hotspots of human anthrax transmission using three local clustering techniques. Applied Geography, 60, 29-36.
157. Steere, A. C., Malawista, S. E., Snydman, D. R., Shope, R. E., Andiman, W. A., Ross, M. R., & Steele, F. M. (1977). An epidemic of oligoarticular arthritis in children and adults in three Connecticut communities. Arthritis & Rheumatism: Official Journal of the American College of Rheumatology, 20(1), 7-17.
158. Mollalo, A., Alimohammadi, A., Shirzadi, M. R., & Malek, M. R. (2015). Geographic information system‐based analysis of the spatial and spatio‐temporal distribution of zoonotic cutaneous leishmaniasis in Golestan Province, north‐east of Iran. Zoonoses and public health, 62(1), 18-28.
159. Shirzadi, M. R., Mollalo, A., & Yaghoobi-Ershadi, M. R. (2015). Dynamic relations between incidence of zoonotic cutaneous leishmaniasis and climatic factors in Golestan Province, Iran. Journal of arthropod-borne diseases, 9(2), 148.
160. Ramezankhani, R., Hosseini, A., Sajjadi, N., Khoshabi, M., & Ramezankhani, A. (2017). Environmental risk factors for the incidence of cutaneous leishmaniasis in an endemic area of Iran: A GIS-based approach. Spatial and spatio-temporal epidemiology, 21, 57-66.
128
161. Hanafi-Bojd, A. A., Khoobdel, M., Soleimani-Ahmadi, M., Azizi, K., Aghaei Afshar, A., Jaberhashemi, S. A., ... & Safari, R. (2017). Species composition of sand flies (Diptera: Psychodidae) and modeling the spatial distribution of main vectors of cutaneous leishmaniasis in Hormozgan Province, Southern Iran. Journal of medical entomology, 55(2), 292-299.
162. Afshar, A. A., Rassi, Y. A. V. A. R., Sharifi, I., Abai, M. R., Oshaghi, M. A., Yaghoobi-Ershadi, M. R., & Vatandoost, H. (2011). Susceptibility status of Phlebotomus papatasi and P. sergenti (Diptera: Psychodidae) to DDT and deltamethrin in a focus of cutaneous leishmaniasis after earthquake strike in Bam, Iran. Iranian journal of arthropod-borne diseases, 5(2), 32.
163. Postigo, J. A. R. (2010). Leishmaniasis in the world health organization eastern mediterranean region. International journal of antimicrobial agents, 36, S62-S65.
164. Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., & Arriaga-Weiss, S. (2010). Classification in conservation biology: a comparison of five machine-learning methods. Ecological Informatics, 5(6), 441-450.
165. Huang, G. B., Zhu, Q. Y., & Siew, C. K. (2006). Extreme learning machine: theory and applications. Neurocomputing, 70(1-3), 489-501.
166. Bagheri, M., Bazvand, A., & Ehteshami, M. (2017). Application of artificial intelligence for the management of landfill leachate penetration into groundwater, and assessment of its environmental impacts. Journal of Cleaner Production, 149, 784-796.
167. Peterson, A. T., & Shaw, J. (2003). Lutzomyia vectors for cutaneous leishmaniasis in Southern Brazil: ecological niche models, predicted geographic distributions, and climate change effects. International journal for parasitology, 33(9), 919-931.
168. Samy, A. M., Annajar, B. B., Dokhan, M. R., Boussaa, S., & Peterson, A. T. (2016). Coarse-resolution ecology of etiological agent, vector, and reservoirs of zoonotic cutaneous leishmaniasis in Libya. PLoS neglected tropical diseases, 10(2), e0004381.
169. Rajabi, M., Mansourian, A., Pilesjö, P., & Bazmani, A. (2014). Environmental modelling of visceral leishmaniasis by susceptibility-mapping using neural networks: A case study in north-western Iran. Geospatial health, 179-191.
170. Kapwata, T., & Gebreslasie, M. T. (2016). Random forest variable selection in spatial malaria transmission modelling in Mpumalanga Province, South Africa. Geospatial health.
171. Gomes, A. L. V., Wee, L. J., Khan, A. M., Gil, L. H., Marques Jr, E. T., Calzavara-Silva, C. E., & Tan, T. W. (2010). Classification of dengue fever patients based on gene expression data using support vector machines. PloS one, 5(6), e11267.
129
172. Armitage, D. W., & Ober, H. K. (2010). A comparison of supervised learning techniques in the classification of bat echolocation calls. Ecological Informatics, 5(6), 465-473.
173. Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology: A Journal of the Royal Meteorological Society, 25(15), 1965-1978.
174. Janalipour, M., & Taleai, M. (2017). Building change detection after earthquake using multi-criteria decision analysis based on extracted information from high spatial resolution satellite images. International journal of remote sensing, 38(1), 82-99.
175. Freedman, D. A. (2009). Statistical models: theory and practice. cambridge university press.
176. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions on neural networks, 10(5), 988-999.
177. Bishop, C. M., & Tipping, M. E. (2000). Variational relevance vector machines. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence (pp. 46-53). Morgan Kaufmann Publishers Inc..
178. Tuv, E., Borisov, A., Runger, G., & Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. Journal of Machine Learning Research, 10(Jul), 1341-1366.
179. Park, M. Y., & Hastie, T. (2007). L1‐regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 659-677.
180. Bagheri, M., Mirbagheri, S. A., Kamarkhani, A. M., & Bagheri, Z. (2016). Modeling of effluent quality parameters in a submerged membrane bioreactor with simultaneous upward and downward aeration treating municipal wastewater using hybrid models. Desalination and Water Treatment, 57(18), 8068-8089.
181. Altman, D. G., & Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ: British Medical Journal, 308(6943), 1552.
182. Diel, R., Loddenkemper, R., Niemann, S., Meywald-Walter, K., & Nienhaus, A. (2011). Negative and positive predictive value of a whole-blood interferon-γ release assay for developing active tuberculosis: an update. American journal of respiratory and critical care medicine, 183(1), 88-95.
183. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
130
184. Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 363-374.
185. Aparicio, C., & Bitencourt, M. D. (2004). Spatial and temporal analysis of cutaneous leishmaniasis incidence in São Paulo–Brazil. In XXth International Society for Photogrammetry and Remote Sensing Congress (pp. 12-23).
186. Orshan, L., Elbaz, S., Ben-Ari, Y., Akad, F., Afik, O., Ben-Avi, I., ... & Zonstein, I. (2016). Distribution and dispersal of Phlebotomus papatasi (Diptera: Psychodidae) in a zoonotic cutaneous leishmaniasis focus, the Northern Negev, Israel. PLoS neglected tropical diseases, 10(7), e0004819.
187. Ding, F., Fu, J., Jiang, D., Hao, M., & Lin, G. (2018). Mapping the spatial distribution of Aedes aegypti and Aedes albopictus. Acta tropica, 178, 155-162.
188. Williams, J. N., Seo, C., Thorne, J., Nelson, J. K., Erwin, S., O’Brien, J. M., & Schwartz, M. W. (2009). Using species distribution models to predict new occurrences for rare plants. Diversity and Distributions, 15(4), 565-576.
189. Carvalho, B. M., Rangel, E. F., Ready, P. D., & Vale, M. M. (2015). Ecological niche modelling predicts southward expansion of Lutzomyia (Nyssomyia) flaviscutellata (Diptera: Psychodidae: Phlebotominae), vector of Leishmania (Leishmania) amazonensis in South America, under climate change. PLoS One, 10(11), e0143282.
190. Wang, C. T., Long, R. J., Wang, Q. J., Ding, L. M., & Wang, M. P. (2007). Effects of altitude on plant-species diversity and productivity in an alpine meadow, Qinghai–Tibetan plateau. Australian Journal of Botany, 55(2), 110-117.
191. Mollalo, A., Alimohammadi, A., Shahrisvand, M., Shirzadi, M. R., & Malek, M. R. (2014). Spatial and statistical analyses of the relations between vegetation cover and incidence of cutaneous leishmaniasis in an endemic province, northeast of Iran. Asian Pacific journal of tropical disease, 4(3), 176-180.
192. Kustas, W. P., & Norman, J. M. (1996). Use of remote sensing for evapotranspiration monitoring over land surfaces. Hydrological Sciences Journal, 41(4), 495-516.
193. Boussaa, S., Kahime, K., Samy, A. M., Salem, A. B., & Boumezzough, A. (2016). Species composition of sand flies and bionomics of Phlebotomus papatasi and P. sergenti (Diptera: Psychodidae) in cutaneous leishmaniasis endemic foci, Morocco. Parasites & vectors, 9(1), 60.
194. Cross, E. R., Newcomb, W. W., & Tucker, C. J. (1996). Use of weather data and remote sensing to predict the geographic and seasonal distribution of Phlebotomus papatasi in southwest Asia. The American journal of tropical medicine and hygiene, 54(5), 530-536.
131
195. Medlock, J. M., Hansford, K. M., Van Bortel, W., Zeller, H., & Alten, B. (2014). A summary of the evidence for the change in European distribution of phlebotomine sand flies (Diptera: Psychodidae) of public health importance. Journal of Vector Ecology, 39(1), 72-77.
196. Simsek, F. M., Alten, B., Caglar, S. S., Ozbel, Y., Aytekin, A. M., Kaynas, S., ... & Rastgeldi, S. (2007). Distribution and altitudinal structuring of phlebotomine sand flies (Diptera: Psychodidae) in southern Anatolia, Turkey: their relation to human cutaneous leishmaniasis. Journal of Vector Ecology, 32(2), 269-280.
197. Kasap, O. E., & Alten, B. (2006). Comparative demography of the sand fly Phlebotomus papatasi (Diptera: Psychodidae) at constant temperatures. Journal of Vector Ecology, 31(2), 378-386.
198. Lalvani, A., Pathan, A. A., Durkan, H., Wilkinson, K. A., Whelan, A., Deeks, J. J., ... & Hill, A. V. (2001). Enhanced contact tracing and spatial tracking of Mycobacterium tuberculosis infection by enumeration of antigen-specific T cells. The Lancet, 357(9273), 2017-2021.
199. Sreeramareddy, C. T., Kumar, H. H., & Arokiasamy, J. T. (2013). Prevalence of self-reported tuberculosis, knowledge about tuberculosis transmission and its determinants among adults in India: results from a nation-wide cross-sectional household survey. BMC infectious diseases, 13(1), 16.
200. Smith, I. (2003). Mycobacterium tuberculosis pathogenesis and molecular determinants of virulence. Clinical microbiology reviews, 16(3), 463-496.
201. Whalen, C., Horsburgh, C. R., Hom, D., Lahart, C., Simberkoff, M., & Ellner, J. (1995). Accelerated course of human immunodeficiency virus infection after tuberculosis. American journal of respiratory and critical care medicine, 151(1), 129-135.
202. World Health Organization. Global Tuberculosis Report 2017. Available online: www.who.int/tb/publications/global_report/en/ (accessed on 12 July 2017).
203. Thomas, T. Y., & Rajagopalan, S. (2001). Tuberculosis and aging: a global health problem. Clinical infectious diseases, 33(7), 1034-1039.
204. Figueroa-Munoz, J. I., & Ramon-Pardo, P. (2008). Tuberculosis control in vulnerable groups. Bulletin of the World Health Organization, 86, 733-735.
205. Schmit, K. M., Wansaula, Z., Pratt, R., Price, S. F., & Langer, A. J. (2017). Tuberculosis—United States, 2016. MMWR. Morbidity and mortality weekly report, 66(11), 289.
206. Hill, A. N., Becerra, J. E., & Castro, K. G. (2012). Modelling tuberculosis trends in the USA. Epidemiology & Infection, 140(10), 1862-1872.
132
207. Sun, W., Gong, J., Zhou, J., Zhao, Y., Tan, J., Ibrahim, A., & Zhou, Y. (2015). A spatial, social and environmental study of tuberculosis in China using statistical and GIS technology. International journal of environmental research and public health, 12(2), 1425-1448.
208. Cantwell, M. F., McKENNA, M. T., McCRAY, E. E., & Onorato, I. M. (1998). Tuberculosis and race/ethnicity in the United States: impact of socioeconomic status. American Journal of Respiratory and Critical Care Medicine, 157(4), 1016-1020.
209. Wubuli, A., Xue, F., Jiang, D., Yao, X., Upur, H., & Wushouer, Q. (2015). Socio-demographic predictors and distribution of pulmonary tuberculosis (TB) in Xinjiang, China: A spatial analysis. PloS one, 10(12), e0144010.
210. Mahara, G., Yang, K., Chen, S., Wang, W., & Guo, X. (2018). Socio-Economic Predictors and Distribution of Tuberculosis Incidence in Beijing, China: A Study Using a Combination of Spatial Statistics and GIS Technology. Medical Sciences, 6(2), 26.
211. Harling, G., & Castro, M. C. (2014). A spatial analysis of social and economic determinants of tuberculosis in Brazil. Health & place, 25, 56-67.
212. Krieger, N., Chen, J. T., Waterman, P. D., Rehkopf, D. H., & Subramanian, S. V. (2003). Race/ethnicity, gender, and monitoring socioeconomic gradients in health: a comparison of area-based socioeconomic measures—the public health disparities geocoding project. American journal of public health, 93(10), 1655-1671.
213. Kistemann, T., Munzinger, A., & Dangendorf, F. (2002). Spatial patterns of tuberculosis incidence in Cologne (Germany). Social science & medicine, 55(1), 7-19.
214. Jia, Z. W., Jia, X. W., Liu, Y. X., Dye, C., Chen, F., Chen, C. S., ... & Liu, H. L. (2008). Spatial analysis of tuberculosis cases in migrants and permanent residents, Beijing, 2000–2006. Emerging infectious diseases, 14(9), 1413.
215. Hawker, J. I., Bakhshi, S. S., Ali, S., & Farrington, C. P. (1999). Ecological analysis of ethnic differences in relation between tuberculosis and poverty. Bmj, 319(7216), 1031-1034.
216. Shaweno, D., Karmakar, M., Alene, K. A., Ragonnet, R., Clements, A. C., Trauer, J. M., ... & McBryde, E. S. (2018). Methods used in the spatial analysis of tuberculosis epidemiology: a systematic review. BMC medicine, 16(1), 193.
217. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models (Vol. 4, p. 318). Chicago: Irwin.
133
218. Naghibi, S. A., Pourghasemi, H. R., & Dixon, B. (2016). GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran. Environmental monitoring and assessment, 188(1), 44.
219. Sadeghian, A., Lim, D., Karlsson, J., & Li, J. (2015). Automatic target recognition using discrimination based on optimal transport. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2604-2608). IEEE.
220. Vahedi, B., Kuhn, W., & Ballatore, A. (2016). Question-based spatial computing—A case study. In Geospatial Data in a Changing World (pp. 37-50). Springer, Cham.
221. Hatami, M., & Ameri Siahooei, E. (2013). Examines criteria applicable in the optimal location new cities, with approach for sustainable urban development. Middle-East Journal of Scientific Research, 14(5), 734-743.
222. Sadeghian, A., Sundaram, L., Wang, D., Hamilton, W., Branting, K., & Pfeifer, C. (2016). Semantic edge labeling over legal citation graphs. In Proceedings of the workshop on legal text, document, and corpus analytics (LTDCA-2016) (pp. 70-75).
223. Janalipour, M., & Mohammadzadeh, A. (2017). A fuzzy-ga based decision making system for detecting damaged buildings from high-spatial resolution optical images. Remote Sensing, 9(4), 349.
224. Shafieardekani, M., & Hatami, M. (2013). Forecasting Land Use Change in suburb by using Time series and Spatial Approach; Evidence from Intermediate Cities of Iran. European Journal of Scientific Research, 116(2), 199-208.
225. Mollalo, A., Alimohammadi, A., & Khoshabi, M. (2014). Spatial and spatio-temporal analysis of human brucellosis in Iran. Transactions of The Royal Society of Tropical Medicine and Hygiene, 108(11), 721-728.
226. Sordo, M. (2002). Introduction to neural networks in healthcare. Open Clinical knowledge management for medical care.
227. Aburas, H. M., Cetiner, B. G., & Sari, M. (2010). Dengue confirmed-cases prediction: A neural network model. Expert Systems with Applications, 37(6), 4256-4260.
228. Laureano-Rosario, A., Duncan, A., Mendez-Lazaro, P., Garcia-Rejon, J., Gomez-Carro, S., Farfan-Ale, J., ... & Muller-Karger, F. (2018). Application of artificial neural networks for dengue fever outbreak predictions in the northwest coast of Yucatan, Mexico and San Juan, Puerto Rico. Tropical medicine and infectious disease, 3(1), 5.
134
229. Xue, H., Bai, Y., Hu, H., & Liang, H. (2017). Influenza activity surveillance basedon multiple regression model and artificial neural network. IEEE Access, 6, 563-575.
230. Murray, E. J., Marais, B. J., Mans, G., Beyers, N., Ayles, H., Godfrey-Faussett,P., ... & Bond, V. (2009). A multidisciplinary method to map potential tuberculosistransmission ‘hot spots’ in high-burden communities. The international journal oftuberculosis and lung disease, 13(6), 767-774.
231. Mullins, J., Lobato, M. N., Bemis, K., & Sosa, L. (2018). Spatial clusters of latenttuberculous infection, Connecticut, 2010–2014. The International Journal ofTuberculosis and Lung Disease, 22(2), 165-170.
232. Bennett, R. J., Brodine, S., Waalen, J., Moser, K., & Rodwell, T. C. (2014).Prevalence and treatment of latent tuberculosis infection among newly arrivedrefugees in San Diego County, January 2010–October 2012. American journal ofpublic health, 104(4), e95-e102.
233. Feske, M. L., Teeter, L. D., Musser, J. M., & Graviss, E. A. (2011). Including thethird dimension: a spatial analysis of TB cases in Houston HarrisCounty. Tuberculosis, 91, S24-S33.
234. Scales, D., Brownstein, J. S., Khan, K., & Cetron, M. S. (2014). Toward a county-level map of tuberculosis rates in the US. American journal of preventivemedicine, 46(5), e49.
235. HealthMap. Available online: https://healthmap.org/tb (accessed on 25 July2018).
236. American Community Survey (ASC). Available online:https://www.census.gov/programs-surveys/acs/ (accessed on 25 July 2018).
237. Center for Disease Control and Prevention (CDC) wonder. Available online:http://wonder.cdc.gov/ (accessed on 25 July 2018).
238. García-Elorriaga, G., & Del Rey-Pineda, G. (2014). Type 2 diabetes mellitus as arisk factor for tuberculosis. J Mycobac Dis, 4(144), 2161-1068.
239. The National Map. Available online: http://nationalmap.gov/ (accessed on 25 July2018).
240. The US Census Bureau. Available online: https://www.census.gov/ (accessed on25 July 2018).
241. Anselin, L., Syabri, I., & Kho, Y. (2006). GeoDa: an introduction to spatial dataanalysis. Geographical analysis, 38(1), 5-22.
135
242. Civco, D. L. (1993). Artificial neural networks for land-cover classification and mapping. International journal of geographical information science, 7(2), 173-186.
243. Woods, K., & Bowyer, K. W. (1997). Generating ROC curves for artificial neural networks. IEEE Transactions on medical imaging, 16(3), 329-337.
244. Huang, G. B. (2003). Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2), 274-281.
245. Yilmaz, I., & Kaynar, O. (2011). Multiple regression, ANN (RBF, MLP) and ANFIS odels for prediction of swell potential of clayey soils. Expert systems with applications, 38(5), 5958-5966.
246. Behrang, M. A., Assareh, E., Ghanbarzadeh, A., & Noghrehabadi, A. R. (2010). The potential of different artificial neural network (ANN) techniques in daily global solar radiation modeling based on meteorological data. Solar Energy, 84(8), 1468-1480.
247. Karlik, B., & Olgac, A. V. (2011). Performance analysis of various activation functions in generalized MLP architectures of neural networks. International Journal of Artificial Intelligence and Expert Systems, 1(4), 111-122.
248. Zadeh, M. R., Amin, S., Khalili, D., & Singh, V. P. (2010). Daily outflow prediction by multi layer perceptron with logistic sigmoid and tangent sigmoid activation functions. Water resources management, 24(11), 2673-2688.
249. Bagheri, M., Mirbagheri, S. A., Ehteshami, M., & Bagheri, Z. (2015). Modeling of a sequencing batch reactor treating municipal wastewater using multi-layer perceptron and radial basis function artificial neural networks. Process Safety and Environmental Protection, 93, 111-123.
250. Gardner, M. W., & Dorling, S. R. (1998). Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(14-15), 2627-2636.
251. Riedmiller, M. (1994). Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Computer Standards & Interfaces, 16(3), 265-278.
252. Shanker, M., Hu, M. Y., & Hung, M. S. (1996). Effect of data standardization on neural network training. Omega, 24(4), 385-397.
253. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
254. Anastassiou, G. A. (2011). Multivariate hyperbolic tangent neural network approximation. Computers & Mathematics with Applications, 61(4), 809-821.
136
255. Singh, P., & Deo, M. C. (2007). Suitability of different neural networks in daily flow forecasting. Applied Soft Computing, 7(3), 968-978.
256. Institute of Medicine: Ending Neglect: The Elimination of Tuberculosis in the United States; National Academy Press: Washington, DC, USA, 2000.
257. Georgia Tuberculosis Report (2017). Available online: https://dph.georgia.gov/sites/dph.georgia.gov/files/2016%20GA%20TB%20Report%20FINAL.pdf (accessed on 25 November 2018).
258. Florida Tuberculosis Report (2016). Available online: http://www.floridahealth.gov/diseases-and-conditions/tuberculosis/tb-statistics/index.html (accessed on 25 November 2018).
259. Onozuka, D., & Hagihara, A. (2015). The association of extreme temperatures and the incidence of tuberculosis in Japan. International journal of biometeorology, 59(8), 1107-1114.
260. Mourtzoukou, E. G., & Falagas, M. E. (2007). Exposure to cold and respiratory tract infections. The International Journal of Tuberculosis and Lung Disease, 11(9), 938-943.
261. Khalid, A., Baqai, T., Bukhari, M., & Khan, M. M. (2015). Comparison of the incidence of tuberculosis in different geographical zones in the state of Jammu and Kashmir. Pakistan Journal of Chest Medicine, 19(1).
262. McKenna, M. T., McCray, E., & Onorato, I. (1995). The epidemiology of tuberculosis among foreign-born persons in the United States, 1986 to 1993. New England Journal of Medicine, 332(16), 1071-1076.
263. Ho, M. J. (2004). Sociocultural aspects of tuberculosis: a literature review and a case study of immigrant tuberculosis. Social science & medicine, 59(4), 753-762.
264. Weis, S. E., Moonan, P. K., Pogoda, J. M., Turk, L. E., King, B., Freeman-Thompson, S., & Burgess, G. (2001). Tuberculosis in the foreign-born population of Tarrant county, Texas by immigration status. American journal of respiratory and critical care medicine, 164(6), 953-957.
265. Munch, Z., Van Lill, S. W. P., Booysen, C. N., Zietsman, H. L., Enarson, D. A., & Beyers, N. (2003). Tuberculosis transmission patterns in a high-incidence area: a spatial analysis. International journal of tuberculosis and lung disease, 7(3), 271-277.
266. dos Santos, M. A., Albuquerque, M. F., Ximenes, R. A., Lucena-Silva, N. L., Braga, C., Campelo, A. R., ... & Rodrigues, L. C. (2005). Risk factors for treatment delay in pulmonary tuberculosis in Recife, Brazil. BMC Public Health, 5(1), 25.
137
267. Jakubowiak, W. M., Bogorodskaya, E. M., Borisov, E. S., Danilova, D. I., & Kourbatova, E. K. (2007). Risk factors associated with default among new pulmonary TB patients and social support in six Russian regions. The international journal of tuberculosis and lung disease, 11(1), 46-53.
268. Djibuti, M., Mirvelashvili, E., Makharashvili, N., & Magee, M. J. (2014). Household income and poor treatment outcome among patients with tuberculosis in Georgia: a cohort study. BMC Public Health, 14(1), 88.
269. Bamrah, S., Yelk Woodruff, R. S., Powell, K., Ghosh, S., Kammerer, J. S., & Haddad, M. B. (2013). Tuberculosis among the homeless, United States, 1994–2010. The International Journal of Tuberculosis and Lung Disease, 17(11), 1414-1419.
270. Mapping and analyzing race and ethnicity. Available online: https://www2.census.gov/programs-surveys/sis/activities/geography/hg-2_teacher.pdf (accessed on 25 November 2018).
271. Centers for Disease Control and Prevention Trends in tuberculosis--United States, 2010. MMWR. Morb. Mortal. Wkl. Rep. 2011, 60, 333.
272. Lam, H. K., Ling, S. H., Leung, F. H., & Tam, P. K. S. (2001). Tuning of the structure and parameters of neural network using an improved genetic algorithm. In IECON'01. 27th Annual Conference of the IEEE Industrial Electronics Society (Cat. No. 37243) (Vol. 1, pp. 25-30). IEEE.
273. Hawkins, B. A., Diniz‐Filho, J. A. F., Mauricio Bini, L., De Marco, P., & Blackburn, T. M. (2007). Red herrings revisited: spatial autocorrelation and parameter estimation in geographical ecology. Ecography, 30(3), 375-384.
274. Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.
138
BIOGRAPHICAL SKETCH
Abolfazl Mollalo started his PhD program in the Department of Geography,
University of Florida in August 2015. He received his Ph.D. from this university in
August 2019. His areas of expertise include medical geography, spatial epidemiology,
GIS and applied machine learning in public health. He has obtained his master degree
in GIS and bachelor degree in surveying engineering from national universities in Iran.