GIS-BASED, DATA-DRIVEN TECHNIQUES FOR SPATIAL ANALYSIS … · 2019-10-22 · Overview of Training Models ... the applicability of multilayer perceptron (MLP) ANN for predicting the

GIS-BASED, DATA-DRIVEN TECHNIQUES FOR SPATIAL ANALYSIS OF INFECTIOUS DISEASES AT THE REGIONAL, STATE, AND NATIONAL LEVELS

By

ABOLFAZL MOLLALO

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2019

© 2019 Abolfazl Mollalo

To my family

4

ACKNOWLEDGMENTS

I want to thank my parents for their unconditional love and immeasurable

supports in my life. Words cannot express how grateful I am for all you have done for

me. I am extending my heartfelt thanks to my brother and lovely sister for their love,

encouragement, and help in my life. You are the most important people in my life, and I

dedicate this dissertation to you.

This work wouldn’t have been possible without the support and advice of a

number of individuals. First and foremost, I would like to express my deepest and

sincere gratitude to my mentor professor Gregory Glass for believing in me, giving me

the freedom to bridge my interests in data science and medical geography and

providing feedback over the years. I also would like to express my appreciation to my

other committee members, Dr. Liang Mao, Dr. Jason Blackburn, and Dr. Parisa Rashidi

for their insight, collaboration, feedback, and ideas throughout the entire PhD program.

I would like to give a special thank you to my friends especially those who live in

Gainesville for being there for me and are and will be like a family to me.

Finally, many thanks go to the UF Geography department including my teachers,

staff and all other people who helped me, directly or indirectly. Thanks for working hard

to create a positive learning environment. It was a great pleasure and honor for me to

be a part of this program.

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS .................................................................................................. 4

LIST OF TABLES ............................................................................................................ 7

LIST OF FIGURES .......................................................................................................... 8

LIST OF ABBREVIATIONS ........................................................................................... 10

ABSTRACT ................................................................................................................... 12

CHAPTER

1 BACKGROUND ...................................................................................................... 14

Application of New Technologies and Tools in Public Health ................................. 16 Disease Mapping .................................................................................................... 19 Global Clustering Techniques ................................................................................. 21 Local Cluster Detection Techniques ....................................................................... 23 Space-Time Clustering Techniques ........................................................................ 26 Environment and Infectious Diseases ..................................................................... 27 Disease Modeling ................................................................................................... 30

Knowledge-Driven Models ................................................................................ 30 Data-Driven Models .......................................................................................... 31

Research Questions and Hypotheses..................................................................... 34

2 A 24-YEAR EXPLORATORY SPATIAL DATA ANALYSIS OF LYME DISEASE INCIDENCE RATE IN CONNECTICUT .................................................................. 41

Materials and Methods............................................................................................ 43 Data Collection and Preparation ....................................................................... 43 Global Clustering .............................................................................................. 44 Local Clustering ................................................................................................ 45

Spatial smoothing ...................................................................................... 45 Local moran’s ............................................................................................. 45 Spatial scan statistics ................................................................................. 46

Accuracy Assessments .................................................................................... 47 Results .................................................................................................................... 48 Discussion .............................................................................................................. 50

3 MACHINE LEARNING APPROACHES IN GIS-BASED ECOLOGICAL MODELING OF THE SAND FLY PHLEBOTOMUS PAPATASI, A VECTOR OF ZOONOTIC CUTANEOUS LEISHMANIASIS ......................................................... 60

Material and Methods ............................................................................................. 62

6

Study Area ........................................................................................................ 62 Data Collection and Preparation ....................................................................... 63

Sand fly collection ...................................................................................... 63 Exploratory data ......................................................................................... 64

Classifiers Used in This Study .......................................................................... 66 Pre-processing of the Data ............................................................................... 67 Overview of Training Models ............................................................................ 68 Models Validation ............................................................................................. 69

Results .................................................................................................................... 70 Discussion .............................................................................................................. 72

4 A GIS-BASED ARTIFICIAL NEURAL NETWORK MODEL FOR SPATIAL DISTRIBUTION OF TUBERCULOSIS ACROSS THE CONTINENTAL UNITED STATE .................................................................................................................... 82

Material and Methods ............................................................................................. 86 Tuberculosis Data ............................................................................................ 86 Explanatory Data .............................................................................................. 86 Global and Local Clustering ............................................................................. 87 Artificial Neural Networks ................................................................................. 89 Model Pre-Processing ...................................................................................... 90 Model Evaluation .............................................................................................. 92

Results .................................................................................................................... 93 Discussion .............................................................................................................. 96

5 CONCLUSION ...................................................................................................... 107

Summary .............................................................................................................. 107 Limitations ............................................................................................................. 108 Outlook and Future Challenges ............................................................................ 110

APPENDIX

A STATISTICAL COMPARISON OF CLASSIFIERS ............................................... 112

B HABITAT SUITABILITY MAPS ............................................................................. 113

LIST OF REFERENCES ............................................................................................. 115

BIOGRAPHICAL SKETCH .......................................................................................... 138

7

LIST OF TABLES

Table page 1-1 Application of GIS in the study of infectious diseases ........................................ 37

1-2 Application of RS in the study of infectious diseases .......................................... 38

1-3 Application of GPS in the study of infectious diseases ....................................... 38

1-4 Application of GIS, RS, Spatial statistics and machine learning in the study of infectious diseases (Lyme disease, leishmaniasis, tuberculosis) ....................... 39

2-1 Comparison of LISA and SaTScan clusters by confusion matrix and its derivatives statistics ............................................................................................ 55

2-2 Results of the global Moran statistic of LD incidence rate, Connecticut, 1991-2014 ................................................................................................................... 55

2-3 Characteristics of LD spatial clusters detected by SSS with 5% of the population at risk throughout Connecticut, USA for the period 1991-2014 ......... 55

3-1 Bioclimate variables used in this study ............................................................... 77

3-2 Pearson correlation coefficients between selected variables ............................. 77

4-1 Top 10 states with the largest number of hotspot counties (p < 0.10) of smoothed tuberculosis (TB) incidence rate (STIR) in the continental US, 2006–2010........................................................................................................ 101

4-2 Pearson correlation analysis between selected variables for modelling STIR, continental US. ................................................................................................. 101

4-3 Results of linear regression (LR) model for modeling log (STIR), continental US. ................................................................................................................... 101

4-4 Effects of environment and socio-economic factors on the log (STIR) using LR model. ......................................................................................................... 102

4-5 Comparison of multi-layer perceptron (MLP; one and two hidden layers), and LR model’ performance for predicting log (STIR) in the continental US. .......... 102

A-1 Statistical test for comparing AUCs of SVM, LR and RF classifiers: ................. 112

8

LIST OF FIGURES

Figure page 1-1 Contingency table of Knox method ..................................................................... 40

1-2 Principle of linearly separable SVM using maximum margin .............................. 40

2-1 Geographic location of the Connecticut, its towns and approximate populations. The names on the map show the location of the towns which are mentioned in this paper. ..................................................................................... 56

2-2 Temporal trend of LD incidence throughout Connecticut from 1991 to 2014. Black dots show the incidence rate for each year and the blue line represents a scatter with smooth lines. ................................................................................ 57

2-3 Locations of spatial clusters of LD incidence in Connecticut, USA based on the true-cluster definition of with LISA and SSS methods targeting 5% of the population at risk. ............................................................................................... 58

2-4 Comparison of LISA and spatial scan statistics with 5% and 50% of the population at risk of LD in Connecticut, USA for each period from 1991 to 2014. Sensitivity, specificity and overall accuracy expressed as per cent (%).... 59

3-1 Geographic location of the study area and its counties, NE Iran. The map inset (B) illustrates the location of presence/absence sampling of Ph. papatasi. ............................................................................................................. 78

3-2 Methodology flowchart used in this study ........................................................... 79

3-3 The ROC curves and AUCs for SVM, RF and LR classifiers .............................. 79

3-4 Comparison of the SVM, LR and RF classifiers. Overall accuracy, AUC, Kappa index, sensitivity, specificity, PPV and NPV expressed as %. ................. 80

3-5 Habitat suitability map of Ph. papatasi in Golestan province, northeast Iran based on support vector machine (note: this map is based on 1km units). ........ 81

4-1 Topological architecture of multi-layer perceptron neural network (MLPNN) used in this study. ............................................................................................. 103

4-2 Spatial distribution of training, cross-validation, and test data used for modeling log (STIR). ......................................................................................... 103

4-3 The frequency of TB cases (left) and the cumulative TB incidence rate (right) across the continental US (2006–2010). .......................................................... 104

9

4-4 Hotspot map for the STIR in the continental US identified by hotspot analysis (Getis–Ord Gi*) technique, 2006–2010. ........................................................... 104

4-5 The Normal P-P Plot of LR model. ................................................................... 105

4-6 Scatter plot of observed and predicted log (STIR) (by single hidden layer MLP model) for test data in the continental US. ............................................... 106

4-7 The contribution of input features on predicting log (STIR) according to sensitivity analysis of single hidden layer MLP. RMSE: Root mean square error. ................................................................................................................. 106

B-1 Habitat suitability map of Ph. Papatasi based on logistic regression classifier . 113

B-2 Habitat suitability map of Ph. Papatasi based on random forest classifier ........ 114

10

LIST OF ABBREVIATIONS

AIDS Acquired Immune Deficiency Syndrome

AMOEBA A Multidirectional Optimal Ecotope- based Algorithm

ANN Artificial Neural Network

CDC Center for Disease Control and prevention

CSR Complete Spatial Randomness

CT Connecticut

CTDPH Connecticut Department of Public Health

DEM Digital Elevation Model

DF Dengue Fever

EBS Empirical Bayes Smoothing

ENFA Ecological Niche Factor Analysis

ENM Ecological Niche Models

ESDA Exploratory Spatial Data Analysis

EWS Early Warning Systems

GAM Geographic analysis machine

GARP Genetic Algorithm for Rule-set Production

GIS Geographic Information System

GLM General Linear Model

GPS Global Positioning System

HIV Human Immunodeficiency Virus

KDE Kernel Density Estimation

LASSO Least Absolute Shrinkage and Selection Operator

LD Lyme Disease

LISA Local Indicator Spatial Autocorrelation

11

LR Logistic Regression (in Chapter 3)

LR Linear Regression (in Chapter 4)

LST Land Surface Temperature

MAE Mean Absolute Error

MaxEnt Maximum Entropy

MCDA Multi Criteria Decision Analysis

MLP Multi-Layer Perceptron

MLT Machine Learning Technique

MODIS Moderate-resolution Imaging Spectroradiometer

NDVI Normalized Difference Vegetation Index

NPV Negative Predicative Value

PPV Positive Predictive Value

RF Random Forest

RMSE Root Mean Square Error

ROC Receiver Operating Characteristic

RS Remote Sensing

SA Spatial Autocorrelation

SDM Species Distribution Model

SSS Spatial Scan Statistics

STIR Smoothed Tuberculosis Incidence Rate

SVM Support Vector Machine

TB Tuberculosis

USGS United States Geological Survey

VIF Variance Inflation Factor

VL Visceral Leishmaniasis

12

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

GIS-BASED, DATA-DRIVEN TECHNIQUES FOR SPATIAL ANALYSIS OF

INFECTIOUS DISEASES AT THE REGIONAL, STATE, AND NATIONAL LEVELS

By

Abolfazl Mollalo

August 2019

Chair: Gregory E. Glass Major: Geography

The present research emphasized the spatial epidemiological aspects of three

different infectious diseases: Lyme disease, zoonotic cutaneous leishmaniasis, and

tuberculosis at the state, regional, and national levels, respectively. We combined

relatively innovative methods/tools, for example, remote sensing, exploratory spatial

data analyses, GIS and data science techniques. The findings of these studies can

provide useful insight to health authorities on prioritizing resource allocation to risk-

prone areas.

We examined changes in the spatial distribution of significant spatial clusters of

Lyme disease incidence rates at the town level from 1991 to 2014 as an approach for

targeted interventions. Local clustering was measured using a local indicator of spatial

autocorrelation (LISA). Elliptic spatial scan statistics (SSS) in different shapes and

directions were also performed in SaTScan. The accuracy of these two cluster detection

methods was assessed and compared for sensitivity, specificity, and overall accuracy.

In another case study, we compared several approaches to model the spatial

distribution of Phlebotomus papatasi, the primary vector of zoonotic cutaneous

leishmaniasis, in an endemic region of the disease in Golestan province, northeast of

13

Iran. We gathered and prepared data on related environmental factors

including topography, weather variables, distance to main rivers and remotely sensed

data such as normalized difference vegetation cover and land surface temperature

(LST) in a GIS framework. Applicability of three classifiers: (vanilla) logistic regression,

random forest and support vector machine (SVM) were compared for predicting

presence/absence of the vector. Predictive performances were compared using an

independent dataset to generate area under the ROC curve (AUC) and Kappa statistics.

Despite the usefulness of artificial neural networks (ANNs) in the study of various

complex problems, ANNs have not been applied for modeling the geographic

distribution of tuberculosis (TB) in the US. Likewise, ecological level researches on TB

incidence rate at the national level are inadequate for epidemiologic inferences. We

collected 278 exploratory variables including environmental and a broad range of socio-

economic features for modeling the disease across the continental US. We investigated

the applicability of multilayer perceptron (MLP) ANN for predicting the disease

incidence.

14

CHAPTER 1 BACKGROUND

"Person", "place", and "time" are known as three key components in every

descriptive epidemiologic research 1. Application of place (geography) in the study of

diseases backs to more than 2000 years ago 2. Hippocrates (460-377 BC) in his essay

entitled “On Airs, Water, and Places” proposed a theory suggesting that environmental

factors can have a significant influence on the occurrence of diseases 3. In the late

seventeenth century, Filippo Arrieta (1694) utilized maps to investigate the outbreak of

plague in Bari, Italy 4. At the end of the eighteenth century, Valentine Seaman (1795)

mapped fatal cases of yellow fever in New York city to detect a possible association

between the location of yellow fever cases and waste sites 5. Perhaps the most famous

application of place (map) in the study of disease distribution is the work of John Snow

in mapping cholera death cases in the 1850s in London. Using the distribution map of

(cholera) deaths, he could trace and detect the source of the outbreak to a

contaminated public water pump 6.

Before 1950, there were very few studies that incorporated spatial components in

health researches. During the 1950s, Jacques May (1958) proposed a new concept

titled “Disease-ecology” or “geographic pathology” 7 which surprisingly is still valid in

epidemiological studies 8. “Disease- ecology” investigates the interaction of people with

the environment. His viewpoint was "process-oriented" and applied natural scientific

explanations: "Once the person, disease, and place are known, we may be able to

understand why someone is afflicted and someone else is not” 9. The principal objective

of his approach was to better comprehend the dynamics of disease which varies per

weather condition, mineral particles in water, vegetation cover and other influencing

15

factors 10. May divided an environmental health problem into three main categories: 1)

inorganic environment such as weather/climate variables which can reflect many

features of traditional environmental philosophy, 2) organic environment which can

reflect disease pattern because of animals/plants activities, and 3) socio-economic

factors which can explain the disease associated with the behavior and culture of

human 11.

By the end of the 1960s, remarkable changes in the concepts of May’s approach

began to emerge within the literature. The variations were partially associated with the

raised attention of geographers to the disease causation through visual interpretation.

This was in contrasts with May’s approach that studies disease occurrence as

understanding the processes of disease-ecology 10. In this decade, the World Atlas of

Disease was issued under the supervision of May. The atlas represented a significant

milestone in medical geography in this century. It contained both small and relatively

large scale maps of disease morbidity which were shaded by colors or plain symbols 1.

Mapping continued as one of the essential fields where geographer contributed

substantially and is widely used in spatial epidemiological researches as the exploratory

tool for generating hypotheses useful in health care planning 12.

Before 2000, medical geography mainly focused on clustering and cluster

detection analysis. Along with the advances in technology, the field of medical

geography has progressively evolved 13. Several sophisticated spatial tools such as

geographic information system (GIS), remote sensing (RS), and novel spatial statistical

analysis such as machine learning algorithms played an important role.

16

Although historical researches in medical geography rarely considered

geographical aspects, recent (decade) studies increasingly include "spatial" component

in epidemiological inferences 14. This increment might be due to the broader availability

of spatial data and quality of spatial health data than in the past because of the

advances in technology 15. Moreover, increased availability and sharing more accurate

environmental, socio-economic, cultural, and biological data has fueled researches in

medical geography. The availability of user-friendly software packages designed to

expedite statistical analysis is another explanation, particularly for non-expert users.

Therefore, medical geographers now have more powerful tools and high-quality and

update data at hands, and thus, spatial epidemiological studies are expected to

continue to play a significant role in the epidemiology of diseases. In the later sections

of Chapter 1 we will briefly demonstrate the application of a wide range of new tools and

technologies applied in medical geography, such as GIS, RS, GPS, and spatial

statistics and models.

Application of New Technologies and Tools in Public Health

Over the last two decades, mapping techniques, spatial analyses of disease

patterns, and risk modeling of diseases have substantially improved with the advent of

GIS 16. GIS has provided unprecedented opportunities to collect, organize and integrate

geospatial data from various sources and in different formats. It has enabled conducting

very complex analysis in public health in conjunction with population and environment

that in earlier researches were very hard or impossible to be studied. This tool has

equipped medical geographers to find answers for overly complicated questions more

efficiently 17. The application of GIS in recent decades have been progressively

increased. In Table 1-1, we have summarized some of the widely-used applications of

17

GIS in the studies of infectious diseases. An overview of GIS and its application in the

studies of infectious diseases has been published by Nykiforuk et al. (2011) 18.

According to Campbell and Wynne (2011) remote sensing refers to obtain some

characteristics of an object without making physical contact with it (i.e., through an

aerial sensor) 19. It allows collecting a vast amount of data in a short time and high

accuracy which facilitates disease surveillance and control. An overview of RS and its

application in public health has been reviewed by Hay et al. (1997) 20. High-resolution

satellite images can help medical geographers to investigate geographic variations of

diseases at a small-area scale. Moreover, with the help of RS, some products such as

normalized difference vegetation cover (NDVI) which reflects vegetation cover on the

ground, land surface temperature (LST), soil property, fog, and land cover can be

derived and incorporated as candidate factors in modeling. Also, several biophysical

features can be quantified by this technology including, biomass, chlorophyll absorption,

surface texture, and moisture content 21. In public health, RS data can be used for

generating habitat suitability maps for species, predicting vector or host population or

presence/absence and identifying vector or host habitat. Table 1-2 summarizes some

applications of RS in the study of infectious diseases.

Another technology that has made a significant role in collecting geospatial data

is the global positioning system (GPS). This technology has made collecting massive

amounts of spatial and attributes data in surveillance of infectious disease faster, easier

and with an affordable cost 22. Table 1-3 summarizes some applications of GPS in the

study of diseases.

18

Numerous researches have integrated the above technologies in various types of

infectious diseases researches. These studies include mapping disease prevalence,

predicting habitat suitability map of vectors, identifying risk-prone areas of infections,

etc. As an early study that integrated the above techniques, Glass et al. (1990) used

LANDSAT TM satellite images to derive land cover/land use in developing an

environmental geodatabase. These remotely sensed products and GIS were combined

to identify risk factors and consequently high-risk areas of Lyme disease in Maryland 23.

Mollalo et al. (2018) captured the location of collected sandflies of leishmaniasis in an

endemic province in Iran using handheld GPS. They then used several environmental

factors including NDVI in a GIS framework. They could predict the presence/absence of

the sand fly with an accuracy of 90% for a hold-out dataset 24.

Spatial statistics in epidemiology are statistical tools that are used to mainly map,

explain, and predict the spatial distribution of health problems. In general, the four main

types of spatial statistics frequently used in the studies of infectious diseases are: 1)

mapping (visualization) disease count or rate 2) clustering and cluster detection analysis

(exploratory spatial data analysis) 3) Correlation analysis and 4) modeling (explain or

predict disease rates/counts) 25. The very first step in spatial statistics is linking disease

counts/rates data to their corresponding locations and visualize them as maps. The

mapping step, however, may reveal some useful information, can also conceal the

actual affected areas 26. Therefore, some statistical analyses are required to evaluate

the observed pattern of disease occurrences statistically. Spatial clustering and cluster

detection techniques can consider locations and attributes and can be used to address

this issue by identifying hotspot(s) and cold spot(s) of infectious diseases. Identifying

19

hotspots are very helpful in generating hypothesis and can provide useful information

for further analysis 27. Finally, using a proper model the possible relationship between

disease frequency/rate and several explanatory variables (such as environmental and

socio-economic factors) can be explained. A properly developed model can also be

utilized to predict (spatially/temporally) disease distribution. Detailed information about

each step of the application of spatial statistics in the study of infectious diseases is

provided as follows.

Disease Mapping

The topic of disease mapping has a very long history and is an ongoing topic

among health researchers and organizations to produce more accurate atlas/maps of

morbidity or mortality rates of various infectious diseases. An early example of this topic

is mapping the spatial distribution of cancer mortality in England and Whales by Stocks

(1936) 28. Maps of infectious diseases can illustrate a summary of the complicated

status of spatial distributions of infectious diseases. The disease map can disclose

spatial patterns in the data that are not readily detectable in a tabular format. Some

applications of disease mapping are basic descriptions of the spatial disease

distribution, hypothesis generation of possible relations between factors and health

outcome and risk-prone areas representations useful for budget and resource

allocations 29.

Choropleth maps also known as shaded or thematic maps are widely used to

illustrate mortality and morbidity rates of infectious diseases 26. These maps are

particularly useful to identify health disparities, generating hypothesis and representing

variations of infectious disease counts/rates over time. However, several issues

regarding choropleth maps exist. This representation can be non-informative or to some

20

extent misleading regarding the choice of coloring; classification techniques and the

choice of cut-off values 30. Moreover, shaded mapping for areas with a small population

can lead to high variations of estimated risks compared to the large populations 31.

In the situation of sparse or highly clustered data, Bayesian models can be a

suitable alternative 32. It provides a more robust solution as it can avoid unstable

estimates of rates/risks. These models are increasingly applied in small area

estimations and disease mapping by spatial epidemiologists. They are especially useful

for rare mapping diseases 33. For instance, empirical Bayesian estimation, a type of

Bayesian models, can either shrink unstable risks/rates toward the local mean by

obtaining information from neighboring areas or weights the unstable risks/rates toward

global average from all areas 34. It is evident that the areas with higher rates are less

smoothed than areas with lower rates. Thus, smoothing produces more stable

estimations of disease risks/rates. However, it should be noted that global mean

smoothing can produce over-smoothed rates/risks which can mask informative local

variations 35. An exciting feature of Bayesian methods is that they can help to remove

random components resulting from correlated and unmeasured factors from the maps

36. Another benefit of this method is that it accounts for uncertainty measures associated

with the relative risks with confidence intervals 37. It is evident that including uncertainty

as confidence interval are more reliable for decision makers. Sharmin et al. (2016) used

a Bayesian generalized linear model to adjust for underreporting cases of dengue by

incorporating climate factors in Dhaka, Bangladesh 38. Randremanana et al. (2010)

used the integration of a Bayesian approach and a generalized linear mixed model to

analyze the geospatial distribution of TB in an endemic area, Madagascar. They

21

generated (smoothed) heat maps of TB to compare the association of TB rate with the

nationwide TB indicators 39. The more detailed information regarding Bayesian

approach can be found in Lawson (2013) 40.

Global Clustering Techniques

As already mentioned, although disease mapping can help obtain some useful

information about the spatial pattern of infectious diseases, they can be misleading and

unable to evaluate the significance of the pattern statistically. Spatial dependence or

spatial autocorrelation (SA) is a fundamental concept that is applied to evaluate the

pattern from the statistical viewpoint. SA indicates the association of a feature with itself

located nearby 41. SA proposed by Tobler (1979) as the first law of geography:"

everything is related to everything else but near things are more alike than the distant

things." 42. SA can be positive (i.e., similar attributes are closer to each other), negative

(i.e., different variables are closer to each other), or none (i.e., random distribution). SA

is an essential concept in medical geography and should be investigated in analyzing

spatial data 43. There are several spatial statistic tools to evaluate the presence of SA

(overall pattern) known as global clustering techniques. The null hypothesis significance

(i.e random distribution) is tested against the alternative hypothesis (i.e clustered

distribution).

One of the most straightforward measures of spatial autocorrelation, when the

variable of interest is categorical, is join counts statistics. The null hypothesis of join

counts states that the distribution is random. In this technique, a binary attribute is

classified into black and white colors, and a join (or connections between the zones) is

named as either WW (0,0), BB (1,1), or BW (1,0). By counting the number of WW, BB

and BW, type of SA can be determined. The overall distribution is clustered (i.e.,

22

positive SA) if the number of BW joins is significantly lower than expected by chance 44.

The distribution is dispersed (i.e., negative SA) if the number of BW is significantly

higher than expected by chance. The distribution is random if the number of BW joins

approximately the same as what would be expected by chance. The test of significance

evaluating the BW statistic as a standard deviate is as follows 45:

Z(BW) =BW − E(BW)

�σBW2

(1-1)

In this formula, Z(BW) indicates the magnitude of SA. Join counts statistics have been

applied in the disease studies. Gilbert et al. (1994) used join counts statistics to

evaluate the spatial distribution of canker disease of trees in Panama. They counted the

number of cankered-cankered tree joins (BB), healthy-healthy tree joins (WW), and

cankered -healthy tree joins (BW) and compared them with the expected value 46.

Global Moran’s statistic is the most widely used measure of SA in continuous

variables which relies on both the variable's location and attribute 47. This index ranges

from -1 to +1. In general, the values close to the value of -1 indicate a dispersion, the

values close to the value of +1 shows a clustering, while the values close to the value of

0 indicate a random pattern 48. The null hypothesis states that the distribution is random.

This index is calculated as 47:

I =N∑ ∑ wij(xi − x�)(xj − x�)N

j=1Ni=1

∑ ∑ wijNj=1

Ni=1 ∑ (xi − x�)2N

i=1 (1-2)

Where N is the count of spatial units 𝑖𝑖, 𝑗𝑗; 𝑥𝑥 is the variable of interest (e.g., malaria

incidence rate); 𝑤𝑤𝑖𝑖𝑖𝑖 is the connection between units 𝑖𝑖 and 𝑗𝑗. The spatial weight matrix

can be constructed in two main ways 49: 1) contiguity-based neighbors: sharing a border

(rook) or sharing a border or point (queen) 2) distance-based neighbors: k-nearest

23

neighbors or buffer (threshold distance). Many studies utilized this index to investigate

the overall pattern of infectious diseases. For instance, Naish et al. (2014) used global

Moran's statistic to assess the presence of SA in the study of dengue incidence rates in

Queensland, Canada and found a positive SA 50.

Geary’s C is another common method for measuring SA which has similar

calculations to Moran's I 51. The value of this statistic ranges from 0 to 2 where the

values close to 0 shows a positive SA, the value close to +2 shows a negative SA, while

the index value close to +1 shows a random distribution 52. Therefore, this index is

inversely related to the global Moran’s statistic. The null hypothesis expresses that there

is no SA (i.e., C=1). Using the same notation as for global Moran’s I, the Geary’s C

statistic is computed as follow 51:

C =(N− 1)∑ ∑ wij(xi − xj)2N

j=1Ni=1

2(∑ ∑ wijNj=1

Ni=1 )∑ (xi − x�)2N

i=1 (1-3)

Compared to Moran’s I, which is a global indicator of SA, Geary’s 𝐶𝐶 is highly sensitive to

the difference in neighbors 53. Simões et al. (2004) applied Geary's 𝐶𝐶 to investigate the

distribution of paracoccidioidomycosis, a fungal infection, in southern Brazil. Their

finding rejected the null hypothesis of SA and showed a significant spatial

autocorrelation 54. Other global clustering techniques are widely used in assessing

overall pattern are average nearest neighbors 55, K-function (used in point pattern

analysis) 56; general 𝐺𝐺 57 and Cuzick-Edwards 58.

Local Cluster Detection Techniques

In global clustering techniques, only a single SA value is assigned to the entire

pattern. Thus, global techniques are unable to identify the location of clusters. While, in

local measures of clustering (i.e., cluster detection techniques), an SA is calculated for

24

each areal unit and its significance is evaluated with statistical tests. Also, the

techniques can identify hotspot(s), coldspot(s) or outlier(s) in a pattern.

Perhaps one of the simplest ways to visualize the location of clusters is by using

density function such as kernel density estimation (KDE). In the KDE method, a

weighting kernel function is fitted over each point or line. The weight decreases as we

move away from the point/line feature 59. This method is criticized for the subjective

choice of search radius (bandwidth) and not providing statistical evaluation of the

results. As an example, Aikembayev et al. (2010) used KDE in a GIS environment to

identify the areas of outbreak concentration of Bacillus anthracis by livestock species,

Kazakhstan 60.

The Geographic analysis machine (GAM) is an automated cluster detection

technique developed by Openshaw et al. (1987) for point data. The method developed

to identify spatial clusters of childhood leukemia incidence. In this method, a two-

dimension grid is superimposed over the study area. Then, a series of different circles

with various sizes is generated and scans the whole study area. The observed intensity

of events within each circle is compared with a threshold based on Monte Carlo

simulation. If the observed intensity exceeds the threshold, it draws a circle on the map.

The final output of GAM is the map of significant circles 61. A significant disadvantage of

this technique is that no conclusion can be drawn about the significant level of clusters

62.

Local indicator spatial autocorrelation (LISA) is perhaps the most widely used

cluster detection technique for identifying clusters of various infectious diseases. The

statistic is based on the decomposition of global Moran's I for each areal unit 63. LISA

25

coefficients for each unit 𝑖𝑖 can be calculated as the deviation of values from the mean of

neighbors 63:

Li =(yi − y�)∑ wij(yj − y�)n

j=1,j≠i

si2 (1-4)

si2 =∑ (yi − y�)2nj=1,j≠i

n − 1 (1-5)

One of the advantages of this statistic compared to some other cluster detection

techniques is its ability to identify outliers (i.e., areas with high values of an attribute are

surrounded with low values and vice versa). Using LISA, Hassarangsee et al. (2015)

detected hotspots of tuberculosis incidence in several districts in Thailand 64. Szonyi et

al. (2015) identified several outliers of Lyme disease in western Texas by applying this

technique 65.

Getis-Ord 𝐺𝐺𝑖𝑖∗ statistic 66 is one of the most popular hotspot detection methods. It

is a polygon-based method; however, point events can be aggregated by

superimposing a grid. The statistic is calculated as follows 66:

Gi∗ =

∑ wijxj − X�∑ wijnj=1

nj=1

S �[n∑ wij

2 − �∑ wijnj=1 �

2]n

j=1n − 1

(1-6)

S = �∑ (xj − x�)2nj=1,j≠i

n − 1− x�2 (1-7)

The statistic can identify both hotspots and cold spots. The 𝐺𝐺𝑖𝑖∗ is usually standardized to

z-scores (of normal distribution) so that a large (positive) Z-score is corresponded to a

significant hotspot (i.e. Z-score>1.96 standard deviation); a large negative value of z-

score is associated with a location as coolspot (Z-score < -1.96 standard deviation); and

26

value close to zero shows a location that is neither hotspots nor coolspot. Sun et al.

(2017) utilized the Getis-Ord Gi* statistic to identify hotspots of the dengue fever

epidemic in Sri Lanka 67.

One of the novel tools to determine statistically significant elevated disease rate

is spatial scan statistic (SSS). The tool developed by Kulldorff (1997) 68 for point and

polygon units in SaTScan software. The SSS can also find significant clusters of

individual health events (i.e., in case-control studies) using the Bernoulli model. In this

method, circular windows with varying radius sizes or ellipsoidal windows in different

shapes and directions are drawn around each point or centroid of polygon data. The

size of the window is changed from 0 to the specified cluster size. For each window, a

maximum likelihood test is applied to compare the risk within the window with the

outside. Monte Carlo simulation is used to test for a significance level of clusters. The

cluster with maximum likelihood is considered as primary clusters and the rest of the

cluster(s) as a secondary cluster(s) 69. A significant advantage of this method over the

most cluster detection techniques is that it can take confounding into account by

adjusting for variables 70.

Space-Time Clustering Techniques

Space-time analysis of disease clusters is another crucial topic in epidemiological

studies which is usually overlooked. Space-time analysis is an ongoing research topic

which provides useful insight for public health decision-makers into how an infectious

disease propagates over the landscape. It can help to find direction and periodic

patterns helpful for predictions. Some space-time clustering techniques have been

proposed. In the following section, we have provided a general description of two widely

used space-time cluster detection techniques.

27

Knox method (1964) 71 is a primary method that is used for examining space-time

interactions (clusters) than expected by chance. Space-time interactions exist when

many of cases that are close in time are also close in space. Therefore, in this method

for each pair of cases, the interval (i.e., time and space interval) is examined. A 2*2

contingency table is constructed with cross-comparison proximity in space and time

(Figure 1-1). The number of pairs for each situation is counted, and the actual number

of pairs that fall into each category is compared with the expected number (i.e., cross

products of rows and column totals). Then a final Chi-square test with (n-1)*(r-1) degree

of freedom measures a significant level of difference. The CrimeStat software

(https://www.icpsr.umich.edu/CrimeStat/) allows users to apply Knox test. This

approach has been criticized due to the subjective cut-off value for closeness in space

and time.

Similar to the pure spatial scan statistic, in space-time approach a window scans

the whole study region, and the risk within the window is compared with the risk outside

of the window 72. However, instead of a 2-dimension circle/ellipse window, a cylindrical

window with the base of a circle/ellipse is used. The base of the cylinder depicts space

while the height corresponds to time. Also, p-value for each cylinder is computed using

Monte Carlo simulation. Compared to the Knox test, this method accounts for possible

geographic/temporal population heterogeneity (i.e., different population growth in

different regions) 73.

Environment and Infectious Diseases

The occurrence of infectious diseases represents a serious public health

challenge. Infectious diseases annually kill over 15 million people (>25% of the total

death), worldwide 74. The number is progressively growing, and the geographic

https://www.icpsr.umich.edu/CrimeStat/

28

distribution is expanding to non-endemic areas 75. One strategy to fight against the

infectious diseases is to monitor and control the causative agents such as a virus,

bacteria of the disease. In this regard, identifying geographic patterns of infection and

underlying risk factors such as environmental and socio-economic factors is helpful.

Layers of environmental data can be obtained from various sources such as landcover

data from remote sensing, weather data from weather stations or WorldClim, land use

data from government records, soil maps from the department of agriculture, field

investigation, etc. The data are provided in different formats like excel, grid, shapefile,

text, etc., then using spatial analysis tools like GIS they are integrated and combined for

further analysis. Another strategy for control and monitor infectious diseases is the

surveillance of risk factors for diseased patients or deaths. An explanation for this

approach is because of a strong correlation between underlying risk factors and the

distribution of human cases (rates). For instance, to control Lyme disease in human,

monitoring and controlling Ixodes scapularis population is applied as a proxy for disease

morbidity.

The diseases (or parasites/vectors) highly dependent on the local and global

environment and consequently can influence disease prevalence in a community 76.

Environmental determinants (such as weather/climate, vegetation) can provide

favorable conditions for breeding, feeding, resting sites of certain vector-borne

diseases. However, it should be noted that socio-economic factors such as race,

gender, poverty, lifestyle (such as drinking alcohols or smoking) can play an essential

role in the occurrence of infectious disease such as Tuberculosis 77. Climate change is

an essential determining factor in explaining variations of most infectious diseases

29

especially vector-borne diseases 78. It can lead to migration of pathogens (causative

agents) and animals to new areas and consequently expansion of the diseases to

uninfected areas 79. The relationship between climate condition and infectious diseases

have levels of complexity. For instance, weather and climate factors can have different

influences on the vector or pathogen abundance of vector-borne diseases. The

differences are mainly because of differences in the life cycle or life stage of vectors.

For instance, four stages of tick-borne diseases’ life cycle (i.e., eggs, larvae, nymph,

adult) last two years of being completed. While the life stage of mosquitos (i.e., eggs,

multiple larvae, pupae, adults) only take a few weeks to a few months to be completed.

This difference indicates that mosquito’s abundance responds to short-duration

variations in weather or climate, while tick’s population reacts to longer-term changes in

weather condition and with little inter-variations 80. Also, extreme high/low temperature

prevents both mosquitoes and ticks host for blood meal seeking 81. Under other climate

conditions, meteorological factors can have relatively complex effects on mortality rates.

In harsh weather conditions, all the four stages of tick's life remain secured (because

ticks spend most of their life time under layers of soil), while the only dipteran can seek

refuges. Thus, the direct consequences of temperature have fewer impacts on the

survival of the tick's population compared with mosquitoes' abundance. As almost all

development stages of mosquitoes are affected by the presence of stagnant water, the

reproduction rates depend on the amount of rainfall 82. However, heavy rainfall can flush

away larval and eggs. Therefore, reproduction rates of ticks cannot be influenced by

weather changes apart from long-lasting impacts on host densities 80.

30

Disease Modeling

Although correlation analysis can indicate associations between factors and

disease morbidity/mortality, the associations do not indicate causality. Disease

modeling is an advanced level of spatial analysis in medical geography 83. It includes

testing generated hypothesis obtained from cluster detection or observations. Disease

modeling can involve the integration of previously mentioned tools (i.e., GIS, RS, and

GPS) with statistical and epidemiological analysis. Modeling disease occurrence helps

medical geographers to describe morbidity/mortality rates and identify underlying

factors. It can also be used for predictions: classification (such as presence/absence of

vectors) or regression (such as predicting disease rate/count). The prediction can be

spatial (i.e., for locations with unknown mortality/morbidity) or spatiotemporal (i.e., future

(near/far) status of disease) or for other areas (projection). Space-time predictions can

be used for early detection of infection or outbreak 84. Forecasting disease occurrence

can provide valuable guidelines for public health decision makers for cost-effective

planning and targeted interventions of future outbreaks.

Knowledge-Driven Models

In general, disease modeling techniques can be classified into two categories:

data-driven and knowledge-driven models 85. According to Pfeiffer, data-driven models

are mainly based on statistical analysis to quantify the relationship between disease

morbidity/mortality and underlying factors. While knowledge-driven models use

knowledge of experts as evidence 86. These models depend heavily on environmental

layers and are widely used in spatial epidemiology for mapping suitable areas for

disease/vector transmission. Some techniques such as multi-criteria decision analysis

(MCDA) 87 can use experts' knowledge to provide a habitat suitability map by converting

31

knowledge to decision rules. Analytical hierarchy process (AHP) proposed by Saaty

(1990) is one of the most common ways to define the weights of factors 88. By overlying

(combining) factors based on their corresponding weights, suitability for each pixel in

the study area can be computed. Moreover, uncertainty associated with variables, the

relationship between independent factors and the dependent variable, and the degree

of risk can be modeled by incorporating fuzzy logic 89. One of the significant limitations

of the knowledge-driven methods is that incorporating risk factors, weights and type of

membership functions of fuzzy rules are subjective. Knowledge-driven models have

been applied to modeling several infectious diseases. Clements et al. (2006) used them

in modeling rift-valley fever in Africa 90, Mollalo and Khodabandehloo (2016) applied

them in the study of zoonotic cutaneous leishmaniasis in an endemic area in Iran 91; and

Rakotomanana et al. (2007) applied them in modeling malaria vector control in

Madagascar 92.

Data-Driven Models

Data-driven models can generally be classified into two main categories: 1)

presence-absence models 2) presence-only models. Here, we first describe the

characteristics of several presence-absence models.

Classification and regression tree (CART) is a decision-tree based method that

can be applied for solving both classification and regression problems in spatial

epidemiology. CART is particularly useful in working with noisy data (such as datasets

with missing values or highly skewed data) and extensive and complex dataset. CART

is flexible in modeling a variety of response variables such as categorical, survival and

continuous data. Another advantage of CART is that the results obtained from the

model are easily interpretable. However, it should be acknowledged that the model

32

does not have a statistical test for the significance level of variables and is subject to

overfitting 93. Other features of CART have been discussed in detail in Franklin (2009)

94.

One of the significant problems of CART is attributed to the low predictive ability

and high variations in results. To address this drawback, bagging, boosting and random

forest can be useful. These algorithms generate numerous decision trees by random

selection of variables and random selection of samples with replacement (bootstrap

sampling). In bagging, in each repetition of sampling, one-third of data is used for

testing the model performance (i.e., out-of-bag samples). Boosting is like bagging: in

bagging, samples have equal weights, while in boosting samples are weighted. Random

forest (RF) is a type of bagging 95-97.

Support vector machine (SVM) is another data-driven model that is used for

binary classification problems. In SVM, a dataset of high dimensional points viewed as

vectors {𝑥𝑥𝑖𝑖∈𝑅𝑅𝑑𝑑:𝑖𝑖=1,…𝑛𝑛}, 𝑑𝑑>1, where each point belongs to one of two classes defined

by {𝑦𝑦𝑖𝑖∈{0,1}:𝑖𝑖=1,…𝑛𝑛}. Here 𝑦𝑦𝑖𝑖 corresponds to the presence/absence of points. If we

assume these points to be linearly separable (i.e., can be separated via a linear

boundary), the goal of SVM is to find the d-dimensional hyperplane maximizing the

margin (i.e., the distance between the closest points or support vectors) as illustrated in

Figure 1-2. More detailed explanations can be found in Noble (2006) 98.

Artificial neural networks (ANNs) is another data-driven model that works with

presence-absence data. More detailed information of ANN is provided in Chapter 4 of

this dissertation.

33

Presence-only model in data-driven techniques is a relatively new concept.

Traditional statistical models such as the generalized linear model (GLM), however,

have a well-established algorithm, require presence and absence of the diseases (or

vector) data. However, collecting absence data can be very expensive and may not

cover most of the study region 85. Species distribution model (SDMs) can be used for

modeling presence-only data and a vast area. They are highly applied in conservation

and ecology for modeling geographic distribution of species (i.e., disease macro-

organism). Another advantage of the SDM model is that they can be used for projection

(i.e., extrapolation beyond the study region) with decent accuracy 99.

Ecological niche models (ENMs) use background or pseudo-absence data

instead of real absence data to predict niche (i.e., conditions that enables species to

maintain population without immigration) 100. These models often are developed at

coarse spatial scales. Most important ENMs are ecological niche factor analysis

(ENFA), genetic algorithm for rule-set production (GARP) and maximum entropy

(MaxEnt).

ENFA assumes that species occurrence is not random and is clustered in a small

section of the study region. ENFA extracts marginality and specialization. Marginality

would be close to 1 when variables contribute to the life of species in a specific region,

and the value would be close to 0 when species could be found everywhere.

Specialization shows species tolerance. Thus, it can summarize all the variables into a

few factors 101. As an example of the application of ENFA in infectious diseases, Ayala

et al. (2009) used ENFA to provide habitat suitability map of malaria species in

Cameroon 102.

34

GARP uses four rules to select or reject, evaluate, and test rules from the final

model. The set of rules with the highest predictive performances are selected and used

for habitat suitability. GARP can obtain high accuracy even with a small sample size. It

can also be used for projection or extrapolation. More information about GARP has

been provided in Stockwell and Noble (1992) 103 and Stockwell and Peterson (1999) 104.

Maximum Entropy (MaxEnt) is another presence-only ecological niche model

(ENM) that is widely used in ecology and geography. In this model, the contribution of

each explanatory variable to the final model is computed by removing each variable and

examining the AUC of the model. The MaxEnt can be used even with minimal sample

size (n<100) and is easily interpretable 105. MaxEnt has been successfully used in

predicting the distribution of several infectious diseases (vectors) such as leishmaniasis

in France 106 and the West Nile virus in Iowa 107. More detailed information presented in

Phillips et al. (2006) 108.

Although, many published pieces of research have examined the relationships

between environmental conditions and infectious diseases, novel techniques such as

machine learning techniques have been underutilized in spatial epidemiology. In this

dissertation, we examine three different infectious diseases (i.e., Lyme disease,

zoonotic cutaneous leishmaniasis, tuberculosis) from a spatial perspective and at three

different levels. Table 1-4 summarizes some applications of GIS, RS, and machine

learning techniques in the study of the mentioned infectious diseases.

Research Questions and Hypotheses

The main research questions and hypotheses explored in this dissertation are as

follows:

35

In Chapter 2, we examine changes in the spatial distribution of significant spatial

clusters of Lyme disease (LD) incidence rates in Connecticut at the town level from

1991 to 2014 as an approach for targeted interventions. Spatial distribution of LD in CT

has been rarely investigated. Thus, we intend to help policymakers for targeted

interventions by prioritizing towns with high incidence rate. The primary objective of the

first paper is to integrate GIS and ESDA to better describe changes in the spatial

pattern of LD, supposing the passive cases of LD serve as a spatially random sample of

the infection in the state. Our specific research questions in Chapter 2 include:

• What is the spatial distribution (random/ dispersed/ clustered) of LD incidence over the past 24 years?

• If not-random, where are the locations of the clusters/hotspots in CT?

• Can we avoid or minimize the effects of edges in the spatial scan statistic technique?

• Which cluster detection technique (LISA/ spatial scan statistic) is more sensitive, specific, and accurate?

In Chapter 3, we compare three popular machine learning classifiers (i.e. SVM,

LR, and RF) in predicting spatial distribution of the Phlebotomus Papatasi, one of the

major vectors of zoonotic cutaneous leishmaniasis (ZCL). Support vector machine and

random forest are two-widely used classifier in environmental studies but have been

rarely applied for studying the species geographic distribution. We suppose that

potential sampling and measurement errors during collecting sand flies do not have

much influence on the accuracy of the classifiers. Also, we assume that ecological

factors used in this study are sufficient predictors of Ph.paptasi. The main research

questions in Chapter 1 are:

• Which machine learning technique (i.e., SVM, LR, and RF) is more sensitive, specific, and accurate in predicting Ph.papatasi in the study area?

36

• What are the most ecological contributing factors in predicting Ph. Papatasi?

• Where are the high-risk areas of Ph. papatasi in the study area?

In Chapter 4, we integrated geographic information system (GIS), spatial

statistics, and artificial neural networks (ANNs) in analyzing TB distribution in the US to

assist national TB control programs. ANN is another data science technique which is a

favorite modeling tool in other scientific domain but has never been investigated for

geographic modeling of TB. We examine the spatial distribution of the disease and

applicability of machine learning techniques in TB modeling with the following

assumptions 1) all the reported county-level TB incidence rates represent the status of

TB in the US and 2) ecological and socio-economic factors influence the TB infection.

Our specific research questions in Chapter 4 include:

• What are the most important ecological and socio-economic contributing factors in predicting TB incidence rate in the US?

• Which modeling technique (i.e., multilayer perceptron (MLP) or linear regression) is more accurate in predicting TB incidence rate?

37

Tables Table 1-1. Application of GIS in the study of infectious diseases

Basic GIS Function Applications in Infectious Diseases

Organizing data from multiple sources and in different formats

Sofizadeh et al. (2016) 109 developed a geodatabase of sandflies and explanatory variables to study the distribution of Phlebotomus Paptasi. Sandflies were collected during the field operation. The explanatory data were climatic conditions and the elevation layer. These data were obtained from the WorldClim database. Normalized differentiated vegetation index (NDVI) was used to reflect vegetation cover and obtained from MODIS satellite products.

Computing slope, and aspect of study region as a candidate factor

Hasyim et al. (2016) 110 derived topographic features (slope and aspect) of the resampled digital elevation model (DEM) of the study area as two explanatory variables in modeling malaria cases with ordinary least square and geographically weighted regression models.

Calculating Euclidean distance to water bodies

Franke et al. (2015) 111 used Spatial Analyst Tool, to compute Euclidean distance to inland and wetland waters for malaria risk modeling.

Buffer analysis Nakahapakor and Tripathi (2005) 112 applied 500 m and 1000 m buffering operation to identify the geographic environment conditions surrounding village affected by Dengue fever.

Overlay of environmental layers

Palaniyandi et al. (2014) 113 used overlay analysis of various environmental factors to map potential breeding areas of vector-borne diseases in an endemic area in India.

Identifying spatial pattern and clusters

Mollalo et al. (2017) 114 used global and local Moran's I to identify geospatial pattern and location of clusters of Lyme Disease incidence rate in Connecticut in using Spatial Statistics tool in a GIS environment.

WebGIS Li et al. (2013) 115 established a decision support system, as an accessible and inexpensive approach, for the response to infectious disease surveillance based on WebGIS and intelligent mobile services.

38

Table 1-2. Application of RS in the study of infectious diseases RS Function Applications in Infectious Diseases

NDVI and LST Mollalo et al. (2018) 24 derived NDVI and LST from MODIS satellite images and used it for prediction habitat suitability of Phlebotomus Papatasi, in an endemic area in Iran.

Sea surface temperature and Sea surface height

Lobitz et al. (2000) 116 linked public remotely sensed data (Sea surface temperature and height) with cholera cases in Bangladesh.

Crop disease management

Franke and Menz (2007) 117 used high-resolution multi-spectral data to detect in-field heterogeneities of crop diseases over time.

Image Classification

Hugh-Jones et al. (1992) 118 used a land cover map derived from a Landsat TM image to distinguish grazing areas with several levels of animal’s tick infestation.

Tick habitat suitability map

Using Landsat TM satellite images, Glass et al. (1995) 23 indicated that many proper tick habitats coincide with residential properties in proximity to the forested areas in Baltimore County, Maryland.

Table 1-3. Application of GPS in the study of infectious diseases GPS function Applications in Infectious Diseases

Mapping Coburn and Blower (2013) 119 used handheld GPS to establish geographic coordinates at each sampling sites to map HIV epidemics in sub Saharan Africa.

Dynamic mobility network

Paz-Soldan et al. (2010) 120 used GPS device to quantify human mobility among tracked individuals to study dengue virus transmission in Peru.

Outbreak investigation

Masthi et al. (2015) 121 used GPS along with google earth to investigate outbreak of cholera in a village in India to accurately pinpoint the location of household of cases and follow up them.

39

Table 1-4. Application of GIS, RS, Spatial statistics and machine learning in the study of infectious diseases (Lyme disease, leishmaniasis, tuberculosis)

Reference Infectious disease

Aim Study Area Techniques/tools

Garcia-Marti et al. (2017) 122

Lyme disease Mapping and modeling tick dynamics using volunteered data

Netherlands

Random forest; Remote sensing (NDVI); Volunteer geographic information

Ostfeld et al. (2006) 123

Lyme disease

To investigate the effect of variations of temperature, humidity, and deer and mice on the risk of Lyme disease

New York

GPS (field data); GIS; Spatial statistics (density mapping)

Pepin et al. (2012) 124

Lyme disease To investigate relations between human Lyme disease incidence and density of nymphs

36 eastern states

GPS (field data); GIS; Spatial statistic (zonal statistics, density mapping of nymphs; spatial autocorrelation); Modeling (negative binomial model)

Ramezankhani et al. (2018) 125

Leishmaniasis Predicting cutaneous leishmaniasis incidence based on environmental factors

Isfahan, Iran Decision trees Remote sensing (NDVI)

Nieto et al. (2006) 126

Leishmaniasis To predict spatial distribution and potential risk of visceral leishmaniasis

Bahia, Brazil Ecological niche model (GARP); GIS

Paixao Seva et al. (2017) 127

Leishmaniasis To predict future status of visceral leishmaniasis and identify underlying risk factors

Sao Paulo, Brazil

Spatio-temporal Bayesian model; GIS

Yamamura et al. (2016) 128

Tuberculosis

To explain spatial pattern of hospitalization due to tuberculosis, and to identify spatial and space-time clusters of TB

Riberiaro, Brazil

GIS; Spatial statistics (spatial scan statistics)

Patterson et al. (2017) 129

Tuberculosis To investigate social behavior of individuals who developed TB

A town in South Africa

GIS, GPS (locations of houses)

40

Figures

Figure 1-1. Contingency table of Knox method

Figure 1-2. Principle of linearly separable SVM using maximum margin

41

CHAPTER 2 A 24-YEAR EXPLORATORY SPATIAL DATA ANALYSIS OF LYME DISEASE

INCIDENCE RATE IN CONNECTICUT

Lyme disease (LD), a tick-borne, bacterial, zoonotic infection, remains a serious

challenge for public health.∗ The disease is distributed globally, predominantly in

temperate portions of the Northern Hemisphere such as Europe, Canada and USA 130.

In the United States (US), the geographical distribution of LD is primarily confined to the

north-eastern and mid-western areas 124. Past studies have shown that in these areas,

LD is caused by Borrelia burgdorferi sensu stricto. The pathogen is mainly transmitted

to humans during blood meals by the bite of infected blacklegged ticks (Ixodes

scapularis) with white footed mice (Peromyscus leucopus) serving as the primary

reservoir for this bacterium 131. The disease is the most common vector-borne disease

in U.S. with an estimated average number of 30,000 new cases every year 132;

however, the genuine number is likely much higher. The mean incidence rate of LD in

the top 13 U.S. states with the highest incidence rate during 2005-2009 progressively

rose from 29.6±10.6 per 100,000 in 2005 to 49.6±15.5 per 100,000 in 2009 124. At the

same time, in 11 states with the lowest incidence rate, the mean incidence developed

from 1.3±0.7 to 2.3±1.7 per 100,000 individuals 124. Although this common zoonotic

disease rarely leads to death, it can cause severe symptoms related to skin, joints and

heart in addition to anxiety and depression if untreated 133. LD can also be a

socioeconomic burden to society.

∗ Reprinted with permission from Mollalo, A., Blackburn, J. K., Morris, L. R., & Glass, G. E. (2017). A 24-year exploratory spatial data analysis of Lyme disease incidence rate in Connecticut, USA. Geospatial health, 12(2).

42

In recent years, exploratory spatial data analyses (ESDA) to describe spatial

patterns of LD has increased significantly as a strategy to improve our understanding of

disease transmission and risk. Several recent studies from different parts of the U.S.

have examined the spatial pattern of LD using ESDA. For example, Kugeler and

colleagues (2015) applied circular scan statistics to detect high-risk counties of LD in

the U.S. from 1993 to 2012. They showed that the number of counties with high

incidence of LD successively increased from 69 (1993-1997) to 130 (1998-2002) and

further to 197 (2003-2007) and 260 (2008-2012) counties, respectively 134. In Texas,

which is a nonendemic LD area, Szonya et al. (2015) applied global and local Moran’s I-

tests to determine the distribution and location of possible clusters, respectively, with

respect to the spatial distribution of LD at the county level (2000-2011). They observed

a clustered distribution with a high incidence cluster in central parts of the state, mainly

in a cross-timbers eco-region 65. In Virginia, Li et al. (2014) utilized the Empirical Bayes

smoothing (EBS) method on census tract LD cases with the aim of lessening random

variations, especially in censuses with small populations. Then, they applied space-time

scan statistic and found a primary cluster in northern Virginia which had experienced

population growth and urban-sub-urban improvements between 2008 and 2011 135.

The town of Lyme in Connecticut was the first spot that LD was recognized in the

U.S. The initial cluster in 1976 was observed in children 136. Since then, in spite of all

endeavors conducted by the Connecticut Department of Public Health (CTDPH) to

control the disease, it remains endemic with substantial morbidity rates. Although LD is

a well-investigated epidemiological subject in Connecticut, historical changes in patterns

of disease have been minimally studied 137. Additionally, even though geographical

43

information systems (GIS) is a useful tool to study infectious diseases 1, powerful GIS-

based studies of LD from this region are insufficient for prioritizing counties for

intervention. Thus, the main objective of this study is to use the combination of GIS and

ESDA to better describe changes in the spatial pattern of LD, supposing that the

reported passive cases of LD represent a spatially random subsample of the disease in

the state. Our specific research questions included: 1) what is the spatial distribution

(random/ dispersed/ clustered) of LD incidence over the past 24 years?; 2) If clustered,

where have the clusters/hotspots occurred? 3) Can we avoid or minimize the effects of

edges in the spatial scan statistic technique?; and 4) Which cluster detection technique

(LISA/spatial scan statistic) is more sensitive, specific and accurate?

Materials and Methods

Data Collection and Preparation

We used passively reported indigenous LD cases over a period of 24 years from

1991 to 2014 throughout the state of Connecticut. We retrieved data from the CTDPH

containing yearly counts and rates of LD at the town level. The CTDPH has a well-

established LD surveillance system operating since 1987 138. Reports of cases were

based on the National Surveillance Case Definition from the Centers for Disease

Control and Prevention (CDC) for LD 139-141. Data were geocoded and grouped into four

equal intervals (each period included six years: 1991–1996, 1997– 2002, 2003–2008

and 2009–2014) to further explore clustering, possible clusters and how hotspots had

changed.

Administrative boundaries of towns were obtained from the Map and Geographic

Information Center of Connecticut GIS data using the shapefile format

(http://magic.lib.uconn.edu/). Similarly, annual population statistics were downloaded

http://magic.lib.uconn.edu

44

from CTDPH (http://www.ct.gov/dph/site/default.asp). The study area and the names of

the towns mentioned in this paper are shown in Figure 2-1.

Global Clustering

We applied global clustering techniques to statistically evaluate whether the

existing pattern of LD incidence was random, clustered, or dispersed. We used the

global Moran’s I statistic 142. to measure spatial autocorrelation using GeoDa software

version 1.6.7 143. The null hypothesis assumes that there is no spatial pattern among

the incidence of LD in different towns (i.e. complete spatial randomness) 144. This

statistic employs a covariance term between each town and its neighbours as follows

145:

𝐈𝐈 =𝐍𝐍𝐒𝐒𝟎𝟎

∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢 (𝐱𝐱𝐢𝐢 − 𝐱𝐱�)(𝐱𝐱𝐢𝐢 − 𝐱𝐱�)𝐍𝐍𝐢𝐢=𝟏𝟏,𝐢𝐢≠𝐢𝐢

𝐍𝐍𝐢𝐢=𝟏𝟏

∑ (𝐱𝐱𝐢𝐢 − 𝐱𝐱�)𝟐𝟐𝐍𝐍𝐢𝐢=𝟏𝟏

(2-1)

𝐒𝐒𝟎𝟎 = ��𝐰𝐰𝐢𝐢𝐢𝐢

𝐍𝐍

𝐢𝐢=𝟏𝟏

𝐍𝐍

𝐢𝐢=𝟏𝟏

(2-2)

where xi and xj are incidences of LD in the ith and jth towns, respectively; N the

aggregate number of towns; and wij the spatial neighborhood weight for towns i and j

generated based on the first order Queen’s contiguity which shares all common points

including boundaries and vertices. The generated spatial weight is used as a criterion

for recognizing neighbors of each town. The weight is defined taking into account

adjacent neighbors and written as:

𝐰𝐰𝐢𝐢𝐢𝐢 = � 𝟏𝟏 𝐢𝐢𝐢𝐢 𝐢𝐢 𝐚𝐚𝐚𝐚𝐚𝐚 𝐢𝐢 𝐚𝐚𝐚𝐚𝐚𝐚 𝐚𝐚𝐚𝐚𝐢𝐢𝐚𝐚𝐚𝐚𝐚𝐚𝐚𝐚𝐚𝐚 𝐚𝐚𝐚𝐚𝐢𝐢𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐧𝐚𝐚𝐧𝐧 𝟎𝟎 𝐧𝐧𝐚𝐚𝐧𝐧𝐚𝐚𝐚𝐚𝐰𝐰𝐢𝐢𝐧𝐧𝐚𝐚 (2-3)

The Moran’s I index varies between -1 and +1, with 0 showing spatially random

distribution, while negative values indicate dispersed distributions and positive values

http://www.ct.gov/dph/site/default.asp

45

for clustered distributions. We assessed significance of the index using both the Z-score

and P-value.

Local Clustering

Spatial smoothing

The global clustering techniques provide information about the overall distribution

of LD (random, clustered or dispersed), but we were also interested in identifying local

clusters. First, we applied the EBS routine to account for variation in town sizes and

populations. Contrasts in population size among the spatial areal units (i.e. towns of

Connecticut) may lead to variance instability and spurious outliers 143. This is due to the

observed raw rate in spatial areal units with small population being profoundly affected

by small changes of adding or removing few cases. Thus, crude rates might not reflect

underlying risk compared with other areal units with large populations. EBS provides a

solution to avoid this type of possible bias as it adjusts the estimated risk toward the

global mean to reduce variance instability 63; areas with low population are adjusted

more than areas with larger populations. Since there was a considerable difference in

areas of some towns (e.g., Derby and New London are approximately 5 mi2 while

Woodstock and New Milford cover more than 60 mi2) and also population size of towns

(e.g., Union and Canaan have about 1,000 people, whereas New Haven and Bridgeport

have more than 120,000 individuals) applying the EBS is justifiable. We calculated

spatial weights for each time interval using the first-order Queen’s contiguity. EBS

smoothed rates were employed in local cluster detection analyses.

Local moran’s

To detect local clusters of the LD rates smoothed by EBS, we applied Anselin’s

local indicator of spatial autocorrelation (LISA) statistics 143. LISA identifies hotspots

46

(towns with a high incidence surrounded by high incidences); coldspots (towns with a

low incidence surrounded by low incidences) and outliers (towns with a high incidence

surrounded by low incidences, or towns with a low incidence surrounding by high

incidences). We used GeoDa (https://geodacenter.github.io/) for LISA analyses. We set

the number of permutation tests to 999 and 95% significance level (p<0.05). We

mapped significant clusters using ArcGIS 10.2 (ESRI, Redlands, CA, USA).

Spatial scan statistics

We were interested in comparing LD clusters identified by LISA and the Spatial

Scan Statistic (SSS), for which we used an ellipsoidal, moving window situated on the

centroid of each town so that at any point the window incorporated different sets of

neighbors. At each position, the radii of ellipse was set to vary continuously from 0 to a

maximum that never included more than half of the total population at risk. If the window

contained the centroid of the neighboring towns, then that whole town was included.

This procedure produces a very large number of ellipsoidal windows and each one can

be a possible cluster of LD with different set of neighbors. The ratio of the length of

longest to the shortest axis of the ellipse was 1.5, 2, 3, 4 or 5. For each shape, a

different number of angles (i.e. the angle between the horizontal east-west line and the

longest axis of the ellipse) of the ellipse were also tested. For each ellipse, a likelihood

ratio statistic was computed based on the number of observed and expected cases

within and outside the ellipse. The null hypothesis (i.e. LD incidence is equal inside and

outside of the window) was tested against the alternative hypothesis that the risk was

elevated within the ellipse. The likelihood ratio is reported by P values calculated based

on the Monte Carlo simulation approach which finds the maximum likelihood ratio over

the entire study region. The ellipse with the maximum likelihood was signified the most

https://geodacenter.github.io/

47

likely (primary) cluster. This approach also detected secondary clusters which were

additional ranked clusters that had high likelihood ratios but did not overlap the primary

cluster 147.

Three sets of data were built for the analysis based on discrete Poisson

probability model in SaTScan software version 9.4.2 147. The datasets were: Case file

representing the annual number of cases of LD for each town (n=169) from 1991 to

2014; Coordinate file was the two-dimensional Cartesian coordinates of centroid of each

town; and Population file was the population size of each town. We used the mean

population and mean number of cases per time interval. To scan the study area, we

applied two criteria: no geographic overlap between clusters and 5% of the population

as the maximum to search for hotspots with the aim of comparing the results with

LISA’s high-high clusters as the LISA analysis only considered the neighboring towns.

Also in another run, we used no geographic overlap between clusters and 50%

population at risk to investigate whether the results depended on the primary settings.

To ensure statistical power, the number of Monte Carlo replications was set to 999 and

only clusters at 99% confidence interval were considered (p<0.01).

Accuracy Assessments

Table 2-1 shows a 2x2 confusion matrix used to compare the performance of

LISA and SSS (for 5% and 50% of populations at risk) to detect hotspots, sensitivity,

specificity and overall accuracy 148. In this case, sensitivity measures how well the

cluster detection techniques correctly detected the presence of LD hotspots, whereas

specificity provides a measure of how well the techniques correctly identified the

absence of LD hotspots. Overall accuracy shows the ability of cluster detection

techniques to identify true positive and true negative LD hotspots. Locations of the

48

detected clusters with both LISA and SSS techniques were compared with the location

of true clusters as defined by Birnbaum et al. (1996) that “true clusters explain fewer

than 5% of all reported clusters” 149. Therefore, the top 5% towns with regard to high LD

incidence were considered as True Hotspots. We calculated these statistics for each

technique in each period, separately.

Results

There were 54,478 reported human LD cases from 1991 to 2014, out of which

10,328 cases (19.0%) occurred in first period (1991-1996), 20,234 cases (37.1%) in

second period (1997-2002), 11,210 cases (20.6%) in third period (2003-2008) and

12,706 cases (23.3%) in the last period (2009-2014). The annual incidence of LD

ranged between 84.7 (in 1993) and 305.6 (in 2002) cases per 100,000 individuals with a

mean of 144.8 cases per 100,000 people (Figure 2-2). The P values for the global

Moran’s I statistic were close to zero which reject the null hypothesis of complete spatial

randomness (CSR) for all time periods (Table 2-2). The index values ranged from 0.55

to 0.71, which indicates significant clustering. Although, we detected clustering with both

raw and smoothed methods, and in all four periods, the variation in rates required EBS

before running LISA. Based on the results of EB smoothing technique, LISA found High-

High clusters (hotspots), which varied for each study period. Results of LISA showed

that in the first and last period, the hotspots were completely restricted to the towns in

the East. In other words, for the first period, 24 towns and for the last period 30 towns

were identified as the hotspot in eastern Connecticut. The towns in the West were more

influenced in the second and third periods. The number of towns in the West identified

as hotspots in the second period was 8, while 7 towns were hotspot in the East of

Connecticut. In addition, in the third period, 10 towns in the West and 6 towns in the

49

East were identified as hotspots. It should be noted that applying EBS before running

LISA, reduces the likelihood of detecting a false cluster in low-population areas where

the cases were detected. But comparison of the detected clusters by raw and EBS

methods in LISA showed small differences with regard to the location of the detected

clusters.

Spatial scan statistics with 5% and 50% of the populations at risk with no

geographic cluster overlaps identified clusters with high incidence rates for each period.

Except for the second period, The SSS results revealed that the most likely cluster

predominantly occurred in the eastern region of the state, while the secondary cluster

occurred in towns in the West (Figure 2-3). Primary and secondary hotspots were

observed in different locations when 5% of the population at risk was investigated using

SSS for comparison with LISA. For the first period, the primary cluster occurred in the

eastern parts of the state and included 22 towns and 792 cases. The risk of LD

incidence within the primary cluster was 7.67 times greater than outside the cluster. The

secondary cluster occurred in the western region and contained 5 towns and 152 cases.

During the second period, the primary cluster occurred only in the eastern parts of the

state and included 23 towns and 1,208 cases. The risk of LD incidence within this

cluster was 3.40 times higher than outside. During the 2003-2008 period, the disease

largely affected the eastern region with 22 towns and 738 cases as compared to the

western parts with 10 towns and 145 cases. The relative risks of primary and secondary

clusters were 3.33 and 9.18, respectively. In the last period, only eastern parts of the

state showed statically significant clustering (no secondary cluster). The primary cluster

https://geospatialhealth.net/index.php/gh/article/view/588/610

50

contained 24 towns and 942 cases. The risk of LD incidence in this cluster was 3.69

times more than in other areas (Table 2-3).

Comparison of the results of accuracy assessments of the spatial scan statistics

at 5% and 50% of the populations at risk, shows that sensitivity of this method increases

with an increment of the population defined to be at risk. However, increasing the

population at risk also leads to a decrease in the specificity of the results. In addition,

LISA tended to have higher specificity and had the highest overall accuracy (Figure 2-

4).

Discussion

This retrospective study examined the spatial structure of LD incidence

distribution in Connecticut based on 24 years of reported data with the aim of describing

the spatial distribution and the changes that have occurred with regard to the disease. It

differs from most other studies in the region by not focusing on local studies of the

pathogen, reservoir or vector and their associations with the environment. Instead, it

focuses on the changing patterns of documented human disease occurrences. We

assumed that the number of affected human cases corresponded to the density of

Ixodes scapularis in Connecticut 150-152. In addition, it should be noted that ticks have

limited capabilities to move to new areas because of their small size 153. Therefore, one

of the means to fight against the risk of the disease in the study area would be targeted

intervention in the areas (towns in this case) that were constantly affected with a very

high morbidity rate. Targeted control, planning and management of the disease can

assist with resource allocation to the towns with persistent high incidence rates resulting

in time and costs savings.

51

Results of this study confirm and extend the findings of Xue et al. (2015) in

Connecticut 154. They analyzed yearly clusters of LD and showed that the distribution

was clustered and that this clustering occurred in western and eastern Connecticut with

few cases in the central region supposing the yearly incidence weighted-geographic

mean analysis represents the clusters of the case distributions. According to the paper,

the epidemic of LD reached equilibrium in 2007 in the western parts of Connecticut,

while this happened in 2009 in the East. Comparison of the results shows that periods

1-3 included epidemic conditions while period 4 contained equilibrium condition.

Therefore, actual/sudden shifts in the locations of clusters occurred roughly before

equilibrium (period 1-3); and it may be that epidemic conditions affected both western

and eastern parts of the state; while equilibrium conditions only occurred in the eastern

regions.

This study identified towns with high LD incidence rates using the LISA test and

spatial scan statistics approaches. By intersecting the locations of the clusters identified

by LISA some towns, including Chaplin, Windham and Scotland, were persistently

detected to have high-rate clusters. Likewise, neighbors of these towns including

Andover, Columbia and Lebanon, were recognized as having high-rate clusters in 3 out

of the 4 periods of study (Figure 2-1). This indicates that these areas were almost

persistently affected by a high incidence of LD during the 24 years of study and

therefore they deserve closer consideration.

In this study, we used an elliptical window in the scan rather than highly applied

circular shape used in other studies for several reasons. First, according to previously

published papers, the elliptic shape had better power and precision compared with

52

circular one and also follows more accurately certain geographic features with varying

shapes and directions 155. Another reason was to avoid edge effects at the borders of

the study area. When circles are used for scanning the study area, particularly when the

circle should center on the centroid of border towns, considerable parts of it inevitably

covers the neighboring state (i.e. Rhode Island in the East, Massachusetts in the North,

Long Island Sound to the South and New York in West, where states are also LD-

endemic areas where data were not accessible) and the true number of cases would be

underestimated. The use of ellipses with different shapes and directions helps to reduce

this problem to some degree.

There were similarities in the locations of the spatial hotspots identified by

SaTScan and LISA even though these methods apply different methodologies to detect

hotspots. It should be noted that we smoothed the incidence rate for the LISA cluster

detection method, while this is not available for spatial scan statistics. The results were

also in agreement with the findings of Barro et al. (2015), who compared different

cluster detection techniques including Getis-Ord Gi*statistic, a multidirectional optimal

ecotope- based algorithm (AMOEBA) and the spatial scan statistic for identifying

hotspots of human cutaneous anthrax in Georgia for point data 156. They also found that

SSS was more sensitive (due to augmentation in the quantity of true positives) but less

specific (by increment of the population at risk because of a declining number of true

negative clusters). Here, the results were highly dependent on defining the weight

matrix in LISA or the percentage of the population at risk in spatial scan statistics.

The most important limitation of this study is attributed to the data reported that

were used in this study. As indicated by the CDC, surveillance data are subject to

53

under-reporting and misclassification in highly endemic areas such as Connecticut;

however, this problem would not be very severe in light of well-designed LD surveillance

system in this state. Additionally, as long as there is simply a spatially random thinning

of reported cases, the overall spatial analysis should not be affected. However, if there

is persistent over- or under-reporting in selected regions, it may be difficult to recognize

the impact of such biases. Another factor influencing the results is that the definition of

LD has changed several times: from using a two-test approach for laboratory affirmation

139 to the addition of probable and suspect categories with less strict criteria 140 and

subtle changes in the specification of confirmed cases 141. However, empirically (Figure

2-3), these changes do not seem to substantially have altered the areas of high-risk

clusters. Therefore, although the analyses conducted were based on data reported by

CTDPH, the findings of this study should be interpreted with some caution.

The findings of this study can be regarded as a basis for generating hypotheses

about underlying risk components. For instance, visual comparison of the locations of

towns that were never identified as cluster areas (Figure 2-3) and the digital elevation

model (DEM) of the study area suggest that low-risk towns are located in low-lying

areas; thus it seems that people who live at moderate or higher altitudes in Connecticut

are at a higher risk. This would be consistent with the earliest epidemiologic

investigations 157 that identified increased risk in areas away from the seashore. Thus,

altitude as a proxy of other environmental factors can uncover the reasons of the spatial

distribution of the disease in the study area. Moreover, except for Windham, all the

other towns identified as persistent clusters had populations lower than 8,000 people

showing that persistent towns as clusters occurred in less populated areas. Therefore,

54

further studies should incorporate other environmental factors that have significant

influence on the spatial and temporal patterns of the disease.

55

Tables

Table 2-1. Comparison of LISA and SaTScan clusters by confusion matrix and its derivatives statistics Identified hotspots (LISA or SaTScan)

Actual hotspots (top 5%

incidence rates)

Presence Absence

Presence True Positive (TP)

False Negative (FN)

Absence False Positive (FP)

True Negative (TN)

Sensitivity = (TP)/(TP+FN); Specificity = (TN)/(TN+FP); Overall accuracy = (TP+TN)/(TP+FN+FP+TN)

Table 2-2. Results of the global Moran statistic of LD incidence rate, Connecticut, 1991-2014

Year Index Z-score P-value Type of distribution

1991-1996 0.71 16.05 ≈ 0 Clustered 1997-2002 0.59 13.54 ≈ 0 Clustered 2003-2008 0.55 12.73 ≈ 0 Clustered 2009-2014 0.59 13.28 ≈ 0 Clustered

Table 2-3. Characteristics of LD spatial clusters detected by SSS with 5% of the

population at risk throughout Connecticut, USA for the period 1991-2014

Cluster (C) Year Observed Cases (N)

Expected Cases

Ratio (Obs/Exp)

P-value Relative Risk (RR)

Log likelihood ratio

Most Likely (East) 1991-1996 792 171.26 4.62 <0.0001 7.67 736.02

Secondary (West) 1991-1996 152 32.26 4.71 <0.0001 5.07 120.17

Most Likely (East) 1997-2002 1,208 474.52 2.55 <0.0001 3.40 496.50

Most Likely (East) 2003-2008 738 305.46 2.42 <0.0001 3.33 284.31

Secondary (West) 2003-2008 145 16.96 8.55 <0.0001 9.18 187.61

Most Likely (East) 2009-2014 942 376.77 2.50 <0.0001 3.69 400.80

P-values were obtained from the Monte Carlo hypothesis test

56

Figures

Figure 2-1. Geographic location of the Connecticut, its towns and approximate populations. The names on the map show the location of the towns which are mentioned in this paper.

57

Figure 2-2. Temporal trend of LD incidence throughout Connecticut from 1991 to 2014. Black dots show the incidence rate for each year and the blue line represents a scatter with smooth lines.

0

50

100

150

200

250

300

350

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

Inci

denc

e ra

te (p

er 1

00,0

00)

Year

58

Figure 2-3. Locations of spatial clusters of LD incidence in Connecticut, USA based on

the true-cluster definition of with LISA and SSS methods targeting 5% of the population at risk (A), for the period 1991-1996; (B), for the period 1997-2002; (C), for the period 2003-2008; (D), for the period 2009-2014.

59

Figure 2-4. Comparison of LISA and spatial scan statistics with 5% and 50% of the

population at risk of LD in Connecticut, USA for each period from 1991 to 2014. Sensitivity, specificity and overall accuracy expressed as per cent (%)

40

50

60

70

80

90

100

Sens

itivi

ty

Spec

ifici

ty

Ove

rall

Accu

racy

Sens

itivi

ty

Spec

ifici

ty

Ove

rall

Accu

racy

Sens

itivi

ty

Spec

ifici

ty

Ove

rall

Accu

racy

Sens

itivi

ty

Spec

ifici

ty

Ove

rall

Accu

racy

Period1 Period2 Period3 Period4

LISA

SSS 5% pop. at risk

SSS 50% pop. at risk

60

CHAPTER 3 MACHINE LEARNING APPROACHES IN GIS-BASED ECOLOGICAL MODELING OF

THE SAND FLY PHLEBOTOMUS PAPATASI, A VECTOR OF ZOONOTIC CUTANEOUS LEISHMANIASIS

Zoonotic cutaneous leishmaniasis (ZCL) is an important neglected tropical

disease due to insufficient considerations for controlling and ignored remarkable public

health impacts in endemic countries ∗158. It poses unpleasant skin lesions to the patients

which may take a long time to heal and injects serious socio-economic burdens to the

society 159,160. This vector-borne disease is highly prevalent in developing countries

including Iran 161. Previous studies in Iran have demonstrated that Leishmania major is

the causative agent of ZCL and Ph. papatasi which is the target species in this study

serves as the major vector of the disease. It is reported that the disease is still endemic

in many rural areas of 17 of 31 provinces of Iran and almost 70% of

cutaneous leishmaniasis cases in Iran are ZCL 162,91. A large number of factors impact

the distribution of Ph. papatasi and/or its pathogen which have made it a

multidimensional and complex health problem particularly when the underlying

relationship is ambiguous 163.

In recent years, novel machine learning techniques such as random forest (RF)

and support vector machine (SVM) have become popular research tools for ecologists

and medical researchers due to improved predictive performances of the techniques,

particularly in solving binary classification tasks 164. The rapid application of these

techniques might be due to their abilities to approximate almost any complex functional

∗ Reprinted with permission from Mollalo, A., Sadeghian, A., Israel, G. D., Rashidi, P., Sofizadeh, A., & Glass, G. E. (2018). Machine learning approaches in GIS-based ecological modeling of the sand fly Phlebotomus papatasi, a vector of zoonotic cutaneous leishmaniasis in Golestan province, Iran. Acta tropica, 188, 187-194.

61

relationship 165. These techniques are mathematical models that can map the

relationship between input data and the target output by receiving several examples

(training data) 166. A trained model subsequently can predict a variable of interest for a

new independent (test) dataset.

To date, there are few studies using machine learning methods that model the

spatial distribution of leishmaniasis. In South America, Peterson and Shaw (2003) used

the genetic algorithm for rule-set prediction (GARP) to model ecological niche of

three sand fly vector species: Lutzomyia whitmani, Lutzomyia intermedia,

and Lutzomyia migonei. They projected the geographic distribution of vectors for the

year 2050 under global climate change scenarios and predicted that Lutzomyia

whitmani will dramatically increase in southern Brazil 167. Samy et al. (2016) used

Maximum Entropy (MaxEnt), a presence-only ecological niche model (ENM), to map the

potential distribution of ZCL across Libya. Their model successfully predicted the risk

areas and distribution of most species associated with ZCL including Ph. Papatasi. This

sand fly was present mostly in lowland areas and the risk areas of ZCL were almost

confined to the northern coast of the country 168. In a study in northwestern Iran, Rajabi

et al. (2014) used radial basis functional link nets (RBFLN), a neural network modeling

technique, to map potential areas of visceral leishmaniasis (VL), or kala-azar, a fatal

form of leishmaniasis. They produced a susceptibility map of VL, with the overall

accuracy of 92% in endemic villages and concluded that riverside and nomadic villages

without health centers were high-risk areas 169. RF and SVM have been applied in the

classification of other vector-borne diseases such as malaria and dengue fever. In an

endemic district in South Africa, Kapwata and Gebreslasie (2016) successfully applied a

62

random forest approach to select important features to create predictive maps of

malaria. They found that the most determining features were altitude, normalized

difference vegetation index (NDVI), rainfall and temperature 170. Gomes et al.

(2010) trained an SVM model for a small sample size (n = 28 patients) to classify

dengue fever (DF) and dengue hemorrhagic fever (DHF) patients based on gene

expression data. They achieved 85% accuracy for the validation dataset and identified 7

out of 12 genes that can differentiate DF patients from DHF patients 171.

To the best of our knowledge, no effort has been made to model Ph. papatasi or

other vectors of leishmaniasis distribution using SVM and RF, which are two effective

supervised techniques in ecological studies 172. This study analyses the applicability of

these techniques in GIS-based modeling of the species in Golestan province, northeast

of Iran. These techniques are used to classify this endemic area of ZCL into two

distinctive classes: ‘presence’ and ‘absence’ of Ph. papatasi. The accuracy of

these classifiers was compared with the frequently applied (vanilla) logistic regression.

This study is a starting point for further studies and application in control measures such

as “automatic early warning systems”. Findings of this study can help decision makers

target surveillance, and to control and mitigate species propagation into new foci.

Material and Methods

Study Area

Surveillance was conducted in Golestan province, a notorious endemic area

of leishmaniasis. The province has over 1,700,000 inhabitants and covers an area of

20,368 square kilometers in the northeast of Iran. This area lies within latitudes 36°30′

and 38°8′ N and longitudes 53°57′ and 56°22′ E (Figure 3-1). Geographically, this

province is bounded from north to Turkmenistan and from west to the Caspian Sea and

63

Mazandaran province. The altitude of the study area ranges from −32 to 3,945 m above

mean sea level. The climate of province varies from cold to moderate weather in the

south because of Alborz mountain ranges and arid and semi-arid with hot summers in

northern areas. As one moves from southern to northern areas, generally, the amount

of rainfall and percentage of humidity decreases and air temperature increases.

Data Collection and Preparation

Several categories of data in different formats including vector and raster data

models and excel spreadsheets were incorporated. The data were obtained from the

Worldclim, the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite

images, the United States Geological Survey (USGS) and GIScenter of Golestan

province. These data were used as candidate explanatory variables to model spatial

distribution of vector presence/absence in the study region. It was essential to

determine candidate factors to model species distribution. Thus, in the first step, a ZCL

related geo-database was developed in GIS and stored in a common projection system.

Sand fly collection

We conducted sampling during the primary activity period of sand flies from early

April 2014 to late October 2014 (adult sand flies are absent in other months). In each of

the 16 counties of the province, 5 to 10 villages were selected based on their

topographic conditions, spatial distribution of sampling sites (almost distributed

uniformly across the study area), and area of each county. For each sampling site,

sticky paper traps smeared with castor oil with the size of 15*20 cm used and placed

indoors (bedrooms, stables,…) and outdoors (animal shelters, rodents burrows,..). In

each site, 30 indoor and 30 outdoor sticky traps were monthly installed. We installed

traps at sunset and collected them the next day before sunrise. Also, no unprecedented

64

changes in climate condition or ZCL incidence was observed for the study period.

Collected sand flies were then washed to remove oil and other debris. We preserved

the sand flies in 70% ethanol and transferred them to an entomology laboratory for

species identification. Identification of collected material was based on

morphological identification keys of dissected sand flies. In 44 out of 142 sampling

sites, Ph. papatasi was present (Figure 3-1B). Although other species of sand flies were

also found in the study area, we did not consider them in habitat modelling, as Ph.

papatasi is the main vector of ZCL in the study area. Detailed information on collected

sand flies has been presented in Sofizadeh et al. (2016) 109.

Exploratory data

To investigate the effects of temperature and precipitation on habitat suitability

of Ph. papatasi in this area, we used the WorldClim database 173. Nineteen climate

variables (bio1 through bio19) were obtained from the database

(http://www.worldclim.org) and clipped with the boundaries of the study area (Table 3-

1). These variables are at 1 km (30 arcsec) spatial resolution. In addition, to investigate

the possible effects of proximity to water on presence/absence of the sand fly, we

calculated the Euclidean distance from each sampling site to the nearest river in the

GIS environment. The river coverage obtained from GIS and Environment Center of

Golestan province.

Two widely used remotely sensed products including land surface temperature

(LST) and NDVI were used to measure land temperature and vegetation cover,

respectively. These products were downloaded from the Moderate Resolution Imaging

Spectroradiometer (MODIS) satellite imagery from March to October 2014

(corresponding to the same time of the sand fly collection). Both LST and NDVI are at

http://www.worldclim.org

65

1 km spatial resolution; the temporal resolution for the LST dataset is 8 days which

aggregated to a monthly resolution. LST downloaded for day and night, separately. The

NDVI dataset is at a monthly temporal resolution. They were pre-processed in ENVI

5.4 image analysis software. All remotely sensed satellite images are freely available at

the MODIS website: http://www.modis.gsfc.nasa.gov.lp.hscl.ufl.edu.

We used NDVI to reflect the density of vegetation cover on the ground. This

index, which has been highly correlated with the plant species presence, is calculated

via the red (R) and near-infrared (NIR) spectral bands from the following formula:

𝐍𝐍𝐍𝐍𝐍𝐍𝐈𝐈 =𝐍𝐍𝐈𝐈𝐍𝐍 − 𝐍𝐍𝐍𝐍𝐈𝐈𝐍𝐍 + 𝐍𝐍 (3-1)

The value of NDVI differs from −1 to +1 so that a greater value of NDVI represents a

higher amount of vegetation cover 158, 160, 174.

The 30-meter shuttle radar topography mission (SRTM) elevation data was

obtained from the United States Geological Survey (USGS) website (https://www-usgs-

gov.lp.hscl.ufl.edu/). Also, a slope map was generated from the digital elevation model

(DEM) of the study area. To calculate slope, the average maximum algorithm was used

in ArcGIS 10.2 software (ESRI, Redlands, CA, USA).

In summary, we developed a set of 31 candidate variables representing different

weather conditions, topographies, vegetation regimes, land temperature and proximity

to rivers that might plausibly impact sand fly occurrence (Supplementary Table). The

selection of these environmental variables was based on previously published literature

that found direct and/or indirect impacts on the distribution of the sand fly populations.

The values of each raster variable were extracted for each of the 142 sampling sites

http://www.modis.gsfc.nasa.gov.lp.hscl.ufl.edu

https://www-usgs-gov.lp.hscl.ufl.edu/

66

through “Extract values to points” function in ArcGIS 10.2. The presence and absence

of Ph. papatasi are coded with 1 and 0, respectively.

Classifiers Used in This Study

Logistic regression (LR) is a generalized linear regression model that is

appropriate for situations where the dependent variable is dichotomous. The output of

LR is a likelihood of species occurrence as a function of several exploratory variables

175.

The random forest (RF) classifier, an ensemble tree-based algorithm, generates

a large number of binary decision trees (for instance 1000 trees) and then combines

them to provide a final classification for the presence/absence of the outcome 96. This

algorithm injects randomness through random selection of observations

using bootstrap sampling (i.e. sampling with replacement) and random selection of

exploratory variables. This process learns binary trees that are shallow (i.e. have a

small number of nodes) but highly specialized on a subset of the dataset/variables.

Ensembling the decisions of this diverse set of trees (forest) leads to an improvement in

the performance of the method 97. A final decision is achieved based on the majority of

labels (i.e. votes or outputs of trees) obtained from individual trees. In this study, the

number of trees was optimally selected from the set of {10, 30, 50, 100, 500} by cross-

validation. In addition, the optimal maximum depth (i.e., number of layers or number of

edges from the root to the node) of the trees was chosen from the set of {2, 3, 4}. The

maximum depth is selected via the cross-validation done on the training data.

SVM, originally proposed by Vapnik (1999), can extract patterns from complex

non-linear data 176. We used this SVM's ability to extract spatial patterns from the

collected data and features. SVM has several appealing characteristics compared to

67

most machine learning techniques. It guarantees global optima 177 and leads to a

unique solution when solved using iterative methods, rather than stochastic

randomness. We used a grid search to find the optimum values for the SVM

parameters. Detailed explanations of SVM can be found in Noble 98.

We trained all above-mentioned models (i.e. LR, RF, and SVM) using sklearn

(http://scikit-learn.org/stable/) in Python programming language to predict Ph.

papatasi occurrences.

Pre-processing of the Data

In the first step, to avoid/reduce the existence of the multicollinearity and

redundancy of the variables, we applied Pearson’s correlation analysis in SPSSver. 23

software. We investigated Pearson correlations among all pairs of selected variables.

We omitted highly correlated factors (i.e. correlation coefficient >0.7) before modeling.

All of the models are sensitive to the number of exploratory variables because

redundant variables add extra noise which reduces model accuracy and generalizability

178. A proper selection of the most effective input variables is crucial to improving

predictions. To address this, feature selection via L1 regularization was utilized before

developing the models to reduce the effects of possible noise. More detailed information

of L1 regularization method can be found in Park and Hastie (2007) 179.

After selecting the variables, we standardized (i.e., scaled) the input dataset

before incorporating the variables into the models because the data have different

ranges and units. This important pre-processing step can increase the performance of

models by converging faster 180, particularly in SVM and RF which are solved with

iteration-based optimization techniques. There are several standardization formulas

http://scikit-learn.org/stable/

68

presented in literature and, here, we use the following equation which re-scales input

data to z-scores:

𝐙𝐙𝐚𝐚 =𝐙𝐙𝐢𝐢 − 𝛍𝛍𝛔𝛔

(3-2)

Where Zi is the initial (actual) value; μ and σ are the average and standard

deviation of the initial value, respectively and Zn is the standardized value. Moreover,

after developing the models, the output of the model is returned to the original form

through the following equation:

𝐙𝐙𝐢𝐢 = (𝐙𝐙𝐚𝐚 ∗ 𝛔𝛔) + 𝛍𝛍 (3-3)

Overview of Training Models

To develop models, we used 5-fold cross-validation where in each fold, the

dataset is randomly partitioned into two different categories including training set, 75%

of total data (ntrain = 106), and a remaining 25% (ntest = 36) as a hold-out test set used to

assess the accuracy and predictive power of the models after training process. We

saved and used the same training and test datasets for all models for further

comparisons. Also, the data are randomly selected, thus, on average 75% of the

training set is the same from one fold to the other.

To fine-tune each model’s hyper-parameters (e.g. parameters in SVM), using the

earlier mentioned grid search method, we performed a 3-fold cross validation within the

training set, breaking it into 66.6% and 33.3% subsets for each fold. As a final step, LR,

SVM and RF classifiers with optimal parameters were trained and then used to

distinguish presence from the absence of Ph. papatasi. All the reported results are the

average from the 5-fold cross-validation.

69

Models Validation

A confusion matrix or error matrix 148 is used to assess the generalization

capabilities of the classifiers. Sensitivity (or true positive rate) and specificity (true

negative rate) were used to evaluate the ability of the model to correctly identify

presence and absence of Ph. papatasi, respectively 181. Also, two important statistics

have been calculated to measure the performance of classifiers, namely positive

predictive value (PPV) and negative predictive value (NPV). PPV is the probability that

the sampling sites identified by the model with the presence of Ph. papatasi that actually

have that species, while NPV is the probability that the sampling sites identified by the

model as the absence of Ph. papatasi that actually don’t have that species 182. The

models’ accuracies were assessed using overall accuracy and area under receiver

operating characteristics (ROC) curve 183. In this study, we also used Cohen’s Kappa

index to exclude the probability of chance agreement between predictions and ground

truth 184.

Due to black-box nature of SVM, we are unaware of the weights/contribution of

each variable in predicting presence/absence of Ph. papatasi. We applied sensitivity

analysis to understand the contribution of each factor on the output variable by

excluding each variable from the model one by one and comparing the performance of

the model in terms of Cohen’s Kappa index. The absence of a variable that reduces the

Kappa index of the model the most is considered as the most important factor and so

forth. Finally, variables were imported into the most accurate model to generate habitat

suitability map for the entire study area. The obtained map was classified into 5 different

categories: very low, low, moderate, high and very high probabilities. Natural break

70

classification (a.k.a. jenks) was utilized to classify the obtained suitability map. The

overall flowchart of the procedures is shown in Figure 3-2.

Results

After conducting variable selection, 7 variables selected as inputs of the

models. Table 3-2 shows the list of selected variables and the correlation coefficients

among them.

As stated in the Methods section, the parameters C and γ in the SVM model

were determined experimentally by systematically increasing these values from 0.5 to

20 and 0.005 to 0.1, respectively. The best accuracy of SVM was obtained with C= 17.5

and γ= 0.105 and the radial basis function (RBF) kernel.

Results of this study using the test dataset (25% of the sampled areas not used

in developing the models) demonstrated that all three classifiers have good

generalization capabilities. The SVM model generated the highest overall accuracy

(91%) followed by the LR (88%) and the RF (85%). The ROC curves (prediction rates)

and AUCs for each method (Figure 3-3) showed all the classifiers generated high AUCs

(more than 90%); however, there were slight differences between AUCs of the

classifiers. The AUCs were 0.974, 0.965 and 0.935 for SVM, LR, and RF classifiers,

respectively.

Kappa indices were 0.79, 0.73 and 0.65 for the SVM, LR, and RF models,

respectively, suggesting substantial agreements between models outputs and ground

truth. Similarly, the lowest accuracy expected better than chance was obtained by the

RF method.

The highest positive predictive value (PPV) was achieved by the SVM model

(85%), indicating that the probability that the sampling sites identified by the SVM

71

classifier with the presence of Ph. papatasi that actually have that species was 85%. It

was followed by the LR model (82%) and the RF model (79.0%). Similarly, the SVM had

the highest negative predictive value (NPV = 93.2%) suggesting that the probability that

the sampling sites identified by the SVM as the absence of Ph. papatasi that actually

don’t have that species was 93.2% and followed by the LR (90.7%), and RF models

(87.0%).

The SVM model was the most sensitive model, with 84.9% of the correctly

classified sampling sites, followed by LR model with 80.0% of the correctly classified

sampling sites whereas the RF model was the least sensitive classifier with 73.4%. The

SVM model also had the most specific model (93.2%), implying that over 90% of

absence locations are correctly classified by this method. It was followed by the LR

model (91.9%) and RF models (90.4%). A graphical comparison of evaluation metrics of

classifiers (i.e. LR, SVM and RF) is shown in Figure 3-4.

Based on the above results, the best accuracy was obtained by SVM (with the

RBF kernel) compared to the other classifiers. Next, we assessed the contribution of

each variable for this best classifier using sensitivity analysis. Results of this analysis

indicated that the lowest Cohen’s Kappa index obtained when the slope was removed

from the SVM model. This suggested that this factor is the most influential factor in

predicting Ph. papatasi in the study region. The most influential factors in order of

contributions (from higher to lower) as predictors for Ph. papatasi were: slope, nighttime

LST in October, bio8 (mean temperature of wettest quarter), bio4 (temperature

seasonality), bio18 (precipitation of warmest quarter), bio5 (max temperature of

72

warmest month), bio15 (precipitation seasonality), distance to major rivers and bio13

(precipitation of wettest month), respectively.

According to the produced habitat suitability map (Figure 3-5) from the SVM

method, considerable parts of the study area are under very high probabilities of the

species occurrence. Results of the natural break classification of the map show that

28.8%, 5.0% and 7.7% of the total study area were predicted in the very low (0–0.18),

low (0.18–0.48) and moderate (0.48–0.73) probabilities of existence classes, while

areas covering high (0.73–0.90) and very high (>90) probabilities of Ph.

papatasi occurrences represent 16.5% and 42% of the total area, respectively. Very

high probabilities of the species occurrence happened mostly in central and north

(eastern) parts, while no and low occurrence of the species was restricted to the

southern half of the study area.

Discussion

This study focused on geospatial features that can predict the presence or

absence of Ph. papatasi in an endemic province of ZCL in Iran. We combined relatively

novel techniques/tools: remote sensing (to extract NDVI and LST), GIS (mainly for data

preparation and mapping) and machine learning (for model development). Our results

successfully demonstrated the utility of machine learning techniques in predicting the

spatial distribution of Ph. papatasi across Golestan province, Iran, from a surveillance

sampling strategy. Among the compared models, the SVM was an

efficient classifier with the highest prediction capability for producing habitat suitability

map of Ph. papatasi occurrence. Phlebotomine sand flies have limited flight range of

about 300 m 185 and a few have been known to travel 1.5–2 km 186. Regarding 1 km

73

spatial resolution of LST and bioclim variables, it seems that it is safe to assume that

this scale is operative for intervention or allocation of resources.

Regardless of the SVM benefits in Ph. papatasi habitat modeling, our results did

not indicate that the SVM classifier necessarily outperforms other binary classifiers for

species modeling or that RF is not a very accurate classifier. For instance, Ding et al.

(2018) found SVM inferior to the other models including random forest and gradient

boosting machine in modeling the global distribution of Aedes aegypti and Aedes

albopictus 187. Also, Williams et al. (2009) identified RF as the best model in predicting

appropriate habitats of six rare plant species in Creek Terrane, northern California. The

weaker results of RF compared to the other classifiers might be due to overfitting

(memorization). But generally, it is recommended to take SVM into consideration for

binary classification problems when having a dataset with a small sample size and a

relatively large number of exploratory variables 188.

The findings from this study can be compared with previous work conducted in

the same area by Sofizadeh et al. (2016). They utilized presence-only ecological niche

models (i.e. MaxEnt and GARP) for Ph. papatasi habitat modeling. The SVM model had

a 27% improvement in accuracy over the MaxEnt model 109. All three presence-absence

models developed in this study were more accurate than those developed using

ecological niche models. However, niche models generate pseudo-absence data from

the background, AUCs derived from pseudo-absence models are difficult to interpret

and thus a spurious pseudo-model might be selected. In this study, the lowest accuracy

for test data obtained by RF was 85%, while the overall accuracy of MaxEnt, which

outperformed GARP in the study of Sofizadeh et al. (2016), was no better than 64%.

74

This suggests that presence-absence models perform better for habitat suitability

models of the sand flies 109. Our results also confirm and extend the findings

of Carvalho et al. (2015) in predicting the vector of Leishmania amazonensis in South

America. In their study, random forest, as a presence-absence model, had the best

performance compared to several models including ENMs. They concluded that

incorporating species absence data significantly improved model performance 189.

However, it should be acknowledged that we incorporated some new variables such as

monthly NDVI (as opposed to yearly NDVI) and monthly LST (for day and night,

separately) and distance to main rivers together with considering absence points in

modeling. Among these new variables, nighttime LST and distance to main rivers were

found to be important determinants.

Sensitivity analysis indicated that slope was the most influential factor which is in

agreement with the results of previous ecological niche model. Topography and its

derivatives such as slope can have influence on biological and physical factors. For

instance, an increase in altitude leads to decreases in temperature and rainfall, and

these changes cause different plant types to grow 190. Also, Mollalo et al. (2014) found

altitude (and slope) a determining factor in the incidence of human ZCL. This suggests

that there is a strong correlation between human cases of ZCL and occurrence of Ph.

papatasi. Nighttime LST found the most important factor after slope which rarely is

incorporated in modeling Ph. papatasi 191. Land temperature can be associated with soil

moisture, evapotranspiration and vegetation water stress, which can be determining

factors in larval development and egg survival 192. Boussaa et al. (2016) also found LST

a significant factor in describing the distribution of Ph. papatasi. Similarly, no significant

75

association of daytime LST was observed which might be attributed to the nighttime

activities of sand flies. Temperature and precipitation variables (bio8, bio4, bio 13, bio

15, bio18) were found as the next important contributing predictors 193. Cross et al.

(1996) and Medlock et al. (2014) found that temperature and precipitation play

important role in development and survival of different life stages of Ph. papatasi 194-195.

These two factors have different effects on sand fly populations. For instance, lack of

rainfall sometimes doesn’t reduce larval frequency while moderate rainfall facilitates

transmission and intense rainfall reduces transmission because it flushes away suitable

resting sites for adult sand flies and destroys immature stages of the life cycle 196. In

regards to temperature, tropical climate provides favorable conditions to facilitate

transmission, while, the sand fly is not able to survive in temperature less than 15 °C

under laboratory conditions 197.

It should be noted that the accuracy of the classifiers used in this study depends

entirely on the data. Additionally, there are potential sources of errors (random errors or

systematic errors) in the dataset such as sampling bias or errors during vector collection

which can influence the accuracy of the classifiers. Therefore, the results of this study

should be used with some caution. We also recommend testing the performances of the

models on other datasets (e.g., regions) with comparable environmental characteristics

to select the best unbiased estimator for predictions. As a next step, a new sampling

validation in the same area of study is suggested to be performed in absence and

presence sites not sampled previously. Future studies should also consider actual

counts of the Ph. papatasifor each sampling site. Depending on the frequency

distribution of sand flies, proper modeling techniques such as Poisson regression,

76

negative binomial models, and artificial neural networks can be used. However, higher

counts of vector imply a greater risk of ZCL, the main purpose of this study was to

compare predictive capabilities of three different binary classifiers and modeling for

count data was not the purpose of this study.

In this paper, the use of novel machine learning techniques to predict the

presence/absence of an arthropod vector, Ph. papatasi, was investigated. The findings

showed that the SVM classifier was a reliable tool to solve specific problems in medical

geography and ecological modeling. The results of this study can be extended to

develop standalone early warning services (EWS) to actively monitor sand flies’

presence in endemic areas. Due to sophisticated predictive capabilities of SVM

compared to other classification methods, this classifier can be considered/

incorporated into the design of early warning systems (i.e. predictions of vector

presence greater than a threshold as a warning). This is applicable even for

other vector-borne diseases such as tick- and mosquito-borne diseases. However, the

implementation of EWS is more feasible in countries with well-established climate

stations that systematically record data.

77

Tables

Table 3-1. Bioclimate variables used in this study slope bio13 bio15 bio5 bio8 LST_OctN Dist_major slope 1 -.567** .376** -.610** .031 -.356** .457** bio13 -.567** 1 -.539** .467** -.087 .665** -.388** bio15 .376** -.539** 1 .013 -.144 -.479** .380** bio5 -.610** .467** .013 1 -.422** .640** -.331** bio8 .031 -.087 -.144 -.422** 1 -.221** .077 LST_OctN -.356** .665** -.479** .640** -.221** 1 -.375** Dist_major .457** -.388** .380** -.331** .077 -.375** 1

Table 3-2. Pearson correlation coefficients between selected variables

Variable Description Bio1 Annual Mean Temperature Bio2 Mean Diurnal Range (Mean of monthly (max temp – min temp)) Bio3 Isothermality (BIO2/BIO7) (*100) Bio4 Temperature Seasonality (standard deviation *100) Bio5 Max Temperature of Warmest Month Bio6 Min Temperature of Coldest Month Bio7 Temperature Annual Range (BIO5-BIO6) Bio8 Mean Temperature of Wettest Quarter Bio9 Mean Temperature of Driest Quarter Bio10 Mean Temperature of Warmest Quarter Bio11 Mean Temperature of Coldest Quarter Bio12 Annual Precipitation Bio13 Precipitation of Wettest Month Bio14 Precipitation of Driest Month Bio15 Precipitation Seasonality (Coefficient of Variation) Bio16 Precipitation of Wettest Quarter Bio17 Precipitation of Driest Quarter Bio18 Precipitation of Warmest Quarter Bio19 Precipitation of Coldest Quarter

Correlation is significant at the 0.01 level (2-tailed).

78

Figures

Figure 3-1. Geographic location of the study area and its counties, NE Iran. The map inset (B) illustrates the location of presence/absence sampling of Ph. papatasi.

79

Figure 3-2. Methodology flowchart used in this study

Figure 3-3. The ROC curves and AUCs for SVM, RF and LR classifiers

80

Figure 3-4. Comparison of the SVM, LR and RF classifiers. Overall accuracy, AUC, Kappa index, sensitivity, specificity, PPV and NPV expressed as %.

60

70

80

90

100

OverallAccuracy

AUC Kappa Sensitivity Specificity PPV NPV

Comparison of classifiers

SVM

LR

RF

81

Figure 3-5. Habitat suitability map of Ph. papatasi in Golestan province, northeast Iran based on support vector machine (note: this map is based on 1km units).

82

CHAPTER 4 A GIS-BASED ARTIFICIAL NEURAL NETWORK MODEL FOR SPATIAL

DISTRIBUTION OF TUBERCULOSIS ACROSS THE CONTINENTAL UNITED STATE

Tuberculosis (TB) is a contagious disease caused by Mycobacterium

tuberculosis ∗198. The disease is primarily transmitted through the respiratory route by

coughing or sneezing 199. The disease mostly attacks the lungs but can also affect other

organs such as kidney and brain 200. It can promote the course of human

immunodeficiency virus (HIV) infection into acquired immune deficiency syndrome

(AIDS) 201. According to the World Health Organization (WHO) global TB report, it is

estimated that 10.4 million incident cases in 2016 developed the disease, of which

almost 1.7 million patients died 202. This agency has ranked TB as the leading cause of

death among HIV patients, the most common killer from a single infectious agent, and

the 9th leading cause of death, worldwide.

According to the statistics by the WHO, most TB cases (>90%) are reported in

developing countries; however, it can also occur in developed countries 203,204. Despite

efforts to eradicate TB in the US, the disease remains a major public health challenge

204. In 2016, more than 9200 TB cases were reported in various parts of the US, which

placed the disease among the top notifiable infectious diseases in the country 205.

Although the frequency of TB has decreased in recent years, it is not expected that the

US will achieve the goal of TB elimination in this century 206.

There are many factors that influence the spatial distribution of TB, which has

made the disease a multidimensional and complex public health problem 207-210.

∗ Reprinted with permission from Mollalo, A., Mao, L., Rashidi, P., & Glass, G. E. (2019). A GIS-Based Artificial Neural Network Model for Spatial Distribution of Tuberculosis across the Continental United States. International journal of environmental research and public health, 16(1), 157.

83

Previous researches from different parts of the world have demonstrated that TB

transmission is related with various individual factors, for example, age, gender,

education level, race, migration, drinking alcohol, and presence of diseases (such as

HIV and diabetes) 211-213. Moreover, at the ecological level, factors such as climate,

altitude, air pollution, economic level, unemployment rate, and poverty have found

significant on TB occurrence 214,215. One of the major drawbacks of the highly applied

traditional statistical models in the study of TB is that these models are often based on

several hard-to-meet assumptions 191,91. This can bias the estimations of TB frequency/

incidence rate 216. For instance, some assumptions of the linear regression (LR) model

are normality of all variables, the linear relationship between inputs and output, constant

variance of errors, and little or no multicollinearity. They also often need a complete

and/or long-term recorded dataset to achieve unbiased estimations 217. On the other

hand, machine learning techniques (MLTs) may lead to appropriate estimations even

with noise-contaminated and incomplete data 218,219. As advanced tools, MLTs have

been successfully used in analyzing and modeling various complex environmental

disciplines, including in ecology, geography, biomedicine, and epidemiology

24,159,220,221,166. The growing popularity of MLTs can be attributed to their abilities to

approximate almost any complex non-linear functional relationship 222-224. Despite their

capabilities in working with noisy and incomplete data as in most epidemiological

studies, they have been underused in spatial epidemiology 169,225.

Inspired by human neural processing, artificial neural networks (ANNs) are

among the most popular MLTs used in recent years in environmental studies. Artificial

neural networks have a large number of highly interconnected processing elements

84

(neurons) working in unison to solve specific problems 226. Compared to the traditional

statistical models, ANNs are independent of the statistical distribution of data and do not

require a priori knowledge about the data for deriving patterns. ANNs are simplified

mathematical models that can map the relationship between input and output layers by

receiving several examples (training data). A properly trained network can further be

used to predict outcome(s) from new data (test data) 24.

There have been few published ANN architectures in spatial modeling of

infectious diseases, worldwide. Aburas et al. applied an ANN model with a back-

propagation algorithm to predict the frequency of confirmed dengue cases using

Singaporean National Environment Agency data 227. Results of their model showed a

correlation coefficient of 0.91 between actual and predicted values. They also identified

influential environmental factors predicting the number of dengue cases including mean

temperature, mean relative humidity, and total rainfall. Laureano-Rosario et al. used

ANN to predict dengue fever (DF) occurrence in a region in Puerto Rico, and in several

coastal municipalities in Mexico 228. The developed ANN models were trained with 19

years of DF data for Puerto Rico and six years’ data for Mexico. Sea surface

temperature, precipitation, air temperature, humidity, previous DF cases, and population

size were used as explanatory variables. Their results showed that the ANN

successfully modeled DF outbreak occurrences with the overall power of 70% in both

areas. The variables with the most influence on predicting DF outbreak were population

size, previous DF cases, air temperature, and date. In the US, Xue et al. developed the

least square (LS) regression analysis and a neural network trained by the genetic

algorithm to evaluate influenza activity in 10 geographic regions 229. They compared the

85

models using three evaluation metrics: mean square error, mean absolute percentage

error, and relative mean square error. All three evaluation indices used in their study

were lower than the corresponding metrics for the LS regression model showing the

superiority of the genetic algorithm-based neural network to the LS regression model.

To date, TB control efforts have relied on the empirically developed WHO DOTS

(directly observed treatment, short-course) control strategy which focuses on “case-

finding” rather than “place-finding” 230. Previous studies from different parts of the US

and at different levels have shown the association of TB with socio-economic status.

Mullins et al. used purely spatial scan statistics to identify spatial clusters of census

tracts with high TB prevalence rates in Connecticut. They found six clusters of TB

containing 126 census tracts 231. Persons in these clusters were more likely to be black

non-Hispanic and less likely to be Asian. Bennett et al. used multivariate logistic

regression to assess the association between demographic and clinical characteristics

and latent TB infection in refugees in San Diego, California 232. They found that the

highest prevalence rate was among refugees from sub-Saharan Africa and those with

less education. In a study in Harris County (in Texas), Feske et al. showed a positive

association between the percent of individuals using public transportation in census

tract and location of clusters detected by Getis–Ord’s Gi* hotspot analysis 233.

Ecological level researches on TB incidence rate at the national level are

inadequate for epidemiologic inferences especially in the US. Therefore, it is crucial to

perform an ecological study across the continental US to identify the location of

statistically significant hotspots and to determine the relationship between

environmental and socio-economic factors and TB incidence to provide useful insight to

86

policymakers in planning for TB control at a larger scale. To our knowledge, no study

has utilized ANNs in modeling the geographic distribution of TB incidence rate in the

US. Integration of the GIS and ANN can improve policymakers’ insight in identifying

potential TB high-risk areas and risk factors useful for future mitigation efforts. We

examined the spatial distribution of the disease and applicability of MLTs in TB

modeling with the following assumptions (1) all reported county-level TB incidence rates

represent the status of TB in the continental US and (2) the TB incidence rate is

influenced by environmental and socioeconomic factors.

Material and Methods

Tuberculosis Data

Data on all reported TB cases in the continental US between 2006 and 2010

were obtained from the paper of Scales et al. in the American Journal of Preventive

Medicine 234. All data are at the county level (n = 3109) and are publicly available 235.

Latent TB cases were not reported and included in this study. To alleviate variations of

TB incidences, particularly in counties with a small population size such as Loving and

King counties in Texas (n < 300 populations), the cumulative incidence was calculated

(2006–2010). For this purpose, five-year corresponding population estimates from the

American Community Survey (ASC) 236 were used. Tuberculosis incidence rates were

imported into ArcGIS 10.5 (ESRI, Redlands, CA, US) and geocoded at the county level.

Explanatory Data

In this study, 278 environmental and socio-economic factors were collected from

various sources and considered as explanatory variables based on previous studies

and domain knowledge. From the Center for Disease Control and Prevention (CDC)

wonder database 237, climate data were derived including daily maximum and minimum

87

air temperature (°F), and daily maximum heat index (°F). Also, the average number of

diabetes cases 238 during the study period were obtained from this database.

Topographic data including minimum, maximum and mean of altitude and slope were

obtained from the national map website 239. The county-level values of these data were

calculated using zonal statistics in ArcGIS Spatial Analyst extension. Socioeconomic

data were acquired from the US Census Bureau 240. A broad range of socio-economic

factors including age group, agriculture, immigration, education level, employment rate,

health, Hispanic or Latino population, income, poverty, and race were obtained across

the nation from this database. All population-based predictors were normalized to the

county’s population size. The full dataset description of exploratory variables is in

Supplementary Table. All data used in this study are downloadable from the above

sources.

Global and Local Clustering

After mapping TB incidence rates at the county level, the empirical Bayes

smoothing method was implemented in the GeoDa software 241 to adjust the crude

incidence rates toward the global mean. This helps to reduce the variance instability

associated with counties with a small population size 114. Thus, the response variable

changed to (logged) smoothed TB incidence rate (STIR) rather than the crude incidence

rate with more than 900 counties with 0 value which makes modeling difficult. All these

counties have non-zero STIR values. Next, the spatial pattern was statistically

evaluated using the global Moran’s I and Getis–Ord General G.

Global Moran’s I measures the similarities between the TB incidence rates of

neighboring counties as follows 142,158:

88

𝐈𝐈 =𝐚𝐚∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐳𝐳𝐢𝐢𝐳𝐳𝐢𝐢𝐚𝐚

𝐢𝐢=𝟏𝟏,𝐢𝐢≠𝐢𝐢𝐚𝐚𝐢𝐢=𝟏𝟏

𝐧𝐧𝟎𝟎 ∑ 𝐳𝐳𝐢𝐢𝟐𝟐𝐚𝐚𝐢𝐢=𝟏𝟏

(4-1)

𝐧𝐧𝟎𝟎 = ��𝐰𝐰𝐢𝐢𝐢𝐢

𝐚𝐚

𝐢𝐢=𝟏𝟏

𝐚𝐚

𝐢𝐢=𝟏𝟏

(4-2)

where 𝑧𝑧𝑖𝑖 and 𝑧𝑧𝑖𝑖 are the deviations of STIR for counties 𝑖𝑖 and 𝑗𝑗 from average incidences

(i.e., (𝑥𝑥𝑖𝑖 − �̅�𝑥), (𝑥𝑥𝑖𝑖 − �̅�𝑥)), respectively; 𝑤𝑤𝑖𝑖𝑖𝑖 is the spatial weight based on Rook’s contiguity

(i.e., common borders between counties 𝑖𝑖 and 𝑗𝑗); and 𝑠𝑠0 is the aggregation of all spatial

weights.

Moreover, the Getis–Ord General G statistics, developed by Getis and Ord was

used as a measure of clustering of the high or low value of STIR 57. A positive or

negative Z-score for G indicates spatial clustering of high (hotspot) or low (coldspot)

values, respectively. The formula for the general G of spatial association is:

𝐆𝐆 =∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐱𝐱𝐢𝐢𝐱𝐱𝐢𝐢𝐚𝐚

𝐢𝐢=𝟏𝟏𝐚𝐚𝐢𝐢=𝟏𝟏

∑ ∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐚𝐚𝐢𝐢=𝟏𝟏

𝐚𝐚𝐢𝐢=𝟏𝟏

∀𝐢𝐢 ≠ 𝐢𝐢 (4-3)

Getis–Ord Gi* statistics 66 was applied on the smoothed rates to identify the locations of

statistically significant hotspots of the STIR (p<0.05). Using the same notation as in

Equations (1) and (2), this statistic is computed as follows 233,114:

𝐆𝐆𝐢𝐢∗ =∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐱𝐱𝐢𝐢 − 𝐗𝐗�∑ 𝐰𝐰𝐢𝐢𝐢𝐢



𝐒𝐒 �[𝐚𝐚∑ 𝐰𝐰𝐢𝐢𝐢𝐢

𝟐𝟐 − �∑ 𝐰𝐰𝐢𝐢𝐢𝐢𝐚𝐚𝐢𝐢=𝟏𝟏 �

𝟐𝟐]𝐚𝐚

𝐢𝐢=𝟏𝟏𝐚𝐚 − 𝟏𝟏

(4-4)

𝐒𝐒 = �∑ (𝐱𝐱𝐢𝐢 − 𝐱𝐱�)𝟐𝟐𝐚𝐚𝐢𝐢=𝟏𝟏,𝐢𝐢≠𝐢𝐢

𝐚𝐚 − 𝟏𝟏− 𝐱𝐱�𝟐𝟐 (4-5)

89

Artificial Neural Networks

An ANN is a computational model, which consists of several simple processing

elements called neurons 242. The neurons are usually structured in layers: the input

layer, the hidden layer(s) and the output layer. In this study, we used ANNs with one

and two hidden layers, however, theoretical research has shown that almost any

complex and non-linear function can be estimated by an ANN with a single hidden layer

243. Additionally, having more hidden layers increases the number of parameters, which

may lead to over-fitting (i.e., memorizing the data while training). The aim of the hidden

layer is to find a multi-dimensional expansion of the input layer, which can be better

transformed to the pattern in the output layer 244. The neurons in the input layer are

connected to all neurons in the hidden layer. Similarly, all the neurons in the hidden

layer are connected to every neuron in the output layer (Figure 4-1). In this system,

selected explanatory variables are fed to the input layer and passed through the hidden

layer that processes them using simple mathematical operations. The relationship

between input and output layers of ANNs is trained by observing a series of known

examples from the training dataset and adjusting the weights accordingly. Once the

training phase is completed, the network is usually able to generalize what it has

learned to the test data with similar attributes of input. We developed a multi-layer

perceptron (MLP) neural network with one and two hidden layers to approximate the

dependency of log (STIR) in the continental US to environmental and socio-economic

factors.

Multi-layer perceptron is the most commonly used ANN structure in

environmental modeling 245,246. During the training phase of MLP, each input feature is

multiplied by its corresponding weight. The results are then summed and passed

90

through a smooth non-linear activation function to produce the output. The Logistic

function and hyperbolic tangent are among the widely used activation functions 247,248. In

a supervised learning setup, the difference between the (MLP) output and actual

output/target (i.e., the error or cost function) can be calculated as in Equation (5) 249:

𝐚𝐚𝐚𝐚𝐚𝐚𝐧𝐧𝐚𝐚 =𝟏𝟏𝟐𝟐�(𝐚𝐚𝐢𝐢 − 𝐧𝐧𝐢𝐢)𝟐𝟐𝐚𝐚

𝐢𝐢=𝟏𝟏

(4-6)

where 𝑜𝑜 and 𝑡𝑡 are model output and target respectively, and 𝑛𝑛 is the training sample

size. To minimize the cost function, a back-propagation algorithm based on stochastic

gradient descent is used to adjust each weight in the MLP model. The updated value for

each weight is calculated as in Equation (6):

𝐰𝐰(𝐤𝐤+𝟏𝟏): = 𝐰𝐰(𝐤𝐤) − 𝛂𝛂𝛛𝛛𝐚𝐚𝐚𝐚𝐚𝐚𝐧𝐧𝐚𝐚𝛛𝛛𝐰𝐰(𝐤𝐤) (4-7)

where 𝛼𝛼 or learning rate controls how much the coefficients can change on the 𝑘𝑘th

update. More detailed information about back-propagation MLP is presented in 250,251.

Model Pre-Processing

To develop the models, the entire dataset was randomly divided into three

different partitions: (1) training data: to learn and update the weights and biases in

network (60% of the total data) (2) cross-validation data: to avoid overfitting problem by

tuning models’ parameters during training phase (15% of the total data) (3) test data: to

evaluate accuracy and predictive power of the network after the training process (25%

of the total data). The spatial distribution of training, cross-validation, and test data as

shown in Figure 4-2. For the purpose of comparison, we used the same training, cross-

validation and test data for all developed models.

91

The next step was to standardize the input data before using them in the ANN

models. Input data have different ranges and units, this pre-processing step can

enhance the performance of models by faster convergence 24,252. There are several

standardization formulas presented in the literature. Here, we use the following equation

which transforms the input data to the range of 0 to 1 as in Equation (8):

𝐗𝐗𝐧𝐧 =𝐗𝐗𝐢𝐢 − 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚𝐗𝐗𝐦𝐦𝐚𝐚𝐱𝐱 − 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚

(4-8)

where 𝑋𝑋𝑖𝑖 is the initial (actual) value; 𝑋𝑋𝑚𝑚𝑖𝑖𝑚𝑚 and 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 are the minimum and maximum of

the initial values and 𝑋𝑋𝑠𝑠 is the respective standardized value. Moreover, after training

the network, the output of the model is returned to the original form through the

Equation (9):

𝐗𝐗𝐢𝐢 = 𝐗𝐗𝐧𝐧 ∗ (𝐗𝐗𝐦𝐦𝐚𝐚𝐱𝐱 − 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚) + 𝐗𝐗𝐦𝐦𝐢𝐢𝐚𝐚 (4-9)

Multi-layer perceptron and linear regression (LR) models are sensitive to

redundant explanatory variables because noise reduces model accuracy and

generalizability. Thus, a ‘proper’ selection of independent variables is crucial for better

performance. We applied L1-regularization or least absolute shrinkage and selection

operator (LASSO) on the training and cross-validation dataset before developing the

models. This process produces a sparse solution (i.e., few non-zero coefficients) which

reduces overfitting and enhances interpretability of the results 253. Detailed information

of L1 regularization technique can be found in Park and Hastie 179.

In the MLP model, 8 and 1 neurons were used in the input and output layers,

respectively. These numbers correspond to the 8 explanatory variables selected during

L1 regularization and one response variable (i.e. log (STIR)). There is no deterministic

92

rule to determine the number of neurons in the hidden layer. The grid search was used

to tune hyper-parameters in MLP with one and two hidden layer(s). This approach

systematically evaluates the developed model for each combination of model

parameters. For MLP, the tangent hyperbolic activation function �∅(𝑥𝑥) = 1−𝑒𝑒−𝑥𝑥

1+𝑒𝑒−𝑥𝑥�, a non-

linear and symmetric function which maps any real value to [–1,1], was used in the

hidden layer(s) 254. In addition, a linear identity function (𝑓𝑓(𝑥𝑥) = 𝑥𝑥) was applied as an

activation function in the output layer. All computer codes were developed in the Python

programming language.

Model Evaluation

We computed three types of evaluation metrics to assess and compare the

generalization capability of MLP with LR in predicting log (STIR) in the continental US.

These statistics include root mean square error (RMSE), mean absolute error (MAE)

and correlation coefficient (R) between model output and ground-truth.

Due to the fully connected architecture of MLP, it is difficult to define the explicit

relationship between input and output variables by coefficients 255. Sensitivity analysis is

a common way to address this problem. In this analysis, each factor was excluded from

the model, individually, and the RMSE of resulting models were compared. The most

influential factor, among the selected features, is the one that its absence increases the

RMSE of the model the most. Using sensitivity analysis, we identified the most

influential factors in predictions of the log (STIR). Finally, we ranked them according to

their decreasing importance.

93

Results

Between 2006 and 2010, 64,496 TB cases were reported across the continental

US. The number of TB cases showed a consistent declining trend from 14,119 to

11,284 annual cases (Figure 4-3). Among the states, the highest average TB incidence

rates were identified in Louisiana (5.30 cases per 100,000), Arizona (5.05 per 100,000)

and Georgia (4.7 per 100,000).

The global Moran’s I and general G indicated significant spatial clustering of

STIR in the continental US for the study period (Moran’s I = 0.13, Z-score = 32.13,

p<0.005; General G = 0.002, Z-score = 15.3, p<0.005). The hotspot analysis (Getis–Ord

Gi*) identified that about 7% of the continental US counties (n = 216) were part of

hotspots. The hotspots of STIR were distributed unequally, almost restricted to the

southern half of the country and particularly in the southern and southeastern counties

of the US (Figure 4-4). Table 4-1 summarizes the top 10 states with the largest number

of counties detected as part of STIR hotspots by the Getis–Ord Gi* technique.

Based on the results of L1 regularization, the environmental and several socio-

economic factors were selected. Out of 278 explanatory variables, only 8 factors were

incorporated as input variables for the models: (1) RHI820: resident population: not

Hispanic, white alone (July 1-estimate) (proportion of county population); (2) LFE330:

employed persons by industry (NAICS)-agriculture, forestry, fishing and hunting, and

mining (proportion of county population); (3) minimum temperature; (4) POP778: year of

entry by citizenship status in the United States entered 2000 or later-foreign-born

(proportion of county population) (5) IPE110: people of all ages in poverty (proportion of

county population); (6) SPR440: social security-benefit recipients (proportion of county

population); (7) HIS305: Hispanic or Latino persons, educational attainment, 25 years

94

and over, male (proportion of county population); (8) POP730: population one year and

over by residence-moved from different county within same state 2005–2009

(proportion of county population).

All Pearson correlation values among the selected variables were under 0.5, thus

we considered the selected factors as relatively uncorrelated (Table 4-2). The log

(STIR) was positively correlated with all variables except for household income

(p<0.05).

Preliminary results of the developed LR model showed that the selected

predictors generated R = 0.666, R2 = 0.443, and F = 184.246 (Sig. F Change < 0.001).

The R value which represents the simple correlation between predictions and reality

indicated an acceptable degree of correlation. The R2 value indicates that 44.3% of total

variations in the log (STIR) can be explained by the predictors. The F-test was

significant showing the developed LR model as a whole has statistically significant

predictive capability. Durbin–Watson test was close to 2 which verifies independency of

errors assumption (Durbin–Watson statistic = 2.04) (Table 4-3).

The t-test showed that all selected variables are statistically significant at a 99%

confidence interval. Based on the standardized coefficients, among the variables, in

order of strength, “RHI820”, “LFE330”, “HIS305”, “SPR440”, and “POP730” have

negative impacts on the log (STIR) while the variables “POP778”, “Min Temp”, and

“IPE110” have positive impacts on the dependent variable, respectively. Tolerance and

the variance inflation factor (VIF) were used as two collinearity diagnostic tests to

assess multicollinearity level for all variables. As the values of VIF didn’t exceed 3 and

95

tolerance statistic were above 0.1, it seems that there is no cause for concern about

collinearity (Table 4-4).

The normal probability-probability plot (P-P Plot) of LR residuals showed that

data points were closely aligned with the diagonal line suggesting the distribution of the

residuals was almost normal. This indicated that the assumption of normality of errors

was almost met. Nevertheless, there were a few samples which departed from the

diagonal line (Figure 4-5).

Table 4-5 presents the performance of MLP (1-hidden layer), MLP (2-hidden

layer), and LR models, in terms of MAE, RMSE, and R, for training, cross-validation and

test datasets, respectively. The correlation coefficient for a single hidden layer MLP,

with 20 nodes in the hidden layer, was larger than double hidden layers MLP, and LR

(Table 4-5), which showed a better agreement between the predicted and the ground-

truth. In addition, RMSE in MLP (single layer) was lower than the other models.

However, MAE in MLP with a two-hidden layer (20 nodes in first and 10 nodes in

second hidden layers) had the same test errors as single layer MLP. Results suggest

that the single layer MLP model outperformed the rest of models with higher

generalizability for predicting the log (STIR) in the continental US. Similarly, double

hidden layer MLP outperformed the LR model.

The scatter plot between the output of the MLP model and the corresponding

observed log (STIR) (i.e., ground-truth values) for test data (Figure 4-6) showed that the

model was able to predict the average variations of log (STIR), while it was unable to

predict some counties with exceptional rates.

96

The best model accuracy was achieved by the single layer MLP compared with

the other models. We examined the contribution/relative importance of each input

feature on the log (STIR) using sensitivity analysis. The results revealed that the highest

RMSE occurred when “resident population of American Indian and Alaska native” was

removed from the model, which implies that this factor has the maximum contribution,

among the selected factors, in predicting log (STIR) at the county level in the continental

US. The most influential factors in order of contributions were: “RHI820: resident

population: not Hispanic, White alone (July 1-estimate) (proportion)”, “LFE330:

employed persons by industry (NAICS)-agriculture, forestry, fishing and hunting, and

mining (proportion)”, “Minimum Temperature”, “POP778: year of entry by citizenship

status in the United States entered 2000 or later-foreign-born (proportion)”, “IPE110:

people of all ages in poverty (proportion)”, “SPR440: supplemental security income-

average monthly payments per recipient”, “HIS305: Hispanic or Latino persons,

educational attainment, 25 years and over, male (proportion)” and “POP730: population

one year and over by residence-moved from different county within same state 2005-

2009 (proportion)”. (Figure 4-7).

Discussion

According to the Institute of Medicine (2000), eliminating TB will require the

development of new tools to identify risk factors and high-risk areas 256. As expressed

by Feske et al., effective TB elimination in the US would require geographic elucidation

of high-risk areas and systematic surveillance of location-based risk factors 233. In this

study, we combined more advanced tools to examine the relationship between

environmental and socio-economic factors, and the log (STIR) across the continental

US. Integration of GIS, spatial statistics, and ANNs resulted in an efficient multi-

97

disciplinary approach, which can provide helpful guidelines for health decision makers.

The benefits obtained from this approach can enhance mitigation efforts such as budget

allocation, educating people who live in high-risk areas and drug distribution. Due to

limited research on the spatial modeling of TB at the national level, our study can be

regarded as a basis for future nationwide TB program researches.

In this study, we examined the spatial pattern and hotspots of the STIR in the

continental US. Moran’s I and General G statistic were used to investigate the presence

of spatial autocorrelation of local STIR. Our results showed that the distribution of STIR

at the county levels is clustered at the county level (p<0.05). We then conducted the

hotspot analysis to identify the counties with statistically significant STIR using the

hotspot analysis (Getis–Ord Gi*) approach. Our findings showed that STIR hotspots

were concentrated in the southern and western states (Figure 4-4); however, the South,

Southeast, and Southwest counties of the country were more severely affected. The

concentration of hotspots suggests that there were more cases observed in these areas

that would be expected if everyone were equally at risk.

Visual comparison of the location of identified hotspots in some states which had

a very high proportion of counties falling into hotspots (such as Georgia and Florida)

with recent surveillance reports of Georgia 257 and Florida 258 showed pieces of

evidence of similarities with some differences. This suggests the stability of location of

counties falling into hotspots which require close attention. Since after detecting the

statistically significant hotspots of STIR in the region, associated risk factors were not

known, we used ANN to model the relationship between environmental and socio-

economic factors and the log (STIR).

98

Based on sensitivity analysis, the environmental factors identified the contribution

of the selected variables to the log (STIR) at the county level. We found that the

average daily minimum temperature (with a positive effect) was an important climate

factor in the log (STIR). This is probably due to the adverse effects of minimum air

temperature on the respiratory system of patients, and more close lifestyle of people in

cold weather which increases the risk of exposure to infectious agents 259. Similar

results were reported in other studies. In a time-series analysis study in Fukuoka

(Japan), Onozuka and Hagihara found a positive significant relationship between

extreme cold temperature and TB incident cases 259. Our results are also consistent

with the findings of Mourtzoukou et al. 260 and Khalid et al. in Pakistan 261.

Sensitivity analysis showed that economic factors are important for log (STIR) in

the continental US. One of the most important economic factors was “proportion of

population county of all ages in poverty” which suggests that underserved segments of

the population are at higher-risk of STIR in the US. Conversely, counties with a higher

proportion of the population employed by industry or higher security income had lower

STIR. These factors potentially describe the impact of poverty/deprivation on lifestyle

choice. These findings agree with the individual-level studies of McKenna et al. 262, Ho

263, and Weis et al. 264. Unemployment rate has been found to be an important factor in

TB transmission in the studies of Munch et al. who indicated the strongest correlation

with TB caseload 265, and the studies of Dos Santos 266 and Jackubowiak 267; while Sun

et al. 207 did not find it significant in China. In a cohort study in Georgia, Djibuti et al. 268

showed that TB patients with lower-income households are at higher risk of poor TB

treatment. Bamrah et al. analyzed the genotyping of homeless persons in the US,

99

1994–2010. Their results showed that homeless TB patients had an approximately 10-

fold increase in TB incidence and had more than twice the odds of not completing

treatments 269.

Results of this study also showed the importance of race distribution and STIR as

there was a strong negative relationship between the proportion of county population

who are white and STIR. Visual comparison of the locations of the hotspots (Figure 4-4)

with the distribution map of race and ethnicity provided by US Census Bureau 270,

confirms that, in general, counties with more than 80% of the white population didn’t fall

into the hotspots. This agrees with the findings of Cantwell et al. 208 and CDC report 271.

Artificial neural networks and GIS were effective in modeling log (STIR), but

several limitations exist. First, the data used in this study were collected from multiple

online sources. It should be noted that TB reported data are subject to spatial

differences in case detection, thus, a standard passive case-finding approach needs to

be considered. In addition, methods of data collection and preparation are different

which may result in biased estimations. In this study, we only included active reported

TB cases. It should be noted that there are many people with latent TB infection who

have not developed TB disease, and therefore, are not detected or counted as cases.

This is more prevalent (exists in greater proportions) among those from other countries.

The last limitation stems from the study design. Our study at the county level should be

considered to characterize population rather than individual-level characteristics of risk

(ecological fallacy). Thus, the findings of this study should be applied to population-level

targeting rather than considering individual treatment.

100

This study showed machine learning techniques in spatial modeling can be

applied to TB incidence rate across the continental US. For future works, we

recommend conducting researches at multiple scales, particularly at finer scales such

as at the census tracts or block group level for more focal interventions. However, as

most policy decisions on TB control are performed at the state and federal government

levels, this represents a meaningful first step. To optimize the structure of ANNs, choice

of parameters (e.g., learning rate, activation functions) and training the weights in ANN,

and heuristic algorithms such as genetic algorithm may be useful leading to global

optima, because it can help to escape from local optima 272. Also, hyper-parameter

tuning coupled with more recent ANN approaches in terms of activation function or

optimization (e.g., Adam or other adaptive methods) is highly recommended. We also

recommend incorporating genetics data into the model or examine the genetic

characteristics of individual patients in counties with high modeling errors. In this regard,

national TB genotyping surveillance coverage which refers to the proportion of TB cases

with culture-positive with at least one genotyped isolate in the US can be useful. Also,

other statistical techniques such as Poisson model (through including the size of local

population at risk as an offset in the model) or the logistic regression model (through

binomial distribution) can be investigated in modeling TB incidence rate. Coupled with

GIS, the ANNs techniques successfully identified some determinant factors of log

(STIR), ranked them, and had better prediction ability than the traditional linear

regression analysis. The findings of this study can provide useful insight to health

authorities on prioritizing resource allocation to risk-prone areas.

101

Tables Table 4-1. Top 10 states with the largest number of hotspot counties (p < 0.10) of

smoothed tuberculosis (TB) incidence rate (STIR) in the continental US, 2006–2010.

Rank State No. Hotspot Counties

Percentage (#hotspots/#counties)

1 Georgia 57 35.8% 2 Texas 30 11.8% 3 North Carolina 23 23.0% 4 Louisiana 22 34.3% 5 Florida 20 29.9% 6 California 17 22.7% 7 South Carolina 17 37% 8 Arkansas 12 16.0% 9 Mississippi 12 14.6% 10 Alabama 10 14.9%

Table 4-2. Pearson correlation analysis between selected variables for modelling STIR,

continental US. POP730 LFE330 IPE110 POP778 Min

Temp SPR440 HIS305 RHI820

POP730 1.000 0.051 0.041 0.064 −0.124 −0.078 −0.138 −0.024 LFE330 0.051 1.000 0.018 0.057 0.136 −0.499 −0.186 −0.040 IPE110 0.041 0.018 1.000 0.266 −0.231 −0.108 0.066 0.384 POP778 0.064 0.057 0.266 1.000 −0.005 0.091 −0.390 0.248 Min Temp −0.124 0.136 −0.231 −0.005 1.000 0.066 −0.032 0.308

SPR440 −0.078 −0.499 −0.108 0.091 0.066 1.000 −0.015 0.003 HIS305 −0.138 −0.186 0.066 −0.390 −0.032 −0.015 1.000 0.403 RHI820 −0.024 −0.040 0.384 0.248 0.308 0.003 0.403 1.000

Table 4-3. Results of linear regression (LR) model for modeling log (STIR), continental US.

Model R R Square

Adj. R Square

Change Statistics Durbin–Watson

R Square Change

F df1 df2 Sig.

LR 0.666 a 0.443 0.440 0.443 184.246 8 1854 0.000 2.041 a. Predictors: (Constant), POP73, LFE330, IPE110, POP778, Min Temp, SPR440, HIS305, RHI820. Dependent Variable: log (STIR).

102

Table 4-4. Effects of environment and socio-economic factors on the log (STIR) using LR model. Unstandardized Coefficient

Standardized Coefficient

t Sig. 95.0% Confidence Interval for B

Collinearity Statistics

Variables B Beta

Lower Bound

Upper Bound

Tolerance VIF

(Constant) 0.001

0.009 0.993 −0.198 0.200

RHI820 −0.007 −0.294 −11.117

0.000 −0.009 −0.006 0.429 2.328

LFE330 −0.023 −0.166 −7.929 0.000 −0.029 −0.017 0.683 1.463 Min Temp 0.013 0.210 9.809 0.000 0.010 0.016 0.653 1.532 POP778 0.083 0.282 12.621 0.000 0.070 0.095 0.602 1.661 IPE110 0.012 0.140 6.662 0.000 0.008 0.015 0.677 1.477 SPR440 −0.009 −0.097 −4.703 0.000 −0.013 −0.005 0.701 1.426 HIS305 −0.019 −0.145 −5.976 0.000 −0.026 −0.013 0.508 1.968 POP730 −0.015 −0.080 −4.489 0.000 −0.021 −0.008 0.950 1.053

VIF: Variance inflation Factor. Table 4-5. Comparison of multi-layer perceptron (MLP; one and two hidden layers), and LR model’ performance for

predicting log (STIR) in the continental US. Model Training Cross-Validation Test

MAE RMSE R MAE RMSE R MAE RMSE R

LR 0.27 0.35 0.66 0.27 0.36 0.65 0.28 0.36 0.61 MLP (1 hidden layer) 0.25 0.33 0.70 0.26 0.35 0.67 0.27 0.35 0.63

MLP (2 hidden layers) 0.26 0.34 0.69 0.26 0.35 0.65 0.27 0.36 0.62

103

Figures

Figure 4-1. Topological architecture of multi-layer perceptron neural network (MLPNN) used in this study 63.

Figure 4-2. Spatial distribution of training, cross-validation, and test data used for modeling log (STIR).

104

Figure 4-3. The frequency of TB cases (left) and the cumulative TB incidence rate (right) across the continental US (2006–2010).

Figure 4-4. Hotspot map for the STIR in the continental US identified by hotspot analysis (Getis–Ord Gi*) technique, 2006–2010.

105

Figure 4-5. The Normal P-P Plot of LR model.

106

Figure 4-6. Scatter plot of observed and predicted log (STIR) (by single hidden layer

MLP model) for test data in the continental US.

Figure 4-7. The contribution of input features on predicting log (STIR) according to

sensitivity analysis of single hidden layer MLP. RMSE: Root mean square error.

0.28

0.190.17

0.33

0.18

0.12

0.15

0.12

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

RHI820 LFE330 Min Temp POP778 IPE110 SPR440 HIS305 POP730

Cha

nge

in R

MSE

Variable

107

CHAPTER 5 CONCLUSION

Summary

Despite a significant growth in spatial analysis of various types of infectious

diseases, most studies remain non-spatial. Our primary focus of this research was to

study spatial aspects of (three different) infectious diseases. Our studies confirmed that

“geography” is an essential part of the infectious diseases. In other words, we showed

that LD hotspots in CT occurred in specific geographic towns. Also, using geographic

factors we could predict presence/absence of one of main vector of ZCL and TB

incidence with a decent accuracy. Thus, it is necessary to incorporate it in future control

programs and predictions to fight against the diseases.

Due to interdisciplinary nature of the most infectious diseases, we underscored

innovative techniques such as RS, machine learning and GPS in a GIS platform and as

cost-effective and time saving approaches. Our main intent was to make contribution in

predicting presence/absence of the vector(s) that cause the disease and predicting

incidence of the diseases with higher accuracy than highly-applied traditional

techniques in literature. We examined a range of machine learning techniques that are

widely-used in many disciplines but underutilized in spatial epidemiology.

In Chapter 2, using ESDA in a GIS framework, we analyzed resulting pattern of

LD distribution in CT. We identified the towns where controlling LD was most required

and could provide the highest benefits in allocating resources. In Chapter 3, we

promoted machine learning classifiers such as random forest and support vector

machine as powerful algorithms in binary classifications tasks. However, there are many

other machine learning classifiers that have not been included in the research such as

108

k-means and Naïve Bayes classifier. Most of the binary classifiers are transferable to

medical geography and are likely to have better prediction abilities the than traditional

statistical techniques. We therefore recommend medical geographers to develop and

evaluate novel models in infectious disease research to examine the applicability of

techniques that better predict the presence/absence of other vectors or disease

morbidity/mortality rates.

In Chapter 4, the results of ANN showed an improvement of prediction accuracy

of continuous health outcome (TB rate) compared to the classic linear regression. ANN

coupled with GIS provided insight about the heterogeneity of TB in the US. We applied

the data science technique in the context of national level control strategy which are

rarely implemented in nationwide control programs of infectious diseases. However,

less insight is obtained in county level studies compared to individual level studies. Our

search of ANN in spatial epidemiology indicated that very few studies have applied the

technique. We recommend examining other robust and novel ANN techniques.

Depending on the nature and complexity of the infectious disease, size of the study

area, spatial distribution of the vectors, various types of the neural networks could be

applied, such as radial basis function, regulatory feedback, and recurrent neural

networks.

Limitations

The success of classifiers and models used in this research highly depend on

the quality of collected data and the variables used for predictions. Final spatial data

that represent a feature in the models undergo several transformations such as

measurement error or analysis errors (such as converting addresses to locations

(geocoding)). Also, there are several obstacles and sources of uncertainty that need to

109

be considered. First, neglecting the role of spatial autocorrelation especially in sparse

data may result in biased estimation of variables’ importance 273. SA is one of the most

important concepts that are usually ignored in spatial epidemiology. Misclassification

due to population migration or movements is another major source of uncertainty in

spatial and temporal pattern that can influence the validity of the results of our LD and

TB studies. Thus, misclassification can enhance or reduce the strength of associations.

One of the strengths of aggregated data compared to individual level studies is that they

are less influenced by misclassification. Selection bias can be another important

limitation. Inaccurate or incomplete distribution of collected health data (such as

collected sand flies) may lead to improper representation of the population of interest. In

our ZCL study, selection bias during sand fly collection can distort associations between

prediction of sand fly occurrence and descriptive variables.

Another problem is attributed to the administrative boundaries. All LD and TB

data and associated variables which were collected as a single value that represent

administrative boundaries (because of data restrictions, we used aggregated data)

rather than physical boundaries. This might be problematic for spatial and particularly

space-time analysis when comparing changes of pattern over time. The boundaries

may be subject to change (split or merge of some areas) because of governmental

policies, however in our study very few changes in the administrative boundaries

occurred. Spatial scale was another major concern in our LD and TB researches. In

these two papers, data were obtained at the town and county levels, respectively. The

values within each division is uniform but there might be sharp contrasts between

neighboring polygons. Thus, selection of geographic resolution for conducting analysis

110

is important. Because aggregation at various spatial scales can lead to different results.

Also, different aggregation techniques (median, mode, sum, etc.) can lead to different

results. In our studies, the choice of spatial unit was dictated by the available data.

These biases can cause spurious associations. By taking advantage of

technologic advances such as GIS as well as data collection techniques such as GPS

and RS, it is likely to improve reliability of results. Spatial data with higher spatial and

temporal resolutions can provide more accurate estimations of risk and predictions in

researches in medical geography. However, even with having accurate data spurious

associations are inevitable.

Outlook and Future Challenges

Future researches in GIS and spatial analysis need to find optimal solutions for

challenges such as scale and edge effect. Data collection techniques need to make

accurate and reliable spatial data available. Organizations that provide demographic

data need to make the same data available at multiple scales such as census tracts,

black group and county level. Analysis has to be (re)conducted at all the possible

scales, to address Modifiable areal unit problem (MAUP). Due to exponential growth of

availability and use of spatial data which is likely to continue, it is anticipated that GIS

software packages become more available. The software should include advanced and

robust techniques such as data science techniques and other approaches from other

disciplines. Also, spatial autocorrelation and uncertainty inherent in spatial data need to

be incorporated in the packages. These two areas deserve closer attention of medical

geographers. Analyzing big-data particularly in national and global health in near future

can be a new era of investigations in spatial epidemiology. In this regard, deep-learning

111

neural network is a machine learning technique that can be useful when working with

noisy and very large datasets 274.

112

APPENDIX A STATISTICAL COMPARISON OF CLASSIFIERS

Table A-1. Statistical test for comparing AUCs of SVM, LR and RF classifiers: Test

Statistic AUC 1 AUC 2 AUC 1-AUC 2 Lower 95% CL

Upper 95% CL

Test Statistic Value

Prob Level

Reject H0 at

α=0.05 Wald Z 0.974 (SVM) 0.965 (LR) 0.0090 -0.0704 0.0884 0.222 0.8243 No Wald Z 0.974 (SVM) 0.935 (RF) 0.0309 -0.0569 0.1349 0.794 0.4272 No Wald Z 0.965 (LR) 0.935 (RF) 0.0300 -0.0704 0.1304 0.584 0.5592 No

H0: AUC 1 = AUC 2 H1: AUC 1 ≠ AUC 2 Sample Size = n test = 36 Therefore, the P-values do not indicate a significant difference between the AUCs.

113

APPENDIX B HABITAT SUITABILITY MAPS

Figure B-1. Habitat suitability map of Ph. Papatasi based on logistic regression classifier

114

Figure B-2. Habitat suitability map of Ph. Papatasi based on random forest classifier

115

LIST OF REFERENCES

1. Moore, D. A., & Carpenter, T. E. (1999). Spatial analytical methods and geographic information systems: use in health research and epidemiology. Epidemiologic reviews, 21(2), 143-161.

2. Valenčius, C. B. (2000). Histories of medical geography. Medical History, 44(S20), 3-28.

3. Miller, G. (1962). " Airs, Waters, and Places" in History. Journal of the history of medicine and allied sciences, 129-140.

4. Cliff, A. D., Smallman-Raynor, M. R., & Stevens, P. M. (2009). Controlling the geogrpahical spread of infectious disease: Plague in Italy, 1347-1851. Acta medico-historica Adriatica, 7(2), 197-236.

5. Stevenson, L. G. (1965). Putting disease on the map: the early use of spot maps in the study of yellow fever. Journal of the History of Medicine and Allied Sciences, 226-261.

6. McLeod, K. S. (2000). Our sense of Snow: the myth of John Snow in medical geography. Social science & medicine, 50(7-8), 923-935.

7. May, J. M. (1959). The Ecology of Human Disease. The Ecology of Human Disease.

8. Meade, M. S., & Emch, M. (2010). Medical geography. Guilford Press.

9. May, J. M. (1950). Medical geography: its methods and objectives. Geographical Review, 9-41.

10. Gilbert, E. W. (1958). Pioneer maps of health and disease in England. Geographical Journal, 172-183.

11. Armstrong, R. W. (1965). Medical Geography-An Emerging Specialty?. International Pathology, 6(3), 61-63.

12. Musa, G. J., Chiang, P. H., Sylk, T., Bavley, R., Keating, W., Lakew, B., ... & Hoven, C. W. (2013). Use of GIS Mapping as a Public Health Tool–-From Cholera to Cancer. Health services insights, 6, HSI-S10471.

13. Meade, M. S. (2014). Medical geography. The Wiley Blackwell Encyclopedia of Health, Illness, Behavior, and Society, 1375-1381.

14. Kirby, R. S., Delmelle, E., & Eberth, J. M. (2017). Advances in spatial epidemiology and geographic information systems. Annals of epidemiology, 27(1), 1-9.

116

15. Morris, E., & Bone, C. (2016). Identifying spatial data availability and spatial data needs for Chagas disease mitigation in South America. Spatial and spatio-temporal epidemiology, 17, 45-58.

16. Mitchell, S. E. (2011). Using GIS to Explore the Relationship between Socioeconomic Status and Demographic Variables and Crime in Pittsburgh, Pennsylvania. Papers in Resource Analysis, 13(1), 1-11.

17. Cromley, E. K., & McLafferty, S. L. (2011). GIS and public health. Guilford Press.

18. Nykiforuk, C. I., & Flaman, L. M. (2011). Geographic information systems (GIS) for health promotion and public health: a review. Health promotion practice, 12(1), 63-73.

19. Campbell, J. B., & Wynne, R. H. (2011). Introduction to remote sensing. Guilford Press.

20. Hay, S. I. (2000). An overview of remote sensing and geodesy for epidemiology and public health application. Advances in parasitology, 47, 1-35.

21. Ali, I., Greifeneder, F., Stamenkovic, J., Neumann, M., & Notarnicola, C. (2015). Review of machine learning approaches for biomass and soil moisture retrievals from remote sensing data. Remote Sensing, 7(12), 16398-16421.

22. Im, S. B., Hurlebaus, S., & Kang, Y. J. (2011). Summary review of GPS technology for structural health monitoring. Journal of Structural Engineering, 139(10), 1653-1664.

23. Glass, G. E., Schwartz, B. S., Morgan III, J. M., Johnson, D. T., Noy, P. M., & Israel, E. (1995). Environmental risk factors for Lyme disease identified with geographic information systems. American journal of public health, 85(7), 944-948.

24. Mollalo, A., Sadeghian, A., Israel, G. D., Rashidi, P., Sofizadeh, A., & Glass, G. E. (2018). Machine learning approaches in GIS-based ecological modeling of the sand fly Phlebotomus papatasi, a vector of zoonotic cutaneous leishmaniasis in Golestan province, Iran. Acta tropica, 188, 187-194.

25. Elliot, P., Wakefield, J. C., Best, N. G., & Briggs, D. J. (2000). Spatial epidemiology: methods and applications. Oxford University Press.

26. Monmonier, M. (2005). Lying with maps. Statistical science, 20(3), 215-222.

27. Han, J., Kamber, M., & Tung, A. K. (2001). Spatial clustering methods in data mining. Geographic data mining and knowledge discovery, 188-217.

117

28. Stocks, P. (1953). Studies of cancer death rates at different ages in England and Wales in 1921 to 1950: uterus, breast and lung. British journal of cancer, 7(3), 283.

29. Wakefield, J. (2006). Disease mapping and spatial regression with count data. Biostatistics, 8(2), 158-183.

30. Indrayan, A., & Kumar, R. (1996). Statistical choropleth cartography in epidemiology. International journal of epidemiology, 25(1), 181-189.

31. Gebreslasie, M. T. (2015). A review of spatial technologies with applications for malaria transmission modelling and control in Africa. Geospatial health, 10(2).

32. Patil, A. P., Gething, P. W., Piel, F. B., & Hay, S. I. (2011). Bayesian geostatistics in health cartography: the perspective of malaria. Trends in Parasitology, 27(6), 246-253.

33. Best, N., Richardson, S., & Thomson, A. (2005). A comparison of Bayesian spatial models for disease mapping. Statistical methods in medical research, 14(1), 35-59.

34. Carlin, B. P., & Louis, T. A. (2010). Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC.

35. Caprarelli, G., & Fletcher, S. (2014). A brief review of spatial analysis concepts and tools used for mapping, containment and risk modelling of infectious diseases and other illnesses. Parasitology, 141(5), 581-601.

36. Elliott, P., & Wartenberg, D. (2004). Spatial epidemiology: current approaches and future challenges. Environmental health perspectives, 112(9), 998-1006.

37. Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. Journal of the American Statistical Association, 83(404), 1134-1143.

38. Sharmin, S., Glass, K., Viennet, E., & Harley, D. (2018). Geostatistical mapping of the seasonal spread of under-reported dengue cases in Bangladesh. PLoS neglected tropical diseases, 12(11), e0006947.

39. Randremanana, R. V., Richard, V., Rakotomanana, F., Sabatier, P., & Bicout, D. J. (2010). Bayesian mapping of pulmonary tuberculosis in Antananarivo, Madagascar. BMC infectious diseases, 10(1), 21.

40. Lawson, A. B. (2013). Bayesian disease mapping: hierarchical modeling in spatial epidemiology. Chapman and Hall/CRC.

41. Dubin, R. A. (1998). Spatial autocorrelation: a primer. Journal of housing economics, 7(4), 304-327.

118

42. Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region. Economic geography, 46(sup1), 234-240.

43. Miron, J. (1984). Spatial autocorrelation in regression analysis: A beginner’s guide. In Spatial statistics and models (pp. 201-222). Springer, Dordrecht.

44. De Jong, P. D., & De Bree, J. (1995). Analysis of the spatial distribution of rust-infected leek plants with the Black-White join-count statistic. European Journal of Plant Pathology, 101(2), 133-137.

45. Cliff, A. D., & Ord, J. K. (1981). Spatial processes: models & applications. Taylor & Francis.

46. Gilbert, G. S., Foster, R. B., & Hubbell, S. P. (1994). Density and distance-to-adult effects of a canker disease of trees in a moist tropical forest. Oecologia, 98(1), 100-108.

47. Moran, P. A. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society. Series B (Methodological), 10(2), 243-251.

48. Wilhelm, A., & Steck, R. (1998). Interactive graphics and local statistics. Journal of the Royal Statistical Society: Series D (The Statistician), 47(3), 423-430.

49. Chen, Y. (2012). On the four types of weight functions for spatial contiguity matrix. Letters in Spatial and Resource Sciences, 5(2), 65-72.

50. Naish, S., Dale, P., Mackenzie, J. S., McBride, J., Mengersen, K., & Tong, S. (2014). Spatial and temporal patterns of locally-acquired dengue transmission in northern Queensland, Australia, 1993–2012. PloS one, 9(4), e92524.

51. Geary, R. C. (1954). The contiguity ratio and statistical mapping. The incorporated statistician, 5(3), 115-146.

52. Chen, Y. (2013). New approaches for calculating Moran’s index of spatial autocorrelation. PloS one, 8(7), e68336.

53. Junior, G. B., de Paiva, A. C., Silva, A. C., & de Oliveira, A. C. M. (2009). Classification of breast tissues using Moran's index and Geary's coefficient as texture signatures and SVM. Computers in Biology and Medicine, 39(12), 1063-1072.

54. Souza, S. P. D., Jorge, V. M., & Orzechowski Xavier, M. (2014). Paracoccidioidomycosis in southern Rio Grande do Sul: a retrospective study of histopathologically diagnosed cases. Brazilian Journal of Microbiology, 45(1), 243-247.

55. Ripley, B. D. (1979). Tests of'randomness' for spatial point patterns. Journal of the Royal Statistical Society: Series B (Methodological), 41(3), 368-374.

119

56. Haase, P. (1995). Spatial pattern analysis in ecology based on Ripley's K‐function: Introduction and methods of edge correction. Journal of vegetation science, 6(4), 575-582.

57. Getis, A., & Ord, J. K. (2010). The analysis of spatial association by use of distance statistics. In Perspectives on Spatial Data Analysis (pp. 127-145). Springer, Berlin, Heidelberg.

58. Cuzick, J., & Edwards, R. (1996). Methods for investigating localized clustering of disease. Cuzick-Edwards one-sample and inverse two-sampling statistics. IARC scientific publications, (135), 200-202.

59. Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge.

60. Aikembayev, A. M., Lukhnova, L., Temiraliyeva, G., Meka-Mechenko, T., Pazylov, Y., Zakaryan, S., ... & Francesconi, S. C. (2010). Historical distribution and molecular diversity of Bacillus anthracis, Kazakhstan. Emerging Infectious Diseases, 16(5), 789.

61. Openshaw, S., Charlton, M., Wymer, C., & Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information System, 1(4), 335-358.

62. Clarke, K. C., McLafferty, S. L., & Tempalski, B. J. (1996). On epidemiology and geographic information systems: a review and discussion of future directions. Emerging infectious diseases, 2(2), 85.

63. Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical analysis, 27(2), 93-115.

64. Hassarangsee, S., Tripathi, N., & Souris, M. (2015). Spatial pattern detection of tuberculosis: a case study of Si Sa Ket Province, Thailand. International journal of environmental research and public health, 12(12), 16005-16018.

65. Szonyi, B., Srinath, I., Esteve-Gassent, M., Lupiani, B., & Ivanek, R. (2015). Exploratory spatial analysis of Lyme disease in Texas–what can we learn from the reported cases?. BMC public health, 15(1), 924.

66. Getis, A., & Ord, J. K. (1996). Local spatial statistics: an overview. Spatial analysis: modelling in a GIS environment, 374, 261-277.

67. Sun, W., Xue, L., & Xie, X. (2017). Spatial-temporal distribution of dengue and climate characteristics for two clusters in Sri Lanka from 2012 to 2016. Scientific reports, 7(1), 12884.

68. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics-Theory and methods, 26(6), 1481-1496.

120

69. Kulldorff, M., & Nagarwalla, N. (1995). Spatial disease clusters: detection and inference. Statistics in medicine, 14(8), 799-810.

70. Ward, M. P. (2001). Blowfly strike in sheep flocks as an example of the use of a time–space scan statistic to control confounding. Preventive Veterinary Medicine, 49(1-2), 61-69.

71. Knox, E. G., & Bartlett, M. S. (1964). The detection of space-time interactions. Journal of the Royal Statistical Society. Series C (Applied Statistics), 13(1), 25-30.

72. Kulldorff, M., Heffernan, R., Hartman, J., Assunçao, R., & Mostashari, F. (2005). A space–time permutation scan statistic for disease outbreak detection. PLoS medicine, 2(3), e59.

73. Robertson, C., & Nelson, T. A. (2010). Review of software for space-time disease surveillance. International journal of health geographics, 9(1), 16.

74. Fauci, A. S., Touchette, N. A., & Folkers, G. K. (2005). Emerging infectious diseases: a 10-year perspective from the National Institute of Allergy and Infectious Diseases. International Journal of Risk & Safety in Medicine, 17(3, 4), 157-167.

75. Fauci, A. S. (2001). Infectious diseases: considerations for the 21st century. Clinical Infectious Diseases, 32(5), 675-685.

76. Eisenberg, J. N., Cevallos, W., Ponce, K., Levy, K., Bates, S. J., Scott, J. C., ... & Trueba, G. (2006). Environmental change and infectious disease: how new roads affect the transmission of diarrheal pathogens in rural Ecuador. Proceedings of the National Academy of Sciences, 103(51), 19460-19465.

77. Mollalo, A., Mao, L., Rashidi, P., & Glass, G. E. (2019). A GIS-Based Artificial Neural Network Model for Spatial Distribution of Tuberculosis across the Continental United States. International journal of environmental research and public health, 16(1), 157.

78. Kovats, R. S., Campbell-Lendrum, D. H., McMichel, A. J., Woodward, A., & Cox, J. S. H. (2001). Early effects of climate change: do they include changes in vector-borne disease?. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 356(1411), 1057-1068.

79. Cudmore, T. J., Björklund, N., Carroll, A. L., & Lindgren, B. S. (2010). Climate change and range expansion of an aggressive bark beetle: evidence of higher beetle reproduction in naïve host tree populations. Journal of Applied Ecology, 47(5), 1036-1043.

121

80. Ogden, N. H., & Lindsay, L. R. (2016). Effects of climate and climate change on vectors and vector-borne diseases: ticks are different. Trends in parasitology, 32(8), 646-656.

81. Vail, S. G., & Smith, G. (1998). Air temperature and relative humidity effects on behavioral activity of blacklegged tick (Acari: Ixodidae) nymphs in New Jersey. Journal of medical entomology, 35(6), 1025-1028.

82. Lima, A., Lovin, D. D., Hickner, P. V., & Severson, D. W. (2016). Evidence for an overwintering population of Aedes aegypti in Capitol Hill neighborhood, Washington, DC. The American journal of tropical medicine and hygiene, 94(1), 231-235.

83. Jerrett, M., Gale, S., & Kontgis, C. (2010). Spatial modeling in environmental and public health research. International journal of environmental research and public health, 7(4), 1302-1329.

84. Takahashi, K., Kulldorff, M., Tango, T., & Yih, K. (2008). A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring. International Journal of Health Geographics, 7(1), 14.

85. Stevens, K. B., & Pfeiffer, D. U. (2011). Spatial modelling of disease using data-and knowledge-driven approaches. Spatial and spatio-temporal epidemiology, 2(3), 125-133.

86. Pfeiffer, D., Robinson, T. P., Stevenson, M., Stevens, K. B., Rogers, D. J., & Clements, A. C. (2008). Spatial analysis in epidemiology (Vol. 142, No. 10.1093). Oxford: Oxford University Press.

87. Ishizaka, A., & Nemery, P. (2013). Multi-criteria decision analysis: methods and software. John Wiley & Sons.

88. Saaty, T. L. (2008). Decision making with the analytic hierarchy process. International journal of services sciences, 1(1), 83-98.

89. Chang, D. Y. (1996). Applications of the extent analysis method on fuzzy AHP. European journal of operational research, 95(3), 649-655.

90. Clements, A. C., Pfeiffer, D. U., & Martin, V. (2006). Application of knowledge-driven spatial modelling approaches and uncertainty management to a study of Rift Valley fever in Africa. International Journal of Health Geographics, 5(1), 57.

91. Mollalo, A., & Khodabandehloo, E. (2016). Zoonotic cutaneous leishmaniasis in northeastern Iran: A GIS-based spatio-temporal multi-criteria decision-making approach. Epidemiology & Infection, 144(10), 2217-2229.

122

92. Rakotomanana, F., Randremanana, R. V., Rabarijaona, L. P., Duchemin, J. B., Ratovonjato, J., Ariey, F., ... & Jeanne, I. (2007). Determining areas that require indoor insecticide spraying using Multi Criteria Evaluation, a decision-support tool for malaria vector control programmes in the Central Highlands of Madagascar. International Journal of Health Geographics, 6(1), 2.

93. Loh, W. Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 14-23.

94. Franklin, J. (2003). Clustering versus regression trees for determining ecological land units in the southern California mountains and foothills. Forest Science, 49(3), 354-368.

95. Lewis, R. J. (2000). An introduction to classification and regression tree (CART) analysis. In Annual meeting of the society for academic emergency medicine in San Francisco, California (Vol. 14).

96. Rodriguez-Galiano, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M., & Rigol-Sanchez, J. P. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS Journal of Photogrammetry and Remote Sensing, 67, 93-104.

97. Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of chemical information and computer sciences, 43(6), 1947-1958.

98. Noble, W. S. (2006). What is a support vector machine?. Nature biotechnology, 24(12), 1565.

99. Gomes, V. H., IJff, S. D., Raes, N., Amaral, I. L., Salomão, R. P., Coelho, L. S., ... & Guevara, J. E. (2018). Species Distribution Modelling: Contrasting presence-only models with plot abundance data. Scientific reports, 8(1), 1003.

100. Grinnell, J. (1917). The niche-relationships of the California Thrasher. Auk, 34(4), 427-433.

101. Hirzel, A. H., Hausser, J., Chessel, D., & Perrin, N. (2002). Ecological‐niche factor analysis: how to compute habitat‐suitability maps without absence data?. Ecology, 83(7), 2027-2036.

102. Ayala, D., Costantini, C., Ose, K., Kamdem, G. C., Antonio-Nkondjio, C., Agbor, J. P., ... & Simard, F. (2009). Habitat suitability and ecological niche profile of major malaria vectors in Cameroon. Malaria Journal, 8(1), 307.

103. Stockwell, D. R., & Noble, I. R. (1992). Induction of sets of rules from animal distribution data: a robust and informative method of data analysis. Mathematics and computers in simulation, 33(5-6), 385-390.

123

104. Stockwell, D. (1999). The GARP modelling system: problems and solutions to automated spatial prediction. International journal of geographical information science, 13(2), 143-158.

105. Phillips, S. J., & Dudík, M. (2008). Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, 31(2), 161-175.

106. Chamaillé, L., Tran, A., Meunier, A., Bourdoiseau, G., Ready, P., & Dedet, J. P. (2010). Environmental risk mapping of canine leishmaniasis in France. Parasites & vectors, 3(1), 31.

107. Larson, S. R., Degroot, J. P., Bartholomay, L. C., & Sugumaran, R. (2010). Ecological niche modeling of potential West Nile virus vector mosquito species in Iowa. Journal of Insect Science, 10(1), 110.

108. Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological modelling, 190(3-4), 231-259.

109. Sofizadeh, A., Rassi, Y., Vatandoost, H., Hanafi-Bojd, A. A., Mollalo, A., Rafizadeh, S., & Akhavan, A. A. (2016). Predicting the distribution of phlebotomus papatasi (diptera: psychodidae), the primary vector of zoonotic cutaneous leishmaniasis, in golestan province of Iran Using ecological niche modeling: comparison of MaxEnt and GARP models. Journal of medical entomology, 54(2), 312-320.

110. Hasyim, H., Nursafingi, A., Haque, U., Montag, D., Groneberg, D. A., Dhimal, M., ... & Müller, R. (2018). Spatial modelling of malaria cases associated with environmental factors in South Sumatra, Indonesia. Malaria journal, 17(1), 87.

111. Franke, J., Gebreslasie, M., Bauwens, I., Deleu, J., & Siegert, F. (2015). Earth observation in support of malaria control and epidemiology: MALAREO monitoring approaches. Geospatial health, (1), 117-131.

112. Nakhapakorn, K., & Tripathi, N. K. (2005). An information value based analysis of physical and climatic factors affecting dengue fever and dengue haemorrhagic fever incidence. International journal of health geographics, 4(1), 13.

113. Palaniyandi, M., Anand, P., & Maniyosai, R. (2014). Spatial cognition: a geospatial analysis of vector borne disease transmission and the environment, using remote sensing and GIS. Information Systems, 1, 52.

114. Mollalo, A., Blackburn, J. K., Morris, L. R., & Glass, G. E. (2017). A 24-year exploratory spatial data analysis of Lyme disease incidence rate in Connecticut, USA. Geospatial health, 12(2), 588.

124

115. Li, Y. P., Fang, L. Q., Gao, S. Q., Wang, Z., Gao, H. W., Liu, P., ... & Xu, B. (2013). Decision support system for the response to infectious disease emergencies based on WebGIS and mobile services in China. PLoS One, 8(1), e54842.

116. Lobitz, B., Beck, L., Huq, A., Wood, B., Fuchs, G., Faruque, A. S. G., & Colwell, R. (2000). Climate and infectious disease: use of remote sensing for detection of Vibrio cholerae by indirect measurement. Proceedings of the National Academy of Sciences, 97(4), 1438-1443.

117. Franke, J., & Menz, G. (2007). Multi-temporal wheat disease detection by multi-spectral remote sensing. Precision Agriculture, 8(3), 161-172.

118. Hugh-Jones, M., Barré, N., Nelson, G., Wehnes, K., Warner, J., Garvin, J., & Garris, G. (1992). Landsat-TM identification of Amblyomma variegatum (Acari: Ixodidae) habitats in Guadeloupe. Remote Sensing of Environment, 40(1), 43-55.

119. Coburn, B. J., & Blower, S. (2013). Mapping HIV epidemics in sub-Saharan Africa with use of GPS data. The lancet global health, 1(5), e251-e253.

120. Paz-Soldan, V., Stoddard, S., Vazquez-Prokopec, G., Morrison, A., Elder, J., Kitron, U., ... & Scott, T. (2010). Assessing and maximizing the acceptability of GPS device use for studying the role of human movement in dengue virus transmission in Iquitos, Peru. Am J Trop Med Hyg, 82(4), 723-730.

121. Masthi, N. R., Madhusudan, M., & Puthussery, Y. P. (2015). Global positioning system & Google Earth in the investigation of an outbreak of cholera in a village of Bengaluru Urban district, Karnataka. The Indian journal of medical research, 142(5), 533.

122. Garcia-Martí, I., Zurita-Milla, R., Van Vliet, A. J., & Takken, W. (2017). Modelling and mapping tick dynamics using volunteered observations. International journal of health geographics, 16(1), 41.

123. Ostfeld, R. S., Canham, C. D., Oggenfuss, K., Winchcombe, R. J., & Keesing, F. (2006). Climate, deer, rodents, and acorns as determinants of variation in Lyme-disease risk. PLoS biology, 4(6), e145.

124. Pepin, K. M., Eisen, R. J., Mead, P. S., Piesman, J., Fish, D., Hoen, A. G., ... & Diuk-Wasser, M. A. (2012). Geographic variation in the relationship between human Lyme disease incidence and density of infected host-seeking Ixodes scapularis nymphs in the Eastern United States. The American journal of tropical medicine and hygiene, 86(6), 1062-1071.

125. Ramezankhani, R., Sajjadi, N., Jozi, S. A., & Shirzadi, M. R. (2018). Climate and environmental factors affecting the incidence of cutaneous leishmaniasis in Isfahan, Iran. Environmental Science and Pollution Research, 25(12), 11516-11526.

125

126. Nieto, P., Malone, J. B., & Bavia, M. E. (2006). Ecological niche modeling forvisceral leishmaniasis in the state of Bahia, Brazil, using genetic algorithm for rule-set prediction and growing degree day-water budget analysis. Geospatialhealth, 1(1), 115-126.

127. Paixão Sevá, A., Mao, L., Galvis-Ovallos, F., Lima, J. M. T., & Valle, D. (2017).Risk analysis and prediction of visceral leishmaniasis dispersion in São PauloState, Brazil. PLoS neglected tropical diseases, 11(2), e0005353.

128. Yamamura, M., Freitas, I. M. D., Santo Neto, M., Chiaravalloti Neto, F., Popolin,M. A. P., Arroyo, L. H., ... & Arcêncio, R. A. (2016). Spatial analysis of avoidablehospitalizations due to tuberculosis in Ribeirao Preto, SP, Brazil (2006-2012). Revista de saude publica, 50, 20.

129. Patterson, B., Morrow, C. D., Kohls, D., Deignan, C., Ginsburg, S., & Wood, R.(2017). Mapping sites of high TB transmission risk: integrating the shared air andsocial behaviour of TB cases and adolescents in a South Africantownship. Science of the Total Environment, 583, 97-103.

130. Oliver, J. H., Lin, T., Gao, L., Clark, K. L., Banks, C. W., Durden, L. A., ... &Chandler, F. W. (2003). An enzootic transmission cycle of Lyme borreliosisspirochetes in the southeastern United States. Proceedings of the NationalAcademy of Sciences, 100(20), 11642-11645.

131. Anderson, J. F. (1989). Ecology of Lyme disease. Connecticut medicine, 53(6),343-346.

132. Radolf, J. D., Caimano, M. J., Stevenson, B., & Hu, L. T. (2012). Of ticks, miceand men: understanding the dual-host lifestyle of Lyme diseasespirochaetes. Nature reviews microbiology, 10(2), 87.

133. Centers for Disease Control and Prevention. Lyme Disease. Available at:http://www.cdc.gov/lyme/signs_symptoms. Accessed March 20, 2016.

134. Kugeler, K. J., Farley, G. M., Forrester, J. D., & Mead, P. S. (2015). Geographicdistribution and expansion of human Lyme Disease, United States. Emerginginfectious diseases, 21(8), 1455-1457.

135. Li, J., Kolivras, K. N., Hong, Y., Duan, Y., Seukep, S. E., Prisley, S. P., ... &Gaines, D. N. (2014). Spatial and temporal emergence pattern of Lyme disease inVirginia. The American journal of tropical medicine and hygiene, 91(6), 1166-1172.

136. Steere, A. C., Hardin, J. A., & Malawista, S. E. (1977). Erythema chronicummigrans and Lyme arthritis: cryoimmunoglobulins and clinical activity of skin andjoints. Science, 196(4294), 1121-1122.

http://www.cdc.gov/lyme/signs_symptoms

126

137. Cromley, E. K., Cartter, M. L., Mrozinski, R. D., & Ertel, S. H. (1998). Residentialsetting as a risk factor for Lyme disease in a hyperendemic region. Americanjournal of epidemiology, 147(5), 472-477.

138. The Connecticut Department of Public Health (CTDPH). Available at:http://www.ct.gov/dph/cwp/view.asp?a=3136&q=399694. Accessed February 10,2016.

139. Centers for Disease Control and Prevention. Lyme Disease (Borreliaburgdorferi): 1996 Case Definition. Available at:https://wwwn.cdc.gov/nndss/conditions/lyme-disease/case-definition/1996/.Accessed March 20, 2016.



142. Moran, P. A. (1950). Notes on continuous stochasticphenomena. Biometrika, 37(1/2), 17-23.

143. Anselin, L. (2005). Exploring spatial data with GeoDaTM: a workbook. Center forspatially integrated social science.

144. O'sullivan, D., & Unwin, D. (2014). Geographic information analysis. John Wiley& Sons.

145. Mitchell, A., & Minami, M. (1999). The ESRI guide to GIS analysis: geographicpatterns & relationships (Vol. 1). ESRI, Inc..

146. Clayton, D., & Kaldor, J. (1987). Empirical Bayes estimates of age-standardizedrelative risks for use in disease mapping. Biometrics, 671-681.

147. Kulldorff, M. (2010). SaTScan user guide for version 9.0.

148. Fielding, A. H., & Bell, J. F. (1997). A review of methods for the assessment ofprediction errors in conservation presence/absence models. Environmentalconservation, 24(1), 38-49.

149. Birnbaum, D., Jacquez, G. M., Waller, L. A., Grimson, R., & Wartenberg, D.(1996). The analysis of disease clusters, part I: state of the art. Infection Control &Hospital Epidemiology, 17(5), 319-327.

http://www.ct.gov/dph/cwp/view.asp?a=3136&q=399694

https://wwwn.cdc.gov/nndss/conditions/lyme-disease/case-definition/1996/



127

150. Connally, N. P., Ginsberg, H. S., & Mather, T. N. (2006). Assessing peridomestic entomological factors as predictors for Lyme disease. Journal of Vector Ecology, 31(2), 364-371.

151. Mather, T. N., Nicholson, M. C., Donnelly, E. F., & Matyas, B. T. (1996). Entomologic index for human risk of Lyme disease. American Journal of Epidemiology, 144(11), 1066-1069.

152. Stafford, K. C., Cartter, M. L., Magnarelli, L. A., Ertel, S. H., & Mshar, P. A. (1998). Temporal correlations between tick abundance and prevalence of ticks infected with Borrelia burgdorferi and increasing incidence of Lyme disease. Journal of clinical microbiology, 36(5), 1240-1244.

153. Ogden, N. H., St-Onge, L., Barker, I. K., Brazeau, S., Bigras-Poulin, M., Charron, D. F., ... & Michel, P. (2008). Risk maps for range expansion of the Lyme disease vector, Ixodes scapularis, in Canada now and with climate change. International journal of health geographics, 7(1), 24.

154. Xue, L., Scoglio, C., McVey, D. S., Boone, R., & Cohnstaedt, L. W. (2015). Two introductions of Lyme disease into Connecticut: a geospatial analysis of human cases from 1984 to 2012. Vector-Borne and Zoonotic Diseases, 15(9), 523-528.

155. Root, E. D., Meyer, R. E., & Emch, M. E. (2009). Evidence of localized clustering of gastroschisis births in North Carolina, 1999–2004. Social science & medicine, 68(8), 1361-1367.

156. Barro, A. S., Kracalik, I. T., Malania, L., Tsertsvadze, N., Manvelyan, J., Imnadze, P., & Blackburn, J. K. (2015). Identifying hotspots of human anthrax transmission using three local clustering techniques. Applied Geography, 60, 29-36.

157. Steere, A. C., Malawista, S. E., Snydman, D. R., Shope, R. E., Andiman, W. A., Ross, M. R., & Steele, F. M. (1977). An epidemic of oligoarticular arthritis in children and adults in three Connecticut communities. Arthritis & Rheumatism: Official Journal of the American College of Rheumatology, 20(1), 7-17.

158. Mollalo, A., Alimohammadi, A., Shirzadi, M. R., & Malek, M. R. (2015). Geographic information system‐based analysis of the spatial and spatio‐temporal distribution of zoonotic cutaneous leishmaniasis in Golestan Province, north‐east of Iran. Zoonoses and public health, 62(1), 18-28.

159. Shirzadi, M. R., Mollalo, A., & Yaghoobi-Ershadi, M. R. (2015). Dynamic relations between incidence of zoonotic cutaneous leishmaniasis and climatic factors in Golestan Province, Iran. Journal of arthropod-borne diseases, 9(2), 148.

160. Ramezankhani, R., Hosseini, A., Sajjadi, N., Khoshabi, M., & Ramezankhani, A. (2017). Environmental risk factors for the incidence of cutaneous leishmaniasis in an endemic area of Iran: A GIS-based approach. Spatial and spatio-temporal epidemiology, 21, 57-66.

128

161. Hanafi-Bojd, A. A., Khoobdel, M., Soleimani-Ahmadi, M., Azizi, K., Aghaei Afshar, A., Jaberhashemi, S. A., ... & Safari, R. (2017). Species composition of sand flies (Diptera: Psychodidae) and modeling the spatial distribution of main vectors of cutaneous leishmaniasis in Hormozgan Province, Southern Iran. Journal of medical entomology, 55(2), 292-299.

162. Afshar, A. A., Rassi, Y. A. V. A. R., Sharifi, I., Abai, M. R., Oshaghi, M. A., Yaghoobi-Ershadi, M. R., & Vatandoost, H. (2011). Susceptibility status of Phlebotomus papatasi and P. sergenti (Diptera: Psychodidae) to DDT and deltamethrin in a focus of cutaneous leishmaniasis after earthquake strike in Bam, Iran. Iranian journal of arthropod-borne diseases, 5(2), 32.

163. Postigo, J. A. R. (2010). Leishmaniasis in the world health organization eastern mediterranean region. International journal of antimicrobial agents, 36, S62-S65.

164. Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., & Arriaga-Weiss, S. (2010). Classification in conservation biology: a comparison of five machine-learning methods. Ecological Informatics, 5(6), 441-450.

165. Huang, G. B., Zhu, Q. Y., & Siew, C. K. (2006). Extreme learning machine: theory and applications. Neurocomputing, 70(1-3), 489-501.

166. Bagheri, M., Bazvand, A., & Ehteshami, M. (2017). Application of artificial intelligence for the management of landfill leachate penetration into groundwater, and assessment of its environmental impacts. Journal of Cleaner Production, 149, 784-796.

167. Peterson, A. T., & Shaw, J. (2003). Lutzomyia vectors for cutaneous leishmaniasis in Southern Brazil: ecological niche models, predicted geographic distributions, and climate change effects. International journal for parasitology, 33(9), 919-931.

168. Samy, A. M., Annajar, B. B., Dokhan, M. R., Boussaa, S., & Peterson, A. T. (2016). Coarse-resolution ecology of etiological agent, vector, and reservoirs of zoonotic cutaneous leishmaniasis in Libya. PLoS neglected tropical diseases, 10(2), e0004381.

169. Rajabi, M., Mansourian, A., Pilesjö, P., & Bazmani, A. (2014). Environmental modelling of visceral leishmaniasis by susceptibility-mapping using neural networks: A case study in north-western Iran. Geospatial health, 179-191.

170. Kapwata, T., & Gebreslasie, M. T. (2016). Random forest variable selection in spatial malaria transmission modelling in Mpumalanga Province, South Africa. Geospatial health.

171. Gomes, A. L. V., Wee, L. J., Khan, A. M., Gil, L. H., Marques Jr, E. T., Calzavara-Silva, C. E., & Tan, T. W. (2010). Classification of dengue fever patients based on gene expression data using support vector machines. PloS one, 5(6), e11267.

129

172. Armitage, D. W., & Ober, H. K. (2010). A comparison of supervised learning techniques in the classification of bat echolocation calls. Ecological Informatics, 5(6), 465-473.

173. Hijmans, R. J., Cameron, S. E., Parra, J. L., Jones, P. G., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology: A Journal of the Royal Meteorological Society, 25(15), 1965-1978.

174. Janalipour, M., & Taleai, M. (2017). Building change detection after earthquake using multi-criteria decision analysis based on extracted information from high spatial resolution satellite images. International journal of remote sensing, 38(1), 82-99.

175. Freedman, D. A. (2009). Statistical models: theory and practice. cambridge university press.

176. Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE transactions on neural networks, 10(5), 988-999.

177. Bishop, C. M., & Tipping, M. E. (2000). Variational relevance vector machines. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence (pp. 46-53). Morgan Kaufmann Publishers Inc..

178. Tuv, E., Borisov, A., Runger, G., & Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. Journal of Machine Learning Research, 10(Jul), 1341-1366.

179. Park, M. Y., & Hastie, T. (2007). L1‐regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 659-677.

180. Bagheri, M., Mirbagheri, S. A., Kamarkhani, A. M., & Bagheri, Z. (2016). Modeling of effluent quality parameters in a submerged membrane bioreactor with simultaneous upward and downward aeration treating municipal wastewater using hybrid models. Desalination and Water Treatment, 57(18), 8068-8089.

181. Altman, D. G., & Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ: British Medical Journal, 308(6943), 1552.

182. Diel, R., Loddenkemper, R., Niemann, S., Meywald-Walter, K., & Nienhaus, A. (2011). Negative and positive predictive value of a whole-blood interferon-γ release assay for developing active tuberculosis: an update. American journal of respiratory and critical care medicine, 183(1), 88-95.

183. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.

130

184. Landis, J. R., & Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 363-374.

185. Aparicio, C., & Bitencourt, M. D. (2004). Spatial and temporal analysis of cutaneous leishmaniasis incidence in São Paulo–Brazil. In XXth International Society for Photogrammetry and Remote Sensing Congress (pp. 12-23).

186. Orshan, L., Elbaz, S., Ben-Ari, Y., Akad, F., Afik, O., Ben-Avi, I., ... & Zonstein, I. (2016). Distribution and dispersal of Phlebotomus papatasi (Diptera: Psychodidae) in a zoonotic cutaneous leishmaniasis focus, the Northern Negev, Israel. PLoS neglected tropical diseases, 10(7), e0004819.

187. Ding, F., Fu, J., Jiang, D., Hao, M., & Lin, G. (2018). Mapping the spatial distribution of Aedes aegypti and Aedes albopictus. Acta tropica, 178, 155-162.

188. Williams, J. N., Seo, C., Thorne, J., Nelson, J. K., Erwin, S., O’Brien, J. M., & Schwartz, M. W. (2009). Using species distribution models to predict new occurrences for rare plants. Diversity and Distributions, 15(4), 565-576.

189. Carvalho, B. M., Rangel, E. F., Ready, P. D., & Vale, M. M. (2015). Ecological niche modelling predicts southward expansion of Lutzomyia (Nyssomyia) flaviscutellata (Diptera: Psychodidae: Phlebotominae), vector of Leishmania (Leishmania) amazonensis in South America, under climate change. PLoS One, 10(11), e0143282.

190. Wang, C. T., Long, R. J., Wang, Q. J., Ding, L. M., & Wang, M. P. (2007). Effects of altitude on plant-species diversity and productivity in an alpine meadow, Qinghai–Tibetan plateau. Australian Journal of Botany, 55(2), 110-117.

191. Mollalo, A., Alimohammadi, A., Shahrisvand, M., Shirzadi, M. R., & Malek, M. R. (2014). Spatial and statistical analyses of the relations between vegetation cover and incidence of cutaneous leishmaniasis in an endemic province, northeast of Iran. Asian Pacific journal of tropical disease, 4(3), 176-180.

192. Kustas, W. P., & Norman, J. M. (1996). Use of remote sensing for evapotranspiration monitoring over land surfaces. Hydrological Sciences Journal, 41(4), 495-516.

193. Boussaa, S., Kahime, K., Samy, A. M., Salem, A. B., & Boumezzough, A. (2016). Species composition of sand flies and bionomics of Phlebotomus papatasi and P. sergenti (Diptera: Psychodidae) in cutaneous leishmaniasis endemic foci, Morocco. Parasites & vectors, 9(1), 60.

194. Cross, E. R., Newcomb, W. W., & Tucker, C. J. (1996). Use of weather data and remote sensing to predict the geographic and seasonal distribution of Phlebotomus papatasi in southwest Asia. The American journal of tropical medicine and hygiene, 54(5), 530-536.

131

195. Medlock, J. M., Hansford, K. M., Van Bortel, W., Zeller, H., & Alten, B. (2014). A summary of the evidence for the change in European distribution of phlebotomine sand flies (Diptera: Psychodidae) of public health importance. Journal of Vector Ecology, 39(1), 72-77.

196. Simsek, F. M., Alten, B., Caglar, S. S., Ozbel, Y., Aytekin, A. M., Kaynas, S., ... & Rastgeldi, S. (2007). Distribution and altitudinal structuring of phlebotomine sand flies (Diptera: Psychodidae) in southern Anatolia, Turkey: their relation to human cutaneous leishmaniasis. Journal of Vector Ecology, 32(2), 269-280.

197. Kasap, O. E., & Alten, B. (2006). Comparative demography of the sand fly Phlebotomus papatasi (Diptera: Psychodidae) at constant temperatures. Journal of Vector Ecology, 31(2), 378-386.

198. Lalvani, A., Pathan, A. A., Durkan, H., Wilkinson, K. A., Whelan, A., Deeks, J. J., ... & Hill, A. V. (2001). Enhanced contact tracing and spatial tracking of Mycobacterium tuberculosis infection by enumeration of antigen-specific T cells. The Lancet, 357(9273), 2017-2021.

199. Sreeramareddy, C. T., Kumar, H. H., & Arokiasamy, J. T. (2013). Prevalence of self-reported tuberculosis, knowledge about tuberculosis transmission and its determinants among adults in India: results from a nation-wide cross-sectional household survey. BMC infectious diseases, 13(1), 16.

200. Smith, I. (2003). Mycobacterium tuberculosis pathogenesis and molecular determinants of virulence. Clinical microbiology reviews, 16(3), 463-496.

201. Whalen, C., Horsburgh, C. R., Hom, D., Lahart, C., Simberkoff, M., & Ellner, J. (1995). Accelerated course of human immunodeficiency virus infection after tuberculosis. American journal of respiratory and critical care medicine, 151(1), 129-135.

202. World Health Organization. Global Tuberculosis Report 2017. Available online: www.who.int/tb/publications/global_report/en/ (accessed on 12 July 2017).

203. Thomas, T. Y., & Rajagopalan, S. (2001). Tuberculosis and aging: a global health problem. Clinical infectious diseases, 33(7), 1034-1039.

204. Figueroa-Munoz, J. I., & Ramon-Pardo, P. (2008). Tuberculosis control in vulnerable groups. Bulletin of the World Health Organization, 86, 733-735.

205. Schmit, K. M., Wansaula, Z., Pratt, R., Price, S. F., & Langer, A. J. (2017). Tuberculosis—United States, 2016. MMWR. Morbidity and mortality weekly report, 66(11), 289.

206. Hill, A. N., Becerra, J. E., & Castro, K. G. (2012). Modelling tuberculosis trends in the USA. Epidemiology & Infection, 140(10), 1862-1872.

www.who.int/tb/publications/global_report/en/

132

207. Sun, W., Gong, J., Zhou, J., Zhao, Y., Tan, J., Ibrahim, A., & Zhou, Y. (2015). A spatial, social and environmental study of tuberculosis in China using statistical and GIS technology. International journal of environmental research and public health, 12(2), 1425-1448.

208. Cantwell, M. F., McKENNA, M. T., McCRAY, E. E., & Onorato, I. M. (1998). Tuberculosis and race/ethnicity in the United States: impact of socioeconomic status. American Journal of Respiratory and Critical Care Medicine, 157(4), 1016-1020.

209. Wubuli, A., Xue, F., Jiang, D., Yao, X., Upur, H., & Wushouer, Q. (2015). Socio-demographic predictors and distribution of pulmonary tuberculosis (TB) in Xinjiang, China: A spatial analysis. PloS one, 10(12), e0144010.

210. Mahara, G., Yang, K., Chen, S., Wang, W., & Guo, X. (2018). Socio-Economic Predictors and Distribution of Tuberculosis Incidence in Beijing, China: A Study Using a Combination of Spatial Statistics and GIS Technology. Medical Sciences, 6(2), 26.

211. Harling, G., & Castro, M. C. (2014). A spatial analysis of social and economic determinants of tuberculosis in Brazil. Health & place, 25, 56-67.

212. Krieger, N., Chen, J. T., Waterman, P. D., Rehkopf, D. H., & Subramanian, S. V. (2003). Race/ethnicity, gender, and monitoring socioeconomic gradients in health: a comparison of area-based socioeconomic measures—the public health disparities geocoding project. American journal of public health, 93(10), 1655-1671.

213. Kistemann, T., Munzinger, A., & Dangendorf, F. (2002). Spatial patterns of tuberculosis incidence in Cologne (Germany). Social science & medicine, 55(1), 7-19.

214. Jia, Z. W., Jia, X. W., Liu, Y. X., Dye, C., Chen, F., Chen, C. S., ... & Liu, H. L. (2008). Spatial analysis of tuberculosis cases in migrants and permanent residents, Beijing, 2000–2006. Emerging infectious diseases, 14(9), 1413.

215. Hawker, J. I., Bakhshi, S. S., Ali, S., & Farrington, C. P. (1999). Ecological analysis of ethnic differences in relation between tuberculosis and poverty. Bmj, 319(7216), 1031-1034.

216. Shaweno, D., Karmakar, M., Alene, K. A., Ragonnet, R., Clements, A. C., Trauer, J. M., ... & McBryde, E. S. (2018). Methods used in the spatial analysis of tuberculosis epidemiology: a systematic review. BMC medicine, 16(1), 193.

217. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models (Vol. 4, p. 318). Chicago: Irwin.

133

218. Naghibi, S. A., Pourghasemi, H. R., & Dixon, B. (2016). GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran. Environmental monitoring and assessment, 188(1), 44.

219. Sadeghian, A., Lim, D., Karlsson, J., & Li, J. (2015). Automatic target recognition using discrimination based on optimal transport. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2604-2608). IEEE.

220. Vahedi, B., Kuhn, W., & Ballatore, A. (2016). Question-based spatial computing—A case study. In Geospatial Data in a Changing World (pp. 37-50). Springer, Cham.

221. Hatami, M., & Ameri Siahooei, E. (2013). Examines criteria applicable in the optimal location new cities, with approach for sustainable urban development. Middle-East Journal of Scientific Research, 14(5), 734-743.

222. Sadeghian, A., Sundaram, L., Wang, D., Hamilton, W., Branting, K., & Pfeifer, C. (2016). Semantic edge labeling over legal citation graphs. In Proceedings of the workshop on legal text, document, and corpus analytics (LTDCA-2016) (pp. 70-75).

223. Janalipour, M., & Mohammadzadeh, A. (2017). A fuzzy-ga based decision making system for detecting damaged buildings from high-spatial resolution optical images. Remote Sensing, 9(4), 349.

224. Shafieardekani, M., & Hatami, M. (2013). Forecasting Land Use Change in suburb by using Time series and Spatial Approach; Evidence from Intermediate Cities of Iran. European Journal of Scientific Research, 116(2), 199-208.

225. Mollalo, A., Alimohammadi, A., & Khoshabi, M. (2014). Spatial and spatio-temporal analysis of human brucellosis in Iran. Transactions of The Royal Society of Tropical Medicine and Hygiene, 108(11), 721-728.

226. Sordo, M. (2002). Introduction to neural networks in healthcare. Open Clinical knowledge management for medical care.

227. Aburas, H. M., Cetiner, B. G., & Sari, M. (2010). Dengue confirmed-cases prediction: A neural network model. Expert Systems with Applications, 37(6), 4256-4260.

228. Laureano-Rosario, A., Duncan, A., Mendez-Lazaro, P., Garcia-Rejon, J., Gomez-Carro, S., Farfan-Ale, J., ... & Muller-Karger, F. (2018). Application of artificial neural networks for dengue fever outbreak predictions in the northwest coast of Yucatan, Mexico and San Juan, Puerto Rico. Tropical medicine and infectious disease, 3(1), 5.

134

229. Xue, H., Bai, Y., Hu, H., & Liang, H. (2017). Influenza activity surveillance basedon multiple regression model and artificial neural network. IEEE Access, 6, 563-575.

230. Murray, E. J., Marais, B. J., Mans, G., Beyers, N., Ayles, H., Godfrey-Faussett,P., ... & Bond, V. (2009). A multidisciplinary method to map potential tuberculosistransmission ‘hot spots’ in high-burden communities. The international journal oftuberculosis and lung disease, 13(6), 767-774.

231. Mullins, J., Lobato, M. N., Bemis, K., & Sosa, L. (2018). Spatial clusters of latenttuberculous infection, Connecticut, 2010–2014. The International Journal ofTuberculosis and Lung Disease, 22(2), 165-170.

232. Bennett, R. J., Brodine, S., Waalen, J., Moser, K., & Rodwell, T. C. (2014).Prevalence and treatment of latent tuberculosis infection among newly arrivedrefugees in San Diego County, January 2010–October 2012. American journal ofpublic health, 104(4), e95-e102.

233. Feske, M. L., Teeter, L. D., Musser, J. M., & Graviss, E. A. (2011). Including thethird dimension: a spatial analysis of TB cases in Houston HarrisCounty. Tuberculosis, 91, S24-S33.

234. Scales, D., Brownstein, J. S., Khan, K., & Cetron, M. S. (2014). Toward a county-level map of tuberculosis rates in the US. American journal of preventivemedicine, 46(5), e49.

235. HealthMap. Available online: https://healthmap.org/tb (accessed on 25 July2018).

236. American Community Survey (ASC). Available online:https://www.census.gov/programs-surveys/acs/ (accessed on 25 July 2018).

237. Center for Disease Control and Prevention (CDC) wonder. Available online:http://wonder.cdc.gov/ (accessed on 25 July 2018).

238. García-Elorriaga, G., & Del Rey-Pineda, G. (2014). Type 2 diabetes mellitus as arisk factor for tuberculosis. J Mycobac Dis, 4(144), 2161-1068.

239. The National Map. Available online: http://nationalmap.gov/ (accessed on 25 July2018).

240. The US Census Bureau. Available online: https://www.census.gov/ (accessed on25 July 2018).

241. Anselin, L., Syabri, I., & Kho, Y. (2006). GeoDa: an introduction to spatial dataanalysis. Geographical analysis, 38(1), 5-22.

https://www.census.gov/programs-surveys/acs/

http://wonder.cdc.gov/

http://nationalmap.gov/

https://www.census.gov/

https://healthmap.org/tb

135

242. Civco, D. L. (1993). Artificial neural networks for land-cover classification and mapping. International journal of geographical information science, 7(2), 173-186.

243. Woods, K., & Bowyer, K. W. (1997). Generating ROC curves for artificial neural networks. IEEE Transactions on medical imaging, 16(3), 329-337.

244. Huang, G. B. (2003). Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2), 274-281.

245. Yilmaz, I., & Kaynar, O. (2011). Multiple regression, ANN (RBF, MLP) and ANFIS odels for prediction of swell potential of clayey soils. Expert systems with applications, 38(5), 5958-5966.

246. Behrang, M. A., Assareh, E., Ghanbarzadeh, A., & Noghrehabadi, A. R. (2010). The potential of different artificial neural network (ANN) techniques in daily global solar radiation modeling based on meteorological data. Solar Energy, 84(8), 1468-1480.

247. Karlik, B., & Olgac, A. V. (2011). Performance analysis of various activation functions in generalized MLP architectures of neural networks. International Journal of Artificial Intelligence and Expert Systems, 1(4), 111-122.

248. Zadeh, M. R., Amin, S., Khalili, D., & Singh, V. P. (2010). Daily outflow prediction by multi layer perceptron with logistic sigmoid and tangent sigmoid activation functions. Water resources management, 24(11), 2673-2688.

249. Bagheri, M., Mirbagheri, S. A., Ehteshami, M., & Bagheri, Z. (2015). Modeling of a sequencing batch reactor treating municipal wastewater using multi-layer perceptron and radial basis function artificial neural networks. Process Safety and Environmental Protection, 93, 111-123.

250. Gardner, M. W., & Dorling, S. R. (1998). Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(14-15), 2627-2636.

251. Riedmiller, M. (1994). Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Computer Standards & Interfaces, 16(3), 265-278.

252. Shanker, M., Hu, M. Y., & Hung, M. S. (1996). Effect of data standardization on neural network training. Omega, 24(4), 385-397.

253. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

254. Anastassiou, G. A. (2011). Multivariate hyperbolic tangent neural network approximation. Computers & Mathematics with Applications, 61(4), 809-821.

136

255. Singh, P., & Deo, M. C. (2007). Suitability of different neural networks in daily flow forecasting. Applied Soft Computing, 7(3), 968-978.

256. Institute of Medicine: Ending Neglect: The Elimination of Tuberculosis in the United States; National Academy Press: Washington, DC, USA, 2000.

257. Georgia Tuberculosis Report (2017). Available online: https://dph.georgia.gov/sites/dph.georgia.gov/files/2016%20GA%20TB%20Report%20FINAL.pdf (accessed on 25 November 2018).

258. Florida Tuberculosis Report (2016). Available online: http://www.floridahealth.gov/diseases-and-conditions/tuberculosis/tb-statistics/index.html (accessed on 25 November 2018).

259. Onozuka, D., & Hagihara, A. (2015). The association of extreme temperatures and the incidence of tuberculosis in Japan. International journal of biometeorology, 59(8), 1107-1114.

260. Mourtzoukou, E. G., & Falagas, M. E. (2007). Exposure to cold and respiratory tract infections. The International Journal of Tuberculosis and Lung Disease, 11(9), 938-943.

261. Khalid, A., Baqai, T., Bukhari, M., & Khan, M. M. (2015). Comparison of the incidence of tuberculosis in different geographical zones in the state of Jammu and Kashmir. Pakistan Journal of Chest Medicine, 19(1).

262. McKenna, M. T., McCray, E., & Onorato, I. (1995). The epidemiology of tuberculosis among foreign-born persons in the United States, 1986 to 1993. New England Journal of Medicine, 332(16), 1071-1076.

263. Ho, M. J. (2004). Sociocultural aspects of tuberculosis: a literature review and a case study of immigrant tuberculosis. Social science & medicine, 59(4), 753-762.

264. Weis, S. E., Moonan, P. K., Pogoda, J. M., Turk, L. E., King, B., Freeman-Thompson, S., & Burgess, G. (2001). Tuberculosis in the foreign-born population of Tarrant county, Texas by immigration status. American journal of respiratory and critical care medicine, 164(6), 953-957.

265. Munch, Z., Van Lill, S. W. P., Booysen, C. N., Zietsman, H. L., Enarson, D. A., & Beyers, N. (2003). Tuberculosis transmission patterns in a high-incidence area: a spatial analysis. International journal of tuberculosis and lung disease, 7(3), 271-277.

266. dos Santos, M. A., Albuquerque, M. F., Ximenes, R. A., Lucena-Silva, N. L., Braga, C., Campelo, A. R., ... & Rodrigues, L. C. (2005). Risk factors for treatment delay in pulmonary tuberculosis in Recife, Brazil. BMC Public Health, 5(1), 25.

https://dph.georgia.gov/sites/dph.georgia.gov/files/2016%20GA%20TB%20Report%20FINAL.pdf

http://www.floridahealth.gov/diseases-and-conditions/tuberculosis/tb-statistics/index.html

137

267. Jakubowiak, W. M., Bogorodskaya, E. M., Borisov, E. S., Danilova, D. I., & Kourbatova, E. K. (2007). Risk factors associated with default among new pulmonary TB patients and social support in six Russian regions. The international journal of tuberculosis and lung disease, 11(1), 46-53.

268. Djibuti, M., Mirvelashvili, E., Makharashvili, N., & Magee, M. J. (2014). Household income and poor treatment outcome among patients with tuberculosis in Georgia: a cohort study. BMC Public Health, 14(1), 88.

269. Bamrah, S., Yelk Woodruff, R. S., Powell, K., Ghosh, S., Kammerer, J. S., & Haddad, M. B. (2013). Tuberculosis among the homeless, United States, 1994–2010. The International Journal of Tuberculosis and Lung Disease, 17(11), 1414-1419.

270. Mapping and analyzing race and ethnicity. Available online: https://www2.census.gov/programs-surveys/sis/activities/geography/hg-2_teacher.pdf (accessed on 25 November 2018).

271. Centers for Disease Control and Prevention Trends in tuberculosis--United States, 2010. MMWR. Morb. Mortal. Wkl. Rep. 2011, 60, 333.

272. Lam, H. K., Ling, S. H., Leung, F. H., & Tam, P. K. S. (2001). Tuning of the structure and parameters of neural network using an improved genetic algorithm. In IECON'01. 27th Annual Conference of the IEEE Industrial Electronics Society (Cat. No. 37243) (Vol. 1, pp. 25-30). IEEE.

273. Hawkins, B. A., Diniz‐Filho, J. A. F., Mauricio Bini, L., De Marco, P., & Blackburn, T. M. (2007). Red herrings revisited: spatial autocorrelation and parameter estimation in geographical ecology. Ecography, 30(3), 375-384.

274. Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.

https://www2.census.gov/programs-surveys/sis/activities/geography/hg-2_teacher.pdf

138

BIOGRAPHICAL SKETCH

Abolfazl Mollalo started his PhD program in the Department of Geography,

University of Florida in August 2015. He received his Ph.D. from this university in

August 2019. His areas of expertise include medical geography, spatial epidemiology,

GIS and applied machine learning in public health. He has obtained his master degree

in GIS and bachelor degree in surveying engineering from national universities in Iran.

Documents

GIS-BASED, DATA-DRIVEN TECHNIQUES FOR SPATIAL ANALYSIS … · 2019-10-22 · Overview of Training Models ... the applicability of multilayer perceptron (MLP) ANN for predicting the