Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
Geographically Weighted Discriminant
Analysis
Chris Brunsdon,1 Stewart Fotheringham,2 Martin Charlton2
1Department of Geography, University of Leicester, Leicester, U.K. 2National Centre for Geocomputation,
National University of Ireland, Maynooth, U.K.
In this article, we propose a novel analysis technique for geographical data, Geo-
graphically Weighted Discriminant Analysis. This approach adapts the method of
Geographically Weighted Regression (GWR), allowing the modeling and prediction of
categorical response variables. As with GWR, the relationship between predictor and
response variables may alter over space, and calibration is achieved using a moving
kernel window approach. The methodology is outlined and is illustrated with an ex-
ample analysis of voting patterns in the 2005 UK general election. The example shows
that similar social conditions can lead to different voting outcomes in different parts of
England and Wales. Also discussed are techniques for visualizing the results of the
analysis and methods for choosing the extent of the moving kernel window.
Introduction
In this article, an extension to discriminant analysis is proposed, which allows the
discrimination rule to vary over space. The term Geographically Weighted Disc-
riminant Analysis (GWDA) is proposed for this new method. The motivation here is
similar to that for Geographically Weighted Regression (GWR)—in some situations,
relationships between variables are not universal, but dependent on location. A
major distinction between the two techniques is that, while GWR attempts to predict
a measurement or ratio scale variable y given a set of predictors x 5 {x1, . . ., xm},
GWDA analysis attempts to predict a categorical y variable. This is not the first
article to suggest an application of geographical weighting to categorical data fol-
lowing the suggestions of Fotheringham, Brunsdon, and Charlton (2002). Atkinson
et al. (2003) explore the relationships between riverbank erosion and various
environmental variables using geographically weighted logistic regression, and
Paez (2006) examines geographical variations in land use/transportation relation-
ships using a geographically weighted probit model. There are similarities between
discriminant analysis and logistic regression in that both are used to predict group
membership from a set of predictor variables. The assumptions underlying each
Correspondence: Chris Brunsdon, Department of Geography, University of Leicester,Leicester, UKe-mail: [email protected]
Geographical Analysis 39 (2007) 376–396 r 2007 The Ohio State University376
Geographical Analysis ISSN 0016-7363
technique are rather different. Logistic regression may be the method of choice
when the dependent variable has two groups due to its more relaxed assumptions
(Maddala 1983).
In common with existing discussions of discriminant analysis, it is helpful to
regard the data used here as having been drawn from a number of distinct pop-
ulations, one for each unique category of y. The task of GWDA is then to assess
which population a given, unlabeled {x} is likely to have come from. The difference
between GWDA and standard discriminant analysis is that for GWDA this decision
is made taking the geographical location of {x} into account. The GWDA technique
proposed here exploits the fact that linear and quadratic discriminant analyses
(LDA and QDA) rely only on the mean vector and covariance matrix of {x} for each
population (the former assumes that the covariance is the same for all m popula-
tions). This being the case, the route to localizing LDA and QDA is via geograph-
ically weighted means and covariances, such as those proposed in Brunsdon,
Fotheringham, and Charlton (2002) and Fotheringham, Brunsdon, and Charlton
(2002). In the next section, LDA and QDA are reviewed. Following this, extending
these techniques to forms of GWDA is considered. Finally, an empirical example
using voting data in England and Wales is presented.
A review of discriminant analysis
Discriminant analysis is a technique used to identify which population a certain
observation vector x belongs to, given a list of possible populations {1, . . ., m} and a
training set of observations {xij} where ij indicates the ith observation from popu-
lation j. The simplest case occurs when m 5 2 so that a binary classification has to
be made. In this case, Fisher (1936) has used a decision theoretic approach to show
that an optimal decision rule is to assign to population 1 if
f1ðxÞf2ðxÞ
>Cð1j2Þp2
Cð2j1Þp1ð1Þ
and to assign to population 2 otherwise. Here, fj(x) is the probability density function
for population j, pj is the prior probability that an observation comes from popu-
lation j, and C(i|j) is the cost associated with wrongly classifying an observation in
population i to population j. For this article, it will be assumed that all costs of wrong
classification are the same.1 In this case, (1) reduces to assigning x to population 1 if
p1f1ðxÞ > p2f2ðxÞ ð2Þ
and assigning to population 2 otherwise. The situation may be generalized (Rao
1948; Bryan 1951) to m42, by noting that equation (2) simply compares two scores
of the form pjfj(x), and extending this rule to state that x is assigned to population
kA{1..m} where
k ¼ arg maxj2f1; ...; mgfpjfjðxÞg ð3Þ
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
377
The interpretation of equation (3) is intuitive in terms of Bayes Theorem;
pjfj(x) is proportional to the posterior probability of an observation coming from
population j once the value of x is known. Thus, assignment is to the population with
the largest posterior probability. This is a general rule for assignment to
populations, but it assumes that the prior probabilities {pj} and the probability
density functions {fj} are known. In general, they are not, and they must be estimat-
ed from a set of training data. Much of the technique of discriminant analysis lies in
this estimation task. The estimation of the pj’s is straightforward, if the training data
are a random sample from the complete population obtained by merging each of the
populations associated with individual y values. In this case, the estimate for pj is just
pj ¼nj
n1 þ n2 þ . . . þ nmð4Þ
where nj is the number of observations from population j.
If the training data set from each population is not generated in this way—for
example, if they come from a case–control study where there are the same number
of observations in each sample regardless of their relative abundance in reality—and
there is no other information about the prior probabilities for each population, then a
minimax argument shows that the optimal decision rule is found by setting each pj to
m�1. In Bayesian terminology this is a noninformative prior.
Estimating fj is less trivial. Two general possibilities are
1. Assume each fj takes a known parametric form (such as multivariate Gaussian)
and estimate the parameters from the data.
2. Use a nonparametric technique such as kernel density estimation to estimate fjfor each sample.
In this article, we concentrate on the first option and assume that the fj Os are
multivariate Gaussian. This assumption gives rise to the standard techniques of LDA
and QDA. The second option leads to a technique called Kernel Discriminant
Analysis (KDA), which we will consider in the future.
In the multivariate Gaussian model, the decision rule (3) is
k ¼ arg maxj2f1; ...; mg pj1
ð2pjP
j jÞq=2
exp � 1
2ðx � mjÞ0
X�1
j
ðx � mjÞ
0@
1A
24
35 ð5Þ
whereP
j is the variance–covariance matrix for population j, q is the number of
predictor variables in x, and mj is the mean value for population j. Taking logs,
changing signs, and removing terms that are constant for all values of j, the rule may
be written as
k ¼ arg minj2f1; ...; mgq
2log
�jX
j
j�þ 1
2ðx � mjÞ0
X�1
j
ðx � mjÞ � logðpjÞ
24
35 ð6Þ
Geographical Analysis
378
This is the QDA decision rule. Each of the m score functions within the
square brackets in equation (6) is quadratic in x. If we make a further assumption
that all populations have the same variance–covariance matrix, then the rule
simplifies to
k ¼ arg minj2f1; ...; mgq
2log
�jXj�þ 1
2ðx � mjÞ0
X�1
ðx � mjÞ � logðpjÞ" #
ð7Þ
Expanding the quadratic expression, and noting that the term quadratic in x
appears identically in each score, the above rule simplifies to
k ¼ arg minj2f1; ...; mg �x 0X�1
mj þ1
2mj0X�1
mj � logðpjÞ" #
ð8Þ
which is linear in x. This is the LDA decision rule. Note that maximum likelihood
estimates may be used to estimate each mj andP
j, and that a pooled estimate forP
in the case of LDA may be found byX¼ n1
P1þn2
P2þ . . . þnm
Pm
n1 þ n2 þ . . . þ nmð9Þ
whereP
1 ; . . . ;P
m
� �are maximum likelihood estimators of the respective vari-
ance–covariance matrices in each population.
The two rules are compared in Fig. 1. In this case, x is two-dimensional, m 5 3,
and the populations are clearly separated. Here, it is clear that LDA is just as good
as QDA for separating the populations, although this is not often the case. In many
real-world examples, there is a degree of overlap between the populations, so that
neither LDA nor QDA can provide a perfect discrimination rule.
Figure 1. Discriminant analysis illustrated. LHS shows the result of LDA applied to the
points; RHS shows the same for QDA. LDA, linear discriminant analysis; QSD, linear and
quadratic discriminant analysis.
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
379
GWDA—a definition by extending discriminant analysis
Having outlined the ideas underlying QDA and LDA, this section discusses how
these may be extended into geographically weighted methods. Essentially, the idea
is that one now assumes that one or more of the quantitiesP
j ;P; mj; and pj are
not fixed, but depend on spatial location u. In effect, the decision rule now becomes
localized, in the sense that any of the probabilities used to derive the decision rules
are now conditional on u. The dependency is modeled by assuming that the vector mj
is a function of u—so that each element of mj is a geographically weighted mean.
Similarly, each element ofP
j is a geographically weighted variance or covariance.
These methods of estimation are in keeping with a local likelihood approach to es-
timating the population models of the form
fpðxjuÞ ¼1
ð2pjP
j ðuÞjÞq=2
exp � 1
2ðx � mjðuÞÞ
0X�1
j
ðuÞðx � mjðuÞÞ
0@
1A ð10Þ
Paez, Uchida, and Miyamoto (2002a, b) and Paez (2004) have made use of
geographically weighted variances to provide an alternative interpretation of GWR.
Estimation
As suggested in the introduction, calibration of local discriminant functions is
achieved through calibration of local coefficients. Under the Gaussian assumption,
this amounts to the task of calibrating {m1(u), . . ., mm(u)} and fP
1 ðuÞ; . . . ;P
m ðuÞgin equation (10). This is essentially a nonparametric estimation task, because the
objects to be estimated are functions of the vector u. More specifically, {m1(u), . . .,
mm(u)} are vector functions of u while fP
1 ðuÞ; . . . ;P
m ðuÞg are matrix functions
of u. As a shorthand, this collection of functions associated with the point u will
henceforth be referred to as Cj(u). One approach to estimating Cj(u) at any point u is
through local likelihood estimation. This is equivalent to calibrating a weighted
nonlocalized model where the weights are obtained from a kernel function cen-
tered on u. Weights are applied to log-likelihoods, and so for equation (10) CjðuÞ is2
chosen to minimize
LðCjðuÞjxÞ ¼Xn
i¼1
wiðuÞ � log jX
j
ðuÞj
0@
1A� 1
2ðx i � mjÞ0
X�1
j
ðx i � mjÞ
0@
1A ð11Þ
where wi(u) is the weight applied to observation i when calibration is taking place
at the point u. As stated earlier, this weight is determined by a kernel function, so
that if ui is the location associated with observation i, and di(u) 5 |u� ui|, then wi(u)
is a decreasing function of di(u), tending to 0 as di(u) increases. One possibility is
the bisquare kernel function
FhðdÞ ¼1� d
h
� �2� �2
if d < h
0 otherwise
8<: ð12Þ
Geographical Analysis
380
where d is the distance, and h is the kernel bandwidth—a quantity controlling the
smoothness of the nonparametric function estimation.
This approach leads to the following estimators for Cj(u):
mjðuÞ ¼Pn
i¼1 wiðuÞx iPni¼1 wiðuÞ
ð13Þ
and
Xj
ðuÞ ¼Pn
i¼1 wiðuÞðx i � mjðuÞÞðx i � mjðuÞÞ0Pni¼1 wiðuÞ
ð14Þ
The next step is to estimate {p1, . . ., pm}. As before, this is straightforward, if the
training data in the neighborhood of u are a random sample from the complete
population obtained by merging each of the populations associated with individual
y values. In this case, the pj’s are estimated as in equation (4). If we further assume
thatP
is the same for each population, as in LDA, we may obtain an estimate forP
using equation (9).
Note that in the above we are assuming that the proportions from different
populations in the sample reflect the overall proportions. However, if this is not the
case, then the assumption that each pj is equal to 1=m is made. A final complication
is that the proportions in the merged population associated with each of the y
values may vary geographically in its composition. In this case, local likelihood
estimates of pj (written as pj(u)) should be used. These are obtained by
pjðuÞ ¼P
x i2j wiðuÞPj
Px i2j wiðuÞ
ð15Þ
This choice between local and global estimation occurs in other aspects of the
calibration, as discussed in the next subsection.
Choices
One key issue in the above approach is the decision as to what parts of the
model should vary geographically. For any QDA analysis, we must estimate
fP
1; . . . ;P
mg, {p1, . . ., pm} and {m1, . . ., mm}. Similarly, for LDA we must esti-
mate {m1, . . ., mm}, {p1, . . ., pm} andP
. We have a choice as to which of these we
allow to vary geographically. In this study, we allow the mean, the variance–
covariance matrix, and the prior probabilities to vary geographically although it
would be possible to fix any combination of these to constant global values. How-
ever, the circumstances leading to such an action are not immediately obvious and
the more logical approach would be the one followed here of allowing all three
components to vary spatially.
These choices imply that there are a number of ways in which GWDA could be
specified. It seems likely that different choices are likely to be better in different
situations and that some methodology for selecting the best approach for a given
data set should be proposed. In addition to these choices relating to the probability
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
381
model, there is also the matter of choice of bandwidth for the geographical weight-
ing. Again, a selection methodology is required. A tentative approach proposed
here is cross-validation. A comprehensive cross-validation would involve removing
each observation from the training data, applying the GWDA to the rest of the data
set, and then assigning the removed observation to a population using the disc-
riminant rule thus derived. Applying this to every observation in the data set and
noting the proportion of correct assignments gives a performance score for a given
model working with a given bandwidth. This cross-validation score may then be
used to identify the ‘‘best’’ combination of model and bandwidth for a given data
set. An alternative, faster approach would be to randomly select a large proportion
of the training data (say 90%), calibrate the GWDA using this, and then apply the
cross-validation procedure to the remaining 10%.
A more sophisticated approach to cross-validation scoring may be to use cross-
validation likelihood, rather than the proportion of correct assignments. Note that
discriminant analysis assigns observations to populations on the basis of posterior
probabilities. Applying Bayes Theorem, the probability that observation x is in
population j is given by
Prðjjx; uÞ ¼ Prðxjj; uÞPrðxj1; uÞ þ . . . þ Prðxjm; uÞ ð16Þ
For each observation in the held-back cross-validation data set, we know the
actual value of j, and using a given model and bandwidth we can estimate the
posterior probability that the observation is in its true population. Taking logs and
adding up these quantities, a likelihood of the validation data set for a given choice
of bandwidth and model is obtained:
Score ¼X
validation set
log Prðjjx; uÞð Þ ð17Þ
This quantity can be used to select the best combination of model and band-
width. This approach has one clear advantage over a straightforward score of clas-
sifications because it is a continuous function with respect to bandwidth—which is
often advantageous in optimization algorithms, such as gradient descent (Fletcher
and Reeves 1964) or the simplex method (Nelder and Mead 1965). The down side
is that this score is computationally more expensive.
Spatially adaptive kernel bandwidths
As with GWR, one issue relating to the choice of h, the kernel bandwidth, is that
one may expect the degree of spatial variability in the discriminant functions to vary
geographically. For example, if the subjects of the analysis are people, then it is
possible that in urban areas with denser populations, there may be more variety in
social, economic, and cultural characteristics of places than there would be in
equivalent-sized rural areas. If that were the case, one could reasonably expect
geographical drift in GWDA models to be notable in much smaller geographical
Geographical Analysis
382
windows. In such situations, a one-size-fits-all approach to bandwidth choice is
unhelpful, because it assumes that the spatial scale of variation is the same every-
where. To overcome this problem, it is proposed to use a local bandwidth selection
procedure, based on the lth nearest neighbor in the data set to the point u at the
center of the kernel. To calibrate the model at u, we simply choose the l nearest-
neighbor distance from u as the bandwidth. In this way, in denser areas, where
there are more sample observations, the bandwidth will be smaller, and the model
calibration will take place on a smaller geographical scale. This is essentially the
same approach as used in GWR (Fotheringham, Brunsdon, and Charlton 2002) and
has been found to work well in practice.
At this stage, we are still left with the choice of an appropriate value for l. The
approach to this is much the same as that described for bandwidth selection in
‘‘Choices.’’ Essentially, one of the cross-validation scoring methods outlined in that
section is carried out over a series of possible l values, and the value with the best
score is chosen. Unfortunately, because l is a discrete quantity, one cannot take
advantage of some of the automatic optimization procedures suggested in the pre-
vious section, and so computation can take slightly longer. Once an optimal score
has been found for the nearest-neighbor bandwidth method, this can be compared
with that for the fixed bandwidth method, and the best performing of the two can be
presented as the final GWDA model.
Visualization and interpretation
An alternative to mapping coefficients
A major difficulty when presenting the results of GWDA is that of visualization. For
example, when expanding the expression in equation (7), then for each of the
populations 1, . . ., m there are q11 coefficients. For a local QDA, there would be
even more than this. Thus, in effect, for linear GWDA there is an m by q11 matrix
associated with each point in space. One possibility would be to plot m(q11) maps
side by side (which is in keeping with Tufte’s 1990 principle of ‘‘small multiples’’)—
which is essentially mapping each of the coefficients in the way a typical GWR
analysis would do. The main problem here is one of interpretability. For each vari-
able, there is a separate coefficient for each population. For a given observation,
one needs to compare these m variables to understand the relative likelihood that
this observation will be assigned to a given group. This can be a large amount
of information to take in, and there is often no obvious visual connection between
the shading of a single map and likely group membership. It is for this reason that
approaches to visualization other than the mapping of coefficients are considered
here.
An alternative approach—and one that it is felt embodies the spirit of local
modeling—is to consider the variation between a given fixed set of predictor
variables {x} and the probability of membership of each of the populations Pr(j|x,u),
as the location u varies. In a global model, Pr(j|x,u) would not depend on u and so
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
383
for a given x we would expect any map based on these probabilities to show no
variations. The fact that local models do exhibit variation is their defining charac-
teristic, and mapping this variation should provide an intuitive image of the nature
of this variation for a given model. For example, if we were to choose x to be a
global average or median of the sample data set, our map would show the geo-
graphical variation in the probable category to which a typical observation is likely
to belong. Similarly, by choosing one of the variables of x to be above average (say
the 80th percentile), we could assess changes in the assignment to category due to
this increase in value across geographical space. In the example, we will consider
how the level of deprivation affects voting behavior and show that higher levels of
deprivation affect voting choice in different ways in different parts of England and
Wales.
Mapping probabilities of membership
It has been argued that a helpful visualization strategy is to map variation in Pr(j|x,u)
for a set of given x values. However, this in itself presents a minor challenge, as this
is essentially a vector quantity because there are j potential values of k. Using the
notational shorthand pj(u) 5 Pr(j|x,u) we need a way of mapping the vector
p(u) 5 (p1(u), p2(u), . . ., pk(u)). Here, two approaches will be suggested. First, as-
sume that the map is divided into l zones Z1, . . ., Zl each associated with a rep-
resentative point {u1, . . ., ul}. These zones could be regular (such as rectangular
pixels) or irregular (such as electoral wards). The representative point may be the
centroid of the zone, but need not be, particularly in cases where the centroid of a
zone is not within that zone. The approaches are:
1. ‘‘Majority Vote Approach’’: For each Zi, find the largest element pj(ui) in p(ui).
Color Zi with a color associated with j.
2. ‘‘Mixture Approach’’: Specify a color associated with each value of j using a
color coordinate system (such as RGB or HSV). Use the elements of p(ui) to
obtain a weighted average of these colors in the color space. (This is possible
because the elements of p(ui) sum to 1.) Color Zi with this weighted average
color.
The majority vote approach is less informative—for example, it is not possible
to distinguish between probability vectors such as (0.34, 0.33, and 0.33) and (0.98,
0.01, and 0.01)—and thus one cannot tell how strong the degree of potential
membership is to the most likely population. On the other hand, the mixture ap-
proach allows one to distinguish between more subtle changes in the profile, but
requires that maps are produced in full color. Thus, it is not appropriate to display
this kind of map in many refereed journals that can reproduce only monochrome
images.
For the mixture approach, choice of color space is important. Colors for each of
the populations should be clearly distinguishable—this should avoid confusion
between high values of pj for each possible value of j. Given that colors are chosen
Geographical Analysis
384
by applying linear operations to the color coordinate space, it would also be helpful
if the differences in perception of different colors correspond to distances between
coordinates in the color space. This implies that, for example, if k 5 3, the color
used to represent (0.5, 0.5, and 0) should be equally distinguishable from the two
colors used to represent the classes j 5 1 and 2.
The issue of defining a color space whose distances corresponded to human
color perception was addressed by the Commission Internationale de l’Eclairage
(the CIE) in 1976. Two such spaces are the CIELAB and CIELUV specifications. Both
are based on perceptual color spaces, with CIELUV generally preferred for work
with additive color technologies, such as LCD displays and projection equipment,
while CIELAB is more appropriate for subtractive color systems, such as the use of
inks or dyes. Here, CIELUV will be considered, as a typical use of the mixture ap-
proach will be the creation of maps to be shown on LCD overhead projection
equipment during conferences, lectures, or seminars.
The CIELUV space has three coordinates, L, u, and v, where LA[0, 100] is a
degree of luminance, and (u, v)A[� 100, 100] � [� 100, 100] correspond to bal-
ance between red/green and yellow/blue, respectively. Following the advice of
Cleveland and McGill (1983), we note that areas on a graph with greater luminance
tend to look larger, and to avoid any optical illusions that may draw unmerited
attention to certain map regions purely on the grounds of color choice, we work
with a fixed value of L. Thus, our color model resides entirely on the (u, v) plane.
Next, it is important to note that not all (L, u, v) triplets correspond to a color—
because the coordinate system is a nonlinear transform of the conventional RGB
color cube it does not correspond to a regular cuboid itself, and so some points near
the edges of the color space are undefined.
Thus, for the mixture approach to work, we must choose a set of colors for the k
populations corresponding to k points in the (u, v) plane for a given L, and we must
ensure that the convex hull of these points consists entirely of well-defined colors.
A further condition is that the points should be as far apart as possible, to meet the
requirement that colors for each of the populations should be clearly distinguish-
able. Finding such a set of points for any given L is an exercise in practical com-
putational geometry.3 For k 5 3, a possible solution is illustrated in Fig. 2.4 This uses
a luminance value of 70—which has been found to work well in practice. For the
three mixing colors, this scheme uses (70.0, � 37.0, 63.6), (70.0, � 37.0, � 71.4),
and (70.0, 80.0, � 4.0).
An example using the 2005 UK election results for England and Wales
The results of the May 2005 UK General Election may be downloaded from the UK
electoral commission.5 As of May 6, 2005, the results for the 569 Parliamentary
Constituencies contested in England and Wales are summarized in Table 1. The
results in map form are shown in Fig. 3.
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
385
To obtain a greater understanding of the linkage of voting patterns with social
and economic conditions in each of the constituencies, a number of socioeco-
nomic variables were derived for each constituency. These are tabulated in Table 2.
These variables may be used in a straightforward (nongeographically weighted)
discriminant analysis to predict the party elected in each constituency. This gives
rise to the map in Fig. 4. One interesting feature of this map is that it predicts no
seats for the Liberal Democrats or others. Clearly, this is at odds with the reality of
voting patterns. Looking at Fig. 3, it seems that the Liberal Democrat/Other vote is
stronger in certain regions than in others. In particular, North Wales and the West of
England show a strong tendency to vote away from the Conservative/Labour axis.
However, because there are six predictor variables, an extra step of dimen-
sional reduction using principal component analysis is carried out. This overcomes
the difficulties associated with correlation among the predictor variables. In order
to allow for differences in scale between the predictor variable, the principal com-
0
0
50
50
–50
–50
–100
–100
100
100
u
v
CIELUV plane for L=70
Figure 2. Using the CIELUV color space to visualize a three-way probability vector.
Table 1 May 2005 Election Results (England and Wales)
Party No. of seats
Conservative 196
Labour 314
Liberal Democrat and Others 59
Geographical Analysis
386
Figure 3. Election results for parliamentary constituencies in England and Wales.
Table 2 Constituency-Level Socioeconomic Variables Used in this Example
Variable name Description
Unemp Percentage of economically active males unemployed
NoQual Percentage of adult population with no qualifications
OwnOcc Percentage of owner occupied households
Pension Percentage of pensioners in population
NonWhite Percentage of nonwhite people in population
LoneParHH Percentage of lone-parent households
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
387
ponent analysis is based on the correlation matrix, rather than the covariance ma-
trix, of the variables in Table 2. The first two principal components are listed in
Table 3. These could be interpreted as follows:
Figure 4. Predicted election results for England and Wales using global discriminant analysis.
Table 3 Principal Component Analysis of Predictor Variables
Component Unemp NoQual OwnOcc Pension NonWhite LoneParHH
1 � 0.135 0.027 0.090 0.111 � 0.975 � 0.096
2 0.716 0.471 � 0.204 � 0.021 � 0.151 0.448
Geographical Analysis
388
1. Mainstream Britain: high levels of home owner-occupiers, pensioners, low
unemployment, lone-parent households, and nonwhites.
2. Deprivation: high unemployment, high levels of people without qualifica-
tions, and lone-parent households.
The histograms of the principal components are shown in Fig. 5. Although not
perfectly normal in distribution (an assumption of LDA), they are reasonably close.
Furthermore, for the geographically weighted approach, normality on a global
scale is less important than the normality of components in the vicinity of any given
local prediction point.
Next, both a fixed bandwidth and an adaptive GWDA were applied to this
data. In each case, the optimal smoothing parameters were selected using cross-
validation likelihood, as illustrated in Fig. 6 (fixed) and Fig. 7 (adaptive). From these
it may be seen that the optimal fixed bandwidth is around 100 km, and the optimal
adaptive bandwidth is at six nearest neighbors. It may also be seen that the optimal
adaptive model has a better cross-validation likelihood than the fixed.
The results of predicting the party of the candidate returned by each constit-
uency is shown for the fixed bandwidth GWDA in Fig. 8 and for the adaptive
bandwidth GWDA in Fig. 9. In both cases, the models have to some extent
captured the greater tendency to vote Liberal Democrat in the southwest of
England and Plaid Cymru in Wales. This is also shown in the confusion matrices
for the three methods (Tables 4–6), which confirm the superiority of GWDA over
global LDA.
Next, the relationship between the predictor variables and the categorical out-
comes will be considered. Mapping of the kind suggested in ‘‘An alternative to
mapping coefficients’’ is carried out. Fixing the variables so that the first and second
principal components take the values {� 1, 0, 1} in sequence gives the matrix of
Principal Component 1
PC1
Freq
uenc
y
0–1–2–3 1 2
0
20
40
60
80
100
Freq
uenc
y
0
20
40
60
80
Principal Component 2
PC2
0.0–0.5–1.0 0.5 1.0 1.5
Figure 5. Histograms of principal components.
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
389
maps in Fig. 10.6 These demonstrate the way in which the predicted parties elected
alter as two key elements of the structure of the predictor variables alter. First,
noting that the principal component scores are standardized, a value of 1 or � 1
suggests one standard deviation above or below the mean value, while a score of 0
corresponds to the mean value itself for that particular component. From this, a
number of features can be deduced. First, in all cases, increasing the ‘‘middle Brit-
ain’’ coefficient tends to increase the number of conservative MPs returned. When
the second deprivation component is at the average level, increasing the middle-
Britain component causes a conservative core in the south and east of England to
grow until it covers most of the lower half of England. However, a localized effect is
apparent, where in the southwestern part of England (Devon and Cornwall) the vote
Figure 7. lth nearest-neighbor bandwidth versus cross-validation likelihood.
100 200 300 400 500
–70
–60
–50
–40
–30
–20
Bandwidth (km)
Cro
ss–V
alid
atio
n Li
kelih
ood
Figure 6. Fixed bandwidth versus cross-validation likelihood.
Geographical Analysis
390
switches to liberal democrat rather than conservative. Another interesting
effect is observed when fixing the middle-Britain component at the mean level
and observing the effect of varying the deprivation component from � 1 to 1.
Again, this has the effect of varying the amount of cover gained by the conserva-
tives, perhaps to a larger degree than varying the first component. However,
the different voting pattern in the southwest of England, and also in Wales, is also
apparent here.
ConservativeLabourLibDem/Other
Figure 8. Predicted election results for England and Wales using fixed bandwidth GWDA.
GWDA, geographically weighted discriminant analysis.
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
391
Concluding discussion
In this article, the idea of local modeling has been extended to discriminant anal-
ysis, enabling the prediction of categorical data to be treated in a geographical
weighted framework similar to GWR. Both adaptive and fixed kernel approaches
have been considered, and a method for selecting the degree of smoothing to be
applied has been outlined for these. There are still a number of methodological and
theoretical issues to be considered, and these will now be outlined. First, the
technique outlined here relies on the assumption that the predictor (x) variables are
ConservativeLabourLibDem/Other
Figure 9. Predicted election results for England and Wales using adaptive bandwidth
GWDA. GWDA, geographically weighted discriminant analysis.
Geographical Analysis
392
continuous and follow a Gaussian distribution. In many practical situations, this
will not be the case. One potential approach here would be to consider KDA as
discussed earlier—although there will be some methodological issues to be con-
sidered (see, e.g., Hastie, Tibshirani, and Friedman 2001).
A further issue is dealing with categorical predictors. Again, the current Gauss-
ian approach is not well equipped to deal with these. One approach might be to
consider all interactions between a categorical predictor and the distributions of
continuous predictors—so that, for each level of the categorical variable, there is a
different multivariate Gaussian distribution for the continuous predictors. This may
work well for single categorical predictors, but there is perhaps a risk of multiplying
potential parameterizations of distributions if there are a large number of discrete
Table 4 Predictive Accuracy of the Adaptive Kernel GWDA Method
Prediction
Conservative Labour Other
Result
Conservative 160 33 3
Labour 16 297 1
Other 24 22 13
GWDA, Geographically Weighted Discriminant Analysis.
Table 5 Predictive Accuracy of the Fixed Kernel GWDA Method
Prediction
Conservative Labour Other
Result
Conservative 161 33 2
Labour 21 291 2
Other 25 24 10
GWDA, Geographically Weighted Discriminant Analysis.
Table 6 Predictive Accuracy of the Global LDA Method
Prediction
Conservative Labour Other
Result
Conservative 150 46 0
Labour 24 290 0
Other 30 29 0
LDA, linear discriminant analysis.
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
393
predictors, and every possible combination of these corresponds to a different
multivariate Gaussian distribution. A simpler approach might be to consider treat-
ing the categorical predictors as having independent effects with respect to the
continuous predictors, so that
Prðjjx; u; z ¼ mÞ / Prðjjz ¼ m; uÞ Prðjjx; uÞ ð18Þ
where m is one of the possible categorical values of the predictor variable z.
ConservativeLabourOther
PC1= 0 , PC2= 0 PC1= 0 , PC2= 1
PC1= 1 , PC2= 0 PC1= 1 , PC2= 1
Figure 10. Predicted election results for England and Wales using fixed bandwidth GWDA.
GWDA, geographically weighted discriminant analysis.
Geographical Analysis
394
A further important issue is that of robustness. Currently, the mean and variance
estimates at a given point u are based on weighted means, but it is worth noting that
such statistics are vulnerable to contamination by outliers. An outlier in a par-
ticular locality could be particularly harmful, as it could have a great deal of
leverage on local parameter estimates. Although the distortion would be restricted
to the region of this rogue observation, the degree of distortion could be quite large
if there were not many other normal observations nearby. This could clearly lead
to misleading map patterns. For this reason, in the future we intend to investigate
robust estimators of local mean and variance, such as those outlined in
Rousseeuw and van Dreissen (1999) and Hubert and van Dreissen (2002), and
adapt these for localized parameter estimation.
Notes
1 it is a conceptually simple but tedious task to modify results if this assumption does not
hold.
2 The hat is intentional, denoting an estimate of Cj(u).
3 That is trial and error plus a little guessing.
4 A color version of this diagram is available from homepage.mac.com/chrisbrunsdon/
GWR\_in\_R/FileSharing5.html
5 http://www.electoralcommission.org.uk/elections/generalelection2005.cfm as of
12-12-2005.
6 A color version of these results using the CIELUV scheme is available from
homepage.mac.com/chrisbrunsdon/GWR\_in\_R/FileSharing5.html
References
Atkinson, P. M., S. German, D. Sear, and M. Clark. (2003). ‘‘Exploring the Relations Between
Riverbank Erosion and Geomorphological Controls Using Geographically Weighted
Logistic Regression.’’ Geographical Analysis 35, 58–82.
Brunsdon, C., A. Fotheringham, and M. Charlton. (2002). ‘‘Geographically Weighted
Summary Statistics: A Framework for Localized Exploratory Data Analysis.’’ Computers
Environment and Urban Systems 26, 501–24.
Bryan, J. (1951). ‘‘The Generalized Discriminant Function: Mathematical Foundations and
Computational Routine.’’ Harvard Educational Review 21, 90–95.
Cleveland, W., and R. McGill. (1983). ‘‘A Color-Caused Optical Illusion on a Statistical
Graph.’’ The American Statistician 37, 101–05.
Fisher, R. (1936). ‘‘The Use of Multiple Measurements in Taxonomical Problems.’’ Annals of
Eugenics 7, 179–88.
Fletcher, R., and C. Reeves. (1964). ‘‘Function Minimisation by Conjugate Gradients.’’
Computer Journal 7, 148–54.
Fotheringham, A., C. Brunsdon, and M. Charlton. (2002). Geographically Weighted
Regression: The Analysis of Spatially Varying Relationships. Chichester, UK: Wiley.
Hastie, T., R. Tibshirani, and J. Friedman. (2001). The Elements of Statistical Learning: Data
Mining, Inference and Prediction. New York: Springer.
Geographically Weighted Discriminant AnalysisChris Brunsdon et al.
395
Hubert, M., and K. van Dreissen. (2002) ‘‘Fast and Robust Discriminant Analysis.’’ Technical
Report, Department of Mathematics,Katholeike Universiteit Leuven.
Maddala, G. S. (1983). Limited Dependent and Qualitative Variables in Econometrics.
Cambridge, UK: Cambridge University Press.
Nelder, J., and R. Mead. (1965). ‘‘A Simplex Algorithm for Function Minimisation.’’
Computer Journal 7, 308–13.
Paez, A., T. Uchida, and K. Miyamoto. (2002a). ‘‘A General Framework for Estimation and
Inference of Geographically Weighted Regression Models: 1. Location Specific Kernel
Bandwidths and a Test for Locational Heterogeneity.’’ Environment and Planning A 34,
733–54.
Paez, A., T. Uchida, and K. Miyamoto. (2002b). ‘‘A General Framework for Estimation and
Inference of Geographically Weighted Regression Models: 2. Spatial Association and
Model Specification Tests.’’ Environment and Planning A 34, 883–904.
Paez, A. (2004). ‘‘Anisotropic Variance Functions in Geographically Weighted Regression
Models.’’ Geographical Analysis 36, 299–314.
Paez, A. (2006). ‘‘Exploring Contextual Variations in Land Use and Transport Analysis
Using a Probit Model with Geographical Weights.’’ Journal of Transport Geography 14,
167–76.
Rao, C. (1948). ‘‘The Utilization of Multiple Measurements in Problems of Biological
Classification (with Discussion).’’ Journal of the Royal Statistical Society (B) 10, 159–
203.
Rousseeuw, P., and K. van Dreissen. (1999). ‘‘A Fast Algorithm for the Minimum Covariance
Determinant estimator.’’ Technometrics 41, 212–23.
Tufte, E. (1990). Envisioning Information. Cheshire, CT: Graphics Press.
Geographical Analysis
396