Upload
tamu
View
0
Download
0
Embed Size (px)
Citation preview
Discriminant Procedures Based on Efficient Robust Discriminant
Coordinates
Kimberly Crimin
Wyeth Research
Joseph W. McKean
Western Michigan University
Simon J. Sheather
Texas A & M University
Abstract
For multivariate data collected over groups, discriminant analysis is a two-stage procedure: separation
and allocation. For the traditional least squares procedure, separation of training data into groups is
accomplished by the maximization of the Lawley-Hotelling test for differences between group means. This
produces a set of discriminant coordinates which are used to visualize the data. Using the nearest center
rule, the discriminant representation can be used for allocation of data of unknown group membership.
In this paper, we propose an approach to discriminant analysis based on efficient robust discriminant
coordinates. These coordinates are obtained by the maximization of a Lawley-Hotelling test based on
robust estimates. The design matrix used in the fitting is the usual one-way incidence matrix of zeros
and ones; hence, our procedure uses highly efficient robust estimators to do the fitting. This produces
efficient robust discriminant coordinates which allow the user to visually assess the differences among
groups. Further, the allocation is based on the robust discriminant representation of the data using
the nearest robust center rule. We discuss our procedure in terms of an affine-equivariant estimating
procedure. The robustness of our procedure is verified in several examples. In a Monte Carlo study
on probabilities of misclassifications of the procedures over a variety of error distributions, the robust
discriminant analysis performs practically as well as the traditional procedure for good data and is
much more efficient than the traditional procedure in the presence of outliers and heavy tailed error
distributions. Further, our procedure is much more efficient than a high breakdown procedure.
KEY WORDS: Affine-equivariant estimators; Least squares; Linear discriminant rule; Nearest center rule;
Nonparametrics; Rank-based analysis; Wilcoxon analysis; Visualization.
1
1 Introduction
Consider a multivariate data set where items belong to one of g groups. For such data, discriminant analysis
can be thought of as a two stage process: separation and allocation (see for instance Johnson and Wichern
(1998) or Seber (1984)). In the separation stage, the goal is to find a representation of the observations
that clearly separates the groups. This stage is exploratory in nature and statistical procedures at this stage
are inherently graphical. The separation stage results in a kernel and the associated graphical procedure
(visualization) is based on the spectral decomposition of this kernel. In the allocation stage, the goal is to
assign an unclassified object to one of the known groups using the rule that optimally separates the training
data.
In Section 2 we review a discriminant analysis procedure based on discriminant coordinates and traditional
least squares estimates. Discriminant coordinates are obtained from maximizing the Lawley-Hotelling test for
differences between group means and can be used to graphically display the data. In the allocation stage, we
use the discriminant representation of the data and the simple “nearest center” rule (in terms of the Euclidean
Mahalanobis distance) to assign an unspecified object to one of the known groups. Assuming a multivariate
normal distribution for each group (homogeneous covariance structure) and equal prior probabilities, the
simple rule is equivalent to the tradition rule with the usual estimates substituted (“plugged-in”) for the
parameters.
In Section 3, we propose a discriminant analysis procedure based on efficient robust discriminant coordi-
nates. As with the traditional analysis, the robust discriminant analysis is a two stage process (separation
and allocation) based on the robust discriminant coordinate representation. This representation is obtained
by maximizing a robust Lawley-Hotelling test for differences between group centers. Furthermore the effi-
ciency of our procedure is based on how well the procedure separates small differences among these centers
(local alternatives). We show that this efficiency is the same as the efficiency of the robust estimators.
Because the fitting is based on the usual one-way incidence design matrix of 0s and 1s, highly efficient robust
estimates can be used which results in highly efficient discrimination procedure. These robust discriminant
coordinates allow the user to visually (graphically) assess the differences among groups and to robustly
explore the data. Most robust estimation schemes can be used in our procedure. All that is required is a√
n consistent equivariant estimator of location with an asymptotic linearity result and a consistent esti-
mate of its asymptotic variance-covariance matrix. The allocation rule is the simple “nearest center” rule in
terms of a Mahalanobis distance using the variance-covariance estimate found in the robust version of the
Lawley-Hotelling test statistic.
2
In Section 4 we use the affine equivariant robust estimators of multivariate location and scatter proposed
by Hettmansperger and Randles (2002) in the generic procedure discussed in the previous section. Their
proposed estimator combines an L1, or spatial median, with Tyler’s (1987) M estimator of scatter. The
resulting estimates have a bounded influence function, a positive breakdown and are highly efficient for
heavy tailed error structures. Furthermore, if multivariate normal errors are assumed, the “nearest center”
allocation rule can be fine tuned to be a consistent estimate of the optimal rule, similar to the traditional
plug-in rule. In Section 5 the robustness of our procedure is illustrated with examples.
In Section 6 we present the results of a simulation study for the following three methods: the proposed
method described in the last paragraph based on Hettmansperger and Randles’ (2002) estimates (HR);
the traditional least squares procedure (LS); and a high breakdown but low efficiency method proposed by
Hawkins and McLachlan (1997), (HM). Besides the multivariate normal distribution, we generated data from
the elliptical contaminated normal and t distributions. The theoretical robustness and efficiency properties
of the HR procedure discussed in Section 4 are verified for the situations investigated. Other than the
normally distributed data, the HR procedure was more efficient than the LS procedure in terms of empirical
misclassification probabilities. Further, it was much more efficient than the high breakdown but low efficient
procedure, even at the elliptical Cauchy distributions.
There are other robust discriminant analysis procedures in the literature, some of which, such as the
Hawkins and McLachlan (1997) procedure, substitute robust estimates for traditional estimates in the linear
discriminant rule. An example, which illustrates the difference of such procedures from ours, was proposed
by Randles et al. (1978). Their procedure replaces the sample means by Huber type location estimates and
the sample variance-covariance matrix with a weighted estimate. Our procedure, though, maximizes a robust
Lawley-Hotelling test for differences between group centers. If Huber estimates are used in our procedure,
then the location estimates are similar to those of Randles et al., but the estimates of scatter differ. Our
estimates use the standardization which is required by the associated Lawley-Hotelling test statistic. Hence
in this case, the efficiency of our procedure is the same as the efficiency of the Huber estimator. Thus our
procedure is highly efficient. Randles et al. weighted estimate of scatter is not estimating the same matrix
and its efficiency properties will differ. Generally the weighting will result in lower efficiency (see Chapter 5
of Hettmansperger and McKean, 1998).
3
2 Traditional Discriminant Analysis
2.1 Notation
Suppose there are g distinct groups. Let xij represent the k×1 random vector of the measured characteristics
made on the jth object in the ith group, with n =∑g
i=1ni. The n×k data matrix X contains n row vectors,
x′
ij , of multivariate observations. Let µi, i = 1, . . . , g, denote the mean for the ith group and let µ denote
the g × k matrix whose ith row is µi. Let πi, i = 1, . . . , g, denote the prior probability that observation x
belongs to the ith group. In this scenario, the model of interest is the one-way multivariate linear model
X = Wµ + e, (2.1)
where W is the incidence matrix and e is an n×k matrix of random errors with E(eij) = 0 and Var(ei) = Σ,
where e′
i is the ith row of the matrix e. Assume that e′
i has density function f(x) and distribution function
F (x). Denote the jth marginal cdf and pdf of ei by Fj(xj) and fj(xj). Let Ω denote the column space of
W and let P Ω denote the projection matrix onto the subspace Ω.
We next briefly describe the traditional analysis; see, for instance, Chapters 5 and 6 of Seber (1984) for
more details.
2.2 Separation
Discriminant coordinates were introduced as a dimension reduction technique useful for “examining clustering
effects in the data”, e.g., Gnanadesikan (1977) or Seber (1984). The goal in discriminant coordinates is to
find linear combinations of the data that “best” separates the groups of observations.
The amount of separation in the groups is proportional to the size of the test statistic for testing
H0 : Aµ = 0 versus HA : Aµ 6= 0, (2.2)
where A is the usual contrast matrix for testing equality of g group means. Since we are interested in
the “maximum” amount of separation in the groups, this is equivalent to finding c that maximizes the
Lawley-Hotelling type of test statistic for
H0 : Aµc = 0 versus HA : Aµc 6= 0. (2.3)
Let µLS be the argument that minimizes tr(X − Wµ)′(X − Wµ), then µLS is the traditional least
squares estimate of µ. The associated Lawley-Hotelling test statistic is
TLS = tr(AµLSc)′(A(W ′W )−1A′)−1(AµLSc)(c′Σc)−1, (2.4)
4
where Σ = 1
n−g X ′(I − P Ω)X is the usual estimate of the variance-covariance matrix. Under the null
hypothesis (2.3), TLS has an approximate χ2g−1-distribution.
By the Generalized Cauchy-Schwartz inequality, the maximum value of TLS, (2.4), is λ1, the maximum
eigenvalue of
Σ−1
(AµLS)′(A(W ′W )−1A′)−1(AµLS), (2.5)
and the direction of maximum separation is c1, the corresponding orthonormal eigenvector. Then proceed
as in principle components obtaining k orthogonal directions c1, c2, . . . , ck, which are the k orthonormal
eigenvectors corresponding to the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λk ≥ 0 of the matrix in expression (2.5). The
eigenvalues of the matrix in expression (2.5) are the same as the eigenvalues of
KLS = Σ−1/2
(AµLS)′(A(W ′W )−1A′)(AµLS)Σ−1/2
(2.6)
which is symmetric and, hence, is easier to handle numerically. Let ai, i = 1, . . . , k denote the corresponding
eigenvectors of KLS , (2.6). It can easily be shown that ci = Σ−1/2
ai, i = 1, . . . , k. The vector ci is called
the ith discriminant direction. Let C = [c1, . . . , ck]. Then the discriminant coordinate representation of the
matrix X is Z = XC, where the columns of Z are the discriminant coordinates.
The matrix KLS, (2.6), is called the kernel for the traditional procedure and the associated visualization
procedures are graphical methods based on the discriminant coordinates.
2.3 Allocation
The objective of allocation is to classify an unknown object to one of g known groups. Recall that the rule
that minimizes the total probability of misclassification (TPM) is
Assign x to Gi if πifi(x) ≥ πjfj(x), for all j = 1, . . . , g; (2.7)
see, for instance, Seber (1984). If fi(x) is the pdf of a Nk(µi,Σ) distribution then the optimal rule becomes:
Assign x to Gi if Li(x) ≥ Lj(x), for all j = 1, 2, . . . , g, (2.8)
where,
Li(x) = ln(πifi(x)) = ln πi + µ′
iΣ−1(x − 1
2µi).
If the prior probabilities are assumed equal, then a short algebraic derivation shows that the rule in expres-
sion (2.8) is equivalent to the nearest center rule, where Euclidean distance is measured using Mahalanobis
distance. That is, expression (2.8) is equivalent to
Assign x to Gi if (x − µi)′Σ
−1(x − µi) ≤ (x − µj)′Σ
−1(x − µj) for all j = 1, 2, . . . , g. (2.9)
5
In practice, the traditional estimates are substituted for the parameters.
The discriminant coordinate representation, Z = XC, is the representation for the data which gives
maximal separation, so it is the appropriate representation from which to work. In practice, we may only
use the first several principal discriminant coordinates to do the allocation. The nearest center rule for
discriminant coordinates is:
Assign x to Gi if Di(z) ≤ Dj(z) for all j = 1, 2, . . . , g, (2.10)
where Σz = C ′ΣxC, µ′
ziis the ith row of the g × k matrix
µz,LS = µLSC (2.11)
and
Di(z) = (z − µzi)′Σ
−1
z (z − µzi).
3 Efficient Robust Discriminant Analysis
In this section, we outline a generic robust discrimination procedure, which is analogous to the traditional
procedure. The separation stage is based on maximizing a robust Lawley-Hotelling test statistic. The
efficiency of this stage is based on how powerful this test is in separating small differences of location.
3.1 Separation
To derive robust discriminant coordinates, begin with a robust estimate µP of µ in model (2.1), where P
denotes a generic robust estimating procedure. Recall, W is an incidence matrix so a highly efficient robust
estimator can be used. Assume, under regularity conditions,
µP is asymptotically Ng,k(µ, (W ′W )−1,ΣP ), (3.12)
where ΣP is the asymptotic variance-covariance matrix of µP . Note that the square-root of this matrix,
Σ1/2
P , is the multivariate analog of the standard error of the estimate. Let ΣP be a consistent estimate of
ΣP . Then the Lawley-Hotelling type test statistic for the hypotheses (2.3) is:
TP = (AµP c)′(A(W ′W )−1A′)−1(AµP c)(c′ΣP c)−1. (3.13)
Proceeding, as in traditional discriminant coordinates, obtain the k orthogonal directions aP1, . . . , aPk
cor-
responding to eigenvalues λ1 ≥ · · · ≥ λk ≥ 0 of the matrix
KP = Σ−1/2
P (AµP )′(A(W ′W )−1A′)−1(AµP )Σ−1/2
P . (3.14)
6
The robust discriminant coordinates are the columns of ZP = XCP , where CP = [cP1, . . . , cPk
] and
cPi= Σ
−1/2
P aPi. In particular, the vector cP1
gives the direction for the maximal separation for the generic
robust procedure P .
The matrix KP , (3.14), is the kernel for the robust procedure and the associated visualization procedure
is based on the robust discriminant coordinates, the columns of ZP .
3.2 Efficiency
The efficiency of the procedure depends on how well the test statistic TP detects small differences among the
means. A way to measure this is to determine the asymptotic power of the test TP under local alternatives,
Hn : Aµn =1√n
Aµ0, (3.15)
where µ0 is a g × k matrix not equal to the zero matrix and A is the (g − 1) × g contrast matrix given in
the hypotheses (2.2). Assume a sequence of linear models of the form (2.1) indexed by n. Let W n denote
the incidence matrix and assume that
limn→∞
n−1W ′
nW n = ΣW , (3.16)
where ΣW is positive definite. The asymptotic power of the test statistic TP,n can be determined from its
asymptotic distribution under the sequence of alternatives. Assuming certain conditions, we can generally
show that
TP,n has an asymptotic noncentral χ2g−1(θP )-distribution, (3.17)
with g − 1 degrees of freedom and noncentrality parameter
θP = tr(Aµ0)′[AΣW A′
]−1
Aµ0Σ−1
P. (3.18)
The conditions depend on the specific robust estimator chosen, but often a uniform linearity (quadraticity)
result is required. The robust procedure discussed in Section 4 satisfies such a condition.
Provided the variance-covariance matrix of the random vector ei is finite, under the sequence of local
alternatives defined in expression (3.15), the LS test statistic TLS,n (2.4) has a noncentral χ2 distribution
with g − 1 degrees of freedom and noncentrality parameter
θLS = tr(Aµ0)′[AΣW A′
]−1
Aµ0Σ−1. (3.19)
It follows that the asymptotic relative efficiency between the robust procedure and LS is the ratio of non-
centrality parameters
ARE(P, LS) =θLS
θP. (3.20)
7
This result easily generalizes to the univariate case. Suppose the components of the error random vector
ei are iid with variance σ2. Then Σ = σ2Ik and ΣP = τP Ik, for some parameter τP , which depends
on the robust procedure used. Thus in the iid case, the ARE (3.20) simplifies to the univariate formula,
ARE(P, LS) = σ2/τ2
P . For the general multivariate case, expression (3.20) does not simplify; however,
by comparing the noncentrality parameter θP , (3.18), with the asymptotic distribution of µP , (3.12), the
efficiency properties of the separation phase of the robust procedure is essentially the same as the efficiency
properties of the robust estimates. Because the fitting is based on the incidence matrix, there are no outliers
in factor space; hence, we recommend highly efficient robust estimates.
3.3 Allocation
Let z = C′
P x where CP is the matrix of robust discriminant directions based on procedure P . The robust
Mahalanobis distance is:
DP,i(z) = (z − µP,zi)′Σ
−1
P,z(z − µP,zi), (3.21)
where the estimate of location and scatter are given by
µP,zi= C′
P µP,xiand ΣP,z = C ′
P ΣP CP , (3.22)
respectively. Then a nearest center, robust linear discriminant rule is
Assign x to Gi if DP,i(z) ≤ DP,j(z), for all j = 1, 2, . . . , g. (3.23)
As discussed in Section 2.3, under the assumption that the prior probabilities of group membership are
all equal, the derivation between the traditional rules of expressions (2.8) and (2.9) is algebraic. Hence, the
robust rule
Assign x to Gi if Li(z) ≥ Lj(z), for all j = 1, 2, . . . , g, (3.24)
where
Li(z) = − ln g + µP,ziΣ
−1
P,z
(z − 1
2µP,zi
), j = 1, 2, . . . , g.
is equivalent to rule (3.23). For the remainder of this article, we will use the nearest center rule (3.23).
The proposed robust discriminant rule (3.23) can be used with most robust estimators. All that is required
is a√
n consistent estimate of location and a consistent estimate of its asymptotic variance-covariance matrix.
The efficiency of the procedure is the same as the efficiency of the robust estimator µP . While the rule is
based on asymptotic theory, the Monte Carlo study presented in Section 6 verifies, over the situations covered,
the robustness and validity for the procedure based on the estimator discussed in Section 4. This empirical
study involved estimates based on a sample size of 25.
8
4 Robust Affine Equivariant Estimate
Hettmansperger and Randles (2002) proposed an M-estimate for multivariate location which is affine equiv-
ariant and robust with positive breakdown. The Hettmansperger and Randles estimator (HR) is the L1
(spatial) median combined with the M-estimate of scatter proposed by Tyler (1987). The estimate proposed
by HR minimizes the following dispersion function
n∑
i=1
‖AT (xi − µ)‖ (4.25)
where AT is a k × k upper triangular, positive definite matrix (with a one in the upper left corner) chosen
to satisfy:
n−1
n∑
i=1
AT (xi − µ)(xi − µ)′A′
T
‖AT (xi − µ)‖2= k−1I , (4.26)
where I is the k × k identity matrix and ‖ · ‖ denote the Euclidean norm. Let µHR be the value that
minimizes (4.25).
Under model (2.1),
µHR is asymptotically Ng,k(µ, (W ′W )−1, B−1A∗B−1), (4.27)
where
A∗ = E
[AT (X − µ)(X − µ)′A′
T
‖AT (X − µ)‖2
]
B = E
[AT
‖AT (X − µ)‖
(I − (X − µ)(X − µ)′
‖AT (X − µ)‖2
)]. (4.28)
Further, AT is a consistent estimator of AT ; see Hettmansperger and Randles (2002) for discussion. Let
ΣHR = B−1
A∗
B−1
, where B and A∗
are the respective matrices B and A∗ with AT replaced by AT .
Then ΣHR is a consistent estimate of ΣHR.
4.1 Separation
Using the HR estimators µHR and ΣHR, the Lawley-Hotelling test statistic for hypothesis (2.3), under model
(2.1), is the statistic THR defined in the following theorem.
Theorem 4.1. Assume the regularity conditions in Hettmansperger and Randles (2002) are true. Let
THR = tr(AµHRc)′(A(W ′W )−1A′)−1(AµHRc)(c′ΣHRc)−1.
Then under the null hypothesis
THR is asymptotically χ2
g−1.
9
Proof. From the asymptotic distribution of µHR given in equation (4.27) we have
AµHRc is asymptotically Nq,s(Aµc, A(W ′W )−1A′, c′Σc).
Further, under the null hypothesis Aµc = 0. From these two results, the theorem follows immediately.
Based on this theorem, the kernel of the HR discriminant coordinate procedure is
KHR = Σ−1/2
HR (AµHR)′(A(W ′W )−1A′)−1(AµHR)Σ−1/2
HR . (4.29)
Let aHR1, . . . , aHRk
denote the eigenvectors corresponding to eigenvalues λ1 ≥ · · · ≥ λk ≥ 0 of the matrix
KHR. Then the HR robust discriminant coordinates are the columns of ZHR = XCHR, where CHR =
[cHR1, . . . , cHRk
] and cHRi= Σ
−1/2
HR aHRi. The HR associated visualization procedure is based on these
discriminant coordinates.
4.2 Efficiency
For efficiency results, consider the set up of Section 3.2 with the sequence of local alternatives (3.15). Based
on the linearization result given in Hettmansperger and Randles (2002), it follows that under this sequence
of local alternatives
THR,n has an asymptotic noncentral χ2g−1(θHR)-distribution, (4.30)
with g − 1 degrees of freedom and noncentrality parameter
θHR = tr(Aµ0)′[AΣW A′
]−1
Aµ0Σ−1
HR. (4.31)
The efficiency of the separation procedure is the same as the efficiency of the HR estimator which is discussed
in Section 3 of Hettmansperger and Randles (2002). In particular, it appears to be highly efficient for heavy-
tailed error distributions relative to the LS procedure.
4.3 Allocation
The nearest center rule is
Assign x to Gi if DHR,i(z) ≤ DHR,j(z), z = C ′
HRx, for all j = 1, 2, . . . , g, (4.32)
where the robust HR Mahalanobis distance is:
DP,i(z) = (z − µHR,zi)′Σ
−1
HR,z(z − µHR,zi), (4.33)
10
and the estimate of location and scatter are given by
µHR,zi= C ′
HRµHR,xiand ΣHR,z = C ′
HRΣHRCHR. (4.34)
Because of affine equivariance of the HR estimator, these same estimates would be obtained from the trans-
formed data, Z.
4.4 Equivalence to the Traditional Rule
As with most robust procedures, interest centers on how efficient the robust estimate is to the traditional
estimate under the multivariate normal distribution. Suppose the rows of e in model (2.1) have a symmetric
elliptical error distribution with density proportional to cmh(t′t). As discussed in Hettmansperger and
Randles (2002), r2 = ‖e‖2 has density
fr2(y) =ckπk/2
Γ(k/2)yk/2−1h(y).
Then the asymptotic relative efficiency of µHR relative to µLS is
ARE(µHR, µLS) = k−2E(r2)[E(r−1)]2(k − 1)2.
At the multivariate normal, the asymptotic relative efficiency of µHR to the least squares µLS is
ARE(µHR, µLS) =
(
Γ(k2)21/2k
Γ(k−1
2)(k − 1)
)2
1
k
−1
. (4.35)
If ΣHR is divided by equation (4.35), then the resulting estimate is consistent for Σ and rule (4.32) is
asymptotically equivalent to the traditional nearest center rule (2.9).
5 Examples
To investigate the robustness of the procedures, we used Fisher’s (1936) classic Iris data set and four con-
taminated versions of the Iris data set. Recall, the Iris data set consists of three species of Iris with 50
observations of each species. The four variables are sepal length, sepal width, petal length and petal width.
Group one is denoted with red circles, group two with green triangles pointing up and group three with
blue triangles pointing down. We contaminated the Iris data set with a single outlier in group one, with 5
outliers in group one, with 5 clustered outliers in group one, and 3 outliers in group one and two outliers
in group two. For each data set, the visualizations were constructed using the the first two discriminant
coordinates of the kernels KLS and KHR. We also calculated the probabilities of misclassification (PMC)
11
using leave-one-out cross-validation. The allocations were based on the nearest center rule using the first
two discriminant coordinates.
5.1 Visualization
Figure 1 displays the plots of the first two traditional discriminant coordinates and the HR robust discrimi-
nant coordinates on the original Iris data set. In each of the plots, the first coordinate is showing a difference
−10 −5 0 5
45
67
89
10
Least Squares
Coordinate 1
Coo
rdin
ate
2
−60 −40 −20 0 20 40
2030
4050
60
HR
Coordinate 1
Coo
rdin
ate
2
Figure 1: Iris Data, (original data)
in location among the three groups.
Figure 2 displays the plots of the first two traditional discriminant coordinates and the HR robust
discriminant coordinates from the Iris data set with one outlier. From these plots, the traditional discriminant
coordinates fail to separate the groups or identify the outlier (first plot, Figure 2) whereas the robust
discriminant coordinates identify the outlier (second plot, Figure 2) and separate the groups (third plot,
Figure 2).
12
5 6 7 8 9 10 11
−4
−3
−2
−1
01
2
Least Squares
Coordinate 1
Coo
rdin
ate
2
0 500 1000 1500 2000 2500 3000
−30
0−
200
−10
00
HR
Coordinate 1
Coo
rdin
ate
2
−40 −20 0 20 40 60
2030
4050
60
HR − Zoomed In
Coordinate 1
Coo
rdin
ate
2
Figure 2: Iris Data, (1 Outlier)
13
Figure 3 displays the plots of the first two traditional discriminant coordinates and the HR robust
discriminant coordinates from the Iris data set with 5 outliers. From these plots, the traditional discriminant
8 10 12 14
02
46
8
Least Squares
Coordinate 1
Coo
rdin
ate
2
−8000 −6000 −4000 −2000 0
020
0040
0060
0080
0010
000
HR
Coordinate 1
Coo
rdin
ate
2
−60 −40 −20 0 20 40
2030
4050
60
HR − Zoomed In
Coordinate 1
Coo
rdin
ate
2
Figure 3: Iris Data, (5 Outliers)
coordinates are separating the groups, but only 3 of the 5 outliers are clearly identified in the plot (first plot,
Figure 3). The robust discriminant coordinates identify the 5 outliers and separate the groups (second and
third plots, Figure 3).
Figure 4 displays the plots of the first two traditional discriminant coordinates and the HR robust
discriminant coordinates from the Iris data set with 5 outliers in a cluster. From these plots, the traditional
discriminant coordinates identify the 5 outliers but do not separate the groups (first plot, Figure 4). The
robust discriminant coordinates identify the 5 outliers and separate the groups (second and third plots,
Figure 4).
Figure 5 displays the plots of the first two traditional discriminant coordinates and the HR robust
discriminant coordinates from the Iris data set with 3 outliers in group one and 2 outliers in group two.
From these plots, the traditional discriminant coordinates do not separate the groups or identify the outliers
14
5 6 7 8 9 10 11
−1
01
23
45
Least Squares
Coordinate 1
Coo
rdin
ate
2
0 500 1000 1500 2000 2500 3000
−25
0−
200
−15
0−
100
−50
050
HR
Coordinate 1
Coo
rdin
ate
2
−40 −20 0 20 40 60
2030
4050
60
HR − Zoomed In
Coordinate 1
Coo
rdin
ate
2
Figure 4: Iris Data, (5 Outliers in a Cluster)
15
5 6 7 8 9 10 11
−5
−4
−3
−2
−1
0
Least Squares
Coordinate 1
Coo
rdin
ate
2
−2500 −2000 −1500 −1000 −500 0
−25
0−
200
−15
0−
100
−50
050
HR
Coordinate 1
Coo
rdin
ate
2
−60 −40 −20 0 20 40
2030
4050
60
HR − Zoomed In
Coordinate 1
Coo
rdin
ate
2
Figure 5: Iris Data, (3 Outliers Grp 1 & 2 Outliers Grp 2)
16
(first plot, Figure 5) whereas the robust discriminant coordinates identify the 5 outliers and separate the
groups (second and third plots, Figure 5).
Thus in terms of separation, the HR discriminant procedure agrees with the traditional LS procedure on
the original data. For the contaminated data, in all cases, the the HR procedure separates the groups and
identifies all the outliers. In contrast on the contaminated data, the LS procedure either fails to separate or
fails to identify all the outliers. In terms of separation, the HR procedure is robust.
5.2 Allocation
Table 1 displays the estimated probability of misclassification (PMC) for each variation of the Iris data set.
From the results presented in the table, when the Iris data set is not contaminated the PMC is the same
Table 1: Estimated PMCs for each variation of the Iris Data Set
Data Set Least Squares HR
Original 0.0267 0.0267
1 Outlier 0.4733 0.0333
5 Outliers 0.38 0.06
5 Clustered Outliers 0.52 0.06
3 Outlier Grp 1 & 2 Outliers Grp 2 0.55 0.06
for the LS and HR procedures, but when contamination is added to the data set, the HR procedure has a
much lower PMC. In terms of allocation, the HR procedure is robust. The outliers severely hampered the
allocation ability of the LS procedure.
6 Simulation Results
In this section, we present the results of a Monte Carlo study which investigates the behavior of the nearest
center rules of three procedures in terms of their TPMs over various error distributions. For our procedure
we chose the highly efficient robust procedure described in Section 4 based on the HR estimator. For
comparison, we included the the traditional procedure (LS) as described in Section 2. In order to investigate
our efficiency claims, as our third procedure we selected the high breakdown procedure proposed by Hawkins
and McLachlan (1997) using minimum covariance determinants (MCD) as an estimator of scatter. This
procedure has high breakdown but low efficiency. Their procedure accommodates a certain percentage of
17
outliers and the estimates are based on the“inliers”. The outliers are the set of points which, when removed,
minimize the within group covariance determinant. For the simulation, we used a 50% coverage. We selected
50% coverage because this has highest breakdown and lowest efficiency.
For the simulation results presented in this section, we consider situations where there are two groups and
four dimensions, i.e., g = 2 and k = 4. For the mean and variance matrices, we chose the sample mean and
variance matrices of the beetle data on page 295 of Seber (1984). One thousand data sets were randomly
generated from a variety of error distributions. The error distributions used were multivariate normal
(MVN), the contaminated multivariate normal (CN), and the multivariate t. For each error distribution,
fifty observations were generated for both the training and test data sets with twenty-five observations
randomly assigned to each group. The training data set was used to develop the linear classification rule and
then this rule was used to classify the test data set and the probability of misclassification was recorded.
The empirical TPM was used as a benchmark for the performance of each procedure. To calculate the
empirical TPM, we used the rule
Assign xi to group 1 iff(xi|G1)
f(xi|G2)> 1, i − 1, . . . , n,
to classify the data. Then for our empirical TPM, we used the proportion misclassified. At the multivariate
normal, the true TPM is
TPM = Φ(−∆/2),
where
∆2 = (µ1 − µ2)
′Σ
−1(µ1 − µ2)
is Mahalanobis distance. Table 2 displays the empirical TPM for the distributions used in the simulation.
For the multivariate normal, the true TPM is presented in parenthesis.
Table 3 displays 84% confidence intervals for the probabilities of misclassification (PMC) of the three
procedures. The 84% confidence intervals were chosen because for a two sample-analysis based on one-
sample confidence intervals, 84% one-sample confidence intervals yield roughly a 95% two-sample confidence
interval; see Section 1.12 of Hettmansperger and McKean (1998). From the results displayed in the table,
the traditional procedure has the smallest PMC at the multivariate normal. In all the other cases, the HR
procedure has lower PMCs than the LS procedure. In fact, in seven of these situations, the confidence
intervals do not overlap. Thus the HR procedure is more robust than the LS procedure for moderate to
heavy contamination. For comparisons between the two robust procedures, the HR procedure always has
a lower PMC than the HM procedure. For the situations considered, the confidence intervals of the two
procedures never overlap. The HR procedure is more efficient than the HM procedure over all situations in
18
Table 2: Empirical TPM
Distribution Empirical TPM
MVN (0.1142) 0.1251
CN ε = 0.10, σ2 = 9 0.1560
CN ε = 0.20, σ2 = 9 0.1791
CN ε = 0.10, σ2 = 25 0.1487
CN ε = 0.20, σ2 = 25 0.1904
CN ε = 0.10, σ2 = 100 0.1665
CN ε = 0.20, σ2 = 100 0.1996
t df = 1 0.2278
t df = 2 0.1854
t df = 3 0.1546
t df = 4 0.1545
t df = 5 0.1530
this study, including the elliptical Cauchy distribution. In comparing the HM and LS procedures, the HM
procedure has lower PMCs than the LS procedure for heavy-tailed distributions.
Next, consider the comparison between the empirical TPMs of Table 2 and the simulated PMCs of Table
3. The HR values are generally much closer to the TPM values than the values of the other procedures.
7 Conclusion
In this paper, we have proposed a discriminant analysis based on efficient robust discriminant coordinates.
Like the traditional analysis, this robust analysis is a two stage process: separation and allocation. Max-
imizing a robust Lawley-Hotelling test based on robust estimates of group centers achieves the separation
and produces the discriminant coordinates. The efficiency of the procedure follows from the power of the
procedure to detect small differences in group centers. Further, it has the same efficiency of the robust
estimates used for the Lawley-Hotelling test. The robust discriminant coordinates can be used to visualize
the data as we demonstrated with examples. This visualization is much less sensitive to outliers than the
visualization obtained from traditional discriminant coordinates. The robust discriminant coordinates can
be further used to form nearest center rules for the allocation of new data to the groups.
19
Table 3: Simulated PMCs for the Procedures
Distribution LS HR Hawkins
MVN (0.1365,0.1410) (0.1426,0.1473) (0.2323,1.2411)
CN ε = 0.10, σ2 = 9 (0.1684,0.1737) (0.1643,0.1692) (0.2436,0.2520)
CN ε = 0.20, σ2 = 9 (0.1970,0.2027) (0.1875,0.1926) (0.2477,0.2552)
CN ε = 0.10, σ2 = 25 (0.1919,0.1978) (0.1697,0.1747) (0.2449,0.2530)
CN ε = 0.20, σ2 = 25 (0.2310,0.2377) (0.1986,0.2038) (0.2538,0.2612)
CN ε = 0.10, σ2 = 100 (0.2356,0.2434) (0.1738,0.1788) (0.2484,0.2564)
CN ε = 0.20, σ2 = 100 (0.2953,0.3044) (0.2074,0.2126) (0.2605,0.2679)
t df=1 (0.3370,0.3466) (0.2411,0.2467) (0.2611,0.2676)
t df = 2 (0.2282,0.2345) (0.2008,0.2059) (0.2373,0.2440)
t df = 3 (0.1943,0.1997) (0.1853,0.1902) (0.2315,0.2387)
t df = 4 (0.1784,0.1835) (0.1758,0.1808) (0.2288,0.2359)
t df = 5 (0.1670,0.1752) (0.1665,0.1717) (0.2318,0.2399)
Our procedure is generic in the sense that any robust fitting procedure can be used provided its estimates
are root n consistent with an asymptotic linearity result and a consistent estimate of its variance-covariance
matrix. The design matrix for the fitting is an incidence matrix, so highly efficient robust estimators are
recommended which results in the associated discriminant procedure being highly efficient. In this paper
we used the affine equivariant estimator proposed by Hettmansperger and Randles (2002) but any highly
efficient robust estimator could be used.
The examples that we presented showed the robustness of the procedures on real data. On the original
data, the HR robust procedure behaved similarly to the traditional LS procedure in terms of visualization
and classification (PMC). However, when outliers were introduced in the Iris data, the results were quite
different. The behavior of the robust procedure was quite similar to its behavior on the original data but
the traditional procedure’s PMC rate changed from 3% to 48% (on average) and its visualization was quite
poor.
In our Monte Carlo study, we investigated the behavior of the nearest center rules in terms of misclas-
sifications for three procedures. The data were split into two data sets: “training” and “test”. Empirical
PMCs for the procedures were obtained for families of multivariate t- and contaminated multivariate normal
distributions. We selected the highly efficient procedure (HR) described in Section 4. As competitors we
20
selected the LS procedure and a high breakdown, but low efficient, procedure proposed by Hawkins and
McLachlan (1997). The HR procedure was comparable to the LS procedure when the errors had a multivari-
ate normal distribution but it generally performed much better than the LS procedure for the heavier tailed
error distributions. Further, over all situations simulated, the HR procedure had lower empirical PMCs than
the high breakdown HM procedure.
In summary, the discriminant procedures that we have proposed form an attractive robust alternative
to the traditional procedure. The procedures are highly efficient relative to the traditional procedure and
they are quick to compute. Further, they produce robust discriminant coordinates which allow the user to
visually explore the data and assess the differences among groups.
Acknowledgment
The authors thank the associate editor and a referee whose comments led to an improvement of this paper.
References
Davis, J. B. and McKean, J. W. (1993), Rank based methods for multivariate linear models, Journal of the
American Statistical Association, 88, 245-251.
Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7.
Flurry, B. and Riedwyl, H. (1988), Multivariate Statistics: A Practical Approach, London: Chapman and
Hall.
Gnanadesikan, R. (1977), Methods for Statistical Analysis of Multivariate Observations, New York: John
Wiley & Sons.
Hawkins, Douglas M. and McLachlan, Geoffry J. (1997), High-Breakdown Linear Discriminant Analysis,
Journal of the American Statistical Association, 92:437, 136-143.
Hettmansperger, T. P. and McKean, J. W. (1998), Robust Nonparametric Statistical Methods, London:
Arnold.
Hettmansperger, T.P. and Randles, R. H. (2002), A practical affine equivariant multivariate median, Biometrika,
89, 851-860.
Jaeckel, L. A. (1972), Estimating regression coefficients by minimizing the dispersion of the residuals, Annals
of Mathematical Statistics, 43, 1449-1458.
Johnson, R. A. and Wichern, D. W. (1998), Applied Multivariate Statistical Analysis, 4th Ed., Upper Saddle
River, New Jersey: Prentice Hall.
21
Randles R. H., Broffitt J. D., Ramberg, J. S., and Hogg, R. V. (1978), Generalized linear and quadratic
discriminant functions using robust estimates, Journal of the American Statistical Association, 73, 564-
568.
Reaven, G. M. and Miller, R. G. (1986), Robust Regression and Outlier Detection, New York: John Wiley
& Sons.
Seber, G. A. F. (1984), Multivariate Observations, New York: John Wiley & Sons.
Tyler, D. E. (1987), A distribution-free M-estimator of scatter, Annals of Statistics, 15, 234-251.
22