22
Discriminant Procedures Based on Efficient Robust Discriminant Coordinates Kimberly Crimin Wyeth Research Joseph W. McKean Western Michigan University Simon J. Sheather Texas A & M University Abstract For multivariate data collected over groups, discriminant analysis is a two-stage procedure: separation and allocation. For the traditional least squares procedure, separation of training data into groups is accomplished by the maximization of the Lawley-Hotelling test for differences between group means. This produces a set of discriminant coordinates which are used to visualize the data. Using the nearest center rule, the discriminant representation can be used for allocation of data of unknown group membership. In this paper, we propose an approach to discriminant analysis based on efficient robust discriminant coordinates. These coordinates are obtained by the maximization of a Lawley-Hotelling test based on robust estimates. The design matrix used in the fitting is the usual one-way incidence matrix of zeros and ones; hence, our procedure uses highly efficient robust estimators to do the fitting. This produces efficient robust discriminant coordinates which allow the user to visually assess the differences among groups. Further, the allocation is based on the robust discriminant representation of the data using the nearest robust center rule. We discuss our procedure in terms of an affine-equivariant estimating procedure. The robustness of our procedure is verified in several examples. In a Monte Carlo study on probabilities of misclassifications of the procedures over a variety of error distributions, the robust discriminant analysis performs practically as well as the traditional procedure for good data and is much more efficient than the traditional procedure in the presence of outliers and heavy tailed error distributions. Further, our procedure is much more efficient than a high breakdown procedure. KEY WORDS: Affine-equivariant estimators; Least squares; Linear discriminant rule; Nearest center rule; Nonparametrics; Rank-based analysis; Wilcoxon analysis; Visualization. 1

Discriminant procedures based on efficient robust discriminant coordinates

  • Upload
    tamu

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Discriminant Procedures Based on Efficient Robust Discriminant

Coordinates

Kimberly Crimin

Wyeth Research

Joseph W. McKean

Western Michigan University

Simon J. Sheather

Texas A & M University

Abstract

For multivariate data collected over groups, discriminant analysis is a two-stage procedure: separation

and allocation. For the traditional least squares procedure, separation of training data into groups is

accomplished by the maximization of the Lawley-Hotelling test for differences between group means. This

produces a set of discriminant coordinates which are used to visualize the data. Using the nearest center

rule, the discriminant representation can be used for allocation of data of unknown group membership.

In this paper, we propose an approach to discriminant analysis based on efficient robust discriminant

coordinates. These coordinates are obtained by the maximization of a Lawley-Hotelling test based on

robust estimates. The design matrix used in the fitting is the usual one-way incidence matrix of zeros

and ones; hence, our procedure uses highly efficient robust estimators to do the fitting. This produces

efficient robust discriminant coordinates which allow the user to visually assess the differences among

groups. Further, the allocation is based on the robust discriminant representation of the data using

the nearest robust center rule. We discuss our procedure in terms of an affine-equivariant estimating

procedure. The robustness of our procedure is verified in several examples. In a Monte Carlo study

on probabilities of misclassifications of the procedures over a variety of error distributions, the robust

discriminant analysis performs practically as well as the traditional procedure for good data and is

much more efficient than the traditional procedure in the presence of outliers and heavy tailed error

distributions. Further, our procedure is much more efficient than a high breakdown procedure.

KEY WORDS: Affine-equivariant estimators; Least squares; Linear discriminant rule; Nearest center rule;

Nonparametrics; Rank-based analysis; Wilcoxon analysis; Visualization.

1

1 Introduction

Consider a multivariate data set where items belong to one of g groups. For such data, discriminant analysis

can be thought of as a two stage process: separation and allocation (see for instance Johnson and Wichern

(1998) or Seber (1984)). In the separation stage, the goal is to find a representation of the observations

that clearly separates the groups. This stage is exploratory in nature and statistical procedures at this stage

are inherently graphical. The separation stage results in a kernel and the associated graphical procedure

(visualization) is based on the spectral decomposition of this kernel. In the allocation stage, the goal is to

assign an unclassified object to one of the known groups using the rule that optimally separates the training

data.

In Section 2 we review a discriminant analysis procedure based on discriminant coordinates and traditional

least squares estimates. Discriminant coordinates are obtained from maximizing the Lawley-Hotelling test for

differences between group means and can be used to graphically display the data. In the allocation stage, we

use the discriminant representation of the data and the simple “nearest center” rule (in terms of the Euclidean

Mahalanobis distance) to assign an unspecified object to one of the known groups. Assuming a multivariate

normal distribution for each group (homogeneous covariance structure) and equal prior probabilities, the

simple rule is equivalent to the tradition rule with the usual estimates substituted (“plugged-in”) for the

parameters.

In Section 3, we propose a discriminant analysis procedure based on efficient robust discriminant coordi-

nates. As with the traditional analysis, the robust discriminant analysis is a two stage process (separation

and allocation) based on the robust discriminant coordinate representation. This representation is obtained

by maximizing a robust Lawley-Hotelling test for differences between group centers. Furthermore the effi-

ciency of our procedure is based on how well the procedure separates small differences among these centers

(local alternatives). We show that this efficiency is the same as the efficiency of the robust estimators.

Because the fitting is based on the usual one-way incidence design matrix of 0s and 1s, highly efficient robust

estimates can be used which results in highly efficient discrimination procedure. These robust discriminant

coordinates allow the user to visually (graphically) assess the differences among groups and to robustly

explore the data. Most robust estimation schemes can be used in our procedure. All that is required is a√

n consistent equivariant estimator of location with an asymptotic linearity result and a consistent esti-

mate of its asymptotic variance-covariance matrix. The allocation rule is the simple “nearest center” rule in

terms of a Mahalanobis distance using the variance-covariance estimate found in the robust version of the

Lawley-Hotelling test statistic.

2

In Section 4 we use the affine equivariant robust estimators of multivariate location and scatter proposed

by Hettmansperger and Randles (2002) in the generic procedure discussed in the previous section. Their

proposed estimator combines an L1, or spatial median, with Tyler’s (1987) M estimator of scatter. The

resulting estimates have a bounded influence function, a positive breakdown and are highly efficient for

heavy tailed error structures. Furthermore, if multivariate normal errors are assumed, the “nearest center”

allocation rule can be fine tuned to be a consistent estimate of the optimal rule, similar to the traditional

plug-in rule. In Section 5 the robustness of our procedure is illustrated with examples.

In Section 6 we present the results of a simulation study for the following three methods: the proposed

method described in the last paragraph based on Hettmansperger and Randles’ (2002) estimates (HR);

the traditional least squares procedure (LS); and a high breakdown but low efficiency method proposed by

Hawkins and McLachlan (1997), (HM). Besides the multivariate normal distribution, we generated data from

the elliptical contaminated normal and t distributions. The theoretical robustness and efficiency properties

of the HR procedure discussed in Section 4 are verified for the situations investigated. Other than the

normally distributed data, the HR procedure was more efficient than the LS procedure in terms of empirical

misclassification probabilities. Further, it was much more efficient than the high breakdown but low efficient

procedure, even at the elliptical Cauchy distributions.

There are other robust discriminant analysis procedures in the literature, some of which, such as the

Hawkins and McLachlan (1997) procedure, substitute robust estimates for traditional estimates in the linear

discriminant rule. An example, which illustrates the difference of such procedures from ours, was proposed

by Randles et al. (1978). Their procedure replaces the sample means by Huber type location estimates and

the sample variance-covariance matrix with a weighted estimate. Our procedure, though, maximizes a robust

Lawley-Hotelling test for differences between group centers. If Huber estimates are used in our procedure,

then the location estimates are similar to those of Randles et al., but the estimates of scatter differ. Our

estimates use the standardization which is required by the associated Lawley-Hotelling test statistic. Hence

in this case, the efficiency of our procedure is the same as the efficiency of the Huber estimator. Thus our

procedure is highly efficient. Randles et al. weighted estimate of scatter is not estimating the same matrix

and its efficiency properties will differ. Generally the weighting will result in lower efficiency (see Chapter 5

of Hettmansperger and McKean, 1998).

3

2 Traditional Discriminant Analysis

2.1 Notation

Suppose there are g distinct groups. Let xij represent the k×1 random vector of the measured characteristics

made on the jth object in the ith group, with n =∑g

i=1ni. The n×k data matrix X contains n row vectors,

x′

ij , of multivariate observations. Let µi, i = 1, . . . , g, denote the mean for the ith group and let µ denote

the g × k matrix whose ith row is µi. Let πi, i = 1, . . . , g, denote the prior probability that observation x

belongs to the ith group. In this scenario, the model of interest is the one-way multivariate linear model

X = Wµ + e, (2.1)

where W is the incidence matrix and e is an n×k matrix of random errors with E(eij) = 0 and Var(ei) = Σ,

where e′

i is the ith row of the matrix e. Assume that e′

i has density function f(x) and distribution function

F (x). Denote the jth marginal cdf and pdf of ei by Fj(xj) and fj(xj). Let Ω denote the column space of

W and let P Ω denote the projection matrix onto the subspace Ω.

We next briefly describe the traditional analysis; see, for instance, Chapters 5 and 6 of Seber (1984) for

more details.

2.2 Separation

Discriminant coordinates were introduced as a dimension reduction technique useful for “examining clustering

effects in the data”, e.g., Gnanadesikan (1977) or Seber (1984). The goal in discriminant coordinates is to

find linear combinations of the data that “best” separates the groups of observations.

The amount of separation in the groups is proportional to the size of the test statistic for testing

H0 : Aµ = 0 versus HA : Aµ 6= 0, (2.2)

where A is the usual contrast matrix for testing equality of g group means. Since we are interested in

the “maximum” amount of separation in the groups, this is equivalent to finding c that maximizes the

Lawley-Hotelling type of test statistic for

H0 : Aµc = 0 versus HA : Aµc 6= 0. (2.3)

Let µLS be the argument that minimizes tr(X − Wµ)′(X − Wµ), then µLS is the traditional least

squares estimate of µ. The associated Lawley-Hotelling test statistic is

TLS = tr(AµLSc)′(A(W ′W )−1A′)−1(AµLSc)(c′Σc)−1, (2.4)

4

where Σ = 1

n−g X ′(I − P Ω)X is the usual estimate of the variance-covariance matrix. Under the null

hypothesis (2.3), TLS has an approximate χ2g−1-distribution.

By the Generalized Cauchy-Schwartz inequality, the maximum value of TLS, (2.4), is λ1, the maximum

eigenvalue of

Σ−1

(AµLS)′(A(W ′W )−1A′)−1(AµLS), (2.5)

and the direction of maximum separation is c1, the corresponding orthonormal eigenvector. Then proceed

as in principle components obtaining k orthogonal directions c1, c2, . . . , ck, which are the k orthonormal

eigenvectors corresponding to the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λk ≥ 0 of the matrix in expression (2.5). The

eigenvalues of the matrix in expression (2.5) are the same as the eigenvalues of

KLS = Σ−1/2

(AµLS)′(A(W ′W )−1A′)(AµLS)Σ−1/2

(2.6)

which is symmetric and, hence, is easier to handle numerically. Let ai, i = 1, . . . , k denote the corresponding

eigenvectors of KLS , (2.6). It can easily be shown that ci = Σ−1/2

ai, i = 1, . . . , k. The vector ci is called

the ith discriminant direction. Let C = [c1, . . . , ck]. Then the discriminant coordinate representation of the

matrix X is Z = XC, where the columns of Z are the discriminant coordinates.

The matrix KLS, (2.6), is called the kernel for the traditional procedure and the associated visualization

procedures are graphical methods based on the discriminant coordinates.

2.3 Allocation

The objective of allocation is to classify an unknown object to one of g known groups. Recall that the rule

that minimizes the total probability of misclassification (TPM) is

Assign x to Gi if πifi(x) ≥ πjfj(x), for all j = 1, . . . , g; (2.7)

see, for instance, Seber (1984). If fi(x) is the pdf of a Nk(µi,Σ) distribution then the optimal rule becomes:

Assign x to Gi if Li(x) ≥ Lj(x), for all j = 1, 2, . . . , g, (2.8)

where,

Li(x) = ln(πifi(x)) = ln πi + µ′

iΣ−1(x − 1

2µi).

If the prior probabilities are assumed equal, then a short algebraic derivation shows that the rule in expres-

sion (2.8) is equivalent to the nearest center rule, where Euclidean distance is measured using Mahalanobis

distance. That is, expression (2.8) is equivalent to

Assign x to Gi if (x − µi)′Σ

−1(x − µi) ≤ (x − µj)′Σ

−1(x − µj) for all j = 1, 2, . . . , g. (2.9)

5

In practice, the traditional estimates are substituted for the parameters.

The discriminant coordinate representation, Z = XC, is the representation for the data which gives

maximal separation, so it is the appropriate representation from which to work. In practice, we may only

use the first several principal discriminant coordinates to do the allocation. The nearest center rule for

discriminant coordinates is:

Assign x to Gi if Di(z) ≤ Dj(z) for all j = 1, 2, . . . , g, (2.10)

where Σz = C ′ΣxC, µ′

ziis the ith row of the g × k matrix

µz,LS = µLSC (2.11)

and

Di(z) = (z − µzi)′Σ

−1

z (z − µzi).

3 Efficient Robust Discriminant Analysis

In this section, we outline a generic robust discrimination procedure, which is analogous to the traditional

procedure. The separation stage is based on maximizing a robust Lawley-Hotelling test statistic. The

efficiency of this stage is based on how powerful this test is in separating small differences of location.

3.1 Separation

To derive robust discriminant coordinates, begin with a robust estimate µP of µ in model (2.1), where P

denotes a generic robust estimating procedure. Recall, W is an incidence matrix so a highly efficient robust

estimator can be used. Assume, under regularity conditions,

µP is asymptotically Ng,k(µ, (W ′W )−1,ΣP ), (3.12)

where ΣP is the asymptotic variance-covariance matrix of µP . Note that the square-root of this matrix,

Σ1/2

P , is the multivariate analog of the standard error of the estimate. Let ΣP be a consistent estimate of

ΣP . Then the Lawley-Hotelling type test statistic for the hypotheses (2.3) is:

TP = (AµP c)′(A(W ′W )−1A′)−1(AµP c)(c′ΣP c)−1. (3.13)

Proceeding, as in traditional discriminant coordinates, obtain the k orthogonal directions aP1, . . . , aPk

cor-

responding to eigenvalues λ1 ≥ · · · ≥ λk ≥ 0 of the matrix

KP = Σ−1/2

P (AµP )′(A(W ′W )−1A′)−1(AµP )Σ−1/2

P . (3.14)

6

The robust discriminant coordinates are the columns of ZP = XCP , where CP = [cP1, . . . , cPk

] and

cPi= Σ

−1/2

P aPi. In particular, the vector cP1

gives the direction for the maximal separation for the generic

robust procedure P .

The matrix KP , (3.14), is the kernel for the robust procedure and the associated visualization procedure

is based on the robust discriminant coordinates, the columns of ZP .

3.2 Efficiency

The efficiency of the procedure depends on how well the test statistic TP detects small differences among the

means. A way to measure this is to determine the asymptotic power of the test TP under local alternatives,

Hn : Aµn =1√n

Aµ0, (3.15)

where µ0 is a g × k matrix not equal to the zero matrix and A is the (g − 1) × g contrast matrix given in

the hypotheses (2.2). Assume a sequence of linear models of the form (2.1) indexed by n. Let W n denote

the incidence matrix and assume that

limn→∞

n−1W ′

nW n = ΣW , (3.16)

where ΣW is positive definite. The asymptotic power of the test statistic TP,n can be determined from its

asymptotic distribution under the sequence of alternatives. Assuming certain conditions, we can generally

show that

TP,n has an asymptotic noncentral χ2g−1(θP )-distribution, (3.17)

with g − 1 degrees of freedom and noncentrality parameter

θP = tr(Aµ0)′[AΣW A′

]−1

Aµ0Σ−1

P. (3.18)

The conditions depend on the specific robust estimator chosen, but often a uniform linearity (quadraticity)

result is required. The robust procedure discussed in Section 4 satisfies such a condition.

Provided the variance-covariance matrix of the random vector ei is finite, under the sequence of local

alternatives defined in expression (3.15), the LS test statistic TLS,n (2.4) has a noncentral χ2 distribution

with g − 1 degrees of freedom and noncentrality parameter

θLS = tr(Aµ0)′[AΣW A′

]−1

Aµ0Σ−1. (3.19)

It follows that the asymptotic relative efficiency between the robust procedure and LS is the ratio of non-

centrality parameters

ARE(P, LS) =θLS

θP. (3.20)

7

This result easily generalizes to the univariate case. Suppose the components of the error random vector

ei are iid with variance σ2. Then Σ = σ2Ik and ΣP = τP Ik, for some parameter τP , which depends

on the robust procedure used. Thus in the iid case, the ARE (3.20) simplifies to the univariate formula,

ARE(P, LS) = σ2/τ2

P . For the general multivariate case, expression (3.20) does not simplify; however,

by comparing the noncentrality parameter θP , (3.18), with the asymptotic distribution of µP , (3.12), the

efficiency properties of the separation phase of the robust procedure is essentially the same as the efficiency

properties of the robust estimates. Because the fitting is based on the incidence matrix, there are no outliers

in factor space; hence, we recommend highly efficient robust estimates.

3.3 Allocation

Let z = C′

P x where CP is the matrix of robust discriminant directions based on procedure P . The robust

Mahalanobis distance is:

DP,i(z) = (z − µP,zi)′Σ

−1

P,z(z − µP,zi), (3.21)

where the estimate of location and scatter are given by

µP,zi= C′

P µP,xiand ΣP,z = C ′

P ΣP CP , (3.22)

respectively. Then a nearest center, robust linear discriminant rule is

Assign x to Gi if DP,i(z) ≤ DP,j(z), for all j = 1, 2, . . . , g. (3.23)

As discussed in Section 2.3, under the assumption that the prior probabilities of group membership are

all equal, the derivation between the traditional rules of expressions (2.8) and (2.9) is algebraic. Hence, the

robust rule

Assign x to Gi if Li(z) ≥ Lj(z), for all j = 1, 2, . . . , g, (3.24)

where

Li(z) = − ln g + µP,ziΣ

−1

P,z

(z − 1

2µP,zi

), j = 1, 2, . . . , g.

is equivalent to rule (3.23). For the remainder of this article, we will use the nearest center rule (3.23).

The proposed robust discriminant rule (3.23) can be used with most robust estimators. All that is required

is a√

n consistent estimate of location and a consistent estimate of its asymptotic variance-covariance matrix.

The efficiency of the procedure is the same as the efficiency of the robust estimator µP . While the rule is

based on asymptotic theory, the Monte Carlo study presented in Section 6 verifies, over the situations covered,

the robustness and validity for the procedure based on the estimator discussed in Section 4. This empirical

study involved estimates based on a sample size of 25.

8

4 Robust Affine Equivariant Estimate

Hettmansperger and Randles (2002) proposed an M-estimate for multivariate location which is affine equiv-

ariant and robust with positive breakdown. The Hettmansperger and Randles estimator (HR) is the L1

(spatial) median combined with the M-estimate of scatter proposed by Tyler (1987). The estimate proposed

by HR minimizes the following dispersion function

n∑

i=1

‖AT (xi − µ)‖ (4.25)

where AT is a k × k upper triangular, positive definite matrix (with a one in the upper left corner) chosen

to satisfy:

n−1

n∑

i=1

AT (xi − µ)(xi − µ)′A′

T

‖AT (xi − µ)‖2= k−1I , (4.26)

where I is the k × k identity matrix and ‖ · ‖ denote the Euclidean norm. Let µHR be the value that

minimizes (4.25).

Under model (2.1),

µHR is asymptotically Ng,k(µ, (W ′W )−1, B−1A∗B−1), (4.27)

where

A∗ = E

[AT (X − µ)(X − µ)′A′

T

‖AT (X − µ)‖2

]

B = E

[AT

‖AT (X − µ)‖

(I − (X − µ)(X − µ)′

‖AT (X − µ)‖2

)]. (4.28)

Further, AT is a consistent estimator of AT ; see Hettmansperger and Randles (2002) for discussion. Let

ΣHR = B−1

A∗

B−1

, where B and A∗

are the respective matrices B and A∗ with AT replaced by AT .

Then ΣHR is a consistent estimate of ΣHR.

4.1 Separation

Using the HR estimators µHR and ΣHR, the Lawley-Hotelling test statistic for hypothesis (2.3), under model

(2.1), is the statistic THR defined in the following theorem.

Theorem 4.1. Assume the regularity conditions in Hettmansperger and Randles (2002) are true. Let

THR = tr(AµHRc)′(A(W ′W )−1A′)−1(AµHRc)(c′ΣHRc)−1.

Then under the null hypothesis

THR is asymptotically χ2

g−1.

9

Proof. From the asymptotic distribution of µHR given in equation (4.27) we have

AµHRc is asymptotically Nq,s(Aµc, A(W ′W )−1A′, c′Σc).

Further, under the null hypothesis Aµc = 0. From these two results, the theorem follows immediately.

Based on this theorem, the kernel of the HR discriminant coordinate procedure is

KHR = Σ−1/2

HR (AµHR)′(A(W ′W )−1A′)−1(AµHR)Σ−1/2

HR . (4.29)

Let aHR1, . . . , aHRk

denote the eigenvectors corresponding to eigenvalues λ1 ≥ · · · ≥ λk ≥ 0 of the matrix

KHR. Then the HR robust discriminant coordinates are the columns of ZHR = XCHR, where CHR =

[cHR1, . . . , cHRk

] and cHRi= Σ

−1/2

HR aHRi. The HR associated visualization procedure is based on these

discriminant coordinates.

4.2 Efficiency

For efficiency results, consider the set up of Section 3.2 with the sequence of local alternatives (3.15). Based

on the linearization result given in Hettmansperger and Randles (2002), it follows that under this sequence

of local alternatives

THR,n has an asymptotic noncentral χ2g−1(θHR)-distribution, (4.30)

with g − 1 degrees of freedom and noncentrality parameter

θHR = tr(Aµ0)′[AΣW A′

]−1

Aµ0Σ−1

HR. (4.31)

The efficiency of the separation procedure is the same as the efficiency of the HR estimator which is discussed

in Section 3 of Hettmansperger and Randles (2002). In particular, it appears to be highly efficient for heavy-

tailed error distributions relative to the LS procedure.

4.3 Allocation

The nearest center rule is

Assign x to Gi if DHR,i(z) ≤ DHR,j(z), z = C ′

HRx, for all j = 1, 2, . . . , g, (4.32)

where the robust HR Mahalanobis distance is:

DP,i(z) = (z − µHR,zi)′Σ

−1

HR,z(z − µHR,zi), (4.33)

10

and the estimate of location and scatter are given by

µHR,zi= C ′

HRµHR,xiand ΣHR,z = C ′

HRΣHRCHR. (4.34)

Because of affine equivariance of the HR estimator, these same estimates would be obtained from the trans-

formed data, Z.

4.4 Equivalence to the Traditional Rule

As with most robust procedures, interest centers on how efficient the robust estimate is to the traditional

estimate under the multivariate normal distribution. Suppose the rows of e in model (2.1) have a symmetric

elliptical error distribution with density proportional to cmh(t′t). As discussed in Hettmansperger and

Randles (2002), r2 = ‖e‖2 has density

fr2(y) =ckπk/2

Γ(k/2)yk/2−1h(y).

Then the asymptotic relative efficiency of µHR relative to µLS is

ARE(µHR, µLS) = k−2E(r2)[E(r−1)]2(k − 1)2.

At the multivariate normal, the asymptotic relative efficiency of µHR to the least squares µLS is

ARE(µHR, µLS) =

(

Γ(k2)21/2k

Γ(k−1

2)(k − 1)

)2

1

k

−1

. (4.35)

If ΣHR is divided by equation (4.35), then the resulting estimate is consistent for Σ and rule (4.32) is

asymptotically equivalent to the traditional nearest center rule (2.9).

5 Examples

To investigate the robustness of the procedures, we used Fisher’s (1936) classic Iris data set and four con-

taminated versions of the Iris data set. Recall, the Iris data set consists of three species of Iris with 50

observations of each species. The four variables are sepal length, sepal width, petal length and petal width.

Group one is denoted with red circles, group two with green triangles pointing up and group three with

blue triangles pointing down. We contaminated the Iris data set with a single outlier in group one, with 5

outliers in group one, with 5 clustered outliers in group one, and 3 outliers in group one and two outliers

in group two. For each data set, the visualizations were constructed using the the first two discriminant

coordinates of the kernels KLS and KHR. We also calculated the probabilities of misclassification (PMC)

11

using leave-one-out cross-validation. The allocations were based on the nearest center rule using the first

two discriminant coordinates.

5.1 Visualization

Figure 1 displays the plots of the first two traditional discriminant coordinates and the HR robust discrimi-

nant coordinates on the original Iris data set. In each of the plots, the first coordinate is showing a difference

−10 −5 0 5

45

67

89

10

Least Squares

Coordinate 1

Coo

rdin

ate

2

−60 −40 −20 0 20 40

2030

4050

60

HR

Coordinate 1

Coo

rdin

ate

2

Figure 1: Iris Data, (original data)

in location among the three groups.

Figure 2 displays the plots of the first two traditional discriminant coordinates and the HR robust

discriminant coordinates from the Iris data set with one outlier. From these plots, the traditional discriminant

coordinates fail to separate the groups or identify the outlier (first plot, Figure 2) whereas the robust

discriminant coordinates identify the outlier (second plot, Figure 2) and separate the groups (third plot,

Figure 2).

12

5 6 7 8 9 10 11

−4

−3

−2

−1

01

2

Least Squares

Coordinate 1

Coo

rdin

ate

2

0 500 1000 1500 2000 2500 3000

−30

0−

200

−10

00

HR

Coordinate 1

Coo

rdin

ate

2

−40 −20 0 20 40 60

2030

4050

60

HR − Zoomed In

Coordinate 1

Coo

rdin

ate

2

Figure 2: Iris Data, (1 Outlier)

13

Figure 3 displays the plots of the first two traditional discriminant coordinates and the HR robust

discriminant coordinates from the Iris data set with 5 outliers. From these plots, the traditional discriminant

8 10 12 14

02

46

8

Least Squares

Coordinate 1

Coo

rdin

ate

2

−8000 −6000 −4000 −2000 0

020

0040

0060

0080

0010

000

HR

Coordinate 1

Coo

rdin

ate

2

−60 −40 −20 0 20 40

2030

4050

60

HR − Zoomed In

Coordinate 1

Coo

rdin

ate

2

Figure 3: Iris Data, (5 Outliers)

coordinates are separating the groups, but only 3 of the 5 outliers are clearly identified in the plot (first plot,

Figure 3). The robust discriminant coordinates identify the 5 outliers and separate the groups (second and

third plots, Figure 3).

Figure 4 displays the plots of the first two traditional discriminant coordinates and the HR robust

discriminant coordinates from the Iris data set with 5 outliers in a cluster. From these plots, the traditional

discriminant coordinates identify the 5 outliers but do not separate the groups (first plot, Figure 4). The

robust discriminant coordinates identify the 5 outliers and separate the groups (second and third plots,

Figure 4).

Figure 5 displays the plots of the first two traditional discriminant coordinates and the HR robust

discriminant coordinates from the Iris data set with 3 outliers in group one and 2 outliers in group two.

From these plots, the traditional discriminant coordinates do not separate the groups or identify the outliers

14

5 6 7 8 9 10 11

−1

01

23

45

Least Squares

Coordinate 1

Coo

rdin

ate

2

0 500 1000 1500 2000 2500 3000

−25

0−

200

−15

0−

100

−50

050

HR

Coordinate 1

Coo

rdin

ate

2

−40 −20 0 20 40 60

2030

4050

60

HR − Zoomed In

Coordinate 1

Coo

rdin

ate

2

Figure 4: Iris Data, (5 Outliers in a Cluster)

15

5 6 7 8 9 10 11

−5

−4

−3

−2

−1

0

Least Squares

Coordinate 1

Coo

rdin

ate

2

−2500 −2000 −1500 −1000 −500 0

−25

0−

200

−15

0−

100

−50

050

HR

Coordinate 1

Coo

rdin

ate

2

−60 −40 −20 0 20 40

2030

4050

60

HR − Zoomed In

Coordinate 1

Coo

rdin

ate

2

Figure 5: Iris Data, (3 Outliers Grp 1 & 2 Outliers Grp 2)

16

(first plot, Figure 5) whereas the robust discriminant coordinates identify the 5 outliers and separate the

groups (second and third plots, Figure 5).

Thus in terms of separation, the HR discriminant procedure agrees with the traditional LS procedure on

the original data. For the contaminated data, in all cases, the the HR procedure separates the groups and

identifies all the outliers. In contrast on the contaminated data, the LS procedure either fails to separate or

fails to identify all the outliers. In terms of separation, the HR procedure is robust.

5.2 Allocation

Table 1 displays the estimated probability of misclassification (PMC) for each variation of the Iris data set.

From the results presented in the table, when the Iris data set is not contaminated the PMC is the same

Table 1: Estimated PMCs for each variation of the Iris Data Set

Data Set Least Squares HR

Original 0.0267 0.0267

1 Outlier 0.4733 0.0333

5 Outliers 0.38 0.06

5 Clustered Outliers 0.52 0.06

3 Outlier Grp 1 & 2 Outliers Grp 2 0.55 0.06

for the LS and HR procedures, but when contamination is added to the data set, the HR procedure has a

much lower PMC. In terms of allocation, the HR procedure is robust. The outliers severely hampered the

allocation ability of the LS procedure.

6 Simulation Results

In this section, we present the results of a Monte Carlo study which investigates the behavior of the nearest

center rules of three procedures in terms of their TPMs over various error distributions. For our procedure

we chose the highly efficient robust procedure described in Section 4 based on the HR estimator. For

comparison, we included the the traditional procedure (LS) as described in Section 2. In order to investigate

our efficiency claims, as our third procedure we selected the high breakdown procedure proposed by Hawkins

and McLachlan (1997) using minimum covariance determinants (MCD) as an estimator of scatter. This

procedure has high breakdown but low efficiency. Their procedure accommodates a certain percentage of

17

outliers and the estimates are based on the“inliers”. The outliers are the set of points which, when removed,

minimize the within group covariance determinant. For the simulation, we used a 50% coverage. We selected

50% coverage because this has highest breakdown and lowest efficiency.

For the simulation results presented in this section, we consider situations where there are two groups and

four dimensions, i.e., g = 2 and k = 4. For the mean and variance matrices, we chose the sample mean and

variance matrices of the beetle data on page 295 of Seber (1984). One thousand data sets were randomly

generated from a variety of error distributions. The error distributions used were multivariate normal

(MVN), the contaminated multivariate normal (CN), and the multivariate t. For each error distribution,

fifty observations were generated for both the training and test data sets with twenty-five observations

randomly assigned to each group. The training data set was used to develop the linear classification rule and

then this rule was used to classify the test data set and the probability of misclassification was recorded.

The empirical TPM was used as a benchmark for the performance of each procedure. To calculate the

empirical TPM, we used the rule

Assign xi to group 1 iff(xi|G1)

f(xi|G2)> 1, i − 1, . . . , n,

to classify the data. Then for our empirical TPM, we used the proportion misclassified. At the multivariate

normal, the true TPM is

TPM = Φ(−∆/2),

where

∆2 = (µ1 − µ2)

′Σ

−1(µ1 − µ2)

is Mahalanobis distance. Table 2 displays the empirical TPM for the distributions used in the simulation.

For the multivariate normal, the true TPM is presented in parenthesis.

Table 3 displays 84% confidence intervals for the probabilities of misclassification (PMC) of the three

procedures. The 84% confidence intervals were chosen because for a two sample-analysis based on one-

sample confidence intervals, 84% one-sample confidence intervals yield roughly a 95% two-sample confidence

interval; see Section 1.12 of Hettmansperger and McKean (1998). From the results displayed in the table,

the traditional procedure has the smallest PMC at the multivariate normal. In all the other cases, the HR

procedure has lower PMCs than the LS procedure. In fact, in seven of these situations, the confidence

intervals do not overlap. Thus the HR procedure is more robust than the LS procedure for moderate to

heavy contamination. For comparisons between the two robust procedures, the HR procedure always has

a lower PMC than the HM procedure. For the situations considered, the confidence intervals of the two

procedures never overlap. The HR procedure is more efficient than the HM procedure over all situations in

18

Table 2: Empirical TPM

Distribution Empirical TPM

MVN (0.1142) 0.1251

CN ε = 0.10, σ2 = 9 0.1560

CN ε = 0.20, σ2 = 9 0.1791

CN ε = 0.10, σ2 = 25 0.1487

CN ε = 0.20, σ2 = 25 0.1904

CN ε = 0.10, σ2 = 100 0.1665

CN ε = 0.20, σ2 = 100 0.1996

t df = 1 0.2278

t df = 2 0.1854

t df = 3 0.1546

t df = 4 0.1545

t df = 5 0.1530

this study, including the elliptical Cauchy distribution. In comparing the HM and LS procedures, the HM

procedure has lower PMCs than the LS procedure for heavy-tailed distributions.

Next, consider the comparison between the empirical TPMs of Table 2 and the simulated PMCs of Table

3. The HR values are generally much closer to the TPM values than the values of the other procedures.

7 Conclusion

In this paper, we have proposed a discriminant analysis based on efficient robust discriminant coordinates.

Like the traditional analysis, this robust analysis is a two stage process: separation and allocation. Max-

imizing a robust Lawley-Hotelling test based on robust estimates of group centers achieves the separation

and produces the discriminant coordinates. The efficiency of the procedure follows from the power of the

procedure to detect small differences in group centers. Further, it has the same efficiency of the robust

estimates used for the Lawley-Hotelling test. The robust discriminant coordinates can be used to visualize

the data as we demonstrated with examples. This visualization is much less sensitive to outliers than the

visualization obtained from traditional discriminant coordinates. The robust discriminant coordinates can

be further used to form nearest center rules for the allocation of new data to the groups.

19

Table 3: Simulated PMCs for the Procedures

Distribution LS HR Hawkins

MVN (0.1365,0.1410) (0.1426,0.1473) (0.2323,1.2411)

CN ε = 0.10, σ2 = 9 (0.1684,0.1737) (0.1643,0.1692) (0.2436,0.2520)

CN ε = 0.20, σ2 = 9 (0.1970,0.2027) (0.1875,0.1926) (0.2477,0.2552)

CN ε = 0.10, σ2 = 25 (0.1919,0.1978) (0.1697,0.1747) (0.2449,0.2530)

CN ε = 0.20, σ2 = 25 (0.2310,0.2377) (0.1986,0.2038) (0.2538,0.2612)

CN ε = 0.10, σ2 = 100 (0.2356,0.2434) (0.1738,0.1788) (0.2484,0.2564)

CN ε = 0.20, σ2 = 100 (0.2953,0.3044) (0.2074,0.2126) (0.2605,0.2679)

t df=1 (0.3370,0.3466) (0.2411,0.2467) (0.2611,0.2676)

t df = 2 (0.2282,0.2345) (0.2008,0.2059) (0.2373,0.2440)

t df = 3 (0.1943,0.1997) (0.1853,0.1902) (0.2315,0.2387)

t df = 4 (0.1784,0.1835) (0.1758,0.1808) (0.2288,0.2359)

t df = 5 (0.1670,0.1752) (0.1665,0.1717) (0.2318,0.2399)

Our procedure is generic in the sense that any robust fitting procedure can be used provided its estimates

are root n consistent with an asymptotic linearity result and a consistent estimate of its variance-covariance

matrix. The design matrix for the fitting is an incidence matrix, so highly efficient robust estimators are

recommended which results in the associated discriminant procedure being highly efficient. In this paper

we used the affine equivariant estimator proposed by Hettmansperger and Randles (2002) but any highly

efficient robust estimator could be used.

The examples that we presented showed the robustness of the procedures on real data. On the original

data, the HR robust procedure behaved similarly to the traditional LS procedure in terms of visualization

and classification (PMC). However, when outliers were introduced in the Iris data, the results were quite

different. The behavior of the robust procedure was quite similar to its behavior on the original data but

the traditional procedure’s PMC rate changed from 3% to 48% (on average) and its visualization was quite

poor.

In our Monte Carlo study, we investigated the behavior of the nearest center rules in terms of misclas-

sifications for three procedures. The data were split into two data sets: “training” and “test”. Empirical

PMCs for the procedures were obtained for families of multivariate t- and contaminated multivariate normal

distributions. We selected the highly efficient procedure (HR) described in Section 4. As competitors we

20

selected the LS procedure and a high breakdown, but low efficient, procedure proposed by Hawkins and

McLachlan (1997). The HR procedure was comparable to the LS procedure when the errors had a multivari-

ate normal distribution but it generally performed much better than the LS procedure for the heavier tailed

error distributions. Further, over all situations simulated, the HR procedure had lower empirical PMCs than

the high breakdown HM procedure.

In summary, the discriminant procedures that we have proposed form an attractive robust alternative

to the traditional procedure. The procedures are highly efficient relative to the traditional procedure and

they are quick to compute. Further, they produce robust discriminant coordinates which allow the user to

visually explore the data and assess the differences among groups.

Acknowledgment

The authors thank the associate editor and a referee whose comments led to an improvement of this paper.

References

Davis, J. B. and McKean, J. W. (1993), Rank based methods for multivariate linear models, Journal of the

American Statistical Association, 88, 245-251.

Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7.

Flurry, B. and Riedwyl, H. (1988), Multivariate Statistics: A Practical Approach, London: Chapman and

Hall.

Gnanadesikan, R. (1977), Methods for Statistical Analysis of Multivariate Observations, New York: John

Wiley & Sons.

Hawkins, Douglas M. and McLachlan, Geoffry J. (1997), High-Breakdown Linear Discriminant Analysis,

Journal of the American Statistical Association, 92:437, 136-143.

Hettmansperger, T. P. and McKean, J. W. (1998), Robust Nonparametric Statistical Methods, London:

Arnold.

Hettmansperger, T.P. and Randles, R. H. (2002), A practical affine equivariant multivariate median, Biometrika,

89, 851-860.

Jaeckel, L. A. (1972), Estimating regression coefficients by minimizing the dispersion of the residuals, Annals

of Mathematical Statistics, 43, 1449-1458.

Johnson, R. A. and Wichern, D. W. (1998), Applied Multivariate Statistical Analysis, 4th Ed., Upper Saddle

River, New Jersey: Prentice Hall.

21

Randles R. H., Broffitt J. D., Ramberg, J. S., and Hogg, R. V. (1978), Generalized linear and quadratic

discriminant functions using robust estimates, Journal of the American Statistical Association, 73, 564-

568.

Reaven, G. M. and Miller, R. G. (1986), Robust Regression and Outlier Detection, New York: John Wiley

& Sons.

Seber, G. A. F. (1984), Multivariate Observations, New York: John Wiley & Sons.

Tyler, D. E. (1987), A distribution-free M-estimator of scatter, Annals of Statistics, 15, 234-251.

22