13
ZI’PLItD \TOCHASTI( \lODtL\ AND DATA ANAL \\I\, 1’01 4, 219 2fl (19x8) C luslering and classif~calion ON THE CLASSIFICATION OF OBSERVATIONS STRUCTURED INTO GROUPS SUMMARY The paper is concerned with the problem of classifying a specific groirp into two populations (insect eggs oi the same clutch belonging therefore to the same species). Two approaches, one parametric and the other non-parametric, are described. The classical likelihood ratio procedure is derived. An interpretation anid a decomposition of the test criteria is given. A misclassiiication estimate using the Chernoll‘-- Kullback-Kailath region is provided. hl 1 WORDS Classification and discrimination of a specific group Mahalanobis distance and observation shape Relative discriminant contribution Estimate of misclassification Sebestyen approach Chernoff-Kullback-Kailath region INTRODUCTION Two distinct populations WI and w2 are considered whose individuals, known by means of a vcctor of u variables, are structured into groups. The problem dealt with is the classification of a set of observations yk E R“ (k = I, ..., K) all belonging to the same specific group y between the two wI (i = 1,2). Two approaches are used. The first, parametric, investigates a special component of variance niodel of the type Yk = t-, + &k where t, is the barycentre of the group that yk comes from: , $ , and Ek are independent random v ;i r i a b 1 es with k now n pro ba b i 1 it y d is t r i b u t i o n . Whereas in classic problems of discriminant analysis (DA) we arrive at a discriminant rule (DR) based on the comparison between two Mahalanobis distances, in the problem addressed here the DR is based on the comparison between two sets of distance indices: one due to the position the other to the shape of the observations to discriminate. The second approach, non-parametric, is a generalization of Sebestyen’s model, I.’ which provides as many metrics as there are populations to be discriminated. As known, in this approach the classical likelihood ratio test cannot be used and the resulting DR is based on the comparison between two distances. This paper on DA arises from an entomological problem (described in detail by Pucci and Forcina’ and solved by Forcina‘), where it is desired to classify a set of insect eggs from the same clutch (the specific group 7) certainly coming from one of the two species Sesainia crericu and S.nonagrioides (the populations wl). An example is given at the end of this paper. 8755-0024/88/040239- 13$06.50 1988 by John Wiley & Sons, Ltd. Received Juniiaty I986 Revised 23 NoveiiiOei. 19S7

On the classification of observations structured into groups

Embed Size (px)

Citation preview

Z I ’ P L I t D \TOCHASTI( \ l O D t L \ A N D DATA ANAL \ \ I \ , 1’01 4, 219 2 f l (19x8)

C luslering and classif~calion

ON THE CLASSIFICATION OF OBSERVATIONS STRUCTURED INTO GROUPS

SUMMARY

The paper is concerned with the problem of classifying a specific groirp into two populations (insect eggs oi the same clutch belonging therefore to the same species). Two approaches, one parametric and the other non-parametric, are described. The classical likelihood ratio procedure is derived. An interpretation anid a decomposition of the test criteria is given. A misclassiiication estimate using the Chernoll‘-- Kullback-Kailath region is provided.

h l 1 WORDS Classification and discrimination of a specific group Mahalanobis distance and observation shape Relative discriminant contribution Estimate of misclassification Sebestyen approach Chernoff-Kullback-Kailath region

INTRODUCTION

Two distinct populations W I and w2 are considered whose individuals, known by means of a vcctor of u variables, are structured into groups.

The problem dealt with is the classification of a set of observations yk E R“ ( k = I , ..., K ) all belonging to the same specific group y between the t w o wI ( i = 1,2).

Two approaches are used. The first, parametric, investigates a special component of variance niodel of the type

Y k = t-, + &k

where t, is the barycentre of the group that yk comes from: ,$, and Ek are independent random v ;i r i a b 1 es with k now n pro ba b i 1 i t y d is t r i b u t i o n .

Whereas i n classic problems of discriminant analysis (DA) we arrive at a discriminant rule (DR) based on the comparison between two Mahalanobis distances, in the problem addressed here the DR is based on the comparison between two sets of distance indices: one due to the position the other to the shape of the observations to discriminate.

The second approach, non-parametric, is a generalization of Sebestyen’s model, I.’ which provides as many metrics as there are populations to be discriminated. As known, in this approach the classical likelihood ratio test cannot be used and the resulting DR is based on the comparison between two distances.

This paper on DA arises from an entomological problem (described in detail by Pucci and Forcina’ and solved by Forcina‘), where i t is desired to classify a set of insect eggs from the same clutch (the specific group 7) certainly coming from one of the two species Sesainia crericu and S.nonagrioides (the populations w l ) . An example is given at the end of this paper.

8755-0024/88/040239- 13$06.50 1988 by John Wiley & Sons, Ltd.

Received J u n i i a t y I986 Revised 23 NoveiiiOei. 19S7

240 F . BERTOLINO

The specific object of this paper is to provide the entomologist with a reliable and easily applicable DR. The general aim is to investigate the discrimination of populations structured into groups.

The problem of classification of homozygous or heterozygous twins-described in detail by Stocks5-is almost certainly the first example where the object to be categorized is a group of individuals rather than one. In all the discrimination models proposed-those at Penrose,' Smith,' Bartlett and Please' for twins and Desu and Geisser,9 which also considers the case of triplets-the classification problem is reduced to a particular case of quadratic DA (with a single observation) with equal means and particular different covariances. For twins there exists one observation given by the difference between the two: z = xi - x 2 , whereas for triplets we have two observations zI = XI - x2 and 1 2 = XI - xj.

Obviously, not all of these classification models can be used for discriminating among groups of indefinite size.

DATA, DEFINITIONS AND HYPOTHESES

The data, well classified, required to discriminate future observations Yk ( k = 1, ..., K ) were collected according to the following procedure: from the population w, ( i = 1,2) N, clutches y,,! ( n = 1, ..., N,) are sampled; then from each y,,,-containing a large number of eggs-./,,, eggs are sampled and for each egg a vector x,,,, of u variables is observed.

We want to classify a set of K future observations which certainly belong to one of the two populations w,, all corning from the same batch y.

The nature of the data suggests the definitions and hypotheses illustrated below. The (known) probability that a group y belongs to one of the two populations w, is

Probty c w,) = p ( w I ) , (i = 1,2) ( 1 )

Hypothesis 1

The conditional probability density function (p.d.f.) of y ~ . , given E7 and y c a,, is Y ~ - N ( Y I [ ~ , X , ) , Y E R ' ' , ( i = l , 2 ) , v k = l , ..., K ( 2 )

where E7 is the (random) vector of the centre of the group y and X I the covariance matrix of the vector y in the group y to which it belongs. Note that X I only depends on the population w,, that is homoscedasticity between the groups coming from the same population and heteroscedasticity between those originating from different populations.

Hypothesis 2

The vector E., relating to the group y C w, has multinormal p.d.f. given by

E-,-N(EIpi,C,), E c R ' , ( i = 132) ( 3 )

where p, is the vector of the center of the population w, and C, the covariance matrix of the vector E.,.

Hypothesis 3

All the K observations to be classified (yl, ..., YK) , originate from the same group y C w,

O N THE CLASSIFICATION OF OBSERVATIONS STRUCTURED INTO GROUPS 24 1

( 1 = 1,2). I t follows that since they are interchangeable within the group y they belong to, they are mutually independent (given

I t is readily seen that hypotheses 1, 2 and 3 lead to the well-known component of variance

and E l ) .

model y h = , $ ? + & h , V k = l , ..., K ,

where t , and &A are independent, ,$, obeys the law (3) and &I - N ( E 10, E l ) , E E [R'

Recalling (2) we have C w , ) = o l - ' N ( y ( , $ , , K - ' E , ) W ( ( K - 1 ) S I K - l , a O , c1 ( i = 1,2) (4)

P(3'1, . . . * ! A It-,? where y and S are the sample mean and covariance matrix, respectively:

are sufficient estimates of ,$., and E l . W((K - 1)s I K - I , Z , ) denotes Wishart's p.d.f. with parameters Z, and K - 1 , and

I

[ r ( ; ( K - ;))I -I 1 ( K - 1)s - I - ? ) cy = (2h I I) 4 ~ i ?

, = I

which does not depend on the population W, but only on the observations ( j 1 , ..., Y K ) .

to the same group is I t ensues that the joint p.d.f. of the same K observations belonging to the populations w, and

DISCRIMINATION RULE

Let c ( j 1 i ) , i , j = 1 , 2 be the loss as a result of assigning the group y to the population w, when -, belongs to w,, with c ( i I i ) = 0. The Bayes discrimination rule (DR) with allocates y to one of the two populations w, recalling (4) is given by the likelihood ratio (LR): '"-"

Taking logarithms, the DR becomes

( 7 )

where d'(j, p , ) = (y - p,)T*;l(j - p , ) is the Mahalanobis distance between the barycentre j of the K obserLations to be discriminated and that of the population w,, where 6 ' ( S , Z , ) = t r ( ( K - 1)Z;'SJ and

W I L2 (d2(Y,pl)+6'(S,EI)l + [ c / ~ ( y , p ? ) + 6 z ( s , E : ? ) J z 2 - y c

242 F. BERTOLINO

The L R (6) reduces to a simple algebraic comparison between the components d' and 6' (7) . Since $ and S are independent c12 and 6' are also independent.

The quantity ti2 can be considered as a disfance inde.~ between the shape of the K observations (given by S ) and the shape of the group 7 C w, (given by Z;). Although 6 2 is not a distance (6' does not enjoy the property &'(A, B ) = 0 iff A = B ) , note that i t has numerous interesting distributive properties, quite similar to those of the Mahalanobis distance.

I t is easily shown that

( i ) the 6' are invariant with respect to any non-singular linear transformation of the random vector y ~ , ..., Y K

( i i ) i f the observations y ~ , . . . , y ~ to discriminate come from w , then h 2 ( S , 2;) is distributed according to a xz distribution with v(K- 1) d.f.; i f the y ~ , . . . . y K come from w.,, the following relation holds:

1'

dZ(S, E l ) = c X < , D , , ( V ) ( Y = I

where D,,(v) is distributed according to x 2 with v = K - 1 and where A,, is the a t h eigenvalue of ATE;'A ( A is a triangular matrix such that A A T = Z.,).

For the sake of brevity, some specific cases of undeniable interest are not considered here. However, mention should be made of the problem dealt with in References 6-9 referred to i n the introduction ( p 1 , p 2 , E 1 f Ez, C I f C2).

Again for brevity the case of more than two populations is not examined here.

DECOMPOSITION OF DISCRIMINATION CRITERIA

In the well-known when y - N(y I p, , X I ) , i = 1 ,2 , K = 1, the DR is

I f C I = Cz = C, the Mahalanobis distance d ' ( p ~ , p z ) = d ' ( p z , p ~ ) = (pl = pz) T I C- (PI - pz)

plays a crucial role in calculating the probability of misclassification. However, in general, the Mahalanobis distance alone is not a complete index of the distance between w1 and w2: when, for instance, pl z p2 and CI # C2, we have ( p , - ~ z ) ~ C ; '(PI - p z ) = 0, i = I , 2 , even i f U I and w2 differ greatly.

I t is easily shown that the expected value of the random quantity on the left-hand side of the DR concerned, when y comes from the population wI, is equal to-apart from a few known constants-the Kullback-Leibler distance between wI and w,. In fact we have

~ [ - d 2 ( ~ , p I ~ + d Z ~ ~ , ~ , ~ + ~ ~ ~ l C , l / I ~ l ~ ~ 1 ~ ~ y I ~ l , ~ l ~ dq

In a few steps we obtain

Idw : a,) = [ (p l - p,) T C i (al - a,) + tr 1 C,C; I - u + in( I C, I / I C, 1 )I i # j = I , 2

which shows that the Mahalanobis and Kullback-Leibler distances are identical i f f CI = Cz = c.

O N THE CLASSIFICATION OF OBSERVATIONS 5TRUCTURJi.D INTO GROUPS 243

In the case at hand, i t seems clear that we require a distance that takes into account not on ly

Recalling (5) we have the shape (C , ) of the centres t,, but also of the shape ( E l ) of the group 7.

I , ,4w1 : a/)

Since V and S are riot dependent, on the strength of Reference 13 Theorem 2.1, p.12 we obtain

I \ , S ( W , : w,) = I>(UI : O / ) + I S ( C J , : id/), ( i # j = 1 ,2 )

\\here (see Reference 13, p.189)

I ! ( w l : a / ) = ~ [ I n ( I % / I / I % l I ) + t r { % l ( ~ ~ ' -*;')I + K ( p l - p / ) ' % ~ ' ( p I - p j ) l

Is(wl : a/) = {(k' -- i ) [ l n ( i E , 1 / I E l 1 + tri E , ( E ; ' - E; ' 1 1 1 I t can be readily asserted that the distances I! and 14, which only depend on sample size h'

(subordinately to pi , E l and C;) are invariant with respect to any non-singular linear transformation of the vector y.

I t is easily checked that the Kullback-Leibler distance of wI from u., decomposes the discrimination criteria into the two quantities I! and I S .

This property is particularly useful both for drawing comparisons between procedures relating to different problems and for estimating the relarive discriminatif contribution (RDC) of d' and 6 2 to the procedure used. If, for example, it happens that E , 3 E,, p, # p; and C, # C., the greater RDC goes to the distances d', since s 2 ( S , E ; ) = A2(S, E;) . In the case briefly mentioned at the end of the preceding section, p; = p;, E ; # E, and C; # C;, the greater RDC goes to the distance 6'. The fact then that the Kullback-Leibler distance is non-symmetric- generally I . (wl : w/) # I .(wf : w,)-does not in itself prevent i t from being used; the difference between the two populations will be expressed by two distances.

The term RDC denotes the following indices:

py.h.(w, : WJ) = 4(w : W,)/I\.S(WI : W / )

ps.h.(w; :a;) = M W I : W,)/l?.S(Wl : W / )

which also only depend on K . For K 4 03 the RDCs converge to the following values:

P>..O(Wi : a;) = (a1 - p./)TC, ' ( p , - P / ) / R

P S . ~ ( W ~ : wI) = [ I n ( / E , I / 1 El 1 ) + t r i E l @ ; ' - E L ' ) I I / R

with

R = ( p ; - p f ) ' C , - ' ( p i - p ; ) + I n ( l ~ , I / I / ; I ) + tr lEi(E; '-E; ')]

In the example, illustrated later, the role played by the RDCs will be quantitatively shown.

ESTIMATE O F MISCLASSIFICATION

With regard to the main basic problem of probability of inisclassification (MC), very little is found in the literature which can be used directly.

244 F. BERTOLINO

In particular i t is neither possible to reduce the problem to quadratic discrimination, according to the geometric approach due to Cluines-Ross I' and subsequently resumed by Anderson and Bahadur, l 5 nor to equal-mean discrimination. '-'

O n the other hand, we cannot resort to the Fukunaga I' method either, owing to the difficulty in finding suitable algebraic operators which, applied to the vector y, transform the covariance matrices E l and C, into the same number of diagonal matrices. Lastly, recourse to series expansion such as that obtained by Qkamoto" is also to be discarded o n account of its complexity.

Let

CY= P r o b ( y c W? I W I 1 and P = P r o b l y C wI ( W Z )

be the two MCs and

= P ( W l ) C Y + P ( W 2 ) P

the total misclassification (TMC): CY and /3 (and hence E ) depend on K . Two methods have been used for estimating a and 0. The first is based on the usual

5imulation methods; the second finds a region DK contained in the open square

The region DK is defined by the hnown Chernoff, I' Kullback" and Kailath'" inequalities: (0 < CY < 1 ) fl (0 < 6 < 1 ) such that (a, 0) E DA.

where

and s~ [0, 11 is such that the right-hand side of (9) is minimum.

Reference 19, and Theorem 1 in Reference 18, we have Since S and S are sufficient estimates and independent, from the properties described i n

In the example illustrated below, the practical use of DK is shown.

ON THE CLASSIFICATION OF OBSERVATIONS STRUCTURED INTO GROUPS 245

A NON-PARAMETRIC APPROACH

11 can often happen that the populations to discriminate are composed of strongly heterogeneous groups as for mean and variance and in any case their probabilistic structure is unknown or too complicated. In this case, it may not be convenient, or extremely difficult, to construct a probabilistic model based on parametric hypotheses of the type given in the second section of this paper.

Therefore a non-parametric approach, based solely on the concept of metrics, is certainly more convenient. In other words, as a DR of type (6) cannot be constructed (ratio between pi-obabilities) we attempt to obtain a DR formally similar to the DR (7) (comparison between distances).

The non-parametric approach considered here is the generalization to the case of K observations of Sebestyen's approach I . ' which considers instead one single observation y to discriminate among g given populations W ) ( i = I , ..., g , with g 2 2).

N o hypothesis is advanced in References 1 and 2 as to the distribution of y (or other quantities) and the construction of the DR-of type (12)-is based solely on g metrics: as many as the g populations to discriminate. These g metrics are obtained by means of a suitable optimality criterion.

For ease of reading Sebestyen's approach is recalled below with g = 2. Given the symmetric positive definite matrix M and the pair of vectors x, y E R", the

MI-distance I . ' is

~ M , ( X , y ) = ( X - Y ) T M ~ ( X - Y)

The MI-distance of an individual from w; consisting of C; individuals xlrr ( i = 1,2 ; n = 1, ..., C;), is defined

C, c ,, dM,(Y, W I ) = c; I c ~ M , ( Y , X ; I I ) = cy ' (Y - Xi11 )TM~(Y - X I I I

, I = I 1 1 = 1

= t r [MiSi ) + (y - x;)TM,(y - X I )

with S; and S,, i = 1,2 , the means vector and covariance matrix, respectively, of the populations

The Mi-aggregation of W ; is defined by the mean distance of each of its points from all the U I .

others:

c, c',

I t = I , / = I d M , ( W i ) = [ci(c, - I ) ] z] d M , ( X / i r r X I / )

c, - - (C; - l ) - ' c ~ M , ( X , , ~ , w , ) = 2(C; - l ) - ' t r [ MIS, 1 .

/ I = I

Sebestyen's approach concludes with the search for those metrics M: which minimizes the

The matrix M) can be found by solving the following optimum problems: MI-aggregations d ~ , ( a ; } , (i = 1,2).

minM, I d M , ( a 1 1 subject to { I Mi I = 1 I ( i = 2)

whose solutions are I , '

MT= / S l 1 ' " ' S ~ ' , ( i = 1,2).

246 I - . BEKTOLINO

The M:-distance of an indicidual y from w, thus become5

d w : ( y , w , ) = 1 s, 1 ' (L'+ (y - % ) ' S f I(y ~ < , ) I 3 ( i = 1,2).

Similarly to (6) (see Reference 21, p. 140), Sebesteyn's non-parametric approach leads to the following DR:

[:: - d ( y , W I ) + d(Y, w 2 ) = t - Y E

where we can still assume t = [c ( l \2)p(wz)l / [ c ( 2 1 I ) p ( w 1 ) 1 .

approach can be generalized as follows.

The Mz,-distance of y k E y from the group y;,, is given by

Starting from the situation illustrated at the beginning of the second section, Sebestyen's

Let ( X I l t , S;,,) and (y , S!) be the barycentre and covariance matrix, respectively, of yI t t and y .

d ~ ; , ( y k , - y o t ) = 1 s;,, I ' " ' I u + (yk - ~ t t r ) ~ S ~ t l ( ~ ! , ~ Xi,,) I since now

M:, = 1 S,,, ( I

I t follows that the mean of the Mz,-distance of K observations (yl, ..., y ~ ) E y from the group y,,, is

K

~ M ; , ( Y , y,,,) = K - I I s,,, 1 ' ' " ( 11 + (SI - X i , , ) ' S [ I ' ( Y L - X i , , ) 1 k = l

and since K

T - 1 -, K - l C (yk - X , , , ) ~ S ; ~ ' ( ~ ~ - x,,,) = tr l s;,'s! I + (S - L) S,,, (1 - S,,,)

k = I

we obtain

dhf,,(y, T i r r ) = I s,,, u + t r ( S i ' S ! I + (9 - x f , l ) ~ r ~ ~ l l ( ~ - X , , , ) I

The mean distance of from the groups y;,, C w , is given by

/ I = I

Similarly to the DR (11) we can construct the following DR:

- d ( y , w 1 ) + d ( y , w z ) Z t - y C

EXAMPLE

(13)

The application presented here concerns, as mentioned in the Introduction, the two species of insects Sesumin crericu and Xnonogrioides. The problem of discriminating between the two species of insects is of some consequence in that the way of treating possible infestation differs for the two species examined.

The eggs, perfectly classified by electron microscopy, are known with respect to a common set of variables measured with an optical microscope. Hence the covariance structure of the

ON THE CLASSIFICATION OF OBSERVATIONS STRUCTURED INTO GROUPS 247

discriminant variables describe and measure the difference between the various groups in each population.

Once we have chosen the DR (7) or (13), i.e. the choice between the parametric or non- parametric approach, it will be useful for the entomologist to know the number K of eggs (all coming from the same clutch to discriminate) that he must sample in order that the MCs a and 6 are reasonably low.

For this purpose the MCs CY and /3 have been estimated as functions of K , in the two different approaches (see Tables I and 111).

Both populations are known with respect to the following three variables:

(i) diameter of the maximum horizontal section ( i i ) number of micropylic orifices (small holes in the middle of the main corolla)

(iii) number of primary structures (petal-shaped relief).

In their natural habitat where the two species were found and ~ t u d i e d , ~ i t is reasonable to assume p ( w l ) = p ( w 2 ) = i; moreover, c(2 1 1 ) = c( 1 12).

The statistical basis consisted of 490 observations, of which 250 pertaining to S . cretica come from five clutches of eggs (50 eggs per clutch) whereas the remaining 240 refer to S. nonagrioides, again from five clutches of eggs (50 eggs from each of the first four clutches and 40 from the fifth).

We assume that the vector of the three variables has multivariate normal distribution, and more generally i t is supposed that all those conditions occur such that ( 5 ) holds and likewise the DR (7).

For pi, E l and C; we used the estimates N,

11= I

i . - T: I I - I c J l l l X l l l

N, J,,,

n = l j = l

N,

) I = I

1 2; = ( T; - N ; ) - I c c (x;1,.; - X l l l 1 ( X i l l , - X I , , 1

ct = ( N ; - I ) - I c J;,r(JZ;ll - j i l ) ( i l , l - i;, T

with

I I = I I" I

Table 1. Estimate of MCs a and B using the parametric approach

K El (Y E2 a 2 1 08 0.169 - 1 576 3 40 0.625 - 2 3 24 4 25 0.391 - 2 151 5 22 0,344 - 2 71 6 I I 0 . 1 7 2 - 2 26 7 7 0,109 - 2 21 8 6 0.938 - 3 18 9 4 0.625 - 3 I2

10 2 0.313 - 3 4

0.900 - 1 0.506 - I 0 .236- I 0.111 - I 0.306 - 2 0.328 - 2 0.281 - 2 0. I87 - 2 0.625 - 3

248 F. BERTOLINO

.6

.5

.4

.3

.2

.1

Table 11. Discriminant Kullback-Leibler distances

K I d w , : w2) f 4 W l : w2) I,(wz : w I ) I 4 w z : wI 1

2 0.499 + 1 3 0.651 + 1 4 0.792 + 1 5 0*926+ I 6 0.106 + 2 8 0.131 + 2

10 0 * 1 5 6 + 2 12 0.181 + 2 14 0.205 + 2 16 0.229 + 2 20 0.277 + 2 24 0.325 + 2 28 0.313 + 2 32 0.421 + 2 40 0.517 + 2 48 0.612 + 2 56 0.708 + 2 64 0.803 + 2

0.145 + 1 0-291 + 1 0.436 + 1 0.581 + 1 0.726 + I 0.102 + 2 0.131 + 2 0 . I60 + 2 0.189 + 2 0.218 + 2 0.276 + 2 0.334 + 2 0.392 + 2 0.450 + 2 0.561 + 2 0.683 + 2 0.799 + 2 0.915 + 2

0.347 + 2 0.536 + 2 0.732 + 2 0.930 + 2 0.113 + 3 0.154 + 3 0.195 + 3 0.236 + 3 0.271 + 3 0.318 + 3 0.401 + 3 0.484 + 3 0.567 + 3 0.650 + 3 0,816 + 3 0.983 + 3 0 . 1 1 5 1 4 0.135 + 4

0.530 + 1 0.106 + 2 0 .159+2 0.212 + 2 0.265 + 2 0.371 + 2 0.471 + 2 0.583 + 2 0.689 + 2 0.194 + 2 0.101 + 3 0.122 + 3 0 . I43 + 3 0 . I64 + 3 0.207 + 3 0.249 + 3 0.291 + 3 0.334 + 3

K 10 20 30 4b 50 60

Figure 1. Relative discriminant contribution Q$.A(w, : w,) versus K

ON THE CLASSIFICATION OF OBSERVATIONS STRUCTURED INTO GROUPS 249

For the matrices C,-as opposed to p, and =,-it was not possible to resort to the unbiased t!\ t i in a t e

2, = J - ' ( C : -2,) which holds i n the balanced case ( J , , , = J = constant) (see Reference 22, p. 226) or t o the much more complicated estimate in the unbalanced This is due to the fact that the 2, may either have negative variances, or be negative definite.

B

. 4

.015

.010

.005

.1 .2 . 3 .4

a )

B

\

.08

-06

.04

.02

.003

.002

.001

a

.02 .04 .06 .08

L.!

8

\ 10

.005 .OlO .015 .001 .002 .003

cl d l Figure 2. Regions LIE;(@, 0) for K = 1 , ..., 12

250 F. BEKTOLINO

In any case, the use of the estimate C: seems all the less incorrect if we consider that in our

The numerical values presented below are in exponential notations:

j i : = [0.689 0 * 1 4 7 + 2 0 . 4 4 3 + 1 ] , j i T = [0.778 O * I 6 3 + 2 0.569+ I ] ,

example the J,,, are much greater than N,.

l 3 I

0.151 - 2 E z = 0.479- 1 0.25.5-t 1

0.431 - 1 0.199-t 1 0.245 + 1

0.407 - 2 I ? - [

0.402 - 3 El = 0.316- 1 0.299+ 1

0.174- 1 0.135+ 1 0.120+ 1

0. I59 - 2 Cz = 0.542- 1 0.147+ 1 [ 0.983 - 1 0.665 + I 0.668 + 1 1 . + [ 0.314- 1 0.771 0.415 + 1

- [ C ; = 0.894- 1 0.997-t 1

The estimation of the MCs cx and p based on simulation, is relatively simple, despite the large computational effort involved.

For each population w; (i = 1,2), N barycentres E;,, ( n = 1, ..., N ) , are generated, in accordance with ( 3 ) . For each &,, a group of size K : y ~ , ..., yu is generated, in accordance with (2).

Determine E , , i = 1,2, the number of incorrect classifications from w; made by the DR (7) and estimate 01 and 6: C;l = E I / N , 8 = Ez/N.

The estimate of 01 and p was restricted to the cases K = 2, ..., 10 with N = 6400. Table I shows the results obtained.

The fact that we found 6, 6 < 0.001 for K = 10 indicates a good discrimination among the two populations examined.

Table I1 shows the discriminant Kullback-Leibler distance versus sample size K; Figure 1 gives the RDC p s . ~ ( w ; : w ; ) , versus K.

The signficant role played by the distances A’, in discriminating among the two populations w,, clearly emerging (at least in the example here concerned) from the increasingly asymptotic behaviour of the contributions p s , ~ .

As for the second procedure used for determining the MCs a and p (by constructing the regions DK) , only the cases K = I , ..., 12 were considered (see Figures 2(a)-(d)). The regions DK show already for K 2 10 that the MCs are quite remote.

Finally, with regard to the non-parametric approach DR(13), even lower estimates h and 6 are obtained.

The estimates ti and p̂ were obtained for K = 1, ..., 10 with N = 17,600 using the usual simulation methods (of the jack-knife type). Despite the high value of N none of the groups simulated, of size K 3 6 were misclassified. Table I11 shows the results obtained.

Table 111. Estimate of MC, (Y and 0 using the non- parametric approach

7 i 16 0.909 - 3 261 0.157- I 3 3 0. I70 - 3 50 0.284 - 2 4 1 0.568 - 4 6 0.341 - 3 _s 0 0 3 0.170 - 3

6-‘I 0 0 0 0 0

O N T H E CLASSI171CATION OF OBSERVATIONS S T R U C T U R E D INTO G R O U P S 25 1

RE FE R E N CES

Y . .I .-h1 Romctlci, Mc‘//ioc/cv e I p/’o,yrufir/fiI?y d ’ A / r u / w , Df,~.r.i/rririci,r/t, Dunod, Pari \ . 1973. i!. G. Sebestyen, Decision-Mu/ii/ig Procesrer in Pf///wIr Rcc’ognicwu, hlacmillan Puhli\liin_g Co., Inc. , N N Y o r l .

1962. 3 . C. Pucci and A . Forcina, ‘Morphological difference\ between the eggs of Sesamia crctica (Lef . ) and S .

nonagrioides (Lef.) (Lepidoptera: Noctuidae)’, In / . J . /tisec/ Morphol. & O77broy/., 13, ( 3 ) , 249-253 (1984). -1. A . Forcina, ‘Analisi discriminatoria con osserva7ioni t t rut turate in gruppi o grappoli’ (in Italian), Merroti, XLI,

.i. 1’. Stock\, ‘ A biometric investigation of twini, Part II’, Ann. €r/Reri., 5, 1-55 (1933). 0 . L . S. Penrose. ‘Some notes on discrimination’, Atiti . E I , x ~ . , 13. 228-237 (1947). ?. C . ,A. H . Smith, ‘Some examples o f diwrimination’, Atin. Eirgen., 18, 272-283 (1947). X . M. S. Bartlett and N. W . Please, ‘Dijcrimination i n the case of Lero mean differences’. Bio/mVrik.u, 50. 17-21

(1963). 0 . M. h1. De\u and S. Gei\ser, ‘Methods and application\ of equal-mean discrimination’. i n T . Cacoullor (ed . )*

~ ~ / . \ ~ , f . i f f r i / r ~ / t r / . . ( i r c r / ~ s i ~ UM/ , - I / J / I / ~ u / / K J ~ ~ s , Academic P re \> , New Y o r k , 1973, pp. 139- lS9.

213-221 (1983).

10. T . M’. Andersen, Ati itr/roditc/iori IO Mulriiwritrre S/u/i5/ /cu/ A / r u / . ~ s i . s , Wiley, New Yorh, 1958. I I . A . M. Kshirsagar, hlu/ / ivur iu /e A/rtr/ysrs, M. Dekker Inc. , New Yorh, 1972. I.?. P. A . Lachcnbruch, Dircrin7inunr Atiu/ysi.s, Hafner Preh\. New York. 1975. 1.7. S. Kullbach, /uJorfiiario/i 7heorj1 utid Sruris/ics, Wile!. New York. 1959. 14. C. \\’. Cluiiic\-Ro\\ and R. H . Riffenburgh, ‘Gcoriietry and linear di\crimination’. B t ~ / ~ / ~ , / r i / i c / , 47, 185- 189

(IY60). 15. T. W. Andci\on and R . R . Bahadur , ‘Cla\\ific.ation into two multi\ i~rii i te noriiial distl-ibution\ with dilleicnt

Lwatiance niatrices‘, Ann. Mo//r . Slcrt.. 33, 420-431 (1962). 10. K. Fuhunaga, /ti/rodircrion 10 S/cr/ is/r~,u/ Purlern Recwgnirion, Academic Press, New York, 1972. 1:’. %I, Okamoto, ‘An asymptotic expan,ion for the distribution of the linear discriminant function’, An/i . .tlu//?.

I X . I-. Bertolino and W. Racugno, ‘On error hounds i n choo\ ins hetween I M O simple hypothesc5’. ,!V/e/ron. X1.V.

I<), H. F. Chernol l . ‘,4 niea\ure of a\yniptotic e f f i c imq lor fc\t\ of ;I h y ~ x ~ t l i e \ i \ b a w l on the \ L I I ~ of ob\er \a t ion\’ .

20. T. Kailath, ‘The divergence and Bhattacharyya distance incawre, in 4sna l selection’, /€€€ Truris. Cufii/rr. Tr,c.lifi.,

21. I . D. Broffitt, ‘Nonparametric classification’, in P. R. Krishnaiah and L. N. Kana1 (ed\) , Hotrdbook ofS/o/rs/ic,\. Val. 2, North-Holland Publishing Company , Amsterdam, 1982, pp. 139- 168.

22.. H . Schefli, 7/11? .4nrr/.vris oj’ Vortance, Wiley, New York, 1959. 2 3 . A. Wald, ‘A note on the analysij of variance with unequal class frequencic\’, Ann. Mo//7. S lur . , 11, 9h- 100 (1940). 24. N. L. John5on. and S. Kotz, CotrliniioiiJ Unii~trrictle Dit/ribu/io/rs, I’ok / utrd 2, Wile!, New York, 1070.

.S /c / / . . 34, 1286-1301 (1963).

(1-2) . 223- 248 (IY87).

.4/1/7. Mr//h. S / O / . . 23, 493-507 (1952).

COM-15, ( I ) , 52-60 (1967).