12
This article was downloaded by: [University of Leeds] On: 08 November 2014, At: 18:10 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Communications in Statistics - Simulation and Computation Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lssp20 k-Means Algorithm in Statistical Shape Analysis Getulio J. A. Amaral a , Luiz H. Dore b , Rosangela P. Lessa c & Borko Stosic b a Departamento de Estatistica , Universidade Federal de Pernambuco, CCEN, Cidade Universitaria , Recife, Brasil b Departamento de Estatística e Informática , Universidade Federal Rural de Pernambuco , Dois Irmãos, Brasil c Departamento de Pesca , Universidade Federal Rural de Pernambuco , Dois Irmãos, Brasil Published online: 10 May 2010. To cite this article: Getulio J. A. Amaral , Luiz H. Dore , Rosangela P. Lessa & Borko Stosic (2010) k-Means Algorithm in Statistical Shape Analysis, Communications in Statistics - Simulation and Computation, 39:5, 1016-1026, DOI: 10.1080/03610911003765777 To link to this article: http://dx.doi.org/10.1080/03610911003765777 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

k -Means Algorithm in Statistical Shape Analysis

  • Upload
    borko

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: k               -Means Algorithm in Statistical Shape Analysis

This article was downloaded by: [University of Leeds]On: 08 November 2014, At: 18:10Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Simulation andComputationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/lssp20

k-Means Algorithm in Statistical Shape AnalysisGetulio J. A. Amaral a , Luiz H. Dore b , Rosangela P. Lessa c & Borko Stosic ba Departamento de Estatistica , Universidade Federal de Pernambuco, CCEN, CidadeUniversitaria , Recife, Brasilb Departamento de Estatística e Informática , Universidade Federal Rural de Pernambuco ,Dois Irmãos, Brasilc Departamento de Pesca , Universidade Federal Rural de Pernambuco , Dois Irmãos, BrasilPublished online: 10 May 2010.

To cite this article: Getulio J. A. Amaral , Luiz H. Dore , Rosangela P. Lessa & Borko Stosic (2010) k-Means Algorithmin Statistical Shape Analysis, Communications in Statistics - Simulation and Computation, 39:5, 1016-1026, DOI:10.1080/03610911003765777

To link to this article: http://dx.doi.org/10.1080/03610911003765777

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: k               -Means Algorithm in Statistical Shape Analysis

Communications in Statistics—Simulation and Computation®, 39: 1016–1026, 2010Copyright © Taylor & Francis Group, LLCISSN: 0361-0918 print/1532-4141 onlineDOI: 10.1080/03610911003765777

k-Means Algorithm in Statistical Shape Analysis

GETULIO J. A. AMARAL1, LUIZ H. DORE2,ROSANGELA P. LESSA3, AND BORKO STOSIC2

1Departamento de Estatistica, Universidade Federal de Pernambuco,CCEN, Cidade Universitaria, Recife, Brasil2Departamento de Estatística e Informática, Universidade FederalRural de Pernambuco, Dois Irmãos, Brasil3Departamento de Pesca, Universidade Federal Rural de Pernambuco,Dois Irmãos, Brasil

In this work it is shown how the k-means method for clustering objects can beapplied in the context of statistical shape analysis. Because the choice of the suitabledistance measure is a key issue for shape analysis, the Hartigan and Wong k-meansalgorithm is adapted for this situation. Simulations on controlled artificial datasets demonstrate that distances on the pre-shape spaces are more appropriate thanthe Euclidean distance on the tangent space. Finally, results are presented of anapplication to a real problem of oceanography, which in fact motivated the currentwork.

Keywords Geodesic distance; Landmarks; Non-Euclidean spaces.

Mathematics Subject Classification 62H11.

1. Introduction

With the advent of easily accessible computing resources over the past decades,development and application of diverse (supervised and unsupervised) clusteringtechniques have been steadily gaining momentum. The success of applicationsin diverse areas of knowledge has turned this field of research into a trulyinterdisciplinary focus of interest.

The current work was motivated by a concrete problem in oceanography ofhow shape of fish specimens can be used for classification of similar fish species.More precisely, traditional criteria for classification of fish species introduced byCollette (1965) often turn out to be difficult for practical implementation, andreliable automated procedures for implementing this task are highly desirable. In thecurrent work the shape information (landmarks) is used to classify the specimens

Received May 19, 2008; Accepted March 9, 2010Address correspondence to Getulio J. A. Amaral, Departamento de Estatistica,

Universidade Federal de Pernambuco, Av. Prof. Moraes Rego S/n, Cidade Universitaria,Recife 50670-901, Brasil; E-mail: [email protected]

1016

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 3: k               -Means Algorithm in Statistical Shape Analysis

k-Means Algorithm in Statistical Shape Analysis 1017

into one of the two similar fish species, through an application of a modifiedHartigan and Wong (1979) k-means algorithm. For further details about clusteringanalysis, see Hartigan (1975), Hastie et al. (2001), Johnson and Wichern (1992), andRipley (1996).

To adapt the k-means method to clustering objects based on landmarks it isnecessary to consider some basic concepts of shape analysis. The shape of an objectis what is left after allowing for the effects of translation, scale, and rotation. Theseminal paper about this topic was Kendall (1977). Kendall (1984) introduced acoordinate system. Goodall (1991) gave a significant contribution to this field inintroducing the procrustes analysis and tests for one and two populations. Extensiveaccounts of shape analysis are given in the books by Small (1996), Dryden andMardia (1998), and Kendall et al. (1999).

The article is organized as follows. Section 2 concerns basic concepts aboutstatistical shape analysis. In Sec. 3 it is shown how the k-means method may beadapted for shape analysis, and in Sec. 4 a simulation study and a solution for areal problem of classification of fish species are presented.

2. Shape Distances

To fix the notation used in the rest of this work, in what follows a brief overview isgiven of definitions and concepts necessary for the statistical shape analysis.

A configuration is represented by a k×m matrix X of Cartesian coordinates ofthe k landmarks in m dimensions. The effects of translation and rotation of X areremoved one at a time.

The translation effect is removed by pre-multiplying X by the Helmertsubmatrix (see Dryden and Mardia, 1998, p. 34). The full Helmert matrix HF is ak× k orthogonal matrix whose first row has all elements equal to 1/

√k and has row

j + 1 for j ≥ 1 given by

�hj� � � � � hj︸ ︷︷ ︸j

�−jhj� 0� � � � � 0︸ ︷︷ ︸k−j−1

��

for j = 1� � � � � k− 1, where the first j elements are equal to

hj = −�j�j + 1��−1/2�

and the �j + 1�th element is given by −jhj , followed by k− j − 1 zeros. The locationeffect (translation) is removed by multiplying shape X by the �k− 1�× k Helmertsubmatrix H , which is the Helmert matrix HF with the first row removed, anddividing this product by the norm

z�m� = HX

�HX� � (2.1)

The resulting matrix k×m matrix z�m� is called a pre-shape, and it is invariantunder the effects of translation and scaling (see Dryden and Mardia, 1998, p. 55).For most common case of two-dimensional shapes m = 2 this notation can be

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 4: k               -Means Algorithm in Statistical Shape Analysis

1018 Amaral et al.

simplified by treating the pre-shape z as a complex vector. More precisely, whenm = 2, the shape configuration

X =

x1�1 x1�2���

���xk�1 xk�2

may be written in complex notation as

z0 = �z01� � � � � z0k�

T =

x1�1 + ix1�2���

xk�1 + ixk�2

and the pre-shape expression (2.1) reduces to

z�2� = Hz0√(Hz0

)�Hz0

Next, relevant spaces need to be defined. The pre-shape space is the space of allpossible pre-shapes z�m�. For planar pre-shapes, the pre-shape space is the complexsphere

CSk−2 = �z�2� � �z�2���z�2� = 1� z ∈ �k−1��

where �k−1 is the complex space with dimension �k− 1�.The shape z�m� of a pre-shape z�m� is the set of all the rotated versions of z�m�.

For the case that m = 2, the shape of z�2� is given by

z�2� = �z�2�ei� � 0 ≤ � < 2���

The shape space can be defined as the set of all possible shapes.Another important concept that needs to be defined for shape analysis is the

mean shape. For three or higher dimensional statistical shape analysis, calculatingthe mean shape is accomplished using the so-called generalized procrustes analysis(GPA) algorithm developed by Gower (1975). If X1� � � � � Xn is a random sample ofconfigurations, the mean shape can be estimated by

�m� = 1n

∑i=1

XPi �

where XPi is the full procrustes fit of each Xi, which is obtained from GPA algorithm

(see Dryden and Mardia, 1998, pp. 90–91). On the other hand, for planar pre-shapes, the mean shape is readily calculated by considering n planar pre-shapesz�2�1 � � � � � z�2�n , corresponding to the n configurations (observations) X1� � � � � Xn. Hereit can be shown that mean shape �2� is the eigenvector associated with the largesteigenvalue of

∑ni=1 z

�2�i �z

�2�i �

�.

Another space that is commonly used in shape analysis is the tangent space,which represents a linearized version of the shape space. Through the use of this

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 5: k               -Means Algorithm in Statistical Shape Analysis

k-Means Algorithm in Statistical Shape Analysis 1019

concept by construction brings about a certain loss of information (contained in thecurvature of the original shape space), the advantage is that the Euclidean metricof the tangent space permits use of standard multivariate analysis. The coordinatesof the objects in the tangent space are obtained from the m-dimensional pre-shapesz�m� and the corresponding mean shape �m� using the expression

v�m� = [Ikm−m − vec

( �m�

)vec

( �m�

)T ]vec

(z�m��

)�

where the vectorize operator vec�X� stacks the columns of a k×m matrix X to yielda km-vector, I is the identity matrix, and � minimizes � − z��2.

For the planar case, the tangent space is represented by the tangent plane T���at point �, where � = −arg��z� minimizes ��− zei��2 (see Dryden and Mardia, 1998,p. 72). The partial tangent coordinates for the planar shapes are then given by

v�2� = ei�Ik−1 − ���z�2��

where � usually corresponds to �2�.The fundamental concept used for classification of objects within the shape

analysis paradigm is the distance, which may be defined (and calculated), both forthe the pre-shape space and the tangent space, in a variety of ways. In this work,the four most common definitions of distance are considered: the full procrustesdistance, the partial procrustes distance, the procrustes distance, and the Euclideandistance on the tangent space.

Consider two configurations X1 and X2, let �1 ≥ �2 ≥ · · · ≥ �m−1 ≥ ��m� be thesquare roots of the eigenvalues of z

�m�1

(z�m�2

)Tz�m�2

(z�m�1

)T, where zi is the pre-shape

defined in (2.1) of the ith configuration Xi and �m is the negative square root iffdet

[z�m�1

(z�m�2

)T ]< 0. The full procrustes distance, the procrustes distance, and the

partial procrustes distance are then given by

d�m�F = dF�X1� X2� =

{1−

( m∑i=1

�i

)2}1/2

� (2.2)

��m� = ��X1� X2� = arccos( m∑

i=1

�i

)� (2.3)

and

d�m�P = dP�X1� X2� =

√2(1−

m∑i=1

�i

)1/2

(2.4)

respectively.For planar shapes �m = 2�, the formulas (2.2)–(2.4) can be further simplified.

Let y = HTHy0

�HTHy0� and w = HTHw0

HTHw0 be centered pre-shapes, then �y� = 1 = �w� andy�1k = 0 = w�1k. The distance expressions (2.2)–(2.4) can be further simplified to

d�2�F =

√√√√1−∣∣∣∣ k∑j=1

yjwj

∣∣∣∣2

� (2.5)

��2� = arccos��y�w��� (2.6)

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 6: k               -Means Algorithm in Statistical Shape Analysis

1020 Amaral et al.

and

d�2�p = √

2�1− cos ��2��� (2.7)

For the tangent space, the Euclidean distance

d�m�t =

√�v

�m�i �

Tv�m�j (2.8)

may be used, where vi and vj are the tangent coordinates calculated at the pole �m�.Equation (2.8) is still valid if m = 2.

It has been shown (Dryden and Mardia, 1998, p. 75) that for highlyconcentrated data sets, standard multivariate methods can be used for tangentcoordinates. Moreover, in this case the distances are close, or d

�m�F ≈ ��m� ≈

d�m�P ≈ d

�m�t . However, if the data sets are not highly concentrated, the tangent

approximation may not work well. In particular, if one attempts to do clusteringanalysis using the Euclidean distances for the tangent coordinates, the method mayturn out to be inappropriate. A numerical illustration of this fact for Goodall,Hotelling, James and � tests is given by Amaral et al. (2007). Therefore, the k-meansmethod needs to be adapted for other metric measures.

3. k-Means Algorithm for Pre-Shapes

The problem of oceanography that motivated the current work is how to separate asample of fishes containing individuals from two similar species (Hemiramphus balaoand Hemiramphus brasiliensis) found in the waters of northeastern Brazil, based ontheir shapes. The number of groups is two by the nature of the problem at hand,where a traditional phenomenological approach proposed by Collette (1965) turnsout to be inefficient (more details about this problem will be given in Sec. 4), butotherwise the approach presented in what follows may be applied in other situationswith shapes of higher dimension.

The k-means algorithm can be used for the current purpose of unsupervisedshape classification, because it uses the concept of cluster centroids to identify thegroups in such a way that the distances within each cluster are minimized, whilemaximizing the distances between the different clusters. Therefore, application ofk-means should lead to separation of fish specimens into groups, where fish ofdifferent groups are less similar to one another than the fish within each group.

To adapt the Hartigan and Wong (1979) k-means algorithm for pre-shapes,consider the matrix

Z = z�m�1 � · · · �z�m�

n �

of n pre-shapes in k− 1 dimensions. For planar shapes, Z is a �k− 1× n� matrix,whereas for higher dimensions, e.g., m ≥ 3, Z is a �km�× n matrix.

The issue of calculating the cluster centers should be seen in some detail. Let Cbe the number of clusters and suppose that each cluster has an initial cluster center.Set NC(L) as the number of points in cluster L. If z�m�

1 � � � � � z�m�L are the pre-shapes

of the objects in cluster L, the Fréchet mean is

= argminz�m�∈M

[12n

L∑j=1

dist�z�m�� z�m�j �2

](3.1)

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 7: k               -Means Algorithm in Statistical Shape Analysis

k-Means Algorithm in Statistical Shape Analysis 1021

Where the function dist�·� depends on the metric to be used and �M� dist� is a metricspace (Kendall et al., 1999).

If the Riemannian metric or great circular arc distance is used (see Eqs. (2.4)and (2.6), Eq. (3.1) needs to be calculated numerically by the algorithm of Pennec(1994).

In case the procrustean metric is used, which is associated with thedistances (2.3) and (2.7), the solution of (3.1) is obtained by the algorithm of Ziezold(1994).

For the full procrustes metric and planar shapes, one just needs to obtain

as the eigenvector associated to the largest eigenvalue of∑L

i=1 z�2�i �z

�2�i �

�. For higher

dimensions, the algorithm GPA is necessary (Dryden and Mardia, 1998, pp. 90–91).If the tangent coordinates are used, the cluster center of L is simply the vector

mean of those coordinates.The distance between the point z�m�

i and the cluster L is given by the distancebetween z

�m�i and the center of L and it is denoted by D�z

�M�i � L�, where only the center

of the cluster is considered, orD�z�m�i � L� = D�z

�m�i � z

�m�L �. The distanceD�z

�2�i � L� can be

calculated from (2.5)–(2.7), and (2.8) for planar shapes, whereas for higher dimensions,the distance D�z

�m�i � L� can be calculated by using Eqs. (2.2)–(2.4), and (2.8).

The k-means algorithm suitable for shapes has similar steps to the one ofHartigan and Wong (1979) and is given as follows:

Step 1. For the pre-shapes z�m�i , i = 1� � � � � n, find its closest and second closest

cluster centers and denote them by IC1�i� and IC2�i�, respectively. The pre-shapez�m�i is assigned to IC1�i�.

Step 2. Update the clusters centers. So the Fréchet mean (3.1) should becalculated for each cluster with the new observations.

Step 3. All clusters are included in the live set.

Step 4. This stage is called optimal transfer (OPTRAN). For the pointsz�m�1 � � � � � z�m�

n , verify which ones were updated in the last quick transfer (QTRAN)stage. These points should be included in the live set throughout this stage. Let z�m�

i

be a point of cluster L1. If L1 is in the live set, do Step 4a; otherwise, do Step 4b.

Step 4a. For the clusters L �L �=� L = 1� 2� � � � � C�, compute R2 =NC�L�D�I� L�2/NC�L�+ 1. Suppose that L2 is the cluster with smallest R2, whichcan be called R2�. If R2� ≤ NCL1D�I� L1�2/NC�L1�− 1, there is no reallocationand L2 is the new IC2�I�. Otherwise, point I is allocated to cluster L2 and L1 is thenew IC2�I�. The mean shapes of the cluster that changed are updated. The live setincorporates the two clusters involved in the transfer.

Step 4b. The operations of step 4a are performed with R2 calculated only forthe cluster in the live set.

Step 5. If the live set is empty, stop. Alternatively, go to step 6.

Step 6. Consider each point I �I = 1� 2� � � � � n� in turn. Set L1 = IC1�I� andL2 = IC2�I�. Calculate

R1 = NC�L1� ∗D�I� L1�2/NC�L1�− 1 and

R2 = NC�L2� ∗D�I� L2�2/NC�L2�+ 1�

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 8: k               -Means Algorithm in Statistical Shape Analysis

1022 Amaral et al.

If R2 is bigger than R1, point I remains in cluster L1. Otherwise, switch IC1(I)and IC2(I) and update the mean shapes of clusters L1 and L2.

Step 7. If in the last n steps there is no transfer, go to Step 4. Otherwise, goto Step 6.

One should note that the k-means method of this section can be generalized formore than two groups.

4. Results

4.1. Numerical Simulation

To test the performance of the k-means algorithm for shape analysis, a simulationexperiment was performed with 5000 Monte Carlo replicates, with results presentedin Table 1. For each Monte Carlo replicate two samples with sizes n1 and n2

of the complex normal are generated. The means of the two populations are�1+ 1i� 1+ 2i� 1�5+ 1i� 1�5+ 2i� and �1+ 5i� 1+ 15i� 1�5+ 13i� 2+ 20i�, and thecovariance matrix

∑j = �jI8, where �1 and �2 are given in the table. The k-means

is applied to the combination of the two samples. For high concentration data, theallocation rates of the four methods are similar. For example, when �1 = �2 = 0�01and n1 = n2 = 30, the allocation rates are 0.909, 0.909, 0.909, and 0.924. For lowconcentration data, the allocation rate for the distance on the tangent space d

�2�T is

smaller than the allocation rates for the distances on the pre-shape space: d�2�F , ��2�,

and d�2�p . For example, when �1 = �2 = 0�4 and n1 = n2 = 30, the allocation rate

for d�2�T is 0.627 and allocation rates for d

�2�F , ��2� and d�2�

p are 0.756, 0.761, 0.760,respectively.

4.2. Distinguishing Fish Species

As mentioned earlier, the problem from oceanography that motivated this workis how can two species of fish, Hemiramphus balao and Hemiramphus brasiliensis,can be distinguished from each other. The traditional phenomenological criteriaproposed by Collette (1965) turn out to be difficult in practice and do not yield aclear distinction of the two species.

Table 1Estimated allocation rates and mean square error of the four distances

considered (see Eqs. (2.2)–(2.5))

n1 n2 �1 �2 d�2�F ��2� d�2�

p d�2�T Mse�d

�2�F � Mse���2�� Mse�d�2�

p � Mse�d�2�T �

20 20 4 4 0�761 0�764 0�764 0�649 13�027 12�965 12�983 12�98320 30 4 4 0�795 0�800 0�800 0�652 14�483 14�395 14�408 17�01530 30 4 4 0�756 0�761 0�760 0�627 20�497 20�938 20�242 24�00530 30 3 3 0�765 0�767 0�767 0�648 19�457 19�381 19�396 23�23530 30 2 2 0�770 0�770 0�770 0�683 18�275 18�255 18�250 22�01130 30 1 1 0�771 0�771 0�771 0�731 16�660 16�659 16�656 19�63930 30 0�1 0�1 0�909 0�909 0�909 0�924 6�950 6�950 6�950 7�00030 30 0�01 0�01 1 1 1 1 0�922 0�922 0�922 0�922

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 9: k               -Means Algorithm in Statistical Shape Analysis

k-Means Algorithm in Statistical Shape Analysis 1023

Figure 1. Eleven landmarks of the 49 fish of the sample.

The criteria defined by Collette (1965) are

• Criterion: Pectoral fin

• Hemiramphus balao has long pectoral fins, reaching beyond theanterior margin of nasal pit when folded forward.

• Hemiramphus brasiliensis has short pectoral fins, not reaching the nasalpit when folded forward.

• Criterion: Color of the upper lobe of caudal fin

• Upper and lower lobes of the caudal fin of Hemiramphus balao arebluish-violet.

• Upper and lower lobes of the caudal fin of Hemiramphus brasiliensisare reddish-orange.

To illustrate this issue a sample of 49 specimens of Hemiramphus balao andHemiramphus brasiliensis were collected from a beach near Recife, Brazil. In Fig. 1the landmarks of the 49 specimens are displayed, where it is seen that it is difficultto identify the two groups without a numerical procedure.

The Collette criteria were applied to the sample of 49 fish yielding two groups,and four statistical tests (Amaral et al., 2007) were applied to check the validityof this procedure. The results of the four tests are shown in Table 2. The testswere implemented in the R function “resampletest” of the library shapes (see RDevelopment Core Team, 2006). The function “resampletest” delivers p-values withand without bootstrap resamples, and because bootstrap versions of the tests havea smaller order error (see Amaral et al., 2007), one should focus in the results of

Table 2Tests for the two groups defined by Collette (1965)

Lambda Hotelling James Goodall

Statistics 58�0449 1.3626 54�8535 6.6273p-Value 0�0000 0.2208 0�0260 0.0000p-Value (resampling) 0�1244 0.4577 0�686 0.0050

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 10: k               -Means Algorithm in Statistical Shape Analysis

1024 Amaral et al.

Figure 2. Eleven landmarks of the 31 fish in group 1.

these bootstrap versions. The p-values of three tests with 200 bootstrap samplesindicate that the two groups are not different, whereas only the bootstrap versionof Goodall’s test suggests that they are different. The latter test, however, assumesisotropy, which is not indicated by Fig. 1.

The conclusion can be drawn at this point that the traditional Collette (1965)criteria do not yield a satisfactory distinction between the species or, more precisely,the shapes of the two groups obtained through a phenomenological approachcannot be considered different.

The approach adopted in this article of applying the k-means algorithm adaptedfor shape analysis, described in Sec. 3, delivers two groups of sizes 31 and 18.Landmarks of the specimens belonging to the two identified groups are shown inFigs. 2 and 3.

The results of the four tests described by Amaral et al. (2007) are given inTable 3. The p-values of the tests suggest that the means of the two groups havedifferent shapes, and the current procedure can therefore be considered satisfactory.

Figure 3. Eleven landmarks of the 18 fish in group 2.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 11: k               -Means Algorithm in Statistical Shape Analysis

k-Means Algorithm in Statistical Shape Analysis 1025

Table 3Tests for the two groups obtained by the k-means method

Lambda Hotelling James Goodall

Statistics 251�5020 6.4673 181�1170 37�6159p-Value 0�0000 0.0000 0�0000 0�0000p-Value (resampling) 0�0050 0.0149 0�0348 0�0050

5. Summary

It is shown in this work how the Hartigan and Wong k-means algorithm canbe adapted for statistical shape analysis. Though direct application of standardk-means is straightforward with the tangent coordinates, numerical simulationresults show that this approach may be acceptable only for highly concentrateddatasets. In practical situations where the data are not highly concentrated, itis more appropriate to use one of the distance measures in the spherical shapespace. Application on a real problem of oceanography, which in fact motivatedthe current article, yields satisfactory classification results, in contrast with thetraditional phenomenological approach.

Acknowledgments

We thank the anonymous referee for constructive criticism. Support of the Brazilianagencies CAPES and CNPQ is acknowledged. GJAA acknowledges support of theBrazilian agency FACEPE (APQ-0461-1.02/06).

References

Amaral, G. J. A., Dryden, I. L., Wood, A. T. W. (2007). Pivotal bootstrap methods fork-sample problems in directional statistics and shape analysis. Journal of the AmericanStatistical Association 102:695–707.

Collette, B. B. (1965). Hemiramphidae (Pisces, Synentognathi) from tropical West Africa.Atlantide Reports 8:217–235.

Dryden, I. L., Mardia, K. V. (1998). Statistical Shape Analysis. Chichester: Wiley & Sons.Goodall, C. R. (1991). Procrustes methods in the statistical analysis of shape (with

discussion). Journal of the Royal Statistical Society, Series B 53:285–339.Gower, J. C. (1975). Generalized procrustes analysis. Psychometrika 40:33–50.Hartigan, J. A. (1975). Clustering Algorithms. New York: John Wiley & Sons.Hartigan, J. A., Wong, M. A. (1979). A k-means clustering algorithm. Applied Statistics

28:100–108.Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning. Canada:

Springer.Johnson, R. A., Wichern, D. W. (1992). Applied Multivariate Statistical Analysis. NJ: Prentice-

Hall.Kendall, D. G. (1977). The diffusion of shape. Advances in Applied Probability 9:428–430.Kendall, D. G. (1984). Shape manifolds, procrustean metric and complex projective spaces.

Bulletin of the London Mathematical Society 18:81–121.Kendall, D. G., Barden, D., Carne, T. K., Le, H. (1999). Shape and Shape Theory. London:

John Wiley & Sons.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014

Page 12: k               -Means Algorithm in Statistical Shape Analysis

1026 Amaral et al.

Pennec, X. (1994). Probabilities and statistics on riemannian manifolds: basic tools forgeometric measurements. IEEE Workshop on Nonlinear Signal and Image Processing18:81–121.

R Development Core Team. (2006). R: A Language and Environment for Statistical Computing.Vienna, Austria: R Foundation for Statistical Computing.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge: CambridgeUniversity Press.

Small, C. G. (1996). The Statistical Theory of Shape. New York: Springer-Verlag.Ziezold, H. (1994). Mean figures and mean shapes applied to biological figure and shape

distribution in the plane. Biometrical Journal 36:491–510.

Dow

nloa

ded

by [

Uni

vers

ity o

f L

eeds

] at

18:

10 0

8 N

ovem

ber

2014