24
Spatial k -means clustering in archaeology – variations on a theme M. J. Baxter 16 Lady Bay Road, West Bridgford, Nottingham NG2 5BJ, UK. WORKING PAPER – November 2015 Abstract In archaeological applications involving the spatial clustering of two-dimensional spatial data k- means cluster analysis has proved to be a popular method for over 40 years. There are alternatives to k-means analysis which can be thought of as either ‘competitive’ or ‘complementary’ such as k-medoids, model-based clustering, fuzzy clustering and density based clustering among many possibilities. Most of these have been little used in archaeology. That k-means has been a method of choice is probably because it is readily understood, is perceived as being geared to archaeological needs, and was rendered accessible at a time when computational resources were limited compared to what is now available. It is, in fact, a long-established approach to clustering that pre-dates archaeological interest in it by some years. The theses of the present paper are that (a) other methods are available that potentially improve on what is possible with k-means; (b) these are (mostly) as readily understood as k-means; (c) they are now as easy to implement as k-means is; and (d) merit more attention than they have received from practitioners who find k-means useful. The arguments are illustrated by extensive application to a data set that has been the subject of several previous studies. 1 Introduction Spatial k-means clustering, as popularised by Kintigh and Ammerman (1982), is a standard technique, often used in archaeological applications for identifying patterns in point scatters at both the intrasite and intersite level (e.g. Blankholm, 1991; Enloe et al., 1994; McAndrews et al., 1997; Savage, 1997; Vaquero, 1999; Ladefoged and Pearson, 2000; Alconini, 2004; Dixon et al., 2008; Lemke, 2013). The method has been extended to point clouds in three dimensions (e.g. Koetje, 1987; Anderson and Burke; 2008). In fact the methodology predates (Kintigh and Ammerman (1982) by about a quarter of a century (see Steinley, 2006, who provides a useful overview of k-means methodology). Hodson (1970) is an early (non-spatial) archaeological application (see, also, Doran and Hodson, 1975: 180–4, 235–7) that was not much emulated at the time.) The method is to be distinguished from unconstrained spatial clustering (Whallon, 1984), which involves clustering assemblage compositions and mapping cluster memberships back to their spatial location. Comparative studies of the methodologies include Kintigh (1990), Blankholm (1991) and Gregg et al. (1991), the general consensus being that they are complementary. Kintigh’s (1990) preferred terminology for spatial k-means clustering is ‘pure locational clus- tering’ which better differentiates between the methodologies; the present paper will only be concerned with this method and only with the two-dimensional case. Aldenderfer (1998: 103) and others have noted the problem of linking spatial pattern to process. This is a matter for archaeological interpretation rather than a statistical one. Kintigh and Ammerman (1982) emphasised the heuristic nature of their proposed methodology, intended to be ‘concordant’ with archaeological aims and data, with clustering output intended as a starting point for intelligent archaeological interpretation rather than an end in itself. There are more technical problems with k-means that have been acknowledged from the outset. Kintigh (1990: 190) identified the tendency to form circular clusters as ‘the most serious 1

Spatial k-means clustering in archaeology -- variations on a theme

Embed Size (px)

Citation preview

Spatial k-means clustering in archaeology –variations on a theme

M. J. Baxter

16 Lady Bay Road, West Bridgford, Nottingham NG2 5BJ, UK.

WORKING PAPER – November 2015

Abstract

In archaeological applications involving the spatial clustering of two-dimensional spatial data k-means cluster analysis has proved to be a popular method for over 40 years. There are alternativesto k-means analysis which can be thought of as either ‘competitive’ or ‘complementary’ such ask-medoids, model-based clustering, fuzzy clustering and density based clustering among manypossibilities. Most of these have been little used in archaeology. That k-means has been amethod of choice is probably because it is readily understood, is perceived as being geared toarchaeological needs, and was rendered accessible at a time when computational resources werelimited compared to what is now available. It is, in fact, a long-established approach to clusteringthat pre-dates archaeological interest in it by some years. The theses of the present paper arethat (a) other methods are available that potentially improve on what is possible with k-means;(b) these are (mostly) as readily understood as k-means; (c) they are now as easy to implementas k-means is; and (d) merit more attention than they have received from practitioners who findk-means useful. The arguments are illustrated by extensive application to a data set that hasbeen the subject of several previous studies.

1 Introduction

Spatial k-means clustering, as popularised by Kintigh and Ammerman (1982), is a standardtechnique, often used in archaeological applications for identifying patterns in point scatters atboth the intrasite and intersite level (e.g. Blankholm, 1991; Enloe et al., 1994; McAndrews etal., 1997; Savage, 1997; Vaquero, 1999; Ladefoged and Pearson, 2000; Alconini, 2004; Dixon etal., 2008; Lemke, 2013). The method has been extended to point clouds in three dimensions(e.g. Koetje, 1987; Anderson and Burke; 2008). In fact the methodology predates (Kintighand Ammerman (1982) by about a quarter of a century (see Steinley, 2006, who provides auseful overview of k-means methodology). Hodson (1970) is an early (non-spatial) archaeologicalapplication (see, also, Doran and Hodson, 1975: 180–4, 235–7) that was not much emulated atthe time.)

The method is to be distinguished from unconstrained spatial clustering (Whallon, 1984),which involves clustering assemblage compositions and mapping cluster memberships back totheir spatial location. Comparative studies of the methodologies include Kintigh (1990), Blankholm(1991) and Gregg et al. (1991), the general consensus being that they are complementary.Kintigh’s (1990) preferred terminology for spatial k-means clustering is ‘pure locational clus-tering’ which better differentiates between the methodologies; the present paper will only beconcerned with this method and only with the two-dimensional case.

Aldenderfer (1998: 103) and others have noted the problem of linking spatial pattern toprocess. This is a matter for archaeological interpretation rather than a statistical one. Kintighand Ammerman (1982) emphasised the heuristic nature of their proposed methodology, intendedto be ‘concordant’ with archaeological aims and data, with clustering output intended as astarting point for intelligent archaeological interpretation rather than an end in itself.

There are more technical problems with k-means that have been acknowledged from theoutset. Kintigh (1990: 190) identified the tendency to form circular clusters as ‘the most serious

1

problem with pure locational clustering’, while also acknowledging that determination of thenumbers of clusters was problematic. Ducke (2015: 360–1) states that the need to assume k isone of the ‘major drawbacks’ of k-means analysis. His comparative study examines approachesto spatial clustering that have had little or no previous applications in the relevant literature.This includes a negative appraisal of k-means, which also lists its inability to deal with noise (i.e.points not belonging to any cluster) as a matter for concern. It will be argued in Section 4 thatsome of this critique is overstated.

Among the desiderata often suggested for statistical methods to merit archaeological atten-tion are ease of understanding and application. They should be both useful to and used byarchaeologists; unfortunately these are not the same thing. The popularity of k-means clusteringsuggests it meets all these criteria. The availability of software written specifically for archae-ologists (Kintigh, 1994) undoubtedly contributed to this popularity. It may also have hinderedexploration of alternative methodologies.

The arguments of this paper are firstly that there are alternatives to k-means analysis that candeliver benefits beyond the reach of k-means. Secondly, these methods are as readily understoodas k-means. Thirdly, they are easy to implement (see the Appendix). Fourthly, these methodsare not used; given the previous three arguments they merit more exploration than they havereceived.

Section 2 reviews the ideas underlying the methods explored in this paper. Mathematicaland algorithmic details are largely eschewed other than by reference. Many methods of clusteranalysis are defined by the algorithms used to obtain the clusters; users need not be concernedabout the details of implementation so long as they have an idea about what a method attempts.

The data set used for illustrative purposes is from Barmose I, an Early Mesolithic camp sitefrom South Zealand, Denmark (Blankholm, 1991). Apart from Blankolm’s analyses these datawere also used for illustration in Ducke (2015). The data are given by Blankholm and availablefrom David Carlson’s archdata package for the open-source R software (R Core Team, 2015).They are explored in Section 3.

Unless otherwise stated, analyses have been undertaken with the R package. Code is given inthe Appendix. The intention is to draw attention to and explore rather than ‘sell’ (or otherwise)the methods covered, though I have views that will intrude. I’ve allowed myself some latitude ofexpression in Section 4. There is life beyond spatial k-means clustering; the life may be worthliving; and, computationally, it can be easily led.

2 Methods

2.1 k-means and related methods

The term ‘k-means’ is sometimes used in the literature, particularly more technical papers, in ageneral sense to refer to a particular type of methodology that can be differently implemented.It is also used to refer to a specific algorithm used to obtain clusters; this is probably the mostcommon usage in archaeology and that followed here.

k-means

For fixed k the k-means algorithm as typically applied starts from an initial clustering based onk centroids, clusters being formed by associating with them those points to which they are theclosest centroid. Centroids are then recalculated and points reallocated between clusters if a newcentroid is closer. This proceeds until no changes occur. The ideal (which is rarely achieved) isthat clusters will be compact and distinct from other clusters.

Cluster compactness is measured by Si for cluster i, defined by the sum of squared (Euclidean)distances, d2 say, of points in the cluster to the centroid. The sum of the Si which will be denotedS is an objective function that the k-means method seeks to minimize; this is also called the sumof squared errors (SSE) or within-clusters sum of squares (WCSS).

2

Practical and interpretive considerations include the following:

• Cluster initialisation – Clusters can be initialised in various ways; by user definition, bytaking the outcome of a hierarchical clustering method as a starting point, or by randomlyselecting k cluster centers, for example. The k-means algorithm is not guaranteed toconverge to a global optimum, so the usual advice is to run an analysis several times,selecting from these the outcome with the smallest S. Steinley (2006: 6) cautions that thisis not foolproof, since there be many local optima in data sets of moderate size, so eventhen detection of a global optimum is not guaranteed. In the present study, where resultsdepend on the initialisation, analyses have been run with at least 100 initialisations andthen this process has been repeated several times to check that results are repeated. Acorollary of this is that studies that have been based on a single initialisation may haveended up interpreting results somewhat less than optimal.

• Choosing a value of k – A common way of selecting k is to plot S (or its logarithm) againstk as k varies (a scree plot) and look for a ‘kink’ or ‘elbow’ in the plot. Since S shoulddecrease as k increases the idea is that a ‘kink’ identifies the point at which a furtherdecrease becomes unimportant. ‘Kinkless’ plots are not uncommon, in which case they areof limited use for selecting k. Many other rules have been proposed for selection but maynot help much (Section 3.2). The sensible thing is to map cluster labels onto point locationsand exercise (albeit ‘subjective’) common sense in interpreting the merits of different valuesof k.

• Predestined patterning – The well-advertised ‘problem’ of k-means, that it tends to producecircular clusters that don’t necessarily respect the observable ‘geometry’ of a point scatter,can be understood as a consequence of its close relationship to a model-based methodologythat is optimal if clusters are spherical and of equal size. If you are wedded to k-meansthere is not much to be done about this; resorting to other methods – the subject of muchof this paper – might be contemplated.

There is an emphasis in some of the recent archaeological literature on the multiscalar natureof spatial patterning (e.g. Bevan and Conolly, 2006) – that is, different levels of archaeologicallyinterpretable pattern may exist at different spatial scales. Choice of k is an issue, but experi-mentation with different k might be viewed more positively as the conscious exploitation of amultiscalar approach.

Cluster analysis is also sometimes described as ‘data segmentation’ (e.g. Hastie et al., 2009:501) where the aim is to partition the data in a sensible way that does not necessarily identifyseparate clusters. It’s possible to view k-means in this light; using a larger value of k than isprobably needed; inspecting the results visually; and mentally aggregating those ‘clusters’ wherecommon sense suggests a partition is unhelpful.

k-medoids

Alternatives to k-means that modify the basic algorithm, or use different objective functions, areavailable. The k-medoids method (Kaufman and Rousseeuw, 1990; Hastie et al. 2009: 515–520)replaces centroids with medoids, representative points within clusters that can be thought of asanalogous to the median for one-dimensional data. This uses unsquared distances d of pointsfrom the medoid; otherwise it operates in same way as k-means (Steinley, 2006: 19) and muchthe same issues of application and interpretation arise. The main advantages claimed, as a resultof the differences from k-means, are that it is less sensitive to outliers, so more robust, and lessaffected by the ‘spherical cluster’ problem.

Trimmed k-means and medoids

Trimmed k-means (Cuesta-Albertos et al., 1997; Garcıa-Escudero and Gordaliza, 1999; Garcıa-Escudero et al., 2003), as the names suggests, removes what can be thought of as noise by

3

trimming a proportion, a, of points from the cluster. ‘Noisy’ points can be isolated; be internalto a point scatter, acting as ‘bridges’ between clusters; or be small clusters that get trimmedfor larger a. The idea is conceptually simple, computationally less so. Garcıa-Escudero et al.(2003), the most ‘accessible’ of the papers listed, describe an algorithm for the method.

A practical complication is that results depend on the choice of a. Rather than takingan initial k-means solution and simply ‘peeling’ it, the algorithm operates by starting from aselection of k centers; finding the (1−a) proportion of points closest to the centers; implementingthe usual k-means algorithm using these; then re-defining centers and iterating to convergence.The difference the choice of a makes will be illustrated in Figure 8.

Writers on the subject are usually at pains to emphasise that, other than dealing with noise,the method inherits many of the (less desirable) properties of k-means, specifically the tendency toproduce spherical clusters of similar size. This has led naturally to the development of algorithmsthat avoid some of these problems, trimmed k-medoids being an obvious candidate.

Fuzzy clustering

The k-means and k-medoids methods are examples of ‘hard’ or ‘crisp’ clustering methods. Points(spatial locations) are assigned to only one of k possible clusters; that is, they have a membershipof 1 of belonging to that cluster, and zero elsewhere. The fuzzy clustering methods discussedhere are examples of ‘soft’ clustering; the membership of a case, which sums to 1, is ‘spread out’across all k clusters. A crisp clustering can be obtained by assigning a case to the cluster forwhich it has its highest membership (which would seem to defeat the object of the exercise); thememberships can be used to assess how secure the assignation is.

Fuzzy k-means, or c-means, clustering (Bezdek, 1974; Dunn, 1974) includes k-means as aspecial case. The idea is similar to k-means. The objective function depends on weighted squareddistances (Baxter, 2009; Everitt et al., 2011: Section 8.7). The weights depend on the unknownmemberships and a fuzzifier, m, that the analyst needs to specify. The value of m = 2 iscommonly used, but can be varied. These return the estimated memberships which can beexploited in a manner to be illustrated.

The same issues as with k-means – initialisation, choice of k and the tendency to producecircular clusters – arise, along with the need to choose m. The fanny fuzzy clustering algorithm(Kaufman and Rousseeuw, 1990), if implemented using an objective function based on weighteddistances (d), is promoted as a more robust method. Kaufman and Rousseeuw (1990, 189-190)demonstrate that if d2 is used rather than d, and m = 2, the method is equivalent to fuzzyc-means clustering; if m = 1 is used with d2 a crisp clustering is obtained. Baxter (2009)summarises some of the detail.

2.2 Model-based clustering

In general, model-based methodologies depend on specifying a probabilistic model for the clusterssought. The more specific case discussed here – the most usual – assumes, given k clusters, thattheir shape has a bivariate normal distribution. It is also assumed here that models will be fittedusing the Mclust function from the mclust package in R and it is the methodology used therethat is summarised (see Fraley and Raftery, 2000 for theory and applications, and Fraley et al.,2012 for the version current at the time of writing).

A statistical model is involved, the parameters of which must be estimated. This is moredemanding than maximizing a ‘reasonable’ objective function. The bivariate normality assump-tion implies that the members of cluster i are sampled from an N(µi,Σi) distribution, where µi

is the cluster centroid and Σi is its covariance matrix. This implies that clusters are ellipsoidal.Constraints may be imposed on the covariance structures. In the simplest case model clusterscan be defined to be circular and of the same size; in the fully unconstrained case the size, shapeand orientation of clusters is free to vary. The simplest model can be thought of as a more ‘for-mal’ approach to doing what k-means attempts; this goes some way to explaining why k-means

4

tends to produce circular clusters since this is the aim of the model that implicitly underlies it(Steinley, 206: 17–18).

The method of maximum-likelihood, with an expectation-maximization (EM) algorithm, isused to estimate the parameters of the means and covariances and assign cases to clusters thatthey have the highest probability of belonging to. Baxter (2003: 102–4) and Ducke (2015) arerare applications to archaeological spatial clustering, the latter under the name ‘EM clustering’.

A selling point of model-based clustering is that it produces a measure of the ‘fit’ of aclustering, the maximized likelihood, allowing solutions for differently constrained models anddiffering k to be compared in a ‘principled’ way. Other things being equal larger values of kwill lead to better fitting models (because they use more parameters); similarly, for fixed k,given two models, one of which is a special case of the other, the more complex model willproduce the better fit. This means that the maximized likelihoods are not directly suited formodel comparisons because they will tend to favour models with more parameters. This can becompensated for by subtracting from the maximized likelihood some function of the number ofparameters, a penalty ; that favoured in Mclust is the Bayes Information Criterion (BIC) (Fraleyand Raftery, 2000: 7–9, Baxter, 2003: 101). The methodology is best appreciated when seen inaction (Section 3.7).

2.3 DBSCAN

As Ducke (2015: 367) doesn’t quite say there’s rather a lot of ‘cluster detectors’ around andlife is much too short to explore many of these in articles of reasonably finite length. Amen tothat. Writing over 30 years ago Blaishfield et al. (1982: 168) bemoaned the fact that ‘the taskof searching through the voluminous literature on clustering is simply too great’ !

The DBSCAN (density-based spatial clustering of applications with noise) algorithm, whichhas been around since the mid-1990s (Ester et al., 1996) and is popular in some literatures has –Ducke (2015) excepted – largely escaped archaeological attention. It is a density-based method,designed for large data sets and for detecting ‘dense’ clusters of irregular shape (Ester et al.,1996).

The mantra that accompanies presentations of DBSCAN typically includes statements alongthe lines that it is ‘robust’, can identify noise, does not require k to be known, and can dealwith complex non-spherical shapes (e.g. Argot-Espino et al., 2012: 53; Ducke, 2015: 362). Themethod may perform poorly if clusters of strongly varying densities are present.

The user is required to specify two ‘parameters’, a distance threshold, ε (or eps), and aminimum number of points, MinPts, that constitute a cluster. A core point is one with at leastMinPts other points within a distance ε from it. A boundary point is one within a distance ε

from a core point that is not itself a core point; other points are classified as noise. It is notrequired that k be specified in advance, the method producing (given ε and MinPts) a uniquesolution. Guidelines have been suggested for choosing ε and MinPts but may not be especiallyhelpful. Later analyses (Figures 9 and 10) indicate that results can be very sensitive to thechoice.

The method, which originates in the data mining literature, is described in its Wikipedia entryas ‘one of the most common clustering algorithms and also most cited in scientific literature’.The few accessible archaeological applications – not just to spatial data – I’ve located includeArgot-Espino et al. (2012, 2013) and Lopez et al. (2015) who use sample sizes so small (< 20)that the need for any clustering method, never mind DBSCAN, is questionable. Crema (2013)cites his use of the method in two figure captions but otherwise does not discuss it; Smith etal. (2015) discuss but reject the idea of using it for the reasons alluded to above, leaving Ducke(2015) as the only comprehensible, substantive archaeological application to spatial data. I’veseen. His assessment of the method is contrasted with that of this paper in Section 4.

5

3 Examples – the Barmose 1 data

3.1 Data

Barmose I is an Early Mesolithic camp site from South Zealand, Denmark, excavated in the late1960s and early 1970s. The data on 473 flint artefact locations has been published by Blankholm(1991), inviting subsequent exploitation for the purposes of statistical-methodological illustration,and are available from David Carlson’s archdata package. The flints are classified into 11categories. Following Ducke (2015), class differences are ignored here. Orton (2004) has exploredother methods of point-pattern analysis, primarily Ripley’s K-function, after disaggregating thedata into classes.

Blankholm (1991: 183–205) includes a detailed analysis of the spatial distribution of individ-ual artefact types and to aid later discussion this is discussed in a little detail. He identified sevenactivity areas, an archaeological interpretation based on his initial analysis. Subsequent analysesinvestigated the extent to which different methods of identifying artefact clusters were coincidentwith this proposed division, and if they suggested other aspects of ‘structure’, subsumed withinthat division1.

His k-means analysis was based on individual types with the results then synthesized tosuggest seven activity areas which are not, and not intended to be, coincident with any particularclustering. That is, his use differs from that here and in Ducke (2015) in examining differentclasses of artefact. The other methods investigated suggested 15, 16 and 19 clusters, interpretedas smaller number of activity areas, 11, 9 and 15. The methods together were judged to producecomplementary and useful information. For later reference the important point to note is thatinterpreted activity areas are not to be equated with clusters and their number, however defined,is less that the number of clusters that might exist in the spatial scatter.

Figure 1a shows a plot of the artefact scatter, uninterpreted in the sense that no cues areprovided to indicate what might be interpreted as clusters in the data. Figure 1b shows the samedata overlaid on a kernel estimate of the intensity function of the point process that generatedthe point pattern.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

Barmose I − flint artifact scatter

(a) (b)

Figure 1: (a) A plot of the flint artefact scatter at Barmose I. (b) The data overlaid on a kernel estimate

of the intensity function.

There is an obviously dense ‘horseshoe’ shaped scatter of points, around a central hearth,appearances suggesting this might readily be subdivided into smaller clusters. Blankholm’s

1The methods were k-means, unconstrained spatial clustering, correspondence analysis and a method of hisown devising, PRESAB – subsequently not much sighted or cited.

6

(1991: 191) preliminary interpretation, based on the distribution of individual classes of artefacts,was that three work areas adjacent to the hearth could be identified, to the north-northeast,east-southeast and south-southwest. He states that these cannot be ‘fully separated’ and areoverlapping. Away from these areas the scatter is less dense; with the ‘eye-of-faith’ it is possible tosuggest that some can be interpreted as clusters, albeit not dense ones. Blankholm comments onthe irregularity of the point distribution, but identifies four areas some of which might tentativelybe associated with different forms of activity and distinctive artefact clusters.

It will be of interest in what follows to see how different ‘objective’ methods of cluster detectionperform, both in relationship to each other and to the above interpretation. More specifically, towhat extent do results ‘cut-across’ the areas identified by Blankholm and/or suggest subdivisionsof the areas he identifies.

3.2 k-means and k-medoids analysis

Figure 2a shows a scree-plot of log (S) against k for k = 1, 2, . . . , 12, based on the best resultfrom 100 random starts2. There is no discernible ‘kink’ in the plot to guide the choice of k;for illustration the result for k = 9 is shown in Figure 2b. The random runs are obtained byrandomizing one of the coordinates and running the k-means analysis, repeating this – 250 timesin the figure – to get a picture of what the graph would look like if there were no clusters. Thereis clear evidence of clustering in this case, even if pinpointing a ‘best’ value of k is not possible.

2 4 6 8 10 12

6.0

6.5

7.0

7.5

8.0

8.5

Log of within cluster sums of squares (S) vs. k

number of clusters (k)

log

of w

ithin

clu

ster

sum

s of

squ

ares

(S

)

Actual data (best of 100 random starts)250 Random Runs

(a)

2.5

5.0

7.5

10.0

12.5

2.5 5.0 7.5 10.0 12.5Easting

Nor

thin

g

Barmose I − k−means, 9 clusters

(b)

Figure 2: (a) A scree plot for k-means analyses of the Barmose I data. (b) A plot of the 9-cluster

solution showing 95% confidence ellipsoids.

The 90% confidence ellipsoids illustrate the general tendency for clusters to be circular, Onthe whole, and in comparison with Figure 1, the clustering to my eye seems sensible. The largecluster to the bottom-right is the least convincing, subsuming several small groups of points thatare visually separate. Given this is a crisp clustering the points have to go into some cluster.Elsewhere, most clusters look acceptable as a partition of the data, but there are obviously pointsassociated with them that don’t ‘convincingly belong’ to the cluster and might be better treatedas noise, an option explored in later analyses.

The dense ‘horseshoe’ shaped conglomeration of points evident in Figure 1, bordering thehearth, is, for the most part, sensibly divided. Some of the bordering cases between clusterscould as well belong to one as the other without damaging the picture. One is at liberty tomentally merge or disaggregate clusters – the method is an aid to pattern recognition rather

2This was obtained using a modified version of Matt Peeples (2011) R script for k-means analysis whichimplements the approach of Kintigh (1994). http://www.mattpeeples.net/kmeans.html. (Accessed 06/11/2015)

7

than a tool for producing a ‘definitive’ partition. The central group, centred at about (7, 8),is one that might merit this treatment. If a 12-cluster solution is examined (Figure 3a) thisis subdivided into two with some of the outer clusters becoming more tightly defined, leavingothers that could be interpreted as noise.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

k−means, 12 clusters

(a)

2.5

5.0

7.5

10.0

12.5

2.5 5.0 7.5 10.0 12.5Easting

Nor

thin

g

Voronoi diagram, k−means, 6 clusters

(b)

Figure 3: (a) A k-means analyses of the Barmose I data with 12 clusters. Shaded areas are the convex

hulls for each cluster. (b) A k-means analyses of the Barmose I data with 6 clusters. The Voronoi

diagram and 90% confidence ellipsoids are superimposed.

The choice of a 9-cluster solution is a little arbitrary. A large number of indices have beendeveloped for identiying an appropriate choice of k, 27 of which are available in the NbClust

function from the package of the same name. The most common number of clusters suggested is 2(for seven indices), followed by 7 (six indices) and 3 (five indices); nine indices suggest 14 or moreclusters. A pessimistic interpretation of this is that these indices aren’t of much help; a positiveview is that examination of results at different scales of clustering is worthwhile. Although, ifconfined to a single analysis, a choice of k is needed, it is possible to be overly concerned aboutmaking a ‘correct’ choice of k. In an archaeological context Kintigh (1990: 190) observes thatthe method often ‘identifies clustering at several spatial scales’ and that there may be ‘soundethnographic reasons to believe that clustering may be evident at more than one spatial scale’.

There are several ways in which output can be displayed, the most obvious one being to lablepoints by cluster on a scatterplot of the data; colour is used in Figure 2b. The use of confidenceellipsoids can help clarify patterns, possibly omitting the data points, particularly when there area large number of them. Some alternative methods of presentation are illustrated in Figure 3.

For variety a 12-cluster solution is shown in Figure 3a using shaded polygons that are theconvex hulls of the clusters; compared to the use of point labelling (with or without confidenceellipsoids) a sharper picture of the differences between clusters perhaps emerges at the possibleexpense of exaggerating differences. The choice of a 6-cluster solution along with the Voronoidiagram in Figure 3b was designed to emulate the analysis shown in Figure 8 of Ducke (2015)but doesn’t; this is discussed further in Section 4. The Voronoi diagram is a complete partitionof the plane within which the points lie; cluster centroids are indicated by the larger red dots; allpoints within the associated region defined by the partition are closer to this than to any othercentroid. It is a mathematically correct ‘description’ of what k-means does, favoured by someanalysts. Viewing the diagram without reference to the point scatter is, however, potentiallymisleading. In particular it does not, by itself, indicate whether segments of the partition can beequated with clusters or not, as the term ‘cluster’ would often be understood. This is fine if allthat is needed is a sensible segmentation of the data, but unhelpful if the existence or otherwiseof distinct clusters is the focus of interest.

8

I quite like the colourful appearance of Figure 3a; it’s reminiscent of a late Matisse papercutout (The Snail). This doubtless testifies to a lack of taste on my part, and probably violatesthe tenets of good graphical presentation. For comparative purposes the colours can be a dis-traction and a more sober variant is used in later analyses. A first illustration is Figure 4 whichcontrasts the 9-cluster solutions for k-means and k-medoids.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

k−means, 9 clusters

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

k−medoids, 9 clusters

(b)

Figure 4: Analyses of the Barmose I data with 9 clusters, (a) k-means,(b) k-medoids.

Both analyses were run from 100 different random starts and repeating this suggested thatresults were stable. There are clearly differences between the plots, particularly in the denserregions of scatter. In part this arises because of the way in which points peripheral to a clusterare treated. Careful examination shows that this affects most clusters, changing, and probablyexaggerating, differences in the size and shape of clusters. Subsequent analyses explore how suchpoints might be identified and treated as noise. The other thing to notice is that the two methodspartition some of the denser regions differently. This is particularly clear in the upper-centre ofthe plot where the k-means analysis identifies one cluster and the k-medoids analysis two.

3.3 Fuzzy c-means analyses

Ideas will initially be explored using fuzzy c-means analysis. As discussed in Section 2.1 themethod depends on specifying a ‘fuzzifier’, m, which to keep things simple will be taken as 2 inwhat follows. Point i is associated with a set of memberships, Mij for j = 1, 2, . . . k, for eachof the k clusters. The idea here is to define a membership threshold, M , and treat as noise anypoint whose maximum Mij < M . Sensible choices of M will depend on the value of k and/orthe proportion of points one is willing to consider as noise.

This approach, which will be called peeling differs from trimmed k-means in that clustersaren’t changed in the process; they are divided into a ‘core’ and ‘periphery’ or noise, the sizeof the core diminishing as M increases with the proportion of noise, a, which depends on M .In trimmed k-means the proportion of trimming, or noise, a, is chosen before applying the k-means analysis, and the clusters obtained may be sensitive to this choice. For k = 9 clusters andM = 0.5 about 25% of the points for the Barmose I data are classified as noise. Using this forillustration Figure 5 illustrates two ways in which results might be displayed.

Results are based on the best solutions from 100 random starts and are stable. The large redpolygon in Figure 5 is the convex hull for the points being treated as noise. The blue polygonsare convex hulls for the peeled clusters – core points – rather than the crisp clustering that canbe derived from the c-means analysis. Core points and noise are distinguished by black andwhite dots; note that the latter contains points internal to the scatter. This form of presentation

9

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

Fuzzy c−means, 9 clusters, m = 2, M = 0.5

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

k−means, 9 clusters, fuzzy clusters overlaid

(b)

Figure 5: (a) Fuzzy c-means analysis with k = 9, m = 2, M = 0.5; white points are noise, black points

core, (b) the core/noise classification derived from c-means superimposed on the k-means 9-cluster

solution.

identifies a central area of five fairly dense clusters, surrounding and nicely delineating the centralhearth and four less dense peripheral clusters. The method inherits the tendency of k-means toproduce spherical clusters.

Figure 5b overlays the points, identified as core/noise using fuzzy c-means, on the convexhulls defined by the 9-cluster k-means solution. Reading clockwise round the periphery fromthe top-left, the first five clusters (to six-o’clock in the south) are the same except that (byconstruction) some of the more outlying points within the k-means clusters are ‘stripped out’.The central band of three clusters, reading from 9 o’clock in the west, is different. The peripheralcluster identified in k-means is largely treated as noise and the c-means analysis suggests a finersubdivision of the data.

This kind of analysis obviously, and unavoidably, depends on k and the degree of peelingimplicit in the choice of M . To get a sense of what is happening Figure 6 shows the resultsof thinning the original point scatter by plotting only points derived from a 10-cluster solution(a) removing 72 (15%) of the points with all membership values less than 0.4 and (b) removing183 (39%) of the points with all membership values less than 0.6. The plots are otherwiseuninterpreted in the sense that neither convex hulls nor colours are provided as cues for identifyingclusters.

With less thinning (Figure 6a) it is less easy to see the ‘imposed’ structure; the suggestedperipheral clusters are clear enough, those round the hearth less so though the possibilities can beseen. Obviously the greater the thinning (Figure 6b) the clearer the suggested clustering will be.The 10-cluster solution sought for in specifying k is now much more evident; note, incidentallythat in moving to 10-clusters and thinning more the central upper cluster seen in earlier k meansanalyses is now subdivided into two.

The important general point, rather than any specific interpretation, is that as well as peelingthe more obviously outlying points those that are intermediate between the denser concentar-ions are also identified as noise, in the sense that their membership of at least two clusters issufficiently great that it prevents any of them from attaining the threshold defined by suitablychosen M . This is consistent with the idea that clusters can overlap so that, from a strictlystatistical perspective, there are points whose cluster membership is inherently ‘undecidable’.The presentational ideas in Figure 6 exploit this and are applicable to any method where theuncertainty of cluster membership can be quantified.

10

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

c−means, 10 clusters, membership cutoff = 0.4

(a)

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

c−means, 10 clusters, membership cutoff = 0.6

(b)

Figure 6: Results for the 10 cluster solution using c-means cluster analysis, (a) omitting cases with less

than 0.4 membership value for all clusters, (b) omitting cases with less than 0.6 membership value for

all clusters.

3.4 Fuzzy c-medoids analyses

In illustrating this further using fuzzy c-medoids problems were encountered. Thus far, analyseshave been carried out using random starts with 100 initialisations, producing stable results. Inthe case of fuzzy c-means discussed above some similarity between the clusters produced andk-means can be expected, but identical cluster locations are not imposed; the method may peelclusters identified by k-means but can also suggest alternative clusterings.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

Fuzzy c−medoids, 9 clusters, m = 2, M = 0.5

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

Fuzzy c−medoids, 9 clusters, m = 2, M = 0.28

(b)

Figure 7: Fuzzy c-medoids for 9 clusters and m = 2, (a) for M = 0.5 and (b) M = 0.28.

For fuzzy c-medoids, stability was not achieved by using random starts, even with a largenumber of initialisations. The expedient was adopted of fixing the initial cluster centres at thosedetermined by a k-medoids analysis with the effect, for the data used here, that the analysessimply peeled the crisp clusters of Figure 4b. A second issue was that emulating the fuzzy c-means analysis by using m = 2 and M = 0.5 identified about 80% of the data as noise resultingin Figure 7a.

11

There is nothing inherently incorrect about a large number of points being classified as noisebut to my eye the outcome here is deeply unsatisfactory3. Reducing m and/or M is one way ofdealing with this; fixing m = 2, M = 0.28 results in about 25% of the points being classified asnoise, comparable to that in the fuzzy k-means analysis of Figure 5; Figure 7b is the result.

Comparison with Figure 4b confirms that cluster cores remain within the original crisp clus-ters, within-cluster outliers and boundary points between clusters treated as noise. Comparisonwith the fuzzy c-means analysis of Figure 5a shows broad similarity; the less dense c-means clus-ter in the south-east is treated as noise in the c-medoids analysis, while the dense upper-centralcluster is divided into two.

3.5 Trimmed k-means analysis

The outcome of two 9-cluster trimmed k-means analyses is shown in Figure 8. The main intentionis to show how the location, and not just the size, of the clusters can vary as the trimmingproportion, a, is varied; a = 0.10 and a = 0.25 are illustrated, the last of these corresponding tothe proportion of noise illustrated in some of the fuzzy clustering analyses.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

Trimmed k−means, 9 clusters, a = 0.10

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

Trimmed k−means, 9 clusters, a = 0.25

(b)

Figure 8: Trimmed k-means cluster analyses with trimming proportions of (a) a = 0.10 and (b) = 0.25.

That there are differences in the analyses is obvious; the heavier trimming results in the‘disappearance’ of the less dense peripheral clusters, in compensation subdividing the more obvi-ously dense regions somewhat more than previously comparable analyses. The lighter trimming,as is possibly to be expected, retains much of the structure of the original k-means analysis ofFigure 4, losing the less dense south-east cluster and splitting one of the more central ones. It isdangerous to draw conclusions from a single analysis, but if this were to prove a typical outcomeit would suggest that the tendency of trimmed k-means is to treat as noise points in the lessdense region of a plot for larger values of a.

It had been intended to discuss trimmed k-medoids in a similar way. There is an R function,trimmedoid from the Anthropometry package that ought to allow this. It seems straightforwardto use but I’ve yet to find a way of persuading it to produce stable results. Other packages mightallow this but I haven’t located them yet so have left trimmed k-medoids unillustrated.

3In some approaches to analysis the hypothesis of complete spatial randomness (CSR) of the kind generatedby an homogeneous Poisson process is tested before undertaking a cluster analysis. This seems superfluous here,but where the hypothesis is tenable it can be thought of as 100% noise.

12

3.6 DBSCAN analyses

The obvious appeal of DBSCAN and methods like it that they purportedly deal with noise, candetect irregularly shaped clusters, and don’t require prior specification of k. The phrase ‘toogood to be true’ possibly springs to (the more cynical) mind. It was clear when experimentingwith the method that cluster configurations were sometimes such that representing them byconvex hulls or ellipsoids was inappropriate; colour coding, with noise as white dots, was used.

Figure 9 uses the same values of eps (ε) and MinPts as Figure 8 in Ducke (2015). Six ratherthan the seven clusters in Ducke are produced, presumably because of differences in the softwareimplementation of the algorithm. Guidelines exist in the literature for choosing choosing eps

and MinPts but did not help much here; quite often the advice emphasises that the choice isdomain specific and depends on what one is looking for.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

DBSCAN (eps = 0.30, MinPts = 10)

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

DBSCAN (eps = 0.25, MinPts = 5)

(b)

Figure 9: DBSCAN cluster analyses for the Barmose I data, varying the eps and MinPts arguments in

the dbscan function from the fpc package in R. Points in white are noise and not clustered

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

DBSCAN (eps = 0.28, MinPts = 11)

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

DBSCAN (eps = 0.32, MinPts = 9)

(b)

Figure 10: DBSCAN cluster analyses for the Barmose I data, varying only slightly in the choice of

‘parameters’ from Figure 9a.

It is obvious, with a little experimentation, that results are very sensitive to the choice of

13

eps and MinPts. Figure 9b shows the outcome when eps is reduced to 0.25 and MinPts to 5. IfFigure 9a is judged to be satisfactory then Figure 9b isn’t, most particularly because of the large,undivided ‘horseshoe-shaped’ (red) cluster. It might be argued that this is a vindication of themethod in that it is highlighting an undeniably dense and non-spherical region of points. Againstthis, inspection of Figure 1 suggests that some subdivison may be merited; other methods mostlydo this; and Blankholm’s (1991: 191) archaeological interpretation also suggests a division inthree (overlapping) work areas. What is particularly troubling is, how sensitive results are toeven small perturbations in the parameters that control the clustering. This is illustrated inFigure 10.

Varying eps by ±0.02 and MinPts by ±1 two such outcomes are shown in Figure 10. Theseresult in 4- and 7-cluster solutions compared to the original 6-cluster solution, the differencesin the 7-cluster solution being particularly marked. The other aspect of the DBSCAN analysisworth noticing is that, presumably because of problems dealing with clusters of different densities,all points away from the central hearth area are treated as noise. Nearly all the other methodsinvestigated associate some of the peripheral scatter with clusters (trimmed k-means with 25%trimming is the exception). In addition to three general work areas around the central hearthBlankholm (1991: 191–2) identifies four peripheral areas, corresponding to regions suggested byclusters for methods other than DBSCAN, for which archaeological interpretations can be offeredwith varying degrees of conviction.

3.7 Normal mixture modelling

Figure 11 shows plots based on the ‘default’ analysis using the Mclust function. See the docu-mentation on this for an explanation of the legend in the BIC plot of Figure 11a. The best modelaccording to the BIC criterion has seven clusters. This is a VII model that assumes sphericalclusters of unequal size. This is shown in the uncertainty plot Figure 11, where the size of thecircles represents the clusters modelled by the method (as opposed to the actual shape of pointsin a cluster, which may differ).

−39

50−

3900

−38

50−

3800

Number of components

BIC

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

EIIVIIEEIVEIEVIVVIEEE

EVEVEEVVEEEVVEVEVVVVV

(a)

2 4 6 8 10

46

810

12

X

Y

Classification Uncertainty

(b)

Figure 11: The plots show the ‘default’ results from an initial normal mixture model cluster analysis

of the Barmose I data. The BIC plot, (a), suggests models worth investigating. The uncertainty

plot,(b), displays the estimated shape of the model clusters underpinning the analysis; the larger dots

are associated with the points whose classification is least certain – see the text for details.

The analysis is not especially satisfactory and the uncertainty plot needs a little explanation.The EM algorithm used estimates the parameters of the model assumed and the conditionalprobabilities of point i belonging to cluster j, pij . Points are assigned to the cluster for which

14

pij is a maximum – call this Max pij – and their uncertainty is defined as (1 - Max pij . Thedefault (which can be varied if the mclust2Dplot function is used) is to plot the 5% plots withthe highest uncertainty (i.e. the most poorly classified) with the larger black dots; the next 20%most uncertain are plotted with the second-largest, grey dots. The idea is similar to that usingthe memberships from a fuzzy clustering to define noise and core points; it’s possible to extractthe matrix of the pij and handle them in the same way memberships were exploited, though thisis not illustrated.

The unsatisfactory aspect of the analysis is the large central circular cluster, associated withpoints that are ‘all over the place’, essentially a noise cluster. Treating it as such and using thesame presentational devices as previously illustrated results in Figure 12a. The phenomenon, thatthis kind of analysis can result in a noise cluster, has been noted in Papageorgiou et al. (2001) in amore general (non-spatial) archaeological application of model based clustering. Several modelsare competitive with the 7-cluster VII model. One of these is the 9-cluster EII model whichassumes spherical clusters of equal size; that is, it assumes a structure of the kind frequentlymanifest in applications of k-means. Observe from Figure 11a that the EII model performs verypoorly up to the 7-cluster solution with a marked improvement to eight and then nine clusters.This might be taken as evidence that a 9-cluster solution is reasonable for the k-means method.The solutions for the 9-cluster EII model is shown in Figure 12b.

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

EM clustering, 7 clusters, VII model

(a)

2.5

5.0

7.5

10.0

12.5

3 6 9Easting

Nor

thin

g

EM clustering, 9 clusters, EII model

(b)

Figure 12: (a) The 7-cluster VII solution using EM clustering in Mclust; (b) The 9-cluster EII solution

using EM clustering in Mclust.

The disposition of clusters in Figure 12b is broadly similar to the k-means clustering, butcluster sizes differ somewhat, particularly in the central regions. In part this is because of theway points on the periphery of clusters are assigned; for example, the cluster in the upper-rightfor the k-means analysis has three of its points on the left edge reassigned to the two clusters toits left in the model-based analysis, noticeably changing both the size and shape of the clusters.

It’s possible to take this kind of analysis further by exploiting some of the finer control possiblewithin the mclust package, but requires more effort to execute than is the case for other methodsdiscussed in the paper and is not illustrated. It’s possible, for example, to model the presenceof noise or, as already noted, to extract the pij values and manipulate them in any way deemedsuitable for presentation. The BIC criterion used in selecting models exacts a harsher penaltythan other criteria than might be used and will tend to select the simpler models. It is thuslegitimate to experiment with more complex models, less satisfactory as judged by the BIC, ifthat initially suggested, as here, is judged to be unsatisfactory.

15

4 Discussion

Comparison with analyses in Ducke (2015), an original intention, is hampered by the fact that,presumably because of algorithmic differences in the software used, so is not straightforward.This is of most concern for the results of his 6-cluster k-means analysis which contributes to hisharsh assessment of the method. Our results (Figure 3 would presumably have produced a lessnegative reaction since they are similar to the output from his EM clustering analysis which ismore kindly judged. It will be apparent from earlier comment on the DBSCAN output that Icannot agree that this is obviously to be preferred – quite the opposite.

Ducke’s assessment strategy is arguably problematic. A diagram from Orton (2004) is usedto define six activity areas, rather than the seven Blankholm (1991: 191–2) describes. Theseseem to be equated with ‘clusters’ that a good ‘cluster detector’ should reveal. As noted in(Section 3.2) Blankholm’s (1991) interpretation was a synthesis of a large number of analysesand not a cluster analysis per se and he was prepared to contemplate the existence of somewhatmore clusters than activity areas.

Imposing a 6-cluster solution on the k-means and EM clustering analyses forces them into aninappropriate straitjacket that does not allow them to compete ‘on equal terms’ with methodsnot so constrained. The EM clustering method as implemented in Mclust allows the competingclaims of different k to be assessed; for k-means, if larger values of k are allowed, a perfectlysensible solution can be obtained. The problem of noise remains; the tendency to produce spher-ical clusters of similar size remains. The latter can have the effect of unwarrantedly subdividinglarger/elongated/non-circular clusters but, provided a value of k is chosen that, if anything, is‘too large’ rather than ‘too small’, clusters can be mentally merged if it makes sense to do so.

To the desiderata that Ducke (2015: 357–8) lists for an ideal cluster detector might be addeda requirement that a cluster detector be capable of dealing with clusters of different densities. Onthis basis alone DBSCAN, used in isolation, is arguably and obviously unsuited to the analysis ofmuch archaeological spatial data, unless the focus is on identifying and partitioning dense regions.It is evident both from Blankholm’s discussion of Barmose I, and similar discussions of the Masksite studied by Binford (1978) (e.g. Kintigh, 1990), that archaeologically interpretable, moreperipheral, and less dense artefact scatters can occur and any reasonable cluster detector usedwith such data should be capable of identifying these. Ducke (2015: 364) commends DBSCANon the basis that it fits the expected site structure (i.e. the six clusters assumed a priori) ‘muchbetter . . . with all clusters compactly outlined and neatly arranged around the central hearth, asanticipated by the nature of the activity areas’. Blankholm identified three activity areas aroundthe hearth and four more peripheral areas classified as noise in the DBSCAN analysis.

Most of the illustrations in this paper are based on the use of an arbitrarily chosen 9-clustersolution; fewer clusters, probably down to about seven, or more could clearly have been used.Some of the following summary is conditioned by views that I shall try and be explicit about.One is that the difficulty of choosing k can be exaggerated; unless one believes that there is a‘best’ or ‘correct’ value of k attempting to find one is a chimerical pursuit. Accordingly, exploringdifferent values of k is both sensible and justifiable if the existence of interpretable pattern atdifferent spatial scales is contemplated.

Likewise, the tendency of k-means to produce similar sized circular clusters troubles meless than it probably should. The idea of ‘over-fitting’ by using a larger value of k than looksnecessary and mentally merging circular clusters to get elliptical clusters, or banana-shaped onesetc. doesn’t fill me with horror. I realise that this, and the need eventually to fix on values – theplural is intentional – of k to examine, can be very ‘subjective’, a term often used as if it is adirty word.

There are areas of applications of cluster analysis where human agency is not much involvedin the interpretation of output; I doubt if archaeology is one such. Cluster analysis and cog-nate methods are sometimes promoted as ‘objective’ methodologies; I find it difficult to thinkof archaeological applications that are not imbued with ‘subjectivity’, often disguised and/orunacknowledged. This includes the choice of a clustering algorithm, the choice of things like

16

measures of dissimilarity between cases to use with it, the selection of the number(s) of clustersto interpret, and the interpretation itself. These things can matter, despite the occasional as-sertion one sees to the contrary. Archaeological publications using cluster analysis tend not tobe over-burdened with a comparative analysis of the effect of different choices, or expressions ofangst about the validity of an interpretation4.

With complex (i.e. high-dimensional) multivariate data, methods of multivariate analysis canbe valuable for gaining a ‘toehold’ (if not ‘stranglehold’) on aspects of pattern perception andinterpretation. Spatial data, in the sense of a collection of spatial coordinates – and providedan enormous amount of data is not involved – are not complex. They lend themselves naturallyto two-dimensional display; subjective interpretation of patterns in such data is almost certainlyunavoidable, and it is probably criminal to attempt otherwise. Such ‘subjective’, or ‘commonsense’ interpretation will, ideally, work ‘hand-in-glove’ with more formal methods of clusterdetection when the data set is ‘large enough’ and pattern not ‘obvious’; where the two are inconflict ‘common sense’ should probably trump ‘objective’ clustering on most occasions.

How might the effectiveness, or otherwise of a clustering method be judged for the BarmoseI data if you think this way? There is a visually obvious ‘horseshoe-shaped’ dense pattern ofpoints that defines (or is defined by) a hearth, the area of which includes a number of artefacts(Figure 1). It would be nice if a method clearly delineated the hearth but, because of the artefactswithin the area, no method that produces a crisp clustering will do this.

It would be surprising, provided a large enough value of k is used, if this dense region aroundthe hearth was not subdivided. Subjectively, the intensity map of Figure 1b suggests a partitioninto four clusters might be expected, possibly three or five. If a 9-cluster solution is essayed andexpectations are met where do the other four to six clusters go? The obvious answer is that theywill be found in the more peripheral regions. If you really believe that these are noise then the9-cluster solution has to be abandoned; if archaeological considerations lead you to think thatless dense and/or peripheral clusters can be ‘real’ you want a method that will suggest them,even if the suggestion is then rejected.

A lot of this comes down to ‘noise’; what is it, how is it detected; and what do you do with it?One form of noise is outliers that lie at some distance from any other point; though not exploredhere, nearest-neighbour statistics might be, and have been, exploited to identify such outliers andremove them prior to analysis. This, or course, requires the specification of a nearest-neighbourthreshold that defines an outlier.

Small clusters of points, otherwise distant from other locations might be defined as ‘noiseclusters’ and similarly removed. As well as defining ‘distant’, the term ‘small’ also needs tobe defined. I’m tempted to suggest that if this is a concern such clusters might as readily beidentified subjectively and removed without the sanction of a more formal identification processdesigned to confirm what you believe in any case.

Points that are regarded as noise because they occupy the less dense areas of denser regions,and can’t, with relative certainty, be assigned to any one cluster, are more interesting. Todesignate them as ‘noise’ some notion of cluster membership, and a membership threshold, M ,seems desirable. Fuzzy clustering ideas, model-based probabilistic and peeling/trimming ideasall lend themselves to this kind of approach. The choice of M is not trivial; the idea has notbeen explored in detail here, but choosing M so that ‘obvious’ features such as the hearth aredelineated, and core clusters are visually distinct, while keeping the proportion of noise as smallas possible seems a sensible avenue to explore. For any value of k a large enough proportion ofnoise, suitably defined will isolate core clusters, so in itself this may not help much in choosingk. Solutions which require a large amount of noise before clarity is achieved might, perhaps, berejected.

Putting these ideas together, DBSCAN and methods that behave like it should not be

4Several reasons for this can be advanced. Papers that equivocate about the merits of their conclusions tendnot to get published. The cluster analysis is often secondary to the ‘scientific content’ of a paper that gets itpublished and is passed over by reviewers. Sometimes the structure is so obvious that any method will find it,though the use of cluster analysis may then be superfluous.

17

regarded as competitive because of the difficulty of coping with clusters of differing density.Trimmed k-means with a large trimming proportion (a = 0.25 in Figure 8b) probably falls intothis category.

Although it is mathematically correct that k-means and other crisp clustering methods definea complete partition of space, rejecting them on these grounds alone is misconceived. If clustersare tightly-defined and well-separated there is not a problem; a representation in terms of convexhulls rather than Voronoi diagrams will emphasise that there is plenty of empty unoccupiedspace. Departing from this ideal situation, the real problem is that all points must be allocatedto a cluster, even if they are unequivocally noise. From this point of view such methods are oftenunsatisfactory, but peeling/trimming, either using fuzzy clustering and related method directly,or using them to define noise and reinterpreting crisp clusters in this light is a possibility thathas been illustrated.

If the ability to delineate the hearth is taken as a touchstone the fuzzy c-means of analysisof Figure 5 does a nice job, as well as suggesting the more peripheral clusters. This treats about25% of the data as noise; a value of M < 0.5 might produce an equally satisfactory solution withless noise if the analysis was pursued. Overlaying results from the fuzzy clustering on that fork-means reveals the main problem with the latter, chiefly that some clusters are far too large, interms of areal extent, because of inclusion of points well away from the core. The fuzzy c-medoidsanalysis of Figure 7b also seems to me to be reasonable; it’s broadly similar to the fuzzy c-meansanalysis, treating the south-east quadrant as noise and adding an extra cluster to the denserupper-central scatter. I’m not sure, other than subjectively, how you might assess the relativemerits of these.

The more lightly trimmed k-means analysis of Figure 8a fails the ‘hearth-test’ so doesn’treally convince me. Varying the trimming proportion a did not resolve this. The crisp mixture-model based clusters of Figure 12 don’t really convince me. It’s possible to model noise withinthe Mclust framework5. I’ve experimented with this – it depends on a good initial estimateof noise – with limited success, so results, which isolated a small cluster within the hearth andtended not to suggest peripheral clusters, are not presented. There are other approaches thatinvolve more formal noise modelling, notably in the vegclust package, that I have yet to explore,that would take up more space than is allowed for here.

At the outset of this paper it was stated that the intention was to explore rather than ‘sell’alternatives to k-means methodology, using methods closely related to it. That I had viewswhich would ‘intrude’ was acknowledged; I wasn’t entirely sure what these would turn out to beat the time. A central thesis of the paper is that the ideas involved in several of the methodsstudied are as easily understood as k-means. In fact, with the added and unobjectionable ideathat noise identification is advantageous, the basis ideas of k-medoids and fuzzy variants of thisand k-means are pretty much the same.

The advertised charms of alternatives that stray from the k-means idiom, especially DBSCAN,don’t much persuade me. Mixture-modelling is probably worth a more thorough examinationbeyond the treatment accorded it here, but it’s more complex to apply and I’m not as convincedas I once was (Baxter, 2003: 102)–4) that it will deliver much in the way of practical advantages.

Virtually nothing has been said about issues involving the choice of k and the tendency ofk-means to produce spherical clusters, beyond indicating that I suspect these are sometimesexaggerated. If you believe the publicity that typically accompanies the method, k-medoids ismore robust and less prone to producing circular clusters. There is no reason why this shouldn’tbe explored more than it has been. Unless I’ve overlooked great swathes of literature on thesubject, it’s tempting to say that the absence of applications of fuzzy clustering in archaeology– not just to spatial data – surprises me, but it wouldn’t be true. I’ve suggested elsewhere thatarchaeologists are sometimes conservative in their use of statistical methods, so the dominanceof ‘custom and practice’ when potentially useful, easily applied alternatives are available (see the

5As an homogeneous Poisson process over the region of the point scatter, with clusters defined by a similarprocess over a sub-region of the data.

18

Appendix) is perhaps regrettable.This will not, I hope, be read as a negative appraisal of spatial k-means clustering. It seems

to be a perfectly sensible starting point for an analysis; it also seems to me that the potentialexists, with virtually no extra effort, to improve on the method while remaining with the boundsof what is readily comprehensible. There is no excuse for not exploring these avenues even ifthey are eventually judged to lead to a dead end.

References

Alconini, S. (2004) The southeastern Inka frontier against the Chiriguanos: structure anddynamics of the Inka imperial borderlands. Latin American Antiquity 15, 389–418.

Aldenderfer, M. S. (1998) Quantitative methods in archaeology: A review of recent trends anddevelopments. Journal of Archaeological Research 6, 91–120.

Anderson, K. I¿ and Burke, A. (2008) Refining the definition of cultural levels at Karabi Tam-chin: a quantitative approach to vertical intra-site spatial analysis. Journal of ArchaeologicalScience 35, 2274–2285.

Argot-Espino, D., Sole, J. Lopez-Garcia, P. and Sterpone, O. (2012) Obsidian subsource identi-fication in the Sierra de Pachuca and Otumba volcanic regions, Central Mexico, by ICP-MSand DBSCAN statistical analysis. Geoarchaeology 27, 4862.

Argot-Espino, D., Sole, J. Lopez-Garcia, P. and Sterpone, O. (2013) Geochemical character-isation of Otumba obsidian sub-sources (Central Mexico) by inductively coupled plasmamass spectrometry and density-based spatial clustering of applications with noise statisticalanalysis. Open Journal of Archaeometry 2013, 1:e18.

Baxter, M. J. (2003) Statistics in Archaeology. London: Arnold.

Baxter, M. J. (2009) Archaeological data analysis and fuzzy clustering. Archaeometry 51, 1035–1054.

Baxter, M. J. (2015) Notes on Quantitative Archaeology and R.https://nottinghamtrent.academia.edu/MikeBaxter

Bevan, A. and Conolly, J. (2006) Multi-scalar approaches to settlement pattern analysis. InLock, G. and Molyneaux B. (eds.) Confronting Scale in Archaeology: Issues of Theory andPractice. New York: Springer, 217–234.

Bezdek, J. C. (1974) Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology 1,57–71,

Binford, L. R. (1978) Dimensional analysis of behavior and site structure: learning from anEskimo hunting stand. American Antiquity 43, 30–61.

Blankholm, H. P. (1991) Intrasite Spatial Analysis in Theory and Practice. Aarhus: AarhusUniversity Press.

Blashfield, R,, Aldenderfer, M. and Morey, L. C. (1982) Validating a cluster analytic solution.In Hudson, H. A. H. (ed.), Classifying Social Data. San Francisco, CA: Jossey-Bass, 167–176.

Crema, E. R. (2013) Cycles of change in Jomon settlement: a case study from eastern TokyoBay. Antiquity 87, 169–1181.

Cuesto-Albertos, J. A., Gordaliza, A. and Matran, C. (1997) Trimmed k-means: an attempt torobustify quantizers. The Annals of Statistics 25, 553–576.

19

Dixon, B., Gosser, D, and Williams, S. S. (2008) Traditional Hawaiian men’s houses and theirsocio-political context in Lualualei, Leeward West O’ahu, Hawai’i. Journal of the PolynesianSociety 117, 267–295.

Doran, J. E. and Hodson, F. R. (1975) Mathematics and Computers in Archaeology. Edinburgh:Edinburgh University Press.

Ducke, B, (2015) Spatial cluster detection in archaeology: Current theory and practice. InBarcelo, J. A. and Bogdanovich, I. (eds.), Mathematics and Archaeology. Boca Raton: CRCPress, 353–368.

Dunn, J. C. (1974) Some recent investigations of a new fuzzy partitioning algorithm and itsapplication to pattern classification problems. Journal of Cybernetics 4, 1–15.

Enloe, J. G., David, F. and Hare, T. S. (1994) Patterns of Faunal Processing at Section 27 ofPincevent: The Use of spatial Analysis and Ethnoarchaeological Data in the Interpretationof Archaeological Site Structure. Journal of Anthropological Archaeology 13, 105–124.

Ester, M., Kriegel, H-P., Sander, J, and Xu, X. (1996) A density-based algorithm for discoveringclusters in large spatial databases with noise. In Simoudis, E., Han, J. and Fayyad, U.M.(eds.) Proceedings of the Second International Conference on Knowledge Discovery and DataMining (KDD-96). Palo Alto, CA: AAAI Press), 226–231.

Everitt, B.S., Landau, S,, Leese, M. and Stahl, D. (2011) Cluster Analysis: Fifth Edition.Chichester, Wiley.

Fraley, C. and Raftery, A. E. (2000)Model-based Clustering, Discriminant Analysis, and DensityEstimation. Technical Report 380, Department of Statistics, University of Washington, USA.

Fraley, C., and Raftery, A., Murphy, T. B. and Scrucca, L. (2012) MCLUST Version 4 forR: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Es-timation. Technical Report no. 597, Department of Statistics, University of Washington,USA.

Garcıa-Escudero, L. A. and Gordaliza, A. (1999) Robustness properties of k-means andtrimmed k-means. Journal of the American Statistical Association 94, 956–969.

Garcıa-Escudero, L. A., Gordaliza, A. and Matran, C. (2003) Trimming tools in exploratorydata analysis. Journal of Computational and Graphical Statistics 12, 434–449.

Gregg, S. A., Kintigh, K. W. and Whallon, R. (1991) Linking Ethno-archaeological Interpre-tation and Archaeological Data: The Sensitivity of Spatial Analytical Methods to Post-Depositional Disturbance. In Kroll, E, and Price, T. D. (eds.) The Interpretation of Archae-ological Spatial Patterning. New York: Plenum Press, 149–196.

Hastie, T., Tibshirani, R. and Freedman, J, (2009) The Elements of Statistical learning: SecondEdition. New York; Springer.

Hodson, F. R. (1970) Cluster analysis and archaeology: some new developments and applica-tions. World Archaeology 1, 299–320.

Kaufman, L. and Rousseew, P. J. (1990) Finding Groups in Data: An Introduction to ClusterAnalysis. New York: Wiley.

Kintigh, K. W. (1990) Intrasite Spatial Analysis: A Commentary on Major Methods. In A.Voorrips (ed.) Mathematics and Information Science in Archaeology: A Flexible Framework.Bonn: Studies in Modern Archaeology 3, Holos, 165–200.

20

Kintigh, K. W. and Ammerman, A. J. (1982) Heuristic Approaches to Spatial Analysis inArchaeology. American Antiquity 47, 31–63.

Koetje, T. A. (1987) Spatial Patterns in Magdelenian Open Air Sites from Isle Valley, South-western France. British Archaeological Reports International Series 346, British Archaeo-logical Reports, Oxford.

Ladefoged, T. N. and Pearson, R, (2000) Fortified Castles on Okinawa Island During the GusukuPeriod, AD 1200-1600. Antiquity 74, 404–412.

Lemke, A. K. (2013) Cutmark systematics: Analyzing morphometrics and spatial patterningat Palangana. Journal of Anthropological Archaeology 32, 16–27.

Lopez, P., Lira, J. and Hein, I. (2015) Discrimination of ceramic types using digital imageprocessing by means of morphological filters. Archaeometry bf 57, 146–162.

McAndrews, T. L., Allbarracin-Jordan, J. and Bermann, M. (1997) Regional Settlement Pat-terns in the Tiwanaku Valley of Bolivia. Journal of Field Archaeology 24, 67–83.

Orton, C. (2004) Point pattern analysis revisited. Archeologia e Calcolatori 15, 299–315.

Papageorgiou, I., Baxter, M. J. and Cau, M. A. (2001) Model-based Cluster Analysis of ArtefactCompositional Data. Archaeometry 43, 571–588.

R Core Team (2015). R: A Language and Environment for Statistical Computing. Vienna: RFoundation for Statistical Computing.

Savage, S. H. (1997) Descent Group Competition and Economic Strategies in PredynasticEgypt. Journal of Anthropological Archaeology 16, 226–268.

Smith, B. A., Davies, T. M. and Higham, C. W. H. (2015) Spatial and social variables in theBronze Age Phase 4 cemetery of Ban Non Wat, Northeast Thailand. Journal of Archaeo-logical Science: Reports 4, 362-370.

Steinley, D. (2006) K -means clustering: A half century synthesis. British Journal of Mathemat-ical and Statistical Psychology 59, 1–34.

Vaquero, M. (1999) Intrasite spatial organization of lithic production in the Middle Palaeolithic:The evidence of the Abric Ronan (Capellades, Spain). Antiquity 73, 493–504.

Whallon, R. W. (1984) Unconstrained Clustering for the Analysis of Spatial Distributionsin Archaeology. In Hietala, H. (ed.) Intrasite Spatial Analysis in Archaeology. Cambridge:Cambridge University Press: Cambridge, 242–277.

Wickham, H. (2009) ggplot2. New York: Springer.

Appendix – R code.

For those unfamilar with R, Baxter (2015) is a book-length account of how R can be used f0rquantitative archaeological analysis. It does not provide a systematic treatment of R but thereis enough to get started. David Carlson’s web page, http://people.tamu.edu/∼dcarlson/quant/,includes a useful and more systematic treatment of R aimed at students doing an introductoryquantitative methods course. There are well over 100 books on R out there, as well as plenty offree introductory material that Carlson provides a brief guide to.

Apart from installing R and getting data into it the main thing to know is that there are alot of contributed packages available that need to be both installed and the loaded to use them.After installation loading is done using the library function; most of the code to follow makesuse of contributed packages.

21

k-means clustering

It is assumed that the coordinates to be clustered are contained in a two-column data frame,data, with names X and Y. The ‘guts’ of the analysis is PART A, which will need to be variedaccording to the method. The points labelled by cluster can be plotted in any way you wish; thedata and cluster ids., in df, can be exported for plotting in a package other than R if preferred.The code in PART B uses the ggplot2 package for R (Wickham, 2009) to lable clusters withdifferent colours. As written, this plots 90% confidence ellipsoids, which can be varied. Code forplotting polygons is in PART C; PART D give code for overlaying output from an analysis witha noise cluster on a crisp cluster – this distinguishes between noise and non-noise. The numberof random starts is controlled using the nstart argument in the kmeans function; the argumentdiffers according to the method used, other possibilities being covered in PART E.

PART A – k-means cluster analysis

nc <- 9 # set number of clusters

Cl <- kmeans(data, nc, nstart = 100) # execute analysis; 100 random starts

cl.out <- Cl$cluster # extract cluster labels

PART B – plot using ellipsoids

library(ggplot2) # load library used for plotting

dfcl <- data.frame(as.factor(cl.out)) # convert ids. to factor

df <- cbind(data, dfcl)

names(df) <- c("X", "Y", "Cluster")

P <- ggplot(data = df[df$Cluster != "0", ], aes(x = X, y = Y, colour = Cluster)) +

geom_point() + coord_fixed() +

stat_ellipse(level = 0.90) +

geom_point(data = df[df$Cluster == "0", ], aes(x = X, y = Y), colour = "white") +

# The rest of this controls labelling and appearance

xlab("Easting") + ylab("Northing") + ggtitle("Title") +

theme(legend.position="none") +

theme(plot.title=element_text(size=16, face="bold")) +

theme(panel.grid.minor = element_blank())

P

This has been kept simple; see the help facilities in R and documentation for the ggplot2 packagefor more detail about fine control (which can be messy). As written this does not include the legendsshown in the text. Change legend.position = "none" to "top" for a legend at the top (by default itis to the right-side).

The [df$Cluster != "0", ] components can be omitted, along with the second geom point linewhen using k-means. They are included to plot, as white points, outliers/noise points labelled with 0 inother methods, and don’t do any harm here.

PART C – plot using polygons

Modify the code in PART B as follows. Add to the ‘preamble’ preceding construction of the plot, P, thecode library(plyr), then, following names(df)

chulls <- ddply(df, .(Cluster), function(df) df[chull(df$X, df$Y), ])

then, in the plot, replace stat ellipse() + with

geom_polygon(data=chulls, aes(x=X, y=Y, fill=Cluster)) + coord_fixed() +

geom_point(colour = "black") +

22

The polygons will be disjoint for a crisp clustering; an outlier cluster (if any and coded as 0) willappear as a filled polygon on which other polygons will be superimposed. For a more sober presentationand with a crisp clustering use something like

geom polygon(data=chulls, aes(x=X, y=Y), fill="skyblue")

For applications with a noise cluster use something like

scale fill manual(values = c("red", rep("skyblue", nlevels(df$Cluster) - 1)))

Colors can be chosen almost at will; if you want to change the default grey background add somethinglike

theme(panel.background = element rect(fill = "green", colour = "red"))

to the plot, where the first colour is the background and the second the border colour.

PART D – superimpose fuzzy clusters on crisp clusters

Assuming that the method used generates cluster lables with noise coded as 0, set up the data as youwould if plotting polygons inserting

FC <- cl.out

FC[FC > 0] <- 1

FC <- as.factor(FC)

FC0 <- data[FC == "0",]; names(FC0) <- c("X", "Y")

FC1 <- data[FC != "0",]; names(FC1) <- c("X", "Y")

before creating dfcl, then

P <- ggplot() + coord_fixed() +

geom_polygon(data=chulls, aes(x=X, y=Y, fill=Cluster)) +

scale_fill_manual(values = c(rep("skyblue",nlevels(df$Cluster)))) +

geom_point(data = FC0, aes(x = X, y = Y), color = "white") +

geom_point(data = FC1, aes(x = X, y = Y), color = "black") +

remembering to add the labelling commands.

PART E – other methods

Replace the code in PART A with

k-medoids

library(WeightedCluster) # load required library

nc <- 9

Cl <- wcKMedoids(dist(data), nc, npass = 100)

cl.out <- Cl$clustering

cl.out <- as.numeric(as.factor(cl.out))

Trimmed k-means

library(trimcluster)

Cl <- trimkmeans(data, nc, trim = 0.10, runs = 100)

cl.out <- Cl$classification

cl.out[cl.out == nc + 1] <- 0

Vary the trim argument to control the amount of trimming.

23

Fuzzy c-means

library(vegclust)

nc <- 9

M = 0.5 # set membership threshold

Cl <- vegclust(x, nc, method = "FCM", nstart = 50)

cl.out <- defuzzify(Cl)$cluster # a crisp clustering

cl.mem <- Cl$memb

maxmem <- apply(cl.mem, 1, max) # Define "outliers"; depends on M

cl.out <- ifelse(maxmem < M, 0, cl.out)

This requires a little extra effort to extract the cluster labels in a form suitable for plotting, treatingcases whose membership is less than M for all clusters as noise. Set M = 0 if a crisp clustering is wanted.

Fuzzy c-medoids

library(vegclust)

Cl0 <- vegclust(data, nc , method = "KMdd", nstart = 100)

Centers <- Cl0$mobileCenters

M = 0.28 # set membership threshold

Cl <- vegclust(data, mobileCenters = Centers , method = "FCMdd", m = 2, nstart = 100)

cl.out <- defuzzify(Cl)$cluster # a crisp clustering

cl.mem <- Cl$memb

maxmem <- apply(cl.mem, 1, max) # Define outliers; depends on M

cl.out <- ifelse(maxmem < M, 0, cl.out)

This is more fiddly because using random starts, even a large number, failed to produce a stableclustering. The first two lines after loading the package carry out a crisp k-medoids analysis and thenextracts their centres which are used as the initial cluster centres in the fuzzy analysis.

DBSCAN

library(fpc)

EPS <- 0.30; MINPTS <- 10 # set parameters to control clustering

Cl <- dbscan(data, eps = EPS, MinPts = MINPTS)

cl.out <- Cl$cluster

In general output is not suited to plotting with ellipses or polygons so use the code for plottingellipses omitting stat ellipse.

Model-based clustering

library(mclust)

Cl <- Mclust(x, G = 1:15)

cl.out <- Cl$classification

print(summary(Cl))

This is a minimal analysis; you’d usually aspire to more. As written, 1- to 15-cluster solutions areexamined. Use summary(Cl) to see what model has been selected; use plot(Cl, "BIC") for the BICplot; "uncertainty"’ for the uncertainty plot. To vary the default choice of model use something like

Mclust(x, G = 12, modelNames = "EII")

which would fit a 12-cluster model assuming spherical clusters of equal size. See the help on mclustModelNames

for what’s available.If it transpires that one of the clusters can be interpreted as a noise cluster and you want to indicate

this, locate the offending cluster, change its lable to 0 and proceed as normal. Thus, if Cluster 5 is theoffending cluster use cl.out[cl.out == 5] <- 0 will do the trick.

24