Chapter 9 Clustering and Unsupervised Classiﬁcationceeserver.cee.cornell.edu/wdp2/cee6150/Text-Richard 2013/CH09... · Chapter 9 Clustering and Unsupervised Classiﬁcation 9.1

Chapter 9Clustering and UnsupervisedClassification

9.1 How Clustering is Used

The classification techniques treated in Chap. 8 all require the availability oflabelled training data with which the parameters of the respective class models areestimated. As a result, they are called supervised techniques because, in a sense,the analyst supervises an algorithm’s learning about those parameters. Sometimeslabelled training data is not available and yet it would still be of interest to convertremote sensing image data into a thematic map of labels. Such an approach iscalled unsupervised classification since the analyst, in principle, takes no part in analgorithm’s learning process. Several methods are available for unsupervisedlearning. Perhaps the most common in remote sensing is based on the use ofclustering algorithms, which seek to identify pixels in an image that are spectrallysimilar. That is one of the applications of clustering treated in this chapter.

Clustering techniques find other applications in remote sensing, particularly inresolving sets of Gaussian modes (single multivariate normal distributions) inimage data before the Gaussian maximum likelihood classifier can be used suc-cessfully. This is necessary since each class has to be modelled by a single normalprobability distribution, as discussed in Chap. 8. If a class happens to be multi-modal, and that is not resolved, then the modelling will not be very effective. Usersof remotely sensed data can only specify the information classes. Occasionally itmight be possible to guess the number of spectral classes in a given informationclass but, in general, the user would have little idea of the number of distinctunimodal groups into which the data falls in spectral space. Gaussian mixturemodelling can be used for that purpose (Sect. 8.4) but the complexity of estimatingsimultaneously the number of Gaussian components, and their parameters, canmake that approach difficult to use. Clustering procedures are practical alternativesand have been used in many fields to enable inherent data structures to bedetermined.

There are many clustering methods. In this chapter only those commonlyemployed with remote sensing data are treated.

J. A. Richards, Remote Sensing Digital Image Analysis,DOI: 10.1007/978-3-642-30062-2_9, � Springer-Verlag Berlin Heidelberg 2013

319

http://dx.doi.org/10.1007/978-3-642-30062-2_8

http://dx.doi.org/10.1007/978-3-642-30062-2_8

http://dx.doi.org/10.1007/978-3-642-30062-2_8

9.2 Similarity Metrics and Clustering Criteria

In clustering we try to identify groups of pixels because they are somehow similarto each other. The only real attributes that we can use to check similarity are thespectral measurements recorded by the sensor used to acquire the data.1 Here,therefore, clustering will imply a grouping of pixels in the spectral domain. Pixelsbelonging to a particular cluster will be spectrally similar. In order to quantify theirspectral proximity it is necessary to devise a measure of similarity. Many simi-larity measures, or metrics, have been proposed but those used commonly inclustering procedures are usually simple distance measures in spectral space. Themost frequently encountered are Euclidean distanceðL2Þ and the city block orManhattan ðL1Þ distance. If x1 and x2 are the measurement vectors of two pixelswhose similarity is to be checked then the Euclidean distance between them is

d x1; x2ð Þ , x1 � x2j jj j¼ fðx1 � x2Þ � ðx1 � x2Þg1=2

¼ fðx1 � x2ÞTðx1 � x2Þg1=2

¼XN

n¼1

ðx1n � x2nÞ2( )1=2

ð9:1Þ

where N is the number of spectral components.The city block ðL1Þ distance between the pixels is just the accumulated

difference along each spectral dimension, similar to walking between two locationsin a city laid out on a rectangular street grid. It is given by

dL1 x1; x2ð Þ ¼XN

n¼1

x1n � x2nj j ð9:2Þ

Clearly the latter is faster to compute but questions must be raised as to howspectrally similar all the pixels within a given L1 distance of each other will be.The Euclidean and city block distance measures are two special cases of theMinkowski Lp distance metric2

dLp x1; x2ð Þ ¼XN

n¼1

x1n � x2nj jp( )1=p

When p = 1 we have the city block distance, while when p = 2 we haveEuclidean distance.

1 We might also decide that spatially neighbouring pixels are likely to belong to the same group;some clustering algorithms use combined spectral and spatial similarity.2 See J. Lattin, J. Douglas Carroll and P. E. Green, Analyzing Multivariate Data, Thompson,Brooks/Cole, Canada, 2003.

320 9 Clustering and Unsupervised Classification

By using a distance measure it should be possible to determine clusters in data.Often there can be several acceptable clusters assignments, as illustrated inFig. 9.1, so that once a candidate clustering has been identified it is desirable tohave a means by which the ‘‘quality’’ of that clustering can be measured. Theavailability of such a measure should allow one cluster assignment of the data tobe chosen over all others.

A common clustering criterion, or quality indicator, is the sum of squared errorSSEð Þ measure; when based on Euclidean distance it is defined as

SSE ¼X

Ci

X

x2Ci

kx�mik2 ¼X

Ci

X

x2Ci

x�mið ÞTðx�miÞ ð9:3Þ

in which mi is the mean vector of the ith cluster and x 2 Ci is a pixel assigned tothat cluster. The inner sum computes the aggregated distance squared of all thepixels in the cluster to the respective cluster mean, while the outer sum adds theresults over all the clusters. It will be small for tightly grouped clusters and largeotherwise, thereby allowing an assessment of the quality of clustering.

Note that SSE has a theoretical minimum of zero, which corresponds to allclusters containing a single data point. As a result, if an iterative method is used toseek the natural clusters or spectral classes in a set of data then it has a guaranteedtermination point, at least in principle. In practice it may be too expensive to allownatural termination. Instead, iterative procedures are often stopped when anacceptable degree of clustering has been achieved.

It is possible now to consider the implementation of an actual clusteringalgorithm. While it should depend on finding the clustering that minimises SSEthat is impracticable since it requires the calculation of an enormous number ofvalues of SSE to evaluate all candidate clusterings. For example, there are

Fig. 9.1 Two apparentlyacceptable clusterings of a setof two dimensional data

9.2 Similarity Metrics and Clustering Criteria 321

approximately CK=K! ways of placing K pixel vectors into C clusters3; that will bean enormous number for any practical image size. Rather than embark on such arigorous and computationally expensive approach the heuristic procedure of thefollowing section is usually adopted in practice.

9.3 k Means Clustering

The k means clustering method, also called migrating means and iterative optimi-sation, is one of the most common approaches used in image analysis applications.With certain refinements it becomes the Isodata technique treated in the next section.

The k means approach requires an initial assignment of the available mea-surement vectors into a user-specified number of clusters, with arbitrarily specifiedinitial cluster centres that are represented by the means of the pixel vectorsassigned to them. This will generate a very crude set of clusters. The pixel vectorsare then reassigned to the cluster with the closest mean, and the means arerecomputed. The process is repeated as many times as necessary such that there isno further movement of pixels between clusters. In practice, with large data sets,the process is not run to completion and some other stopping rule is used, asdiscussed in the following. The SSE measure progressively reduces with iterationas will be seen in the example to follow.

9.3.1 The k Means Algorithm

The k means or iterative optimisation algorithm is implemented in the followingsteps.

1. Select a value for C; the number of clusters into which the pixels are to begrouped.4 This requires some feel beforehand as to the number of clustersthat might naturally represent the image data set. Depending on the reasonfor using clustering some guidelines are available (see Sect. 11.4.2).

2. Initialise cluster generation by selecting C points in spectral space to serveas candidate cluster centres. Call these

m̂c c ¼ 1. . .C

In principle the choice of the m̂c at this stage is arbitrary with the exceptionthat no two can be the same. To avoid anomalous cluster generation with

3 See R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, 2nd ed., John Wiley & Sons,N.Y., 2001.4 By its name the k means algorithm actually searches for k clusters; here however we use C forthe total number of clusters but retain the common name by which the method is known.


http://dx.doi.org/10.1007/978-3-642-30062-2_11

unusual data sets it is generally best to space the initial cluster centresuniformly over the data (see Sect. 9.5). That can also aid convergence.

3. Assign each pixel vector x to the candidate cluster of the nearest mean usingan appropriate distance metric in the spectral domain between the pixel andthe cluster means. Euclidean distance is commonly used. That generates acluster of pixel vectors about each candidate cluster mean.

4. Compute a new set of cluster means from the groups formed in Step 3; callthese

mc c ¼ 1. . .C

5. If mc ¼ m̂c for all c then the procedure is complete. Otherwise the m̂c areset to the current values of mc and the procedure returns to step 3.

This process is illustrated with a simple two dimensional data set in Fig. 9.2.

9.4 Isodata Clustering

The Isodata clustering algorithm5 builds on the k means approach by introducing anumber of checks on the clusters formed, either during or at the end of the iterativeassignment process. Those checks relate to the number of pixels assigned toclusters and their shapes in the spectral domain.

9.4.1 Merging and Deleting Clusters

At any suitable stage clusters can be examined to see whether:

(a) any contain so few points as to be meaningless; for example if the sta-tistical distributions of pixels within clusters are important, as they mightbe when clustering is used as a pre-processing operation for maximumlikelihood classification (see Sect. 11.3.4), sufficient pixels per cluster mustbe available to generate reliable mean and covariance estimates;

(b) any are so close together that they represent an unnecessary or inappro-priate division of the data, in which case they should be merged.

5 G.H. Ball and D.J. Hall, A novel method of data analysis and pattern recognition, StanfordResearch Institute, Menlo Park, California, 1965.

9.3 k Means Clustering 323

http://dx.doi.org/10.1007/978-3-642-30062-2_11

In view of the material of Sect. 8.3.6 a guideline exists for (a). A cluster cannotreliably be modelled by a multivariate normal distribution unless it contains about10N members, where N is the number of spectral components. Decisions in (b)

Fig. 9.2 Illustration of clustering with the k means, or iterative optimisation, algorithm, showinga progressive reduction in SSE; also shown is how the cluster means migrate during the process


http://dx.doi.org/10.1007/978-3-642-30062-2_8

about when to merge adjacent clusters can be made by assessing how similar theyare spectrally. Similarity can be assessed simply by the distance between them inthe spectral domain, although more sophisticated similarity measures are available(see Chap. 10).

9.4.2 Splitting Elongated Clusters

Another test sometimes incorporated in the Isodata algorithm concerns the shapesof clusters in spectral space. Clusters that are elongated can be split in two,if required. Such a decision can be made on the basis of pre-specifying a standarddeviation in each spectral band beyond which a cluster should be halved.

9.5 Choosing the Initial Cluster Centres

Initialising the k means and Isodata procedures requires specification of thenumber of clusters and their initial mean positions. In practice the actual oroptimum number of clusters to choose will not be known. Therefore it is oftenchosen conservatively high, having in mind that any spectrally similar clusters thatresult can be consolidated after the process is completed, or at intervening itera-tions, if a merging option is available.

The choice of the initial locations of the cluster centres is not critical, althoughit can influence the time it takes to reach a final, acceptable clustering. In someextreme cases it might influence the final set of clusters found. Several proceduresare in common practice. In one, the initial cluster centres are chosen uniformlyspaced along the multidimensional diagonal of the spectral space. That is a linefrom the origin to the point corresponding to the maximum brightness value ineach spectral component. The choice can be refined if the user has some idea of theactual range of brightness values in each spectral component, say by havingpreviously computed histograms. The cluster centres would then be initialisedalong a diagonal through the actual multidimensional extremities of the data.An alternative, implemented as an option in the MultiSpec package,6 is todistribute the initial centres uniformly along the first eigenvector (principalcomponent) of the data. Since most data exhibits a high degree of correlation,the eigenvector approach is essentially a refined version of the first method.

The choice of the initial cluster locations using these methods is reasonable andeffective since they are then spread over a region of the spectral space in whichmany classes occur, particularly for correlated data such as that for soils, rocks,concretes, etc.

6 See https://engineering.purdue.edu/*biehl/MultiSpec/.

9.4 Isodata Clustering 325

http://dx.doi.org/10.1007/978-3-642-30062-2_10

https://engineering.purdue.edu/~biehl/MultiSpec/

9.6 Cost of k Means and Isodata Clustering

The need to check every pixel against all cluster centres at each iteration meansthat the basic k means algorithm can be time consuming to operate, particularlyfor large data sets. For C clusters and P pixels, P9C distances have to becomputed at each iteration, and the smallest found. For N band data, eachEuclidean distance calculation will require N multiplications and N additions,ignoring the square root operation, since that need not be carried out. Thus for20 classes and 10,000 pixels, 100 iterations of k means clustering requires 20million multiplications per band of data.

9.7 Unsupervised Classification

At the completion of clustering, pixels belonging to each cluster are usually givena symbol or colour to indicate that they belong to the same group or spectral class.Based on those symbols, a cluster map can be produced; that is a map corre-sponding to the image which has been clustered, but in which the pixels arerepresented by their symbol rather than by the original measurement vector.Sometimes only part of an image is used to generate the clusters, but all pixels canbe allocated to one of the clusters through, say, a minimum distance assignment ofpixels to clusters.

The availability of a cluster map allows a classification to be made. If somepixels with a given label can be identified with a particular ground cover type (bymeans of maps, site visits or other forms of reference data) then all pixels with thesame label can be assumed to be from that class. Cluster identification is oftenaided by the spatial patterns evident; elongated features, such as roads and rivers,are usually easily recognisable. This method of image classification, depending asit does on a posteriori7 recognition of the classes is, as noted earlier, calledunsupervised classification since the analyst plays no part in class definition untilthe computational aspects are complete.

Unsupervised classification can be used as a stand-alone technique, particularlywhen reliable training data for supervised classification cannot be obtained oris too expensive to acquire. However, it is also of value, as noted earlier,to determine the spectral classes that should be considered in a subsequentsupervised approach. This is pursued in detail in Chap. 11. It also forms the basisof the cluster space method for handling high dimensional data sets, treated inSect. 9.13 below.

7 That is, after the event.


http://dx.doi.org/10.1007/978-3-642-30062-2_11

9.8 An Example of Clustering with the k Means Algorithm

To illustrate the nature of the results produced by the k means algorithm considerthe segment of HyMap imagery in Fig. 9.3a, which shows a highway interchangenear the city of Perth in Western Australia. It was recorded in January 2010 andconsists of vegetation, roadway pavements, water and bare and semi-bare areas.Figure 9.3b shows a scatter diagram for the image in which a near infrared channel(29) is plotted against a visible red channel (15). This is a subspace of the fivechannels used for the clustering, as summarised in Table 9.1.

The data was clustered using the k means (Isodata) procedure available inMultiSpec. The algorithm was asked to determine six clusters, since a visualinspection of the image showed that to be reasonable. No merging and splittingoptions were employed, but any clusters with fewer than 125 pixels at the end ofthe process were eliminated. The results shown in Fig. 9.3c were generated after 8iterations. The cluster means are plotted in just two dimensions in Fig. 9.3d, whileTable 9.2 shows the full five dimensional means which, as a pattern, exhibit thespectral reflectance characteristics of the class names assigned to the clusters. Theclass labels were able to be found in this exercise both because of the spatialdistributions of the clusters and the spectral dependences seen in Table 9.2.

It is important to realise that the results generated in this example are not uniquebut depend on the clustering parameters chosen, and the starting number ofclusters. In practice the user may need to run the algorithm a number of times togenerate a segmentation that matches the needs of a particular analysis. Also,in this simple case, each cluster is associated with a single information class;that is usually not the case in more complex situations.

9.9 A Single Pass Clustering Technique

Alternatives to the k means and Isodata algorithms have been proposed and arewidely implemented in software packages for remote sensing image analysis. One,which requires only a single pass through the data, is described in the following.

9.9.1 The Single Pass Algorithm

The single pass process was designed originally to be a faster alternative to iterativeprocedures when the image data was only available in sequential format, such as ona magnetic tape. Nevertheless, the method is still used in some remote sensingimage analysis software packages.

A randomly selected set of samples is chosen to generate the clusters, ratherthan using the full image segment of interest. The samples are arranged into a two

9.8 An Example of Clustering with the k Means Algorithm 327

dimensional array. The first row of samples is employed to obtain a starting set ofcluster centres. This is initiated by adopting the first sample as the centre of thefirst cluster. If the second sample in the first row is further away from the firstsample than a user-specified critical spectral distance then it is used to formanother cluster centre. Otherwise the two samples are said to belong to the samecluster and their mean is computed as the new cluster centre. This procedure,which is illustrated in Fig. 9.4, is applied to all samples in the first row. Once thatrow has been exhausted the multidimensional standard deviations of the clusters

Table 9.1 HyMap channels used in the k means clustering example

Channel Band centre (nm) Band width (nm)

7 (visible green) 511.3 17.615 (visible red) 634.0 16.429 (near infrared) 846.7 16.380 (middle infrared) 1616.9 14.8108 (middle infrared) 2152.7 30.2

(a)

(c) (d)

(b)

Fig. 9.3 a Segment of a HyMap image of Perth, Western Australia, b scatterplot of the image ina near infrared—visible red subspace, c k means clustering result, searching for 6 clusters usingthe channels specified in Table 9.1, d cluster means in the near infrared—visible red subspace


are computed. Each sample in the second and subsequent rows is checked to see towhich cluster it is closest. It is assigned to that cluster if it lies within a user-prescribed number of standard deviations; the cluster statistics are then recom-puted. Otherwise that sample is used to form a new cluster centre (which isassigned a nominal standard deviation), as shown in Fig. 9.5. In that manner all thesamples are clustered, and clusters with less than a prescribed number of pixels aredeleted. Should a cluster map be required then the original segment of image datais scanned pixel by pixel and each pixel labelled according to the cluster to whichit is closest, on the basis usually of Euclidean distance. Should it be an outlyingpixel, in terms of the available cluster centres, it is not labelled.

9.9.2 Advantages and Limitations of the Single Pass Algorithm

Apart from speed, a major advantage of this approach over the iterative Isodataand k means procedures is its ability to create cluster centres as it proceeds. Theuser does not need to specify the required number of clusters beforehand. Howeverthe method has limitations. First, the user has to have a feel for the necessaryparameters. The critical distance parameter needs to be specified carefully toenable a satisfactory set of initial cluster centres to be established. In addition, theuser has to know how many standard deviations to use when assigning pixels in thesecond and subsequent lines of samples to existing clusters. With experience, thoseparameters can be estimated reasonably.

Another limitation is the method’s dependence on the first line of samples toinitiate the clustering. Since it is only a one pass algorithm, and has no feedbackchecking mechanism by way of iteration, the final set of cluster centres can dependsignificantly on the character of the first line of samples.

9.9.3 Strip Generation Parameter

Adjacent pixels along a line of image data frequently belong to the same cluster,particularly for images of cultivated regions. A method for enhancing the speed of

Table 9.2 Cluster centres for the k means (Isodata) exercise in Fig. 9.3

Cluster mean vectors (on 16 bit scale)

Cluster Label Channel 7 Channel 15 Channel 29 Channel 80 Channel 108

1 Building 3511.9 3855.7 4243.7 4944.2 4931.62 Sparse veg 1509.6 1609.3 4579.5 3641.7 2267.03 Bare 1333.9 1570.7 2734.3 2715.1 2058.74 Trees 725.6 650.6 3282.4 1676.2 866.65 Road 952.3 1037.1 1503.7 1438.5 1202.36 Water 479.2 391.1 354.8 231.0 171.6

9.9 A Single Pass Clustering Technique 329

clustering is to compare a pixel with its predecessor and immediately assign it tothe same cluster if it is similar. The similarity measure used can be straightfor-ward, consisting of a check of the brightness difference in each spectral band. Thedifference allowable for two pixels to be part of the same cluster is called the stripgeneration parameter.

9.9.4 Variations on the Single Pass Algorithm

The single pass technique has a number of variations. For example, the initialcluster centres can be specified by the analyst as an alternative to using the criticaldistance parameter in Fig. 9.4. Also, rather than use a multiplier of standarddeviation for assigning pixels from the second and subsequent rows of samples,some algorithms proceed exactly as for the first row, without employing standarddeviation. Some algorithms use the L1 metric of (9.2), rather than Euclideandistance, and some check inter-cluster distances and merge if desired; periodically

Fig. 9.5 If later samples liewithin a set number ofstandard deviations (dottedcircles) they are included inexisting clusters, otherwisethey start new clusters

Fig. 9.4 Generation ofcluster centres using the firstrow of samples


small clusters can also be eliminated. MultiSpec uses critical distance parametersover the full range, although the user can specify a different critical distance for thesecond and later rows of samples.

9.9.5 An Example of Clustering with the Single Pass Algorithm

The single pass option available in MultiSpec was applied to the data set ofFig. 9.3. The critical distances for the first and subsequent rows were chosen as2,500 and 2,800 respectively. Those numbers are so large because the data we aredealing with is 16 bit (on a scale of 0–65,535) and there are five bands involved.The results are shown in Fig. 9.6 and Table 9.3.

Several points are important to note. First, the image and map as displayedwere rotated 90 degrees clockwise after clustering to bring them to a north–south orientation from the east–west flight line recorded by the HyMapinstrument for this mission. (The same was the case for the data of Fig. 9.3).Therefore the line of samples used to generate the original set of cluster centresis that down the right hand side of the image. Secondly, the clusters are dif-ferent from those in Fig. 9.3, and a slightly different class naming has beenadopted. Again, it was possible to assign information class labels to the clustersbecause of the mean vector behaviour seen in Table 9.3, and the spatial dis-tribution of the clusters. In this case, compared with Fig. 9.3, there are twospectral classes called ‘‘bare.’’

9.10 Hierarchical Clustering

Another approach that does not require the user to specify the number of classesbeforehand is hierarchical clustering. This method produces an output that allowsthe user to decide on the set of natural groupings into which the data falls. Thereare two types of hierarchical algorithm. The first commences by assuming that allthe pixels individually are distinct clusters; it then systematically mergesneighbouring clusters by checking distances between means. That is continueduntil all pixels have been grouped into one single, large cluster. The approach isknown as agglomerative hierarchical clustering. The second method, calleddivisive hierarchical clustering, starts by assuming all the pixels belong to onelarge, single cluster, which is progressively subdivided until all pixels formindividual clusters. This is a computationally more expensive approach than theagglomerative method and is not considered further here.

9.9 A Single Pass Clustering Technique 331

9.10.1 Agglomerative Hierarchical Clustering

In the agglomerative approach a history of the mergings, or fusions, is displayedon a dendrogram. That is a diagram that shows at what distances between centres

(a) (b)

(c)(d)

Fig. 9.6 a Segment of a HyMap image of Perth, Western Australia, b scatterplot of the image ina near infrared—visible red subspace, c Single pass clustering result using the channels specifiedin Table 9.1, d cluster means in the near infrared—visible red subspace

Table 9.3 Cluster centres for the single pass exercise in Fig. 9.6

Cluster mean vectors (on 16 bit scale)

Cluster Label Channel 7 Channel 15 Channel 29 Channel 80 Channel 108

1 Bare 1 2309.7 2632.9 3106.1 3713.2 3663.42 Building 3585.8 3901.5 4300.5 4880.7 4870.23 Road/trees 900.4 940.0 2307.6 1640.2 1143.44 Grassland 1441.3 1447.2 4798.6 3455.6 2028.65 Water 490.4 408.2 409.0 274.9 207.56 Bare 2 1372.7 1630.5 3105.7 3033.3 2214.8


particular clusters are merged. An example of hierarchical clustering, along withits fusion dendrogram, is shown in Fig. 9.7. It uses the same two dimensional dataset as in Fig. 9.2, but note that the ultimate cluster compositions are slightlydifferent. This demonstrates again that different algorithms can and do producedifferent cluster results.

The fusion dendrogram of a particular hierarchical clustering exercise can beinspected to determine the intrinsic number of clusters or spectral classes in thedata. Long vertical sections between fusions in the dendrogram indicate regionsof ‘‘stability’’ which reflect natural data groupings. In Fig. 9.7 the longeststretch on the distance scale between fusions corresponds to two clusters in thedata. One could conclude therefore that this data falls most naturally into twogroups. In the example presented, similarity between clusters was judged on thebasis of Euclidean distance. Other similarity measures are sometimes used,as noted below.

9.11 Other Clustering Metrics

Clustering metrics other than simple distance measures exist. One derives awithin cluster scatter measure by computing the average covariance matrix overall the clusters, and a between cluster scatter measure by looking at how themeans of the clusters scatter about the global mean of the data. Those twomeasures are combined into a single figure of merit8 based on minimising thewithin cluster scatter while attempting to maximise the among cluster measure.It can be shown that figures of merit such as these are similar to the sum ofsquared error criterion.

Similarity metrics can incorporate measures other than spectral likeness. Spatialproximity might be important in some applications as might properties thataccount for categorical information. For example, clustering crop pixels might beguided by all of spectral measurements, soil type and spatial contiguity.

9.12 Other Clustering Techniques

From time to time other clustering algorithms have been applied to remote sensingimage data, although with the increasing spectral dimensionality of imagery somehave fallen into disuse. If the dimensionality is small—say 3 or 4 data channels,with limited radiometric resolution, clustering by histogram peak selection is

8 See Duda, Hart and Stork, loc. cit., or G.B. Coleman and H.C. Andrews, Image segmentationby clustering, Proc. IEEE, vol. 67, no. 5, 1979, pp. 773–785.

9.10 Hierarchical Clustering 333

viable.9 That is the multidimensional form of histogram thresholding often used tosegment scenes in picture processing.10

Fig. 9.7 Agglomerative hierarchical clustering of the data in Fig. 9.2

9 See P.A. Letts, Unsupervised classification in the Aries image analysis system, Proc. 5thCanadian Symposium on Remote Sensing, 1978, pp. 61–71, or J.A. Richards and X. Jia, RemoteSensing Digital Image Analysis, 4th ed., Springer, Berlin, 2006, Sect. 9.8.10 See K.R. Castleman, Digital Image Processing, 2nd ed., Prentice-Hall, N.J., 1996.


Not unlike histogram peak selection is the technique of mountain clustering.It seeks to define cluster centres as local density maxima in the spectral domain.In its original form the spectral space was overlaid with a grid; the grid inter-sections were then chosen as sites for evaluating density.11 More recently, thedensity maxima have been evaluated at each pixel site rather than at overlaid gridpositions.12 The method, which could be used as a clustering techniques in its ownright, or as a process to initialise cluster centres for algorithms such as Isodata, setsup a function, called a mountain function, that measures the local density abouteach pixel. A typical mountain function for indicating density in the vicinity ofpixel xi could be

m xið Þ ¼X

j

exp½�bdðxj; xiÞ2�

in which dðxj;xiÞ is the distance from that pixel to another pixel xj; and b is aconstant that effectively controls the region of neighbours. Once the largest m xið Þhas been found that density maximum is removed or de-emphasised and the nexthighest density is found, and so on.

9.13 Cluster Space Classification

Whereas the high dimensionality of hyperspectral data sets presents a processingchallenge to statistical supervised classifiers such as the maximum likelihood rule(see Sect. 8.3.7), clustering and thus unsupervised classification with hyperspectralimagery is less of a problem because there are no parameters to be estimated. As aresult, clustering can be used as a convenient bridge to assist in thematic mappingwith high dimensional data.

The cluster space technique now to be developed is based first on clusteringimage data and then using reference data to link clusters with informationclasses.13 Importantly, the power of the method rests on the fact that there does notneed to be a one-to-one association of clusters (spectral classes) and informationclasses. That has the benefit of allowing the analyst the flexibility of generating as

11 R.R. Yager and D.P. Filev, Approximate clustering via the mountain method, IEEETransactions on Systems, Man Cybernetics, vol. 24, 1994, pp. 1279–1284.12 S.L. Chiu, Fuzzy model identification based on cluster estimation, J. Intelligent FuzzySystems, vol. 2, 1994, pp. 267–278, and M-S. Yang and K-L Wu, A modified mountain clusteringalgorithm, Pattern Analysis Applications, vol. 8, 2005, pp. 125–138.13 See X. Jia and J.A. Richards, Cluster space representation for hyperspectral classification,IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no.3, March 2002, pp. 593–598.This approach is a generalisation of that given by M.D. Fleming, J.S. Berkebile and R.M. Hofer,Computer aided analysis of Landsat-1 MSS data: a comparison of three approaches, including amodified clustering approach, Information Note 072475, Laboratory for Applications of RemoteSensing, Purdue University, West Lafayette, Indiana, 1979.

9.12 Other Clustering Techniques 335

http://dx.doi.org/10.1007/978-3-642-30062-2_8

many clusters as needed to segment the spectral domain appropriately withoutworrying too much about the precise class meanings of the clusters produced. Thesignificance of that lies in the fact that the spectral domain is rarely naturallycomposed of discrete groups of pixels; rather it is generally more of the nature of amultidimensional continuum, with a few density maxima that might be associatedwith spectrally well-defined classes such as water.14

The method starts by assuming that we have clustered the spectral domain asshown in the two dimensional illustration of Fig. 9.8. Suppose, by the use ofreference data, we are able to associate clusters with information classes, as shownby the overlaid boundaries in the figure. An information class can include morethan one cluster and some clusters appear in more than one information class.

By counting the pixels in each cluster we can estimate the set of cluster con-ditional probabilities

p xjcð Þ c ¼ 1. . .C

in which C is the total number of clusters. For convenience we might assume theclusters are normally distributed so this cluster conditional density function isrepresented by its mean and covariance, which can be estimated from the relevantpixels if the dimensionality is acceptable. Clustering algorithms such as k meansand Isodata tend to generate hyperspherical clusters so we can generally assume adiagonal covariance matrix with identical elements, in which case there are manyfewer parameters to estimate.

Fig. 9.8 Relationshipbetween data, clusters andinformation classes

14 See J.A. Richards & D.J. Kelly, On the concept of spectral class, Int. J. Remote Sensing, vol. 5,no. 6, 1984, pp. 987–991.


From Bayes’ theorem we can find the posterior probability of a given clusterbeing correct for a particular pixel measurement vector

p cjxð Þ ¼ p xjcð ÞpðcÞpðxÞ c ¼ 1. . .C ð9:4Þ

in which pðcÞ is the ‘‘prior’’ probability of the existence of cluster c. That can beestimated from the relative populations of the clusters.

By examining the distribution of the information classes over the clusters wecan generate the class conditional probabilities

p cjxið Þ i ¼ 1. . .M

where M is the total number of information classes. Again, from Bayes’ theoremwe have

p xijcð Þ ¼ p cjxið ÞpðxiÞpðcÞ i ¼ 1. . .M ð9:5Þ

We are interested in the likely information class for a given pixel, expressed in theset of posterior probabilities p xijxð Þ i ¼ 1. . .M. These can be written

p xijxð Þ ¼XC

c¼1

p xijcð ÞpðcjxÞ i ¼ 1. . .M

Substituting from (9.4) and (9.5) gives

p xijxð Þ ¼ 1pðxÞ

XC

c¼1

p cjxið Þp xjcð ÞpðxiÞ i ¼ 1. . .M ð9:6Þ

Since pðxÞ does not aid discrimination we can use the decision rule

x 2 xi if p0 xijxð Þ[ p0 xjjx� �

for all j 6¼ i ð9:7Þ

to determine the correct class for the pixel at x, where p0 xijxð Þ ¼ p xð Þp xijxð Þ:It is instructive now to consider a simple example to see how this method

works. For this we use the two dimensional data in Fig. 9.9 which contains twoinformation classes A and B. For the clustering phase of the exercise the classlabels are not significant. Instead, the full set of training pixels, in this case 500from each class, is clustered using the k means algorithms. Here we have searchedfor just 4 clusters. The results for more clusters are given in Table 9.5. Theresulting class mean positions are seen in the figure. Although 2 appear among thepixels of information class A and two among those for information class B that issimply the result of the distribution of the training pixels and has nothing to dowith the class labels as such.

9.13 Cluster Space Classification 337

Using reference data (a knowledge of which pixels are class A and which areclass B) it is possible to determine the mapping between information classes andclusters. Table 9.4 demonstrates that, both in terms of the number of pixels and theresulting posterior probabilities pðcjxiÞ derived from a normalisation of thecounts. Also shown are the prior probabilities of the clusters.

From (9.6) and (9.7), and using the data in Table 9.4, we have (using c1 etc. torepresent the clusters)

p0 Ajxð Þ ¼ p c1jAð Þp xjc1ð Þp c1ð Þ þ p c2jAð Þp xjc2ð Þp c2ð Þþ p c3jAð Þp xjc3ð Þp c3ð Þ þ p c4jAð Þp xjc4ð Þpðc4Þ

and

p0 Bjxð Þ ¼ p c1jBð Þp xjc1ð Þp c1ð Þ þ p c2jBð Þp xjc2ð Þp c2ð Þþ p c3jBð Þp xjc3ð Þp c3ð Þ þ p c4jBð Þp xjc4ð Þpðc4Þ

The cluster conditional distribution functions p xjcð Þ; c 2 fc1; c2; c3; c4g areobtained from the pixels in each cluster and, in this example, have been modelledby spherical Gaussian distributions in which the diagonal covariances for eachcluster are assumed to be the same and equal to the average covariance over thefour clusters.

Equation (9.7) can now be used to label the pixels. Based on the 1,000 pixels oftraining data used to generate the cluster space model, an overall accuracy of95.7% is obtained. Using a different testing set of 500 pixels from each class anaccuracy of 95.9% is obtained. Clearly, the performance of the method depends on

Fig. 9.9 Two dimensional data set used to illustrate the cluster space technique, and the fourcluster centres generated by applying the k means algorithm to the data


how effectively the data space is segmented during the clustering step. Table 9.5shows how the results depend on the numbers of clusters used.

9.14 Bibliography on Clustering and UnsupervisedClassification

Cluster analysis is a common tool in many fields that involve large amounts ofdata. As a result, material on clustering algorithms will be found in the social andphysical sciences, and particularly fields such as numerical taxonomy. Because ofthe enormous amounts of data used in remote sensing, the range of viable tech-niques is limited so that some treatments contain methods not generally encoun-tered in remote sensing. Standard texts on image processing and remote sensingcould be consulted. Perhaps the most comprehensive of these treatments is

R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, 2nd ed., John Wiley & Sons,N.Y., 2001.

Another detailed, and more recent, coverage of clustering and unsupervisedlearning is in

T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: DataMining, Inference and Prediction, Springer Science +Business Media LLC, N.Y., 2009.

The seminal work on the Isodata algorithm is

G.H. Ball and D.J. Hall, A novel method of data analysis and pattern recognition, StanfordResearch Institute, Menlo Park, California, 1965.

Some more general treatments are

Table 9.4 Association of clusters and information classes

Number of pixels Cluster 1 Cluster 2 Cluster 3 Cluster 4

Class A 7 0 300 194Class B 185 283 0 32

pðclusterjclassÞ Cluster 1 Cluster 2 Cluster 3 Cluster 4

Class A 0.014 0.000 0.600 0.386Class B 0.370 0.566 0.000 0.064

pðclusterÞ 0.192 0.283 0.300 0.226

Table 9.5 Performance of the cluster space method as a function of the number of clusters

No. of clusters 2 3 4 5 6 10 14 18

On training set 93.4% 90.3% 95.7% 96.0% 99.0% 99.3% 99.6% 100%On testing set 93.4% 90.7% 95.9% 96.0% 99.2% 99.6% 99.5% 100%

9.13 Cluster Space Classification 339

M.R. Andberg, Cluster Analysis for Applications, Academic, N.Y., 1973

B.S. Everitt, S. Landau, M. Leese and D. Stahl, Cluster Analysis, 5th ed., John Wiley andSons, N.Y., 2011

G. Gan, C. Ma and J. Wu, Data Clustering: Theory, Algorithms and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria,Virginia, 2007

J.A. Hartigan, Clustering Algorithms, John Wiley & Sons, N.Y., 1975

J. van Ryzin, Classification and Clustering, Academic, N.Y., 1977.

9.15 Problems

9.1 Find the Euclidean and city block distances between the following three pixelvectors. Which two are similar?

x1 ¼444

2

4

3

5 x2 ¼555

2

4

3

5 x3 ¼447

2

4

3

5

9.2 Repeat the exercise of Fig. 9.2 but with

• two initial cluster centres at (2, 3) and (5, 6),• three initial cluster centres at (1, 1), (3, 3) and (5, 5), and• three initial cluster centres at (2, 1), (4, 2) and (15, 15).

9.3 From a knowledge of how a particular clustering algorithm works it issometimes possible to infer the multidimensional spectral shapes of theclusters generated. For example, methods that depend entirely on Euclideandistance as a similarity metric would tend to produce hyperspheroidal clus-ters. Comment on the cluster shapes you would expect to be generated by themigrating means technique based on Euclidean distance and the single passprocedure, also based on Euclidean distance.

9.4 Suppose two different techniques have given two different clusterings of aparticular set of data and you wish to assess which of the two segmentationsis the better. One approach might be to evaluate the sum of square errorsmeasure treated in Sect. 9.2. Another could be based on covariance matrices.For example it is possible to define an ‘‘among clusters’’ covariance matrixthat describes how the clusters themselves are scattered about the data space,and an average ‘‘within class’’ covariance matrix that describes the averageshape and size of the clusters. Let these matrices be called CA and CW

respectively. How could they be used together to assess the quality of the twoclustering results? (See G.R. Coleman and H.C. Andrews, Image segmenta-tion by clustering, Proc IEEE, vol. 67, 1979, pp. 773–785.) Here you maywish to use measures of the ‘‘size’’ of a matrix, such as its trace or deter-minant (see Appendix C).


9.5 Different clustering methods often produce quite different segmentations ofthe same set of data, as illustrated in the examples of Figs. 9.3 and 9.6. Yetthe results generated for remote sensing applications are generally usable.Why do you think that is the case? (Is it related to the number of clustersgenerated?)

9.6 The Mahalanobis distance of (8.26) can be used as the similarity metric for aclustering algorithm. Invent a possible clustering technique based on (8.26)and comment on the nature of the clusters generated.

9.7 Do you see value in having a two stage clustering process in which a singlepass procedure is used to generate initial clusters and then an iterativetechnique is used to refine them?

9.8 Recompute the agglomerative hierarchical clustering example of Fig. 9.7 butuse the L1 distance measure in (9.2) as a similarity metric.

9.9 Consider the two dimensional data shown in Fig. 9.2 and suppose the threepixels at the upper right form one cluster and the remainder another cluster.Such an assignment might have been generated by some clustering algorithmother than the k means method. Calculate the sum of squared error for thisnew assignment and compare with the value of 16 found in Fig. 9.2.

9.10 In the cluster space technique, how is (9.6) modified if there is uniquely onlyone cluster per information class?

9.15 Problems 341

Documents

Chapter 9 Clustering and Unsupervised Classiﬁcationceeserver.cee.cornell.edu/wdp2/cee6150/Text-Richard 2013/CH09... · Chapter 9 Clustering and Unsupervised Classiﬁcation 9.1