10 Cluster Analysis

7/28/2019 10 Cluster Analysis

1/13

Cluster analysis

The basic assumption with these methods is thatmeasurements made for related samples tend to

be similar.

Overall, the distance between similar samples issmaller than for unrelated samples.

Clustering methods

Well look at three unsupervised clustering methods.

Univariate clustering Evaluates individual variables (raw or scaled). Groups samples into homogeneous classes.

Hierarchical cluster analysis Reduction of multiple variables for a sample to a singledistance value.

Rank and link samples based on relative distances. k-mean clustering.

Grouping of samples into a set number of classes. Use all variables to determine relative distances.

Univariate clustering.

Creates k homogeneous classes.

Uses within-class variances as measure ofhomogeneity.

Can be used to convert quantitative variable into adiscrete ordinal variable.

Another use is to simply evaluate if a variable hasany classification type information.

Histogram (Petal width)

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20

Petal width

Relative

frequency

Iris dataset

Species PropertyI. Setosa Petal width

I. Versicolor Petal lengthI. Verginica Sepal width! Sepal length

Well look at a single

property - petal width.

Univariate clustering.

Histogram (Petal width)

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 25

Petal width

Relative

frequency

The goal is to partition the data so that you have k

clusters of data.

Iris data

A simple ranking of the data

indicates that we would getreasonable clustering basedon petal width.


2/13

Iris data Iris data

Body Temp (from exam)

Not exactly the best

classification.

It does show that there issome skew to the results

(more men in class one andmore women in class two) -

and there is a fair amount ofoverlap.

So whats it good for

Really only useful for an initial evaluationindividual variab

Only want to use when you have a small number

classes (or potential class

Main use is to convert quantitative (continuous)ordinal da

HCA Distance and similarity Distance and similarity

Actual distances between your samples will vary based

on the type and number of measurements present.

Similarity values are calculated to normalize the datato a standard scale.

! For similar samples, sij approaches 1! For dissimilar samples, sij approaches 0

sij=1 - dmaxdij


3/13

Clustering

After all our distances or similarities have been

calculated, we need a way of determining how closely

our samples are related or grouped.

We start with the two most related samples and linkthem - forming an initial cluster.

The process is repeated until all samples have been

linked.

Clustering

Several methods of linking our samples are available.

The three most common are:

Single link

Complete linkCentroid link

Lets start by looking at the simplest method ! - single link (in two dimensions)

Single link

This approach determines linkage based on the

distance to the closest point in a cluster.

You start by assuming that the two closest points are

a cluster.

All points are initially compared as pairs and then the

search for links is expanded.

Now lets look at an example.

dij" C=0.5diC+0.5djC- diC-djC

Single link

Single link

dij

Single link


4/13

Single link Single link

Single link Other linkage methods

Complete link

Linkage is based on the farthest point in a cluster

- gives a conservative linkage

dij" C=0.5diC+0.5djC+ diC-djC

Other linkage methods

Centroid link (Wards Method)

Linkage is based on the center of the cluster.

dij" C= ni+njnidiC

2

+ ni+njnjdjC

2

+ ni+njninjdij

2

HCA dendrogram

After conducting your linkage, you need a way tvisualizing the result

Dendrograms can be used for this purpose an

provide a very simple two dimensional plot thaindicates clustering, similarities and linkage


5/13

Dendrograms

We can nowsee how oursamples are

linked.

The higher thelinkage level,the lower thesimilarity.

1.0 similarity 0.0

Dendrograms

This plot appears to indicate that there athree groups of samples that can only b

linked at very low similarity valueA

B

C

D

E

F

G

H

I

J

Dendrograms

Lets look again at our single linkage example and see

what the dendrogram would look like.

Example dendrogram

1.0 similarity 0.0

A real example

Substances commonly used as accelerants were

assayed by capillary column GC / MS.

At present, accelerants are identified based on boilingpoint range.

! Class assignments: A, B, C, D, EGoal:To determine if multivariate data treatment has the

potential for classification of accelerants.

Analysis conditions

Neat samples were spiked with a known

amount an internal standard.

! SP-5 25m x 0.2mm I.D. column! 1 l sample, 100:1 split injection! 50oC,5 min; 10oC/min ramp; hold at 250oC! Total run time: 30 minutes! Mass Range: 50-150 AMU! ISTD: octadeuteronaphthalene


6/13

Preprocessing of data

A total ion chromatographic profile was

extracted and normalized using the internal

standard.

Triplicate samples were averaged.

The first minute of data was discarded due to

the presence of a solvent tail.

The remaining data was simply summed at oneminute intervals - 19 variables.

Classes

A. Light petroleum distillates - petroleum eathers,

lighter fluid, naptha, camping fuels, ...

B. Gasoline

C.Medium petroleum distillates - paint thinners,

mineral spirits, ...

D.Kerosene - #1 fuel oil, jet A fuel, ...

E. Heavy petroleum distillates - #2 fuel oil, diesel

fuel, ...

Representative data profiles

As can be seen, Classes B, C and D show asignificant level of overlap.

A B C D E

Production of dendrograms

Both raw and autoscaled data were processed and

dendrograms were produced using single linkage.

For the autoscaled data, complete and centroid

linkages were also evaluated.

For the dendrograms, classes are color coded and

labeled.

The classes were not used in producing thedendrograms.

Raw - single linkage

Single, raw

cccccccccccddddddddddddeeeeeeebbbbbbabaaaabbbb

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Similarity

Raw - complete linkage

Complete, Raw

b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Similarity


7/13

Raw - centroidal linkage

Centroid, Raw

e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Similarity

Raw - comparison

Centroidal linkage appears to give the best results.

Single, raw

cccccccccccddddddddddddeeeeeeebbbbbbabaaaabbbb

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Similarity

Complete, Raw

b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Similarity

Centroid, Raw

e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Similarity

Autoscaled - single link

Single, Scaled

cccccccccccddddddddddddeeeeeeebaabaaabbbbbbbbb

0.56

0.61

0.66

0.71

0.76

0.81

0.86

0.91

0.96

Similarity

Autoscaled - complete link

Complete, Scaled

dddddedddddddeeeeeecccccccccccbbbbbbbbbbaabaaa

-0.78

-0.58

-0.38

-0.18

0.02

0.22

0.42

0.62

0.82

Similarity

Autoscaled - centroid link

Centroidal, Scaled

cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee

-0.34

-0.14

0.06

0.26

0.46

0.66

0.86

Similarity

Autoscaled - centroid link

Centroidal, Scaled


-0.34

-0.14

0.06

0.26

0.46

0.66

0.86

Similarity


8/13

Iris dataset

A pretty famous data set published by R.A. Fisher, The

Use of Multiple Measurements in Axonmic Problems.

Anals of Eugenics, 7, 179-188 (1936).

He measured four physical properties of iris to see ifthey could be used to classify any of three different

species.

Used length and width of the sepal and petal.

Iris dataset

! Species Property! I. Setosa Petal width! I. Versicolor Petal length! I. Verginica Sepal width!Sepal length

150 samples - no missing values

HCA analysis was conducted on both raw and scaled

data. Both single linkage and complete linkage were

evaluated.

Autoscaled data, centroidal linkage

222222222222222233323223222222232333233333332222223222222222232222221333333333333333333333333333333331111111111111111111111111111111111111111111111111

0

5

10

15

20

25

30

35

40

Dissimilarity

One class is distinct

but the other two

overlap.

Iris dataset

So it should be possibleclassify samples. HCA judoes not provide as use

a view as we had hoped f

5

10

15

20

25

15 30 45 60

Petal length

Petalwidth

Raw data.

Iris dataset

So there was useful information in the dataset.

HCA - not a good tool. Reducing the four

measurements into a single one actually make the data

worse.

Autoscaling - had little or no effect. The actual

numbers were all of a similar range.

Moral - just because a method doesnt work does not

mean that there is no useful information.

Classification of Mycobacteria

Investigators at the CDC wanted to see if it waspossible to identify mycobacteria using pattern

recognition of an HPLC analysis of mycolic acids.

Mycobacteria - include a number of respiratory and

non-respiratory pathogens such asM. tuberculosis.

C70-C90-branched -hydroxy mycolic acids were

selected as they are known to be in the cell walls of

these bacteria.


9/13


Eight species were investigated.! M. asiaticum M. bova! M. gastri M. gordonae! M. kansasii M. marinum! M. szulgai M. tuberculosis22 mycolic acids were used for the classification.

175 total samples.


Limitation.

Although the paper specified that it was necessary to

normalized the data to account for variations in

sample size, no standards were provided.

I chose to normalize to the total peak areas for each

sample. This assumes that each species produces

about the amount of total mycolic acid and that the

response/concentration is the same for each

component.

Single linkage

Single linkageshows some

clustering of thesamples but isnot very useful.

Complete linkage

Complete linkage givessome what better results

Well look at this sampleagain later using other too

Identification of Coffee

An attempt was made to identify the source ofcoffee beans.

Sulawesi Costa Rica

Ethiopia Sumatra

Kenya Columbia

Method.

Mass spectral analysis of headspace of beansamples. m/e range of 47-99 was used.

Six samples were obtained from each source.

Identification of Coffee

The mass spectra represented the sum ofspectra for all components present.

As is normal with mass spectra, each wasnormalized to the largest peak.

Only raw data was evaluated.


10/13

Representative spectra, 47-99 m/e Single linkage

Sulawesi

Costa Rica

Ethiopia

Sumatra

Kenya

Columbia

Complete linkage

Sulawesi

Costa Rica

Ethiopia

Sumatra

Kenya

Columbia

So whats it good fo

This is a fast method of initial data exploratio

Try all of the options with both raw and scaled dat

The plots can be rapidly evaluate

You can also use principal component data. This will be covered the next un

When you get ready to go on to other methods of clusterinknowing the best methods for linkage will also be usefu

k-mean clustering

An iterative method where samples are initially partitionedinto k classes and a centroid calculated.

Must use quantitative variables but can be raw, scaled orPCA based.

The positions of all samples are then calculated relative tothe centroids and then reassigned to new clusters (ifneeded) and the process repeated.

Classification criteria can include within-class variance,pooled covariance matrix or total inertia matrix.

The number of clusters and assignments can vary based onthe initial starting points so several iterations arecommonly used to find a constant solution.

k-mean clustering

Position initialclass centroids

Adjust centroids

Test classmemberships

Retest/repeat


11/13

Using XLStat

Classification criteria that can be minimized.

Trace. Minimize the within-class variance, giving the mosthomogeneous clusters. Data should be autoscaled if this is used.

Determinant. Minimize the covariance matrix. Moreappropriate to use with unscaled data but gives lesshomogeneous clusters.

Wilks lambda. Normalized version of the Determinateapproach.

Trace/median. Centroid ends up being based on median notthe mean, like other approaches. Better when there issubclustering of data.

Using XLStat

XLStats version of HCA (AgglomerativeHierarchical Clustering - AHC) will do a k-meananalysis but only the trace method

The k-means option provides more clusteringcontrol and is faster because no HCA isconducted.

However, AHC has an option to allow the routineto automatically set the number of clusters that

appear to exist.

Iris dataset (again). Iris dataset (again).

Iris dataset (again). Iris dataset (again).

5

10

15

20

25

15 30 45 60

Petal length

Peta

lwidth

Raw data.


12/13

Arson dataset.

Here are the final class results from the k-mean clustering.

Centroidal, Scaled


-0.34

-0.14

0.06

0.26

0.46

0.66

0.86

Similarity

CostaRicanCostaRican

SulawesiSulawesi

CostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRican

SulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSumatraSumatraSumatraSumatraSumatraSumatraSumatraSumatraSumatra

SumatraSumatraSumatraEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopia

KenyaKenya

ColumbiaColumbiaColumbiaColumbia

KenyaKenyaKenyaKenyaKenyaKenyaKenyaKenyaKenyaKenya

ColumbiaColumbiaColumbiaColumbiaColumbiaColumbiaColumbiaColumbia

0 20 40 60 80 100 120 140

Coffee (a more complete data set.)

Coffee (a more complete data set.) Mycobacteria

This data set was VERY difficult to visualize using a dendrogram.

47474747474747474747474747474747474747474543434242424242424242424242424242424242424243434343434345454545454545454545454545454543434343434345454545454545454545454545454545454343434343434343434344434444444444444444444447444444444446464649494949494949494949494949494949494946464646464646464646464646464646464646464646464646464646464646464646464646464646

0

100

200

300

400

500

600

700

800

900

1000

Mycobacteria Mycobacteria - autoclustering.


13/13

So whats it good for?

Can be used as a way to subdivide a dataset into related clusters.

Clusters are objectively determined based on similarities inmultidimensional space.

While results can vary based on starting point, the effect can beminimized by using multiple starting points and repetitions.

Results are easier to see than with HCA. k-mean and HCAcomplement each other.

Documents

10 Cluster Analysis