10 Cluster Analysis

Embed Size (px)

Citation preview

  • 7/28/2019 10 Cluster Analysis

    1/13

    Cluster analysis

    The basic assumption with these methods is thatmeasurements made for related samples tend to

    be similar.

    Overall, the distance between similar samples issmaller than for unrelated samples.

    Clustering methods

    Well look at three unsupervised clustering methods.

    Univariate clustering Evaluates individual variables (raw or scaled). Groups samples into homogeneous classes.

    Hierarchical cluster analysis Reduction of multiple variables for a sample to a singledistance value.

    Rank and link samples based on relative distances. k-mean clustering.

    Grouping of samples into a set number of classes. Use all variables to determine relative distances.

    Univariate clustering.

    Creates k homogeneous classes.

    Uses within-class variances as measure ofhomogeneity.

    Can be used to convert quantitative variable into adiscrete ordinal variable.

    Another use is to simply evaluate if a variable hasany classification type information.

    Histogram (Petal width)

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0 5 10 15 20

    Petal width

    Relative

    frequency

    Iris dataset

    Species PropertyI. Setosa Petal width

    I. Versicolor Petal lengthI. Verginica Sepal width! Sepal length

    Well look at a single

    property - petal width.

    Univariate clustering.

    Histogram (Petal width)

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0 5 10 15 20 25

    Petal width

    Relative

    frequency

    The goal is to partition the data so that you have k

    clusters of data.

    Iris data

    A simple ranking of the data

    indicates that we would getreasonable clustering basedon petal width.

  • 7/28/2019 10 Cluster Analysis

    2/13

    Iris data Iris data

    Body Temp (from exam)

    Not exactly the best

    classification.

    It does show that there issome skew to the results

    (more men in class one andmore women in class two) -

    and there is a fair amount ofoverlap.

    So whats it good for

    Really only useful for an initial evaluationindividual variab

    Only want to use when you have a small number

    classes (or potential class

    Main use is to convert quantitative (continuous)ordinal da

    HCA Distance and similarity Distance and similarity

    Actual distances between your samples will vary based

    on the type and number of measurements present.

    Similarity values are calculated to normalize the datato a standard scale.

    ! For similar samples, sij approaches 1! For dissimilar samples, sij approaches 0

    sij=1 - dmaxdij

  • 7/28/2019 10 Cluster Analysis

    3/13

    Clustering

    After all our distances or similarities have been

    calculated, we need a way of determining how closely

    our samples are related or grouped.

    We start with the two most related samples and linkthem - forming an initial cluster.

    The process is repeated until all samples have been

    linked.

    Clustering

    Several methods of linking our samples are available.

    The three most common are:

    Single link

    Complete linkCentroid link

    Lets start by looking at the simplest method ! - single link (in two dimensions)

    Single link

    This approach determines linkage based on the

    distance to the closest point in a cluster.

    You start by assuming that the two closest points are

    a cluster.

    All points are initially compared as pairs and then the

    search for links is expanded.

    Now lets look at an example.

    dij" C=0.5diC+0.5djC- diC-djC

    Single link

    Single link

    dij

    Single link

  • 7/28/2019 10 Cluster Analysis

    4/13

    Single link Single link

    Single link Other linkage methods

    Complete link

    Linkage is based on the farthest point in a cluster

    - gives a conservative linkage

    dij" C=0.5diC+0.5djC+ diC-djC

    Other linkage methods

    Centroid link (Wards Method)

    Linkage is based on the center of the cluster.

    dij" C= ni+njnidiC

    2

    + ni+njnjdjC

    2

    + ni+njninjdij

    2

    HCA dendrogram

    After conducting your linkage, you need a way tvisualizing the result

    Dendrograms can be used for this purpose an

    provide a very simple two dimensional plot thaindicates clustering, similarities and linkage

  • 7/28/2019 10 Cluster Analysis

    5/13

    Dendrograms

    We can nowsee how oursamples are

    linked.

    The higher thelinkage level,the lower thesimilarity.

    1.0 similarity 0.0

    Dendrograms

    This plot appears to indicate that there athree groups of samples that can only b

    linked at very low similarity valueA

    B

    C

    D

    E

    F

    G

    H

    I

    J

    Dendrograms

    Lets look again at our single linkage example and see

    what the dendrogram would look like.

    Example dendrogram

    1.0 similarity 0.0

    A real example

    Substances commonly used as accelerants were

    assayed by capillary column GC / MS.

    At present, accelerants are identified based on boilingpoint range.

    ! Class assignments: A, B, C, D, EGoal:To determine if multivariate data treatment has the

    potential for classification of accelerants.

    Analysis conditions

    Neat samples were spiked with a known

    amount an internal standard.

    ! SP-5 25m x 0.2mm I.D. column! 1 l sample, 100:1 split injection! 50oC,5 min; 10oC/min ramp; hold at 250oC! Total run time: 30 minutes! Mass Range: 50-150 AMU! ISTD: octadeuteronaphthalene

  • 7/28/2019 10 Cluster Analysis

    6/13

    Preprocessing of data

    A total ion chromatographic profile was

    extracted and normalized using the internal

    standard.

    Triplicate samples were averaged.

    The first minute of data was discarded due to

    the presence of a solvent tail.

    The remaining data was simply summed at oneminute intervals - 19 variables.

    Classes

    A. Light petroleum distillates - petroleum eathers,

    lighter fluid, naptha, camping fuels, ...

    B. Gasoline

    C.Medium petroleum distillates - paint thinners,

    mineral spirits, ...

    D.Kerosene - #1 fuel oil, jet A fuel, ...

    E. Heavy petroleum distillates - #2 fuel oil, diesel

    fuel, ...

    Representative data profiles

    As can be seen, Classes B, C and D show asignificant level of overlap.

    A B C D E

    Production of dendrograms

    Both raw and autoscaled data were processed and

    dendrograms were produced using single linkage.

    For the autoscaled data, complete and centroid

    linkages were also evaluated.

    For the dendrograms, classes are color coded and

    labeled.

    The classes were not used in producing thedendrograms.

    Raw - single linkage

    Single, raw

    cccccccccccddddddddddddeeeeeeebbbbbbabaaaabbbb

    0.70

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    Similarity

    Raw - complete linkage

    Complete, Raw

    b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d

    0.00

    0.10

    0.20

    0.30

    0.40

    0.50

    0.60

    0.70

    0.80

    0.90

    1.00

    Similarity

  • 7/28/2019 10 Cluster Analysis

    7/13

    Raw - centroidal linkage

    Centroid, Raw

    e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c

    0.30

    0.40

    0.50

    0.60

    0.70

    0.80

    0.90

    1.00

    Similarity

    Raw - comparison

    Centroidal linkage appears to give the best results.

    Single, raw

    cccccccccccddddddddddddeeeeeeebbbbbbabaaaabbbb

    0.70

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    Similarity

    Complete, Raw

    b b b b b b b b b b a b a a a a c c c c c c c c c c c e e e e e e e d d d d d d d d d d

    0.00

    0.10

    0.20

    0.30

    0.40

    0.50

    0.60

    0.70

    0.80

    0.90

    1.00

    Similarity

    Centroid, Raw

    e e e e e e e b b a a a a b a b b b b b b b b d d d d d d d d d d d d c c c c c c c c c c c

    0.30

    0.40

    0.50

    0.60

    0.70

    0.80

    0.90

    1.00

    Similarity

    Autoscaled - single link

    Single, Scaled

    cccccccccccddddddddddddeeeeeeebaabaaabbbbbbbbb

    0.56

    0.61

    0.66

    0.71

    0.76

    0.81

    0.86

    0.91

    0.96

    Similarity

    Autoscaled - complete link

    Complete, Scaled

    dddddedddddddeeeeeecccccccccccbbbbbbbbbbaabaaa

    -0.78

    -0.58

    -0.38

    -0.18

    0.02

    0.22

    0.42

    0.62

    0.82

    Similarity

    Autoscaled - centroid link

    Centroidal, Scaled

    cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee

    -0.34

    -0.14

    0.06

    0.26

    0.46

    0.66

    0.86

    Similarity

    Autoscaled - centroid link

    Centroidal, Scaled

    cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee

    -0.34

    -0.14

    0.06

    0.26

    0.46

    0.66

    0.86

    Similarity

  • 7/28/2019 10 Cluster Analysis

    8/13

    Iris dataset

    A pretty famous data set published by R.A. Fisher, The

    Use of Multiple Measurements in Axonmic Problems.

    Anals of Eugenics, 7, 179-188 (1936).

    He measured four physical properties of iris to see ifthey could be used to classify any of three different

    species.

    Used length and width of the sepal and petal.

    Iris dataset

    ! Species Property! I. Setosa Petal width! I. Versicolor Petal length! I. Verginica Sepal width!Sepal length

    150 samples - no missing values

    HCA analysis was conducted on both raw and scaled

    data. Both single linkage and complete linkage were

    evaluated.

    Autoscaled data, centroidal linkage

    222222222222222233323223222222232333233333332222223222222222232222221333333333333333333333333333333331111111111111111111111111111111111111111111111111

    0

    5

    10

    15

    20

    25

    30

    35

    40

    Dissimilarity

    One class is distinct

    but the other two

    overlap.

    Iris dataset

    So it should be possibleclassify samples. HCA judoes not provide as use

    a view as we had hoped f

    5

    10

    15

    20

    25

    15 30 45 60

    Petal length

    Petalwidth

    Raw data.

    Iris dataset

    So there was useful information in the dataset.

    HCA - not a good tool. Reducing the four

    measurements into a single one actually make the data

    worse.

    Autoscaling - had little or no effect. The actual

    numbers were all of a similar range.

    Moral - just because a method doesnt work does not

    mean that there is no useful information.

    Classification of Mycobacteria

    Investigators at the CDC wanted to see if it waspossible to identify mycobacteria using pattern

    recognition of an HPLC analysis of mycolic acids.

    Mycobacteria - include a number of respiratory and

    non-respiratory pathogens such asM. tuberculosis.

    C70-C90-branched -hydroxy mycolic acids were

    selected as they are known to be in the cell walls of

    these bacteria.

  • 7/28/2019 10 Cluster Analysis

    9/13

    Classification of Mycobacteria

    Eight species were investigated.! M. asiaticum M. bova! M. gastri M. gordonae! M. kansasii M. marinum! M. szulgai M. tuberculosis22 mycolic acids were used for the classification.

    175 total samples.

    Classification of Mycobacteria

    Limitation.

    Although the paper specified that it was necessary to

    normalized the data to account for variations in

    sample size, no standards were provided.

    I chose to normalize to the total peak areas for each

    sample. This assumes that each species produces

    about the amount of total mycolic acid and that the

    response/concentration is the same for each

    component.

    Single linkage

    Single linkageshows some

    clustering of thesamples but isnot very useful.

    Complete linkage

    Complete linkage givessome what better results

    Well look at this sampleagain later using other too

    Identification of Coffee

    An attempt was made to identify the source ofcoffee beans.

    Sulawesi Costa Rica

    Ethiopia Sumatra

    Kenya Columbia

    Method.

    Mass spectral analysis of headspace of beansamples. m/e range of 47-99 was used.

    Six samples were obtained from each source.

    Identification of Coffee

    The mass spectra represented the sum ofspectra for all components present.

    As is normal with mass spectra, each wasnormalized to the largest peak.

    Only raw data was evaluated.

  • 7/28/2019 10 Cluster Analysis

    10/13

    Representative spectra, 47-99 m/e Single linkage

    Sulawesi

    Costa Rica

    Ethiopia

    Sumatra

    Kenya

    Columbia

    Complete linkage

    Sulawesi

    Costa Rica

    Ethiopia

    Sumatra

    Kenya

    Columbia

    So whats it good fo

    This is a fast method of initial data exploratio

    Try all of the options with both raw and scaled dat

    The plots can be rapidly evaluate

    You can also use principal component data. This will be covered the next un

    When you get ready to go on to other methods of clusterinknowing the best methods for linkage will also be usefu

    k-mean clustering

    An iterative method where samples are initially partitionedinto k classes and a centroid calculated.

    Must use quantitative variables but can be raw, scaled orPCA based.

    The positions of all samples are then calculated relative tothe centroids and then reassigned to new clusters (ifneeded) and the process repeated.

    Classification criteria can include within-class variance,pooled covariance matrix or total inertia matrix.

    The number of clusters and assignments can vary based onthe initial starting points so several iterations arecommonly used to find a constant solution.

    k-mean clustering

    Position initialclass centroids

    Adjust centroids

    Test classmemberships

    Retest/repeat

  • 7/28/2019 10 Cluster Analysis

    11/13

    Using XLStat

    Classification criteria that can be minimized.

    Trace. Minimize the within-class variance, giving the mosthomogeneous clusters. Data should be autoscaled if this is used.

    Determinant. Minimize the covariance matrix. Moreappropriate to use with unscaled data but gives lesshomogeneous clusters.

    Wilks lambda. Normalized version of the Determinateapproach.

    Trace/median. Centroid ends up being based on median notthe mean, like other approaches. Better when there issubclustering of data.

    Using XLStat

    XLStats version of HCA (AgglomerativeHierarchical Clustering - AHC) will do a k-meananalysis but only the trace method

    The k-means option provides more clusteringcontrol and is faster because no HCA isconducted.

    However, AHC has an option to allow the routineto automatically set the number of clusters that

    appear to exist.

    Iris dataset (again). Iris dataset (again).

    Iris dataset (again). Iris dataset (again).

    5

    10

    15

    20

    25

    15 30 45 60

    Petal length

    Peta

    lwidth

    Raw data.

  • 7/28/2019 10 Cluster Analysis

    12/13

    Arson dataset.

    Here are the final class results from the k-mean clustering.

    Centroidal, Scaled

    cccccccccccbbbbbbbbbbaabaaaddddddddddddeeeeeee

    -0.34

    -0.14

    0.06

    0.26

    0.46

    0.66

    0.86

    Similarity

    CostaRicanCostaRican

    SulawesiSulawesi

    CostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRicanCostaRican

    SulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSulawesiSumatraSumatraSumatraSumatraSumatraSumatraSumatraSumatraSumatra

    SumatraSumatraSumatraEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopiaEthiopia

    KenyaKenya

    ColumbiaColumbiaColumbiaColumbia

    KenyaKenyaKenyaKenyaKenyaKenyaKenyaKenyaKenyaKenya

    ColumbiaColumbiaColumbiaColumbiaColumbiaColumbiaColumbiaColumbia

    0 20 40 60 80 100 120 140

    Coffee (a more complete data set.)

    Coffee (a more complete data set.) Mycobacteria

    This data set was VERY difficult to visualize using a dendrogram.

    47474747474747474747474747474747474747474543434242424242424242424242424242424242424243434343434345454545454545454545454545454543434343434345454545454545454545454545454545454343434343434343434344434444444444444444444447444444444446464649494949494949494949494949494949494946464646464646464646464646464646464646464646464646464646464646464646464646464646

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    Mycobacteria Mycobacteria - autoclustering.

  • 7/28/2019 10 Cluster Analysis

    13/13

    So whats it good for?

    Can be used as a way to subdivide a dataset into related clusters.

    Clusters are objectively determined based on similarities inmultidimensional space.

    While results can vary based on starting point, the effect can beminimized by using multiple starting points and repetitions.

    Results are easier to see than with HCA. k-mean and HCAcomplement each other.