43
An Introduction to Cluster Analysis Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs [email protected] 1

An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

AnIntroductiontoClusterAnalysis

ZhaoxiaYuDepartmentofStatistics

[email protected]

1

Page 2: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Whatcanyousayaboutthefigure?

2

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

• ≈1500subjects

• Twomeasurementspersubject

Page 3: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

3

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

Page 4: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

40.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

CC

TT

CT

Page 5: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

ClusterAnalysis

• Seeksrulestogroupdata– Largebetween-clusterdifference– Smallwithin-clusterdifference

• Exploratory

• Aimstounderstand/learntheunknownsubstructureofmultivariatedata

5

Page 6: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

ClusterAnalysisvsClassification

• Dataareunlabeled

• Thenumberofclustersareunknown

• “Unsupervised”learning

• Goal:findunknownstructures

6

• Thelabelsfortrainingdataareknown

• Thenumberofclassesareknown

• “Supervised”learning

• Goal:allocatenewobservations,whoselabelsareunknown,tooneoftheknownclasses

Page 7: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

TheIrisData

• ItwascollectedbyF.A.Fisher• Afamousdatasetthathasbeenwidelyusedintextbooks

• Fourfeatures:– sepallengthincm– sepalwidthincm– petallengthincm– petalwidthincm

Page 8: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

TheIrisData

• Threetypes:– Setosa

– Versicolor

– Virginica

8

Page 9: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

TheIrisDataSepalL.SepalW.PetalL.PetalW.[1,]5.13.51.40.2[2,]4.93.01.40.2[3,]4.73.21.30.2[4,]4.63.11.50.2[5,]5.03.61.40.2[6,]5.43.91.70.4[7,]4.63.41.40.3[8,]5.03.41.50.2[9,]4.42.91.40.2…………[45,]5.13.81.90.4[46,]4.83.01.40.3[47,]5.13.81.60.2[48,]4.63.21.40.2[49,]5.33.71.50.2[50,]5.03.31.40.2

SepalL.SepalW.PetalL.PetalW.[1,]6.33.36.02.5[2,]5.82.75.11.9[3,]7.13.05.92.1[4,]6.32.95.61.8[5,]6.53.05.82.2[6,]7.63.06.62.1[7,]4.92.54.51.7[8,]7.32.96.31.8[9,]6.72.55.81.8…………[45,]6.73.35.72.5[46,]6.73.05.22.3[47,]6.32.55.01.9[48,]6.53.05.22.0[49,]6.23.45.42.3[50,]5.93.05.11.8

IrisSetosa IrisVirginica

SepalL.SepalW.PetalL.PetalW.[1,]7.03.24.71.4[2,]6.43.24.51.5[3,]6.93.14.91.5[4,]5.52.34.01.3[5,]6.52.84.61.5[6,]5.72.84.51.3[7,]6.33.34.71.6[8,]4.92.43.31.0[9,]6.62.94.61.3…………[45,]5.62.74.21.3[46,]5.73.04.21.2[47,]5.72.94.21.3[48,]6.22.94.31.3[49,]5.12.53.01.1[50,]5.72.84.11.3

IrisVersicolor

Page 10: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

TheIrisData

10

SepalL. SepalW. PetalL. PetalW.

Setosa

Versicolor

Virginica

Page 11: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

ClusteringMethods

• Model-free:– Nonhierarchicalclustering.K-means.– Hierarchicalclustering.Basedonsimilaritymeasures

• Model-basedclustering

11

Page 12: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Model-FreeClusteringNonhierarchicalClustering:K-Means

12

Page 13: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

K-Means

• Assigneachobservationtotheclusterwiththenearestmean

• “Nearest”isusuallydefinedbasedonEuclideandistance

13

Page 14: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

K-Means:Algorithm

• Step0:Preprocessdata.Standardizedataifappropriate

• Step1:PartitiontheobservationsintoK initialclusters.

• Step2– 2.a(updatestep):Calculatethecentroids.– 2.b(assignmentstep):Assigneachobservationtoitsnearestcluster.

• Repeatstep2untilnomorechangesinassignments

14

Page 15: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

From“AnIntroductiontoStatisticalLearning”15

Page 16: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Remarks

• Beforeconvergence,eachstepisguaranteedtodecreasethewithin-clustersumofsquaresobjective

• Withinafinitenumberofsteps,thealgorithmmightconvergetoa(local)minimum

• Usedifferentandrandominitialvalues

16

Page 17: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

DifferentInitialValues

17From“AnIntroductiontoStatisticalLearning”

Page 18: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Example:ClusterAnalysisofIrisData(PetalL&W)

• Pretendthattheiristypesoftheobservationsareunknown=>clusteranalysis

• Asanexample,andforillustrationpurpose,wewillusepetallengthandwidth

• ChooseK=3• K-means

18

Page 19: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

K-MeanClustering:Iris(PetalL&W)

19Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.

Page 20: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

K-MeanClustering:Iris(PetalL&W)

20

Page 21: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

K-MeanClustering:Iris(PetalL&W)

21

Iteration= 1

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 2

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 3

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 4

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 5

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 6

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 7

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 8

0.0

0.2

0.4

0.6

0.8

1.0

Iteration= 9

0.0

0.2

0.4

0.6

0.8

1.0

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Setosa VersicolorVirginica

Page 22: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

K-MeanClustering:Iris(PetalL&W)

Setosa VersicolorVirginica22

Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.

Page 23: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Model-FreeClustering:HierarchicalClustering

23

Page 24: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

HierarchicalClustering

• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram

• Eachleafrepresentsanobservation

• Leavessimilar witheachotherarefusedtobranches

• Leaves/branchessimilarwitheachotherarefusedtobranches

• …24

Page 25: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Setosa Virginica Versicolor

25

Page 26: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

HierarchicalClustering

• Togrowatree,weneedtodefinedissimilarities(distances)betweenleaves/branches– Twoleaves:easy.Onecanuseadissimilaritymeasure

– Aleafandabranch:therearedifferentoptions– Twobranches:similarto“aleafandabranch”,therearedifferentoptions

26

Page 27: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

DistancebetweenTwoBranches/Clusters

Singlelinkage

Completelinkage

Averagelinkage

Manyotheroptions!27

Page 28: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Model-BasedClustering

28

Page 29: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Model-BasedClustering:MixtureModel

• ConsiderarandomvariableX.• WesayitfollowsamixtureofK distributionsifitsdistributioncanberepresentedusingKdistributions:

• Theweightspk,k=1,…,K arenonnegativenumbersandtheyaddupto1

29

Page 30: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

ClusterAnalysisBasedonMixtureModel

• Ipresentafrequentistversion– Chooseanappropriatemodel.E.g.,AGaussianmixturemodelwithK=2clusters

– Writedownthelikelihoodfunction– Findthemaximumlikelihoodestimateoftheparameters

– CalculatethePr(clusterk|observationxi)fori=1,…,n,k=1,2

30

Page 31: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

The Maximum Likelihood Estimate (MLE) of the Parameters

• Aneasy-to-implementalgorithmtofindtheMLEsiscalledtheExpectationandMaximization(EM)algorithm

• Initializeparameters• Estep:calculate“conditional”expectation.

– “conditional”meansconditionaloncurrentestimateoftheparameters

– Thisstepinvolvescalculatingprob(clusterk|obs I,currentestimateofpara),k=1,…,K,i=1,…,n

– Thisstepissimilartotheassignment stepinanK-meansalgorithm

31

Page 32: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

TheMaximumLikelihoodEstimate(MLE)oftheParameters

• TheMstep:findthesetofvaluesthatmaximizetheconditionalexpectationcalculatedintheEstep.Thisstepupdatestheparametervalues

• RepeattheEandMstepsuntilconvergence

32

Page 33: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

EM vsK-Mean

EM• Step1:initialization• E: Calculateconditionalprobabilities

• Mstep:Findoptimalvaluesforparameters

• RepeattheEandMstepsuntilconvergence

• Allowsclusterstohavedifferentshapes

33

K-Mean• Step1:initialization• Step2a:guessclustermembership

• Step2b:findclustercenters

• Repeat2a-2buntilconvergence

Page 34: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Example:GaussianMixtureModel

• Observeddata(simulatedfromtwonormaldistributions)– 0.371.180.162.601.330.181.491.743.582.694.513.392.380.794.122.962.983.943.823.59

• AssumingK=2

• Parameters:μ1,μ0,σ1,σ0,p

34

Page 35: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Example:simulateddata

Group1~N(1,1)

Group2~N(3,1) 35

Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.

Page 36: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

Example:ClusterAnalysisofIrisDataUsingPetalLength

Setosa Versicolor36

Note:theanimationinthefiguredoesn’tworkappropriatelyonMAC.

Page 37: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

RPackage:MCLUST• DevelopedbyAdrianRaftery andcolleagues

• Gaussianmixturemodel

• EM

• Clustering,classification,densityestimation

• Pleasetryitout!

37

Page 38: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

ClusteringAnalysisForMultidimensionalData

38

Page 39: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

MultidimensionalData• Humanfaces,images

• 3Dobjects

• Textdocuments

• Brainimaging

39

Page 40: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

40

WholeBrainConnectivity

Sub1

Sub2

Sub3

Sub4

task1task2task3rest1rest2rest3

Page 41: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

BrainConnectivityvsFingerprint

41SubjectID

Page 42: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

42

task1task2task3rest1rest2rest3

Page 43: An Introduction to Cluster Analysis · Cluster Analysis Based on Mixture Model • I present a frequentist version – Choose an appropriate model. E.g., A Gaussian mixture model

SomeTechnicalDetails

43

?