205

Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 2: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 3: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

PracticalGuidetoPrincipalComponentMethodsinR

MultivariateAnalysisAlboukadelKASSAMBARA

Page 4: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 5: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

PracticalGuidetoPrincipalComponentMethodsinR

Page 6: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 7: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Preface

Page 8: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.1Whatyouwilllearn

Largedatasetscontainingmultiplesamplesandvariablesarecollectedeverydaybyresearchersinvariousfields,suchasinBio-medical,marketing,andgeo-spatialfields.

Discoveringknowledgefromthesedatarequiresspecifictechniquesforanalyzingdatasetscontainingmultiplevariables.Multivariateanalysis(MVA)referstoasetoftechniquesusedforanalyzingadatasetcontainingmorethanonevariable.

Amongthesetechniques,thereare:

Clusteranalysisforidentifyinggroupsofobservationswithsimilarprofileaccordingtoaspecificcriteria.Principalcomponentmethods,whichconsistofsummarizingandvisualizingthemostimportantinformationcontainedinamultivariatedataset.

Previously,wepublishedabookentitled"PracticalGuideToClusterAnalysisinR"(https://goo.gl/DmJ5y5).TheaimofthecurrentbookistoprovideasolidpracticalguidancetoprincipalcomponentmethodsinR.Additionally,wedevelopedanRpackagenamedfactoextratocreate,easily,aggplot2-basedelegantplotsoftheresultsofprincipalcomponentmethod.Factoextraofficialonlinedocumentation:http://www.sthda.com/english/rpkgs/factoextra

Oneofthedifficultiesinherentinmultivariateanalysisistheproblemofvisualizingdatathathasmanyvariables.InR,therearemanyfunctionsandpackagesfordisplayingagraphoftherelationshipbetweentwovariables(http://www.sthda.com/english/wiki/data-visualization).Therearealsocommandsfordisplayingdifferentthree-dimensionalviews.Butwhentherearemorethanthreevariables,itismoredifficulttovisualizetheirrelationships.

Fortunately,indatasetswithmanyvariables,somevariablesareoftencorrelated.Thiscanbeexplainedbythefactthat,morethanonevariablemightbemeasuringthesamedrivingprinciplegoverningthebehaviorofthesystem.Correlationindicatesthatthereisredundancyinthedata.Whenthishappens,youcansimplifytheproblembyreplacingagroupofcorrelatedvariableswithasinglenewvariable.

Principalcomponentanalysisisarigorousstatisticalmethodusedforachievingthissimplification.Themethodcreatesanewsetofvariables,calledprincipalcomponents.Eachprincipalcomponentisalinearcombinationoftheoriginalvariables.Alltheprincipalcomponentsareorthogonaltoeachother,sothereisnoredundantinformation.

Thetypeofprincipalcomponentmethodstousedependsonvariabletypescontainedinthedataset.Thispracticalguidewilldescribethefollowingmethods:

1. PrincipalComponentAnalysis(PCA),whichisoneofthemostpopularmultivariateanalysismethod.ThegoalofPCAistosummarizetheinformationcontainedinacontinuous(i.e,quantitative)multivariatedatabyreducingthedimensionalityofthedatawithoutloosingimportantinformation.

Page 9: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2. CorrespondenceAnalysis(CA),whichisanextensionoftheprincipalcomponentanalysisforanalyzingalargecontingencytableformedbytwoqualitativevariables(orcategoricaldata).

3. MultipleCorrespondenceAnalysis(MCA),whichisanadaptationofCAtoadatatablecontainingmorethantwocategoricalvariables.

4. FactorAnalysisofMixedData(FAMD),dedicatedtoanalyzeadatasetcontainingbothquantitativeandqualitativevariables.

5. MultipleFactorAnalysis(MFA),dedicatedtoanalyzedatasets,inwhichvariablesareorganizedintogroups(qualitativeand/orquantitativevariables).

Additionally,we'lldiscusstheHCPC(HierarchicalClusteringonPrincipalComponent)method.Itappliesagglomerativehierarchicalclusteringontheresultsofprincipalcomponentmethods(PCA,CA,MCA,FAMD,MFA).Itallowsus,forexample,toperformclusteringanalysisonanytypeofdata(quantitative,qualitativeormixeddata).

Figure1illustratesthetypeofanalysistobeperformeddependingonthetypeofvariablescontainedinthedataset.

Page 10: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Principalcomponentmethods

Page 11: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.2Keyfeaturesofthisbook

Althoughthereareseveralgoodbooksonprincipalcomponentmethodsandrelatedtopics,wefeltthatmanyofthemareeithertootheoreticalortooadvanced.

Ourgoalwastowriteapracticalguidetomultivariateanalysis,visualizationandinterpretation,focusingonprincipalcomponentmethods.

ThebookpresentsthebasicprinciplesofthedifferentmethodsandprovidemanyexamplesinR.Thisbookofferssolidguidanceindataminingforstudentsandresearchers.

Keyfeatures

CoversprincipalcomponentmethodsandimplementationinRShort,self-containedchapterswithtestedexamplesthatallowforflexibilityindesigningacourseandforeasyreference

Attheendofeachchapter,wepresentRlabsectionsinwhichwesystematicallyworkthroughapplicationsofthevariousmethodsdiscussedinthatchapter.Additionally,weprovidelinkstootherresourcesandtoourhand-curatedlistofvideosonprincipalcomponentmethodsforfurtherlearning.

Page 12: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.3Howthisbookisorganized

Thisbookisdividedinto4partsand6chapters.PartIprovidesaquickintroductiontoR(chapter2)andpresentsrequiredRpackagesfortheanalysisandvisualization(chapter3).

InPartII,wedescribeclassicalmultivariateanalysismethods:

PrincipalComponentAnalysis-PCA(chapter4)CorrespondenceAnalysis-CA(chapter5)MultipleCorrespondenceAnalysis-MCA(chapter6)

InpartIII,wecontinuebydiscussingadvancedmethodsforanalyzingadatasetcontainingamixofvariables(qualitative&quantitative)organizedornotintogroups:

FactorAnalysisofMixedData-FAMD(chapter7)and,MultipleFactorAnalysis-MFA(chapter8).

Finally,weshowinPartIV,howtoperformhierarchicalclusteringonprincipalcomponents(HCPC)(chapter9),whichisusefulforperformingclusteringwithadatasetcontainingonlyqualitativevariablesorwithamixeddataofqualitativeandquantitativevariables.

Someexamplesofplotsgeneratedinthisbookareshownhereafter.You'lllearnhowtocreate,customizeandinterprettheseplots.

1. Eigenvalues/variancesofprincipalcomponents.Proportionofinformationretainedbyeachprincipalcomponent.

2. PCA-Graphofvariables:

Controlvariablecolorsusingtheircontributionstotheprincipalcomponents.

Page 13: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Highlightthemostcontributingvariablestoeachprincipaldimension:

3. PCA-Graphofindividuals:

Controlautomaticallythecolorofindividualsusingthecos2(thequalityoftheindividualsonthefactormap)

Page 14: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Changethepointsizeaccordingtothecos2ofthecorrespondingindividuals:

4. PCA-Biplotofindividualsandvariables

Page 15: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5. Correspondenceanalysis.Associationbetweencategoricalvariables.

6. FAMD-Analyzingmixeddata

Page 16: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7. Clusteringonprincipalcomponents

Page 17: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.4Bookwebsite

Thewebsiteforthisbookislocatedat:http://www.sthda.com/english/.Itcontainsnumberofresources.

Page 18: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.5ExecutingtheRcodesfromthePDF

ForasinglelineRcode,youcanjustcopythecodefromthePDFtotheRconsole.

Foramultiple-lineRcodes,anerrorisgenerated,sometimes,whenyoucopyandpastedirectlytheRcodefromthePDFtotheRconsole.Ifthishappens,asolutionisto:

PastefirstlythecodeinyourRcodeeditororinyourtexteditorCopythecodefromyourtext/codeeditortotheRconsole

Page 19: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.6Acknowledgment

Isincerelythankalldevelopersfortheireffortsbehindthepackagesthatfactoextradependson,namely,ggplot2(HadleyWickham,Springer-VerlagNewYork,2009),FactoMineR(SebastienLeetal.,JournalofStatisticalSoftware,2008),dendextend(TalGalili,Bioinformatics,2015),cluster(MartinMaechleretal.,2016)andmore.

Page 20: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

0.7Colophon

Thisbookwasbuiltwith:

R3.3.2factoextra1.0.5FactoMineR1.36ggpubr0.1.5dplyr0.7.2bookdown0.4.3

Page 21: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 22: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

1AbouttheauthorAlboukadelKassambaraisaPhDinBioinformaticsandCancerBiology.Heworkssincemanyyearsongenomicdataanalysisandvisualization(readmore:http://www.alboukadel.com/).

Hehasworkexperiencesinstatisticalandcomputationalmethodstoidentifyprognosticandpredictivebiomarkersignaturesthroughintegrativeanalysisoflarge-scalegenomicandclinicaldatasets.

Hecreatedabioinformaticsweb-toolnamedGenomicScape(www.genomicscape.com)whichisaneasy-to-usewebtoolforgeneexpressiondataanalysisandvisualization.

Hedevelopedalsoatrainingwebsiteondatascience,namedSTHDA(StatisticalToolsforHigh-throughputDataAnalysis,www.sthda.com/english),whichcontainsmanytutorialsondataanalysisandvisualizationusingRsoftwareandpackages.

HeistheauthorofmanypopularRpackagesfor:

multivariatedataanalysis(factoextra,http://www.sthda.com/english/rpkgs/factoextra),survivalanalysis(survminer,http://www.sthda.com/english/rpkgs/survminer/),correlationanalysis(ggcorrplot,http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2),creatingpublicationreadyplotsinR(ggpubr,http://www.sthda.com/english/rpkgs/ggpubr).

Recently,hepublishedthreebooksondataanalysisandvisualization:

1. PracticalGuidetoClusterAnalysisinR(https://goo.gl/DmJ5y5)2. GuidetoCreateBeautifulGraphicsinR(https://goo.gl/vJ0OYb).3. CompleteGuideto3DPlotsinR(https://goo.gl/v5gwl0).

Page 23: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 24: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2IntroductiontoRRisafreeandpowerfulstatisticalsoftwareforanalyzingandvisualizingdata.IfyouwanttolearneasilytheessentialofRprogramming,visitourseriesoftutorialsavailableonSTHDA:http://www.sthda.com/english/wiki/r-basics-quick-and-easy.

Inthischapter,weprovideaverybriefintroductiontoR,forinstallingR/RStudioaswellasimportingyourdataintoRforcomputingprincipalcomponentmethods.

Page 25: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2.1InstallingRandRStudio

RandRStudiocanbeinstalledonWindows,MACOSXandLinuxplatforms.RStudioisanintegrateddevelopmentenvironmentforRthatmakesusingReasier.Itincludesaconsole,codeeditorandtoolsforplotting.

1. RcanbedownloadedandinstalledfromtheComprehensiveRArchiveNetwork(CRAN)webpage(http://cran.r-project.org/)

2. AfterinstallingRsoftware,installalsotheRStudiosoftwareavailableat:http://www.rstudio.com/products/RStudio/.

3. LaunchRStudioandstartuseRinsideRstudio.

Rstudiointerface

Page 26: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2.2InstallingandloadingRpackages

AnRpackageisanextensionofRcontainingdatasetsandspecificRfunctionstosolvespecificquestions.

Forexample,inthisbook,you'lllearnhowtocomputeandvisualizeprincipalcomponentmethodsusingFactoMineRandfactoextraRpackages.

TherearethousandsotherRpackagesavailablefordownloadandinstallationfromCRAN,Bioconductor(biologyrelatedRpackages)andGitHubrepositories.

1. HowtoinstallpackagesfromCRAN?Usethefunctioninstall.packages():

install.packages("FactoMineR")

install.packages("factoextra")

2. HowtoinstallpackagesfromGitHub?Youshouldfirstinstalldevtoolsifyoudon'thaveitalreadyinstalledonyourcomputer:

Forexample,thefollowingRcodeinstallsthelatestdevelopmentalversionoffactoextraRpackagedevelopedbyA.Kassambara(https://github.com/kassambara/facoextra)formultivariatedataanalysisandelegantvisualization.

install.packages("devtools")

devtools::install_github("kassambara/factoextra")

Notethat,GitHubcontainsthelatestdevelopmentalversionofRpackages.

3. Afterinstallation,youmustfirstloadthepackageforusingthefunctionsinthepackage.Thefunctionlibrary()isusedforthistask.

library("FactoMineR")

library("factoextra")

Now,wecanuseRfunctions,suchasPCA()[intheFactoMineRpackage]forperformingprincipalcomponentanalysis.

Page 27: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2.3GettinghelpwithfunctionsinR

Ifyouwanttolearnmoreaboutagivenfunction,sayPCA(),typethisinRconsole:

?PCA

Page 28: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2.4ImportingyourdataintoR

1. Prepareyourfileasfollow:

Usethefirstrowascolumnnames.Generally,columnsrepresentvariablesUsethefirstcolumnasrownames.Generallyrowsrepresentobservationsorindividuals.Eachrow/columnnameshouldbeunique,soremoveduplicatednames.Avoidnameswithblankspaces.Goodcolumnnames:Long_jumporLong.jump.Badcolumnname:Longjump.Avoidnameswithspecialsymbols:?,$,*,+,#,(,),-,/,},{,|,>,<etc.Onlyunderscorecanbeused.Avoidbeginningvariablenameswithanumber.Useletterinstead.Goodcolumnnames:sport_100morx100m.Badcolumnname:100mRiscasesensitive.ThismeansthatNameisdifferentfromNameorNAME.Avoidblankrowsinyourdata.Deleteanycommentsinyourfile.ReplacemissingvaluesbyNA(fornotavailable)Ifyouhaveacolumncontainingdate,usethefourdigitformat.Goodformat:01/01/2016.Badformat:01/01/16

2. Thefinalfileshouldlooklikethis:

GeneraldataformatforimportationintoR

3. Saveyourfile

Page 29: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Werecommendtosaveyourfileinto.txt(tab-delimitedtextfile)or.csv(commaseparatedvaluefile)format.

4. GetyourdataintoR:

UsetheRcodebelow.Youwillbeaskedtochooseafile:

#.txtfile:Readtabseparatedvalues

my_data<-read.delim(file.choose(),row.names=1)

#.csvfile:Readcomma(",")separatedvalues

my_data<-read.csv(file.choose(),row.names=1)

#.csvfile:Readsemicolon(";")separatedvalues

my_data<-read.csv2(file.choose(),row.names=1)

Usingthesefunctions,theimporteddatawillbeofclassdata.frame(Rterminology).

YoucanreadmoreabouthowtoimportdataintoRatthislink:http://www.sthda.com/english/wiki/importing-data-into-r

Page 30: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2.5Demodatasets

Rcomeswithseveralbuilt-indatasets,whicharegenerallyusedasdemodataforplayingwithRfunctions.ThemostusedRdemodatasetsinclude:USArrests,irisandmtcars.Toloadademodataset,usethefunctiondata()asfollow:

data("USArrests")#Loading

head(USArrests,3)#Printthefirst3rows

##MurderAssaultUrbanPopRape

##Alabama13.22365821.2

##Alaska10.02634844.5

##Arizona8.12948031.0

IfyouwantlearnmoreaboutUSArrestsdatasets,typethis:

?USArrests

Toselectjustcertaincolumnsfromadataframe,youcaneitherrefertothecolumnsbynameorbytheirlocation(i.e.,column1,2,3,etc.).

#Accessthedatain'Murder'column

#dollarsignisused

head(USArrests$Murder)

##[1]13.210.08.18.89.07.9

#Orusethis

USArrests[,'Murder']

#Orusethis

USArrests[,1]#columnnumber1

Page 31: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

2.6CloseyourR/RStudiosession

EachtimeyoucloseR/RStudio,youwillbeaskedwhetheryouwanttosavethedatafromyourRsession.Ifyoudecidetosave,thedatawillbeavailableinfutureRsessions.

Page 32: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 33: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

3RequiredRpackages

Page 34: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

3.1FactoMineR&factoextra

ThereareanumberofRpackagesimplementingprincipalcomponentmethods.Thesepackagesinclude:FactoMineR,ade4,stats,ca,MASSandExPosition.

However,theresultispresenteddifferentlydependingontheusedpackage.

Tohelpintheinterpretationandinthevisualizationofmultivariateanalysis-suchasclusteranalysisandprincipalcomponentmethods-wedevelopedaneasy-to-useRpackagenamedfactoextra(officialonlinedocumentation:http://www.sthda.com/english/rpkgs/factoextra)(KassambaraandMundt2017).

Nomatterwhichpackageyoudecidetouseforcomputingprincipalcomponentmethods,thefactoextraRpackagecanhelptoextracteasily,inahumanreadabledataformat,theanalysisresultsfromthedifferentpackagesmentionedabove.factoextraprovidesalsoconvenientsolutionstocreateggplot2-basedbeautifulgraphs.

Inthisbook,we'llusemainly:

theFactoMineRpackage(F.Hussonetal.2017)tocomputeprincipalcomponentmethods;andthefactoextrapackage(KassambaraandMundt2017)forextracting,visualizingandinterpretingtheresults.

Theotherpackages-ade4,ExPosition,etc-willbepresentedbriefly.

TheFigure2.1illustratesthekeyfunctionalityofFactoMineRandfactoextra.

Page 35: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

KeyfeaturesofFactoMineRandfactoextraformultivariateanalysis

Methods,whichoutputscanbevisualizedusingthefactoextrapackageareshownontheFigure2.2:

Page 36: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

PrincipalcomponentmethodsandclusteringmethodssupportedbythefactoextraRpackage

Page 37: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

3.2Installation

3.2.1InstallingFactoMineR

TheFactoMineRpackagecanbeinstalledandloadedasfollow:

#Install

install.packages("FactoMineR")

#Load

library("FactoMineR")

3.2.2Installingfactoextra

factoextracanbeinstalledfromCRANasfollow:

install.packages("factoextra")

Or,installthelatestdevelopmentalversionfromGithub

if(!require(devtools))install.packages("devtools")

devtools::install_github("kassambara/factoextra")

Loadfactoextraasfollow:

library("factoextra")

Page 38: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

3.3MainRfunctions

3.3.1MainfunctionsinFactoMineR

Functionsforcomputingprincipalcomponentmethodsandclustering:

Functions DescriptionPCA Principalcomponentanalysis.CA Correspondenceanalysis.MCA Multiplecorrespondenceanalysis.FAMD Factoranalysisofmixeddata.MFA Multiplefactoranalysis.HCPC Hierarchicalclusteringonprincipalcomponents.dimdesc Dimensiondescription.

3.3.2Mainfunctionsinfactoextra

factoextrafunctionscoveredinthisbookarelistedinthetablebelow.Seetheonlinedocumentation(http://www.sthda.com/english/rpkgs/factoextra)foracompletelist.

Visualizingprincipalcomponentmethodoutputs

Functions Descriptionfviz_eig(orfviz_eigenvalue) Visualizeeigenvalues.fviz_pca GraphofPCAresults.fviz_ca GraphofCAresults.fviz_mca GraphofMCAresults.fviz_mfa GraphofMFAresults.fviz_famd GraphofFAMDresults.fviz_hmfa GraphofHMFAresults.fviz_ellipses Plotellipsesaroundgroups.fviz_cos2 Visualizeelementcos2.1fviz_contrib Visualizeelementcontributions.2

Extractingdatafromprincipalcomponentmethodoutputs.Thefollowingfunctionsextractalltheresults(coordinates,squaredcosine,contributions)fortheactiveindividuals/variablesfromtheanalysisoutputs.

Functions Descriptionget_eigenvalue Accesstothedimensioneigenvalues.get_pca AccesstoPCAoutputs.get_ca AccesstoCAoutputs.

Page 39: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

get_mca AccesstoMCAoutputs.get_mfa AccesstoMFAoutputs.get_famd AccesstoMFAoutputs.get_hmfa AccesstoHMFAoutputs.facto_summarize Summarizetheanalysis.

Clusteringanalysisandvisualization

Functions Descriptionfviz_dend EnhancedVisualizationofDendrogram.fviz_cluster VisualizeClusteringResults.

1. Cos2:qualityofrepresentationoftherow/columnvariablesontheprincipalcomponentmaps.↩

2. Thisisthecontributionofrow/columnelementstothedefinitionoftheprincipalcomponents.↩

Page 40: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 41: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4PrincipalComponentAnalysis

Page 42: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.1Introduction

Principalcomponentanalysis(PCA)allowsustosummarizeandtovisualizetheinformationinadatasetcontainingindividuals/observationsdescribedbymultipleinter-correlatedquantitativevariables.Eachvariablecouldbeconsideredasadifferentdimension.Ifyouhavemorethan3variablesinyourdatasets,itcouldbeverydifficulttovisualizeamulti-dimensionalhyperspace.

Principalcomponentanalysisisusedtoextracttheimportantinformationfromamultivariatedatatableandtoexpressthisinformationasasetoffewnewvariablescalledprincipalcomponents.Thesenewvariablescorrespondtoalinearcombinationoftheoriginals.Thenumberofprincipalcomponentsislessthanorequaltothenumberoforiginalvariables.

Theinformationinagivendatasetcorrespondstothetotalvariationitcontains.ThegoalofPCAistoidentifydirections(orprincipalcomponents)alongwhichthevariationinthedataismaximal.

Inotherwords,PCAreducesthedimensionalityofamultivariatedatatotwoorthreeprincipalcomponents,thatcanbevisualizedgraphically,withminimallossofinformation.

Inthischapter,wedescribethebasicideaofPCAand,demonstratehowtocomputeandvisualizePCAusingRsoftware.Additionally,we'llshowhowtorevealthemostimportantvariablesthatexplainthevariationsinadataset.

Page 43: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.2Basics

UnderstandingthedetailsofPCArequiresknowledgeoflinearalgebra.Here,we'llexplainonlythebasicswithsimplegraphicalrepresentationofthedata.

InthePlot1Abelow,thedataarerepresentedintheX-Ycoordinatesystem.Thedimensionreductionisachievedbyidentifyingtheprincipaldirections,calledprincipalcomponents,inwhichthedatavaries.

PCAassumesthatthedirectionswiththelargestvariancesarethemost“important”(i.e,themostprincipal).

Inthefigurebelow,thePC1axisisthefirstprincipaldirectionalongwhichthesamplesshowthelargestvariation.ThePC2axisisthesecondmostimportantdirectionanditisorthogonaltothePC1axis.

Thedimensionalityofourtwo-dimensionaldatacanbereducedtoasingledimensionbyprojectingeachsampleontothefirstprincipalcomponent(Plot1B)

Technicallyspeaking,theamountofvarianceretainedbyeachprincipalcomponentismeasuredbytheso-calledeigenvalue.

Notethat,thePCAmethodisparticularlyusefulwhenthevariableswithinthedatasetarehighlycorrelated.Correlationindicatesthatthereisredundancyinthedata.Duetothisredundancy,PCAcanbeusedtoreducetheoriginalvariablesintoasmallernumberofnewvariables(=principalcomponents)explainingmostofthevarianceintheoriginalvariables.

Page 44: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Takentogether,themainpurposeofprincipalcomponentanalysisisto:

identifyhiddenpatterninadataset,reducethedimensionnalityofthedatabyremovingthenoiseandredundancyinthedata,identifycorrelatedvariables

Page 45: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.3Computation

4.3.1Rpackages

SeveralfunctionsfromdifferentpackagesareavailableintheRsoftwareforcomputingPCA:

prcomp()andprincomp()[built-inRstatspackage],PCA()[FactoMineRpackage],dudi.pca()[ade4package],andepPCA()[ExPositionpackage]

Nomatterwhatfunctionyoudecidetouse,youcaneasilyextractandvisualizetheresultsofPCAusingRfunctionsprovidedinthefactoextraRpackage.

Installthetwopackagesasfollow:

install.packages(c("FactoMineR","factoextra"))

LoadtheminR,bytypingthis:

library("FactoMineR")

library("factoextra")

4.3.2Dataformat

We'llusethedemodatasetsdecathlon2fromthefactoextrapackage:

data(decathlon2)

#head(decathlon2)

AsillustratedinFigure3.1,thedatausedheredescribesathletes'performanceduringtwosportingevents(DesctarandOlympicG).Itcontains27individuals(athletes)describedby13variables.

Here,we'llusethetwopackagesFactoMineR(fortheanalysis)andfactoextra(forggplot2-basedvisualization).

Page 46: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Principalcomponentanalysisdataformat

InPCAterminology,ourdatacontains:

Activeindividuals(inlightblue,rows1:23):Individualsthatareusedduringtheprincipalcomponentanalysis.Supplementaryindividuals(indarkblue,rows24:27):ThecoordinatesoftheseindividualswillbepredictedusingthePCAinformationandparametersobtainedwithactiveindividuals/variablesActivevariables(inpink,columns1:10):Variablesthatareusedfortheprincipalcomponentanalysis.Supplementaryvariables:Assupplementaryindividuals,thecoordinatesofthesevariableswill

Notethat,onlysomeoftheseindividualsandvariableswillbeusedtoperformtheprincipalcomponentanalysis.ThecoordinatesoftheremainingindividualsandvariablesonthefactormapwillbepredictedafterthePCA.

Page 47: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

bepredictedalso.Thesecanbe:Supplementarycontinuousvariables(red):Columns11and12correspondingrespectivelytotherankandthepointsofathletes.Supplementaryqualitativevariables(green):Column13correspondingtothetwoathlete-ticmeetings(2004OlympicGameor2004Decastar).Thisisacategorical(orfactor)variablefactor.Itcanbeusedtocolorindividualsbygroups.

Westartbysubsettingactiveindividualsandactivevariablesfortheprincipalcomponentanalysis:

decathlon2.active<-decathlon2[1:23,1:10]

head(decathlon2.active[,1:6],4)

##X100mLong.jumpShot.putHigh.jumpX400mX110m.hurdle

##SEBRLE11.07.5814.82.0749.814.7

##CLAY10.87.4014.31.8649.414.1

##BERNARD11.07.2314.21.9248.915.0

##YURKOV11.37.0915.22.1050.415.3

4.3.3Datastandardization

Inprincipalcomponentanalysis,variablesareoftenscaled(i.e.standardized).Thisisparticularlyrecommendedwhenvariablesaremeasuredindifferentscales(e.g:kilograms,kilometers,centimeters,...);otherwise,thePCAoutputsobtainedwillbeseverelyaffected.

Thegoalistomakethevariablescomparable.Generallyvariablesarescaledtohavei)standarddeviationoneandii)meanzero.

ThestandardizationofdataisanapproachwidelyusedinthecontextofgeneexpressiondataanalysisbeforePCAandclusteringanalysis.Wemightalsowanttoscalethedatawhenthemeanand/orthestandarddeviationofvariablesarelargelydifferent.

Whenscalingvariables,thedatacanbetransformedasfollow:

xi−mean(x)sd(x)\frac{x_i-mean(x)}{sd(x)}

Wheremean(x)mean(x)isthemeanofxvalues,andsd(x)sd(x)isthestandarddeviation(SD).

TheRbasefunctionscale()canbeusedtostandardizethedata.Ittakesanumericmatrixasaninputandperformsthescalingonthecolumns.

4.3.4Rcode

ThefunctionPCA()[FactoMineRpackage]canbeused.Asimplifiedformatis:

Notethat,bydefault,thefunctionPCA()[inFactoMineR],standardizesthedataautomaticallyduringthePCA;soyoudon'tneeddothistransformationbeforethePCA.

Page 48: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE)

X:adataframe.Rowsareindividualsandcolumnsarenumericvariablesscale.unit:alogicalvalue.IfTRUE,thedataarescaledtounitvariancebeforetheanalysis.Thisstandardizationtothesamescaleavoidssomevariablestobecomedominantjustbecauseoftheirlargemeasurementunits.Itmakesvariablecomparable.ncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.

TheRcodebelow,computesprincipalcomponentanalysisontheactiveindividuals/variables:

library("FactoMineR")

res.pca<-PCA(decathlon2.active,graph=FALSE)

TheoutputofthefunctionPCA()isalist,includingthefollowingcomponents:

print(res.pca)

##**ResultsforthePrincipalComponentAnalysis(PCA)**

##Theanalysiswasperformedon23individuals,describedby10variables

##*Theresultsareavailableinthefollowingobjects:

##

##namedescription

##1"$eig""eigenvalues"

##2"$var""resultsforthevariables"

##3"$var$coord""coord.forthevariables"

##4"$var$cor""correlationsvariables-dimensions"

##5"$var$cos2""cos2forthevariables"

##6"$var$contrib""contributionsofthevariables"

##7"$ind""resultsfortheindividuals"

##8"$ind$coord""coord.fortheindividuals"

##9"$ind$cos2""cos2fortheindividuals"

##10"$ind$contrib""contributionsoftheindividuals"

##11"$call""summarystatistics"

##12"$call$centre""meanofthevariables"

##13"$call$ecart.type""standarderrorofthevariables"

##14"$call$row.w""weightsfortheindividuals"

##15"$call$col.w""weightsforthevariables"

TheobjectthatiscreatedusingthefunctionPCA()containsmanyinformationfoundinmanydifferentlistsandmatrices.Thesevaluesaredescribedinthenextsection.

Page 49: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.4VisualizationandInterpretation

We'llusethefactoextraRpackagetohelpintheinterpretationofPCA.Nomatterwhatfunctionyoudecidetouse[stats::prcomp(),FactoMiner::PCA(),ade4::dudi.pca(),ExPosition::epPCA()],youcaneasilyextractandvisualizetheresultsofPCAusingRfunctionsprovidedinthefactoextraRpackage.

Thesefunctionsinclude:

get_eigenvalue(res.pca):Extracttheeigenvalues/variancesofprincipalcomponentsfviz_eig(res.pca):Visualizetheeigenvaluesget_pca_ind(res.pca),get_pca_var(res.pca):Extracttheresultsforindividualsandvariables,respectively.fviz_pca_ind(res.pca),fviz_pca_var(res.pca):Visualizetheresultsindividualsandvariables,respectively.fviz_pca_biplot(res.pca):Makeabiplotofindividualsandvariables.

Inthenextsections,we'llillustrateeachofthesefunctions.

4.4.1Eigenvalues/Variances

Asdescribedinprevioussections,theeigenvaluesmeasuretheamountofvariationretainedbyeachprincipalcomponent.EigenvaluesarelargeforthefirstPCsandsmallforthesubsequentPCs.Thatis,thefirstPCscorrespondstothedirectionswiththemaximumamountofvariationinthedataset.

Weexaminetheeigenvaluestodeterminethenumberofprincipalcomponentstobeconsidered.Theeigenvaluesandtheproportionofvariances(i.e.,information)retainedbytheprincipalcomponents(PCs)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage].

library("factoextra")

eig.val<-get_eigenvalue(res.pca)

eig.val

##eigenvaluevariance.percentcumulative.variance.percent

##Dim.14.12441.2441.2

##Dim.21.83918.3959.6

##Dim.31.23912.3972.0

##Dim.40.8198.1980.2

##Dim.50.7027.0287.2

##Dim.60.4234.2391.5

##Dim.70.3033.0394.5

##Dim.80.2742.7497.2

##Dim.90.1551.5598.8

##Dim.100.1221.22100.0

Thesumofalltheeigenvaluesgiveatotalvarianceof10.

Theproportionofvariationexplainedbyeacheigenvalueisgiveninthesecondcolumn.Forexample,4.124dividedby10equals0.4124,or,about41.24%ofthevariationisexplainedbythisfirsteigenvalue.Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsofvariationexplainedtoobtaintherunningtotal.Forinstance,41.242%plus18.385%equals59.627%,

Page 50: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

andsoforth.Therefore,about59.627%ofthevariationisexplainedbythefirsttwoeigenvaluestogether.

EigenvaluescanbeusedtodeterminethenumberofprincipalcomponentstoretainafterPCA(Kaiser1961):

Aneigenvalue>1indicatesthatPCsaccountformorevariancethanaccountedbyoneoftheoriginalvariablesinstandardizeddata.ThisiscommonlyusedasacutoffpointforwhichPCsareretained.Thisholdstrueonlywhenthedataarestandardized.

Youcanalsolimitthenumberofcomponenttothatnumberthataccountsforacertainfractionofthetotalvariance.Forexample,ifyouaresatisfiedwith70%ofthetotalvarianceexplainedthenusethenumberofcomponentstoachievethat.

Unfortunately,thereisnowell-acceptedobjectivewaytodecidehowmanyprincipalcomponentsareenough.Thiswilldependonthespecificfieldofapplicationandthespecificdataset.Inpractice,wetendtolookatthefirstfewprincipalcomponentsinordertofindinterestingpatternsinthedata.

Inouranalysis,thefirstthreeprincipalcomponentsexplain72%ofthevariation.Thisisanacceptablylargepercentage.

AnalternativemethodtodeterminethenumberofprincipalcomponentsistolookataScreePlot,whichistheplotofeigenvaluesorderedfromlargesttothesmallest.Thenumberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareallrelativelysmallandofcomparablesize(Jollife2002,Peres-Neto,Jackson,andSomers(2005)).

Thescreeplotcanbeproducedusingthefunctionfviz_eig()orfviz_screeplot()[factoextrapackage].

fviz_eig(res.pca,addlabels=TRUE,ylim=c(0,50))

Page 51: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.4.2Graphofvariables

4.4.2.1Results

Asimplemethodtoextracttheresults,forvariables,fromaPCAoutputistousethefunctionget_pca_var()[factoextrapackage].Thisfunctionprovidesalistofmatricescontainingalltheresultsfortheactivevariables(coordinates,correlationbetweenvariablesandaxes,squaredcosineandcontributions)

var<-get_pca_var(res.pca)

var

##PrincipalComponentAnalysisResultsforvariables

##===================================================

##NameDescription

##1"$coord""Coordinatesforthevariables"

##2"$cor""Correlationsbetweenvariablesanddimensions"

##3"$cos2""Cos2forthevariables"

##4"$contrib""contributionsofthevariables"

Thecomponentsoftheget_pca_var()canbeusedintheplotofvariablesasfollow:

var$coord:coordinatesofvariablestocreateascatterplotvar$cos2:representsthequalityofrepresentationforvariablesonthefactormap.It'scalculatedasthesquaredcoordinates:var.cos2=var.coord*var.coord.var$contrib:containsthecontributions(inpercentage)ofthevariablestotheprincipalcomponents.Thecontributionofavariable(var)toagivenprincipalcomponentis(inpercentage):(var.cos2*100)/(totalcos2ofthecomponent).

Thedifferentcomponentscanbeaccessedasfollow:

#Coordinates

head(var$coord)

#Cos2:qualityonthefactoremap

head(var$cos2)

#Contributionstotheprincipalcomponents

head(var$contrib)

Fromtheplotabove,wemightwanttostopatthefifthprincipalcomponent.87%oftheinformation(variances)containedinthedataareretainedbythefirstfiveprincipalcomponents.

Notethat,it'spossibletoplotvariablesandtocolorthemaccordingtoeitheri)theirqualityonthefactormap(cos2)orii)theircontributionvaluestotheprincipalcomponents(contrib).

Page 52: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Inthissection,wedescribehowtovisualizevariablesanddrawconclusionsabouttheircorrelations.Next,wehighlightvariablesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstotheprincipalcomponents.

4.4.2.2Correlationcircle

Thecorrelationbetweenavariableandaprincipalcomponent(PC)isusedasthecoordinatesofthevariableonthePC.Therepresentationofvariablesdiffersfromtheplotoftheobservations:Theobservationsarerepresentedbytheirprojections,butthevariablesarerepresentedbytheircorrelations(AbdiandWilliams2010).

#Coordinatesofvariables

head(var$coord,4)

##Dim.1Dim.2Dim.3Dim.4Dim.5

##X100m-0.851-0.17940.3020.0336-0.194

##Long.jump0.7940.2809-0.191-0.11540.233

##Shot.put0.7340.08540.5180.1285-0.249

##High.jump0.610-0.46520.3300.14460.403

Toplotvariables,typethis:

fviz_pca_var(res.pca,col.var="black")

Theplotaboveisalsoknownasvariablecorrelationplots.Itshowstherelationshipsbetweenallvariables.Itcanbeinterpretedasfollow:

Positivelycorrelatedvariablesaregroupedtogether.Negativelycorrelatedvariablesarepositionedonoppositesidesoftheplotorigin(opposed

Page 53: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

quadrants).Thedistancebetweenvariablesandtheoriginmeasuresthequalityofthevariablesonthefactormap.Variablesthatareawayfromtheoriginarewellrepresentedonthefactormap.

4.4.2.3Qualityofrepresentation

Thequalityofrepresentationofthevariablesonfactormapiscalledcos2(squarecosine,squaredcoordinates).Youcanaccesstothecos2asfollow:

head(var$cos2,4)

##Dim.1Dim.2Dim.3Dim.4Dim.5

##X100m0.7240.032180.09090.001130.0378

##Long.jump0.6310.078880.03630.013310.0544

##Shot.put0.5390.007290.26790.016500.0619

##High.jump0.3720.216420.10900.020890.1622

Youcanvisualizethecos2ofvariablesonallthedimensionsusingthecorrplotpackage:

library("corrplot")

corrplot(var$cos2,is.corr=FALSE)

It’salsopossibletocreateabarplotofvariablescos2usingthefunctionfviz_cos2()[infactoextra]:

#Totalcos2ofvariablesonDim.1andDim.2

fviz_cos2(res.pca,choice="var",axes=1:2)

Page 54: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Notethat,

Ahighcos2indicatesagoodrepresentationofthevariableontheprincipalcomponent.Inthiscasethevariableispositionedclosetothecircumferenceofthecorrelationcircle.

Alowcos2indicatesthatthevariableisnotperfectlyrepresentedbythePCs.Inthiscasethevariableisclosetothecenterofthecircle.

Foragivenvariable,thesumofthecos2onalltheprincipalcomponentsisequaltoone.

Ifavariableisperfectlyrepresentedbyonlytwoprincipalcomponents(Dim.1&Dim.2),thesumofthecos2onthesetwoPCsisequaltoone.Inthiscasethevariableswillbepositionedonthecircleofcorrelations.

Forsomeofthevariables,morethan2componentsmightberequiredtoperfectlyrepresentthedata.Inthiscasethevariablesarepositionedinsidethecircleofcorrelations.

Insummary:

Thecos2valuesareusedtoestimatethequalityoftherepresentationThecloseravariableistothecircleofcorrelations,thebetteritsrepresentationonthefactormap(andthemoreimportantitistointerpretthesecomponents)Variablesthatareclosedtothecenteroftheplotarelessimportantforthefirstcomponents.

It'spossibletocolorvariablesbytheircos2valuesusingtheargumentcol.var="cos2".Thisproducesagradientcolors.Inthiscase,theargumentgradient.colscanbeusedtoprovideacustomcolor.Forinstance,gradient.cols=c("white","blue","red")meansthat:

Page 55: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

variableswithlowcos2valueswillbecoloredin"white"variableswithmidcos2valueswillbecoloredin"blue"variableswithhighcos2valueswillbecoloredinred

#Colorbycos2values:qualityonthefactormap

fviz_pca_var(res.pca,col.var="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE#Avoidtextoverlapping

)

Notethat,it'salsopossibletochangethetransparencyofthevariablesaccordingtotheircos2valuesusingtheoptionalpha.var="cos2".Forexample,typethis:

#Changethetransparencybycos2values

fviz_pca_var(res.pca,alpha.var="cos2")

4.4.2.4ContributionsofvariablestoPCs

Thecontributionsofvariablesinaccountingforthevariabilityinagivenprincipalcomponentareexpressedinpercentage.

VariablesthatarecorrelatedwithPC1(i.e.,Dim.1)andPC2(i.e.,Dim.2)arethemostimportantinexplainingthevariabilityinthedataset.VariablesthatdonotcorrelatedwithanyPCorcorrelatedwiththelastdimensionsarevariableswithlowcontributionandmightberemovedtosimplifytheoverallanalysis.

Thecontributionofvariablescanbeextractedasfollow:

head(var$contrib,4)

Page 56: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##Dim.1Dim.2Dim.3Dim.4Dim.5

##X100m17.541.7517.340.1385.39

##Long.jump15.294.2902.931.6257.75

##Shot.put13.060.39721.622.0148.82

##High.jump9.0211.7728.792.55023.12

It’spossibletousethefunctioncorrplot()[corrplotpackage]tohighlightthemostcontributingvariablesforeachdimension:

library("corrplot")

corrplot(var$contrib,is.corr=FALSE)

Thefunctionfviz_contrib()[factoextrapackage]canbeusedtodrawabarplotofvariablecontributions.Ifyourdatacontainsmanyvariables,youcandecidetoshowonlythetopcontributingvariables.TheRcodebelowshowsthetop10variablescontributingtotheprincipalcomponents:

#ContributionsofvariablestoPC1

fviz_contrib(res.pca,choice="var",axes=1,top=10)

#ContributionsofvariablestoPC2

fviz_contrib(res.pca,choice="var",axes=2,top=10)

Thelargerthevalueofthecontribution,themorethevariablecontributestothecomponent.

Page 57: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

ThetotalcontributiontoPC1andPC2isobtainedwiththefollowingRcode:

fviz_contrib(res.pca,choice="var",axes=1:2,top=10)

Thereddashedlineonthegraphaboveindicatestheexpectedaveragecontribution.Ifthecontributionofthevariableswereuniform,theexpectedvaluewouldbe1/length(variables)=1/10=10%.Foragivencomponent,avariablewithacontributionlargerthanthiscutoffcouldbeconsideredasimportantincontributingtothecomponent.

Notethat,thetotalcontributionofagivenvariable,onexplainingthevariationsretainedbytwoprincipalcomponents,sayPC1andPC2,iscalculatedascontrib=[(C1*Eig1)+(C2*Eig2)]/(Eig1+Eig2),where

C1andC2arethecontributionsofthevariableonPC1andPC2,respectivelyEig1andEig2aretheeigenvaluesofPC1andPC2,respectively.RecallthateigenvaluesmeasuretheamountofvariationretainedbyeachPC.

Inthiscase,theexpectedaveragecontribution(cutoff)iscalculatedasfollow:Asmentionedabove,ifthecontributionsofthe10variableswereuniform,theexpectedaveragecontributiononagivenPCwouldbe1/10=10%.TheexpectedaveragecontributionofavariableforPC1andPC2is:[(10*Eig1)+(10*Eig2)]/(Eig1+Eig2)

Themostimportant(or,contributing)variablescanbehighlightedonthecorrelationplotasfollow:

fviz_pca_var(res.pca,col.var="contrib",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07")

)

Itcanbeseenthatthevariables-X100m,Long.jumpandPole.vault-contributethemosttothedimensions1and2.

Page 58: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Notethat,it'salsopossibletochangethetransparencyofvariablesaccordingtotheircontribvaluesusingtheoptionalpha.var="contrib".Forexample,typethis:

#Changethetransparencybycontribvalues

fviz_pca_var(res.pca,alpha.var="contrib")

4.4.2.5Colorbyacustomcontinuousvariable

Intheprevioussections,weshowedhowtocolorvariablesbytheircontributionsandtheircos2.Notethat,it'spossibletocolorvariablesbyanycustomcontinuousvariable.ThecoloringvariableshouldhavethesamelengthasthenumberofactivevariablesinthePCA(heren=10).

Forexample,typethis:

#Createarandomcontinuousvariableoflength10

set.seed(123)

my.cont.var<-rnorm(10)

#Colorvariablesbythecontinuousvariable

fviz_pca_var(res.pca,col.var=my.cont.var,

gradient.cols=c("blue","yellow","red"),

legend.title="Cont.Var")

Page 59: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.4.2.6Colorbygroups

It'salsopossibletochangethecolorofvariablesbygroupsdefinedbyaqualitative/categoricalvariable,alsocalledfactorinRterminology.

Aswedon'thaveanygroupingvariableinourdatasetsforclassifyingvariables,we'llcreateit.

Inthefollowingdemoexample,westartbyclassifyingthevariablesinto3groupsusingthekmeansclusteringalgorithm.Next,weusetheclustersreturnedbythekmeansalgorithmtocolorvariables.

#Createagroupingvariableusingkmeans

#Create3groupsofvariables(centers=3)

set.seed(123)

res.km<-kmeans(var$coord,centers=3,nstart=25)

grp<-as.factor(res.km$cluster)

#Colorvariablesbygroups

fviz_pca_var(res.pca,col.var=grp,

palette=c("#0073C2FF","#EFC000FF","#868686FF"),

legend.title="Cluster")

Notethat,ifyouareinterestedinlearningclustering,wepreviouslypublishedabooknamed"PracticalGuideToClusterAnalysisinR"(https://goo.gl/DmJ5y5).

Page 60: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.4.3Dimensiondescription

Inthesection4.4.2.4,wedescribedhowtohighlightvariablesaccordingtotheircontributionstotheprincipalcomponents.

Notealsothat,thefunctiondimdesc()[inFactoMineR],fordimensiondescription,canbeusedtoidentifythemostsignificantlyassociatedvariableswithagivenprincipalcomponent.Itcanbeusedasfollow:

res.desc<-dimdesc(res.pca,axes=c(1,2),proba=0.05)

#Descriptionofdimension1

res.desc$Dim.1

##$quanti

##correlationp.value

##Long.jump0.7946.06e-06

##Discus0.7434.84e-05

##Shot.put0.7346.72e-05

##High.jump0.6101.99e-03

##Javeline0.4284.15e-02

##X400m-0.7021.91e-04

##X110m.hurdle-0.7642.20e-05

##X100m-0.8512.73e-07

Notethat,tochangethecolorofgroupstheargumentpaletteshouldbeused.Tochangegradientcolors,theargumentgradient.colsshouldbeused.

Page 61: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Descriptionofdimension2

res.desc$Dim.2

##$quanti

##correlationp.value

##Pole.vault0.8073.21e-06

##X1500m0.7849.38e-06

##High.jump-0.4652.53e-02

Intheoutputabove,$quantimeansresultsforquantitativevariables.Notethat,variablesaresortedbythep-valueofthecorrelation.

4.4.4Graphofindividuals

4.4.4.1Results

Theresults,forindividualscanbeextractedusingthefunctionget_pca_ind()[factoextrapackage].Similarlytotheget_pca_var(),thefunctionget_pca_ind()providesalistofmatricescontainingalltheresultsfortheindividuals(coordinates,correlationbetweenvariablesandaxes,squaredcosineandcontributions)

ind<-get_pca_ind(res.pca)

ind

##PrincipalComponentAnalysisResultsforindividuals

##===================================================

##NameDescription

##1"$coord""Coordinatesfortheindividuals"

##2"$cos2""Cos2fortheindividuals"

##3"$contrib""contributionsoftheindividuals"

Togetaccesstothedifferentcomponents,usethis:

#Coordinatesofindividuals

head(ind$coord)

#Qualityofindividuals

head(ind$cos2)

#Contributionsofindividuals

head(ind$contrib)

4.4.4.2Plots:qualityandcontribution

Thefviz_pca_ind()isusedtoproducethegraphofindividuals.Tocreateasimpleplot,typethis:

fviz_pca_ind(res.pca)

Likevariables,it'salsopossibletocolorindividualsbytheircos2values:

fviz_pca_ind(res.pca,col.ind="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE#Avoidtextoverlapping(slowifmanypoints)

)

Page 62: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Youcanalsochangethepointsizeaccordingthecos2ofthecorrespondingindividuals:

fviz_pca_ind(res.pca,pointsize="cos2",

pointshape=21,fill="#E7B800",

repel=TRUE#Avoidtextoverlapping(slowifmanypoints)

)

Notethat,individualsthataresimilararegroupedtogetherontheplot.

Page 63: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Tochangebothpointsizeandcolorbycos2,trythis:

fviz_pca_ind(res.pca,col.ind="cos2",pointsize="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE#Avoidtextoverlapping(slowifmanypoints)

)

Tocreateabarplotofthequalityofrepresentation(cos2)ofindividualsonthefactormap,youcanusethefunctionfviz_cos2()aspreviouslydescribedforvariables:

fviz_cos2(res.pca,choice="ind")

Tovisualizethecontributionofindividualstothefirsttwoprincipalcomponents,typethis:

#TotalcontributiononPC1andPC2

fviz_contrib(res.pca,choice="ind",axes=1:2)

Page 64: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.4.4.3Colorbyacustomcontinuousvariable

Asforvariables,individualscanbecoloredbyanycustomcontinuousvariablebyspecifyingtheargumentcol.ind.

Forexample,typethis:

#Createarandomcontinuousvariableoflength23,

#SamelengthasthenumberofactiveindividualsinthePCA

set.seed(123)

my.cont.var<-rnorm(23)

#Colorvariablesbythecontinuousvariable

fviz_pca_ind(res.pca,col.ind=my.cont.var,

gradient.cols=c("blue","yellow","red"),

legend.title="Cont.Var")

4.4.4.4Colorbygroups

Here,wedescribehowtocolorindividualsbygroup.Additionally,weshowhowtoaddconcentrationellipsesandconfidenceellipsesbygroups.Forthis,we'llusetheirisdataasdemodatasets.

Irisdatasetslooklikethis:

head(iris,3)

##Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies

##15.13.51.40.2setosa

##24.93.01.40.2setosa

##34.73.21.30.2setosa

Thecolumn"Species"willbeusedasgroupingvariable.Westartbycomputingprincipalcomponentanalysisasfollow:

#ThevariableSpecies(index=5)isremoved

#beforePCAanalysis

Page 65: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

iris.pca<-PCA(iris[,-5],graph=FALSE)

IntheRcodebelow:theargumenthabillageorcol.indcanbeusedtospecifythefactorvariableforcoloringtheindividualsbygroups.

Toaddaconcentrationellipsearoundeachgroup,specifytheargumentaddEllipses=TRUE.Theargumentpalettecanbeusedtochangegroupcolors.

fviz_pca_ind(iris.pca,

geom.ind="point",#showpointsonly(nbutnot"text")

col.ind=iris$Species,#colorbygroups

palette=c("#00AFBB","#E7B800","#FC4E07"),

addEllipses=TRUE,#Concentrationellipses

legend.title="Groups"

)

#Addconfidenceellipses

fviz_pca_ind(iris.pca,geom.ind="point",col.ind=iris$Species,

palette=c("#00AFBB","#E7B800","#FC4E07"),

addEllipses=TRUE,ellipse.type="confidence",

legend.title="Groups"

)

Notethat,allowedvaluesforpaletteinclude:

Toremovethegroupmeanpoint,specifytheargumentmean.point=FALSE.

Ifyouwantconfidenceellipsesinsteadofconcentrationellipses,useellipse.type="confidence".

Page 66: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

"grey"forgreycolorpalettes;brewerpalettese.g."RdBu","Blues",...;Toviewall,typethisinR:RColorBrewer::display.brewer.all().customcolorpalettee.g.c("blue","red");andscientificjournalpalettesfromggsciRpackage,e.g.:"npg","aaas","lancet","jco","ucscgb","uchicago","simpsons"and"rickandmorty".

Forexample,tousethejco(journalofclinicaloncology)colorpalette,typethis:

fviz_pca_ind(iris.pca,

label="none",#hideindividuallabels

habillage=iris$Species,#colorbygroups

addEllipses=TRUE,#Concentrationellipses

palette="jco"

)

4.4.5Graphcustomization

Notethat,fviz_pca_ind()andfviz_pca_var()andrelatedfunctionsarewrapperaroundthecorefunctionfviz()[infactoextra].fviz()isawrapperaroundthefunctionggscatter()[inggpubr].Therefore,furtherarguments,tobepassedtothefunctionfviz()andggscatter(),canbespecifiedinfviz_pca_ind()andfviz_pca_var().

Here,wepresentsomeoftheseadditionalargumentstocustomizethePCAgraphofvariablesandindividuals.

4.4.5.1Dimensions

Bydefault,variables/individualsarerepresentedondimensions1and2.Ifyouwanttovisualizethemondimensions2and3,forexample,youshouldspecifytheargumentaxes=c(2,3).

#Variablesondimensions2and3

fviz_pca_var(res.pca,axes=c(2,3))

#Individualsondimensions2and3

fviz_pca_ind(res.pca,axes=c(2,3))

4.4.5.2Plotelements:point,text,arrow

Theargumentgeom(forgeometry)andderivativesareusedtospecifythegeometryelementsorgraphicalelementstobeusedforplotting.

1. geom.var:atextspecifyingthegeometrytobeusedforplottingvariables.Allowedvaluesarethecombinationofc("point","arrow","text").

Usegeom.var="point",toshowonlypoints;Usegeom.var="text"toshowonlytextlabels;Usegeom.var=c("point","text")toshowbothpointsandtextlabelsUsegeom.var=c("arrow","text")toshowarrowsandlabels(default).

Page 67: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Forexample,typethis:

#Showvariablepointsandtextlabels

fviz_pca_var(res.pca,geom.var=c("point","text"))

2. geom.ind:atextspecifyingthegeometrytobeusedforplottingindividuals.Allowedvaluesarethecombinationofc("point","text").

Usegeom.ind="point",toshowonlypoints;Usegeom.ind="text"toshowonlytextlabels;Usegeom.ind=c("point","text")toshowbothpointandtextlabels(default)

Forexample,typethis:

#Showindividualstextlabelsonly

fviz_pca_ind(res.pca,geom.ind="text")

4.4.5.3Sizeandshapeofplotelements

1. labelsize:fontsizeforthetextlabels,e.g.:labelsize=4.2. pointsize:thesizeofpoints,e.g.:pointsize=1.5.3. arrowsize:thesizeofarrows.Controlsthethicknessofarrows,e.g.:arrowsize=0.5.4. pointshape:theshapeofpoints,pointshape=21.Typeggpubr::show_point_shapes()tosee

availablepointshapes.

#Changethesizeofarrowsanlabels

fviz_pca_var(res.pca,arrowsize=1,labelsize=5,

repel=TRUE)

#Changepointssize,shapeandfillcolor

#Changelabelsize

fviz_pca_ind(res.pca,

pointsize=3,pointshape=21,fill="lightblue",

labelsize=5,repel=TRUE)

4.4.5.4Ellipses

Aswedescribedintheprevioussection4.4.4.4,whencoloringindividualsbygroups,youcanaddpointconcentrationellipsesusingtheargumentaddEllipses=TRUE.

Notethat,theargumentellipse.typecanbeusedtochangethetypeofellipses.Possiblevaluesare:

"convex":plotconvexhullofasetopoints."confidence":plotconfidenceellipsesaroundgroupmeanpointsasthefunctioncoord.ellipse()[inFactoMineR]."t":assumesamultivariatet-distribution."norm":assumesamultivariatenormaldistribution."euclid":drawsacirclewiththeradiusequaltolevel,representingtheeuclideandistancefromthecenter.Thisellipseprobablywon'tappearcircularunlesscoord_fixed()isapplied.

Page 68: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Theargumentellipse.levelisalsoavailabletochangethesizeoftheconcentrationellipseinnormalprobability.Forexample,specifyellipse.level=0.95orellipse.level=0.66.

#Addconfidenceellipses

fviz_pca_ind(iris.pca,geom.ind="point",

col.ind=iris$Species,#colorbygroups

palette=c("#00AFBB","#E7B800","#FC4E07"),

addEllipses=TRUE,ellipse.type="confidence",

legend.title="Groups"

)

#Convexhull

fviz_pca_ind(iris.pca,geom.ind="point",

col.ind=iris$Species,#colorbygroups

palette=c("#00AFBB","#E7B800","#FC4E07"),

addEllipses=TRUE,ellipse.type="convex",

legend.title="Groups"

)

4.4.5.5Groupmeanpoints

Whencoloringindividualsbygroups(section4.4.4.4),themeanpointsofgroups(barycenters)arealsodisplayedbydefault.

Toremovethemeanpoints,usetheargumentmean.point=FALSE.

fviz_pca_ind(iris.pca,

geom.ind="point",#showpointsonly(butnot"text")

group.ind=iris$Species,#colorbygroups

legend.title="Groups",

mean.point=FALSE)

4.4.5.6Axislines

Theargumentaxes.linetypecanbeusedtospecifythelinetypeofaxes.Defaultis"dashed".Allowedvaluesinclude"blank","solid","dotted",etc.Toseeallpossiblevaluestypeggpubr::show_line_types()inR.

Toremoveaxislines,useaxes.linetype="blank":

fviz_pca_var(res.pca,axes.linetype="blank")

4.4.5.7Graphicalparameters

Tochangeeasilythegraphicalofanyggplots,youcanusethefunctionggpar()[ggpubrpackage]

Thegraphicalparametersthatcanbechangedusingggpar()include:

Maintitles,axislabelsandlegendtitlesLegendposition.Possiblevalues:"top","bottom","left","right","none".Colorpalette.

Page 69: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Themes.Allowedvaluesinclude:theme_gray(),theme_bw(),theme_minimal(),theme_classic(),theme_void().

ind.p<-fviz_pca_ind(iris.pca,geom="point",col.ind=iris$Species)

ggpubr::ggpar(ind.p,

title="PrincipalComponentAnalysis",

subtitle="Irisdataset",

caption="Source:factoextra",

xlab="PC1",ylab="PC2",

legend.title="Species",legend.position="top",

ggtheme=theme_gray(),palette="jco"

)

4.4.6Biplot

Tomakeasimplebiplotofindividualsandvariables,typethis:

fviz_pca_biplot(res.pca,repel=TRUE,

col.var="#2E9FDF",#Variablescolor

col.ind="#696969"#Individualscolor

)

Page 70: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Now,usingtheiris.pcaoutput,let's:

makeabiplotofindividualsandvariableschangethecolorofindividualsbygroups:col.ind=iris$Speciesshowonlythelabelsforvariables:label="var"orusegeom.ind="point"

fviz_pca_biplot(iris.pca,

col.ind=iris$Species,palette="jco",

addEllipses=TRUE,label="var",

col.var="black",repel=TRUE,

legend.title="Species")

Notethat,thebiplotmightbeonlyusefulwhenthereisalownumberofvariablesandindividualsinthedataset;otherwisethefinalplotwouldbeunreadable.

Notealsothat,thecoordinateofindividualsandvariablesarenoteconstructedonthesamespace.Therefore,onbiplot,youshouldmainlyfocusonthedirectionofvariablesbutnotontheirabsolutepositionsontheplot.

Roughlyspeakingabiplotcanbeinterpretedasfollow:

anindividualthatisonthesamesideofagivenvariablehasahighvalueforthisvariable;anindividualthatisontheoppositesideofagivenvariablehasalowvalueforthisvariable.

Page 71: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Inthefollowingexample,wewanttocolorbothindividualsandvariablesbygroups.Thetrickistousepointshape=21forindividualpoints.Thisparticularpointshapecanbefilledbyacolorusingtheargumentfill.ind.Theborderlinecolorofindividualpointsissetto"black"usingcol.ind.Tocolorvariablebygroups,theargumentcol.varwillbeused.

Tocustomizeindividualsandvariablecolors,weusethehelperfunctionsfill_palette()andcolor_palette()[inggpubrpackage].

fviz_pca_biplot(iris.pca,

#Fillindividualsbygroups

geom.ind="point",

pointshape=21,

pointsize=2.5,

fill.ind=iris$Species,

col.ind="black",

#Colorvariablebygroups

col.var=factor(c("sepal","sepal","petal","petal")),

legend.title=list(fill="Species",color="Clusters"),

repel=TRUE#Avoidlabeloverplotting

)+

ggpubr::fill_palette("jco")+#Indiviualfillcolor

ggpubr::color_palette("npg")#Variablecolors

Page 72: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Anothercomplexexampleistocolorindividualsbygroups(discretecolor)andvariablesbytheircontributionstotheprincipalcomponents(gradientcolors).Additionally,we'llchangethetransparencyofvariablesbytheircontributionsusingtheargumentalpha.var.

fviz_pca_biplot(iris.pca,

#Individuals

geom.ind="point",

fill.ind=iris$Species,col.ind="black",

pointshape=21,pointsize=2,

palette="jco",

addEllipses=TRUE,

#Variables

alpha.var="contrib",col.var="contrib",

gradient.cols="RdYlBu",

legend.title=list(fill="Species",color="Contrib",

alpha="Contrib")

)

Page 73: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 74: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.5Supplementaryelements

4.5.1Definitionandtypes

Asdescribedabove(section4.3.2),thedecathlon2datasetscontainsupplementarycontinuousvariables(quanti.sup,columns11:12),supplementaryqualitativevariables(quali.sup,column13)andsupplementaryindividuals(ind.sup,rows24:27).

Supplementaryvariablesandindividualsarenotusedforthedeterminationoftheprincipalcomponents.Theircoordinatesarepredictedusingonlytheinformationprovidedbytheperformedprincipalcomponentanalysisonactivevariables/individuals.

4.5.2SpecificationinPCA

Tospecifysupplementaryindividualsandvariables,thefunctionPCA()canbeusedasfollow:

PCA(X,ind.sup=NULL,

quanti.sup=NULL,quali.sup=NULL,graph=TRUE)

X:adataframe.Rowsareindividualsandcolumnsarenumericvariables.ind.sup:anumericvectorspecifyingtheindexesofthesupplementaryindividualsquanti.sup,quali.sup:anumericvectorspecifying,respectively,theindexesofthequantitativeandqualitativevariablesgraph:alogicalvalue.IfTRUEagraphisdisplayed.

Forexample,typethis:

res.pca<-PCA(decathlon2,ind.sup=24:27,

quanti.sup=11:12,quali.sup=13,graph=FALSE)

4.5.3Quantitativevariables

Predictedresults(coordinates,correlationandcos2)forthesupplementaryquantitativevariables:

res.pca$quanti.sup

##$coord

##Dim.1Dim.2Dim.3Dim.4Dim.5

##Rank-0.701-0.2452-0.1830.0558-0.0738

##Points0.9640.07770.158-0.1662-0.0311

##

##$cor

##Dim.1Dim.2Dim.3Dim.4Dim.5

##Rank-0.701-0.2452-0.1830.0558-0.0738

##Points0.9640.07770.158-0.1662-0.0311

##

##$cos2

##Dim.1Dim.2Dim.3Dim.4Dim.5

Page 75: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##Rank0.4920.060120.03360.003110.00545

##Points0.9290.006030.02500.027630.00097

Visualizeallvariables(activeandsupplementaryones):

fviz_pca_var(res.pca)

Furtherargumentstocustomizetheplot:

#Changecolorofvariables

fviz_pca_var(res.pca,

col.var="black",#Activevariables

col.quanti.sup="red"#Suppl.quantitativevariables

)

#Hideactivevariablesontheplot,

#showonlysupplementaryvariables

fviz_pca_var(res.pca,invisible="var")

#Hidesupplementaryvariables

fviz_pca_var(res.pca,invisible="quanti.sup")

Notethat,bydefault,supplementaryquantitativevariablesareshowninbluecoloranddashedlines.

Usingthefviz_pca_var(),thequantitativesupplementaryvariablesaredisplayedautomaticallyonthecorrelationcircleplot.Notethat,youcanaddthequanti.supvariablesmanually,usingthe

Page 76: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Plotofactivevariables

p<-fviz_pca_var(res.pca,invisible="quanti.sup")

#Addsupplementaryactivevariables

fviz_add(p,res.pca$quanti.sup$coord,

geom=c("arrow","text"),

color="red")

4.5.4Individuals

Predictedresultsforthesupplementaryindividuals(ind.sup):

res.pca$ind.sup

Visualizeallindividuals(activeandsupplementaryones).Onthegraph,youcanaddalsothesupplementaryqualitativevariables(quali.sup),whichcoordinatesisaccessibleusingres.pca$quali.supp$coord.

p<-fviz_pca_ind(res.pca,col.ind.sup="blue",repel=TRUE)

p<-fviz_add(p,res.pca$quali.sup$coord,color="red")

p

fviz_add()function,forfurthercustomization.Anexampleisshownbelow.

Supplementaryindividualsareshowninblue.Thelevelsofthesupplementaryqualitativevariableareshowninredcolor.

Page 77: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.5.5Qualitativevariables

Intheprevioussection,weshowedthatyoucanaddthesupplementaryqualitativevariablesonindividualsplotusingfviz_add().

Notethat,thesupplementaryqualitativevariablescanbealsousedforcoloringindividualsbygroups.Thiscanhelptointerpretthedata.Thedatasetsdecathlon2containasupplementaryqualitativevariableatcolumns13correspondingtothetypeofcompetitions.

Theresultsconcerningthesupplementaryqualitativevariableare:

res.pca$quali

Tocolorindividualsbyasupplementaryqualitativevariable,theargumenthabillageisusedtospecifytheindexofthesupplementaryqualitativevariable.Historically,thisargumentnamecomesfromtheFactoMineRpackage.It'safrenchwordmeaning"dressing"inenglish.TokeepconsistencybetweenFactoMineRandfactoextra,wedecidedtokeepthesameargumentname

fviz_pca_ind(res.pca,habillage=13,

addEllipses=TRUE,ellipse.type="confidence",

palette="jco",repel=TRUE)

Recallthat,toremovethemeanpointsofgroups,specifytheargumentmean.point=FALSE.

Page 78: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.6Filteringresults

Ifyouhavemanyindividuals/variable,it'spossibletovisualizeonlysomeofthemusingtheargumentsselect.indandselect.var.

select.ind,select.var:aselectionofindividuals/variabletobeplotted.AllowedvaluesareNULLoralistcontainingtheargumentsname,cos2orcontrib:

name:isacharactervectorcontainingindividuals/variablenamestobeplottedcos2:ifcos2isin[0,1],ex:0.6,thenindividuals/variableswithacos2>0.6areplottedifcos2>1,ex:5,thenthetop5activeindividuals/variablesandtop5supplementarycolumns/rowswiththehighestcos2areplottedcontrib:ifcontrib>1,ex:5,thenthetop5individuals/variableswiththehighestcontributionsareplotted

#Visualizevariablewithcos2>=0.6

fviz_pca_var(res.pca,select.var=list(cos2=0.6))

#Top5activevariableswiththehighestcos2

fviz_pca_var(res.pca,select.var=list(cos2=5))

#Selectbynames

name<-list(name=c("Long.jump","High.jump","X100m"))

fviz_pca_var(res.pca,select.var=name)

#top5contributingindividualsandvariable

fviz_pca_biplot(res.pca,select.ind=list(contrib=5),

select.var=list(contrib=5),

ggtheme=theme_minimal())

Whentheselectionisdoneaccordingtothecontributionvalues,supplementaryindividuals/variablesarenotshownbecausetheydon'tcontributetotheconstructionoftheaxes.

Page 79: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.7Exportingresults

4.7.1ExportplotstoPDF/PNGfiles

Thefactoextrapackageproducesaggplot2-basedgraphs.Tosaveanyggplots,thestandardRcodeisasfollow:

#Printtheplottoapdffile

pdf("myplot.pdf")

print(myplot)

dev.off()

Inthefollowingexamples,we'llshowyouhowtosavethedifferentgraphsintopdforpngfiles.

ThefirststepistocreatetheplotsyouwantasanRobject:

#Screeplot

scree.plot<-fviz_eig(res.pca)

#Plotofindividuals

ind.plot<-fviz_pca_ind(res.pca)

#Plotofvariables

var.plot<-fviz_pca_var(res.pca)

Next,theplotscanbeexportedintoasinglepdffileasfollow:

pdf("PCA.pdf")#Createanewpdfdevice

print(scree.plot)

print(ind.plot)

print(var.plot)

dev.off()#Closethepdfdevice

Toprinteachplottospecificpngfile,theRcodelookslikethis:

#Printscreeplottoapngfile

png("pca-scree-plot.png")

print(scree.plot)

dev.off()

#Printindividualsplottoapngfile

png("pca-variables.png")

print(var.plot)

dev.off()

#Printvariablesplottoapngfile

png("pca-individuals.png")

Notethat,usingtheaboveRcodewillcreatethePDFfileintoyourcurrentworkingdirectory.Toseethepathofyourcurrentworkingdirectory,typegetwd()intheRconsole.

Page 80: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

print(ind.plot)

dev.off()

Anotheralternative,toexportggplots,istousethefunctionggexport()[inggpubrpackage].Welikeggexport(),becauseit'sverysimple.WithonelineRcode,itallowsustoexportindividualplotstoafile(pdf,epsorpng)(oneplotperpage).Itcanalsoarrangetheplots(2plotperpage,forexample)beforeexportingthem.Theexamplesbelowdemonstrateshowtoexportggplotsusingggexport().

Exportindividualplotstoapdffile(oneplotperpage):

library(ggpubr)

ggexport(plotlist=list(scree.plot,ind.plot,var.plot),

filename="PCA.pdf")

Arrangeandexport.Specifynrowandncoltodisplaymultipleplotsonthesamepage:

ggexport(plotlist=list(scree.plot,ind.plot,var.plot),

nrow=2,ncol=2,

filename="PCA.pdf")

Exportplotstopngfiles.Ifyouspecifyalistofplots,thenmultiplepngfileswillbeautomaticallycreatedtoholdeachplot.

ggexport(plotlist=list(scree.plot,ind.plot,var.plot),

filename="PCA.png")

4.7.2Exportresultstotxt/csvfiles

AlltheoutputsofthePCA(individuals/variablescoordinates,contributions,etc)canbeexportedatonce,intoaTXT/CSVfile,usingthefunctionwrite.infile()[inFactoMineR]package:

#ExportintoaTXTfile

write.infile(res.pca,"pca.txt",sep="\t")

#ExportintoaCSVfile

write.infile(res.pca,"pca.csv",sep=";")

Page 81: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.8Summary

Inconclusion,wedescribedhowtoperformandinterpretprincipalcomponentanalysis(PCA).WecomputedPCAusingthePCA()function[FactoMineR].Next,weusedthefactoextraRpackagetoproduceggplot2-basedvisualizationofthePCAresults.

Thereareotherfunctions[packages]tocomputePCAinR:

1. Usingprcomp()[stats]

res.pca<-prcomp(iris[,-5],scale.=TRUE)

Readmore:http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp

2. Usingprincomp()[stats]

res.pca<-princomp(iris[,-5],cor=TRUE)

Readmore:http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp

3. Usingdudi.pca()[ade4]

library("ade4")

res.pca<-dudi.pca(iris[,-5],scannf=FALSE,nf=5)

Readmore:http://www.sthda.com/english/wiki/pca-using-ade4-and-factoextra

4. UsingepPCA()[ExPosition]

library("ExPosition")

res.pca<-epPCA(iris[,-5],graph=FALSE)

Nomatterwhatfunctionsyoudecidetouse,inthelistabove,thefactoextrapackagecanhandletheoutputforcreatingbeautifulplotssimilartowhatwedescribedintheprevioussectionsforFactoMineR:

fviz_eig(res.pca)#Screeplot

fviz_pca_ind(res.pca)#Graphofindividuals

fviz_pca_var(res.pca)#Graphofvariables

Page 82: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

4.9Furtherreading

ForthemathematicalbackgroundbehindCA,refertothefollowingvideocourses,articlesandbooks:

Principalcomponentanalysis(article)(AbdiandWilliams2010).https://goo.gl/1Vtwq1.PrincipalComponentAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/VZJsnMExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).PrincipalComponentAnalysis(book)(Jollife2002).

Seealso:

PCAusingprcomp()andprincomp()(tutorial).http://www.sthda.com/english/wiki/pca-using-prcomp-and-princompPCAusingade4andfactoextra(tutorial).http://www.sthda.com/english/wiki/pca-using-ade4-and-factoextra

Page 83: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 84: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5CorrespondenceAnalysis

Page 85: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.1Introduction

Correspondenceanalysis(CA)isanextensionofprincipalcomponentanalysis(Chapter4)suitedtoexplorerelationshipsamongqualitativevariables(orcategoricaldata).Likeprincipalcomponentanalysis,itprovidesasolutionforsummarizingandvisualizingdatasetintwo-dimensionplots.

Here,wedescribethesimplecorrespondenceanalysis,whichisusedtoanalyzefrequenciesformedbytwocategoricaldata,adatatableknownascontengencytable.Itprovidesfactorscores(coordinates)forbothrowandcolumnpointsofcontingencytable.Thesecoordinatesareusedtovisualizegraphicallytheassociationbetweenrowandcolumnelementsinthecontingencytable.

Whenanalyzingatwo-waycontingencytable,atypicalquestioniswhethercertainrowelementsareassociatedwithsomeelementsofcolumnelements.Correspondenceanalysisisageometricapproachforvisualizingtherowsandcolumnsofatwo-waycontingencytableaspointsinalow-dimensionalspace,suchthatthepositionsoftherowandcolumnpointsareconsistentwiththeirassociationsinthetable.Theaimistohaveaglobalviewofthedatathatisusefulforinterpretation.

Inthecurrentchapter,we'llshowhowtocomputeandinterpretcorrespondenceanalysisusingtwoRpackages:i)FactoMineRfortheanalysisandii)factoextrafordatavisualization.Additionally,we'llshowhowtorevealthemostimportantvariablesthatexplainthevariationsinadataset.Wecontinuebyexplaininghowtoapplycorrespondenceanalysisusingsupplementaryrowsandcolumns.Thisisimportant,ifyouwanttomakepredictionswithCA.ThelastsectionsofthisguidedescribealsohowtofilterCAresultinordertokeeponlythemostcontributingvariables.Finally,we'llseehowtodealwithoutliers.

Page 86: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.2Computation

5.2.1Rpackages

SeveralfunctionsfromdifferentpackagesareavailableintheRsoftwareforcomputingcorrespondenceanalysis:

CA()[FactoMineRpackage],ca()[capackage],dudi.coa()[ade4package],corresp()[MASSpackage],andepCA()[ExPositionpackage]

Nomatterwhatfunctionyoudecidetouse,youcaneasilyextractandvisualizetheresultsofcorrespondenceanalysisusingRfunctionsprovidedinthefactoextraRpackage.

Here,we'lluseFactoMineR(fortheanalysis)andfactoextra(forggplot2-basedelegantvisualization).Toinstallthetwopackages,typethis:

install.packages(c("FactoMineR","factoextra"))

Loadthepackages:

library("FactoMineR")

library("factoextra")

5.2.2Dataformat

Thedatashouldbeacontingencytable.We'llusethedemodatasetshousetasksavailableinthefactoextraRpackage

data(housetasks)

#head(housetasks)

Thedataisacontingencytablecontaining13housetasksandtheirrepartitioninthecouple:

rowsarethedifferenttasksvaluesarethefrequenciesofthetasksdone:

bythewifeonlyalternativelybythehusbandonlyorjointly

Thedataisillustratedinthefollowingimage:

Page 87: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.2.3Graphofcontingencytablesandchi-squaretest

Theabovecontingencytableisnotverylarge.Therefore,it'seasytovisuallyinspectandinterpretrowandcolumnprofiles:

It'sevidentthat,thehousetasks-Laundry,Main_MealandDinner-aremorefrequentlydonebythe"Wife".RepairsanddrivingaredominantlydonebythehusbandHolidaysarefrequentlyassociatedwiththecolumn"jointly"

Exploratorydataanalysisandvisualizationofcontingencytableshavebeencoveredinourpreviousarticle:Chi-SquaretestofindependenceinR.Briefly,contingencytablecanbevisualizedusingthefunctionsballoonplot()[gplotspackage]andmosaicplot()[garphicspackage]:

library("gplots")

#1.convertthedataasatable

dt<-as.table(as.matrix(housetasks))

#2.Graph

balloonplot(t(dt),main="housetasks",xlab="",ylab="",

label=FALSE,show.margins=FALSE)

Page 88: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Forasmallcontingencytable,youcanusetheChi-squaretesttoevaluatewhetherthereisasignificantdependencebetweenrowandcolumncategories:

chisq<-chisq.test(housetasks)

chisq

##

##Pearson'sChi-squaredtest

##

##data:housetasks

##X-squared=2000,df=40,p-value<2e-16

Notethat,rowandcolumnsumsareprintedbydefaultinthebottomandrightmargins,respectively.Thesevaluesarehidden,intheaboveplot,usingtheargumentshow.margins=FALSE.

Inourexample,therowandthecolumnvariablesarestatisticallysignificantlyassociated(p-value=rchisq$p.value).

Page 89: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.2.4RcodetocomputeCA

ThefunctionCA()[FactoMinerpackage]canbeused.Asimplifiedformatis:

CA(X,ncp=5,graph=TRUE)

X:adataframe(contingencytable)ncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.

Tocomputecorrespondenceanalysis,typethis:

library("FactoMineR")

res.ca<-CA(housetasks,graph=FALSE)

TheoutputofthefunctionCA()isalistincluding:

print(res.ca)

##**ResultsoftheCorrespondenceAnalysis(CA)**

##Therowvariablehas13categories;thecolumnvariablehas4categories

##Thechisquareofindependencebetweenthetwovariablesisequalto1944(p-value=0).

##*Theresultsareavailableinthefollowingobjects:

##

##namedescription

##1"$eig""eigenvalues"

##2"$col""resultsforthecolumns"

##3"$col$coord""coord.forthecolumns"

##4"$col$cos2""cos2forthecolumns"

##5"$col$contrib""contributionsofthecolumns"

##6"$row""resultsfortherows"

##7"$row$coord""coord.fortherows"

##8"$row$cos2""cos2fortherows"

##9"$row$contrib""contributionsoftherows"

##10"$call""summarycalledparameters"

##11"$call$marge.col""weightsofthecolumns"

##12"$call$marge.row""weightsoftherows"

TheobjectthatiscreatedusingthefunctionCA()containsmanyinformationfoundinmanydifferentlistsandmatrices.Thesevaluesaredescribedinthenextsection.

Page 90: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.3Visualizationandinterpretation

We'llusethefollowingfunctions[infactoextra]tohelpintheinterpretationandthevisualizationofthecorrespondenceanalysis:

get_eigenvalue(res.ca):Extracttheeigenvalues/variancesretainedbyeachdimension(axis)fviz_eig(res.ca):Visualizetheeigenvaluesget_ca_row(res.ca),get_ca_col(res.ca):Extracttheresultsforrowsandcolumns,respectively.fviz_ca_row(res.ca),fviz_ca_col(res.ca):Visualizetheresultsforrowsandcolumns,respectively.fviz_ca_biplot(res.ca):Makeabiplotofrowsandcolumns.

Inthenextsections,we'llillustrateeachofthesefunctions.

5.3.1Statisticalsignificance

Tointerpretcorrespondenceanalysis,thefirststepistoevaluatewhetherthereisasignificantdependencybetweentherowsandcolumns.

Arigorousmethodistousethechi-squarestatisticforexaminingtheassociationbetweenrowandcolumnvariables.Thisappearsatthetopofthereportgeneratedbythefunctionsummary(res.ca)orprint(res.ca),seesection5.2.4.Ahighchi-squarestatisticmeansstronglinkbetweenrowandcolumnvariables.

#Chi-squarestatistics

chi2<-1944.456

#Degreeoffreedom

df<-(nrow(housetasks)-1)*(ncol(housetasks)-1)

#P-value

pval<-pchisq(chi2,df=df,lower.tail=FALSE)

pval

##[1]0

5.3.2Eigenvalues/Variances

Recallthat,weexaminetheeigenvaluestodeterminethenumberofaxistobeconsidered.Theeigenvaluesandtheproportionofvariancesretainedbythedifferentaxescanbeextractedusingthefunctionget_eigenvalue()[factoextrapackage].Eigenvaluesarelargeforthefirstaxisandsmallforthesubsequentaxis.

library("factoextra")

eig.val<-get_eigenvalue(res.ca)

eig.val

##eigenvaluevariance.percentcumulative.variance.percent

##Dim.10.54348.748.7

Inourexample,theassociationishighlysignificant(chi-square:1944.456,p=0).

Page 91: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##Dim.20.44539.988.6

##Dim.30.12711.4100.0

Eigenvaluescorrespondtotheamountofinformationretainedbyeachaxis.Dimensionsareordereddecreasinglyandlistedaccordingtotheamountofvarianceexplainedinthesolution.Dimension1explainsthemostvarianceinthesolution,followedbydimension2andsoon.

Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsofvariationexplainedtoobtaintherunningtotal.Forinstance,48.69%plus39.91%equals88.6%,andsoforth.Therefore,about88.6%ofthevariationisexplainedbythefirsttwodimensions.

Eigenvaluescanbeusedtodeterminethenumberofaxestoretain.Thereisno“ruleofthumb”tochoosethenumberofdimensionstokeepforthedatainterpretation.Itdependsontheresearchquestionandtheresearcher'sneed.Forexample,ifyouaresatisfiedwith80%ofthetotalvariancesexplainedthenusethenumberofdimensionsnecessarytoachievethat.

Inouranalysis,thefirsttwoaxesexplain88.6%ofthevariation.Thisisanacceptablylargepercentage.

AnalternativemethodtodeterminethenumberofdimensionsistolookataScreePlot,whichistheplotofeigenvalues/variancesorderedfromlargesttothesmallest.Thenumberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareallrelativelysmallandofcomparablesize.

Thescreeplotcanbeproducedusingthefunctionfviz_eig()orfviz_screeplot()[factoextrapackage].

fviz_screeplot(res.ca,addlabels=TRUE,ylim=c(0,50))

Notethat,agooddimensionreductionisachievedwhenthethefirstfewdimensionsaccountforalargeproportionofthevariability.

Page 92: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

It'salsopossibletocalculateanaverageeigenvalueabovewhichtheaxisshouldbekeptinthesolution.

Ourdatacontains13rowsand4columns.

Ifthedatawererandom,theexpectedvalueoftheeigenvalueforeachaxiswouldbe1/(nrow(housetasks)-1)=1/12=8.33%intermsofrows.

Likewise,theaverageaxisshouldaccountfor1/(ncol(housetasks)-1)=1/3=33.33%intermsofthe4columns.

Accordingto(M.T.Bendixen1995):

TheRcodebelow,drawsthescreeplotwithareddashedlinespecifyingtheaverageeigenvalue:

fviz_screeplot(res.ca)+

geom_hline(yintercept=33.33,linetype=2,color="red")

Accordingtothegraphabove,onlydimensions1and2shouldbeusedinthesolution.Thedimension3explainsonly11.4%ofthetotalinertiawhichisbelowtheaverageeigeinvalue(33.33%)andtoolittletobekeptforfurtheranalysis.

Thepointatwhichthescreeplotshowsabend(socalled"elbow")canbeconsideredasindicatinganoptimaldimensionality.

Anyaxiswithacontributionlargerthanthemaximumofthesetwopercentagesshouldbeconsideredasimportantandincludedinthesolutionfortheinterpretationofthedata.

Page 93: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Dimensions1and2explainapproximately48.7%and39.9%ofthetotalinertiarespectively.Thiscorrespondstoacumulativetotalof88.6%oftotalinertiaretainedbythe2dimensions.Thehighertheretention,themoresubtletyintheoriginaldataisretainedinthelow-dimensionalsolution(M.Bendixen2003).

5.3.3Biplot

Thefunctionfviz_ca_biplot()[factoextrapackage]canbeusedtodrawthebiplotofrowsandcolumnsvariables.

#repel=TRUEtoavoidtextoverlapping(slowifmanypoint)

fviz_ca_biplot(res.ca,repel=TRUE)

Thegraphaboveiscalledsymetricplotandshowsaglobalpatternwithinthedata.Rowsarerepresentedbybluepointsandcolumnsbyredtriangles.

Thedistancebetweenanyrowpointsorcolumnpointsgivesameasureoftheirsimilarity(ordissimilarity).Rowpointswithsimilarprofileareclosedonthefactormap.Thesameholdstruefor

Notethat,youcanusemorethan2dimensions.However,thesupplementarydimensionsareunlikelytocontributesignificantlytotheinterpretationofnatureoftheassociationbetweentherowsandcolumns.

Page 94: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

columnpoints.

Thenextstepfortheinterpretationistodeterminewhichrowandcolumnvariablescontributethemostinthedefinitionofthedifferentdimensionsretainedinthemodel.

5.3.4Graphofrowvariables

5.3.4.1Results

Thefunctionget_ca_row()[infactoextra]isusedtoextracttheresultsforrowvariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2,thecontributionandtheinertiaofrowvariables:

row<-get_ca_row(res.ca)

row

##CorrespondenceAnalysis-Resultsforrows

##===================================================

##NameDescription

##1"$coord""Coordinatesfortherows"

##2"$cos2""Cos2fortherows"

##3"$contrib""contributionsoftherows"

##4"$inertia""Inertiaoftherows"

Thecomponentsoftheget_ca_row()functioncanbeusedintheplotofrowsasfollow:

row$coord:coordinatesofeachrowpointineachdimension(1,2and3).Usedtocreatethescatterplot.row$cos2:qualityofrepresentationofrows.

Thisgraphshowsthat:

housetaskssuchasdinner,breakfeast,laundryaredonemoreoftenbythewifeDrivingandrepairsaredonebythehusband......

Symetricplotrepresentstherowandcolumnprofilessimultaneouslyinacommonspace.Inthiscase,onlythedistancebetweenrowpointsorthedistancebetweencolumnpointscanbereallyinterpreted.

Thedistancebetweenanyrowandcolumnitemsisnotmeaningful!Youcanonlymakeageneralstatementsabouttheobservedpattern.

Inordertointerpretthedistancebetweencolumnandrowpoints,thecolumnprofilesmustbepresentedinrowspaceorvice-versa.Thistypeofmapiscalledasymmetricbiplotandisdiscussedattheendofthisarticle.

Page 95: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

var$contrib:contributionofrows(in%)tothedefinitionofthedimensions.

Thedifferentcomponentscanbeaccessedasfollow:

#Coordinates

head(row$coord)

#Cos2:qualityonthefactoremap

head(row$cos2)

#Contributionstotheprincipalcomponents

head(row$contrib)

Inthissection,wedescribehowtovisualizerowpointsonly.Next,wehighlightrowsaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.

5.3.4.2Coordinatesofrowpoints

TheRcodebelowdisplaysthecoordinatesofeachrowpointineachdimension(1,2and3):

head(row$coord)

##Dim1Dim2Dim3

##Laundry-0.9920.495-0.3167

##Main_meal-0.8760.490-0.1641

##Dinner-0.6930.308-0.2074

##Breakfeast-0.5090.4530.2204

##Tidying-0.394-0.434-0.0942

##Dishes-0.189-0.4420.2669

Usethefunctionfviz_ca_row()[infactoextra]tovisualizeonlyrowpoints:

fviz_ca_row(res.ca,repel=TRUE)

Notethat,it'spossibletoplotrowpointsandtocolorthemaccordingtoeitheri)theirqualityonthefactormap(cos2)orii)theircontributionvaluestothedefinitionofdimensions(contrib).

Page 96: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

It'spossibletochangethecolorandtheshapeoftherowpointsusingtheargumentscol.rowandshape.rowasfollow:

fviz_ca_row(res.ca,col.row="steelblue",shape.row=15)

Theplotaboveshowstherelationshipsbetweenrowpoints:

Rowswithasimilarprofilearegroupedtogether.Negativelycorrelatedrowsarepositionedonoppositesidesoftheplotorigin(opposedquadrants).Thedistancebetweenrowpointsandtheoriginmeasuresthequalityoftherowpointsonthefactormap.Rowpointsthatareawayfromtheoriginarewellrepresentedonthefactormap.

5.3.4.3Qualityofrepresentationofrows

Theresultoftheanalysisshowsthat,thecontingencytablehasbeensuccessfullyrepresentedinlowdimensionspaceusingcorrespondenceanalysis.Thetwodimensions1and2aresufficienttoretain88.6%ofthetotalinertia(variation)containedinthedata.

However,notallthepointsareequallywelldisplayedinthetwodimensions.

Recallthat,thequalityofrepresentationoftherowsonthefactormapiscalledthesquaredcosine(cos2)orthesquaredcorrelations.

Page 97: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Thecos2measuresthedegreeofassociationbetweenrows/columnsandaparticularaxis.Thecos2ofrowpointscanbeextractedasfollow:

head(row$cos2,4)

##Dim1Dim2Dim3

##Laundry0.7400.1850.0755

##Main_meal0.7420.2320.0260

##Dinner0.7770.1540.0697

##Breakfeast0.5050.4000.0948

Thevaluesofthecos2arecomprisedbetween0and1.Thesumofthecos2forrowsonalltheCAdimensionsisequaltoone.

Ifarowitemiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftherowitems,morethan2dimensionsarerequiredtoperfectlyrepresentthedata.

It'spossibletocolorrowpointsbytheircos2valuesusingtheargumentcol.row="cos2".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.Forinstance,gradient.cols=c("white","blue","red")meansthat:

variableswithlowcos2valueswillbecoloredin"white"variableswithmidcos2valueswillbecoloredin"blue"variableswithhighcos2valueswillbecoloredinred

#Colorbycos2values:qualityonthefactormap

fviz_ca_row(res.ca,col.row="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

Thequalityofrepresentationofaroworcolumninndimensionsissimplythesumofthesquaredcosineofthatroworcolumnoverthendimensions.

Page 98: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Notethat,it'salsopossibletochangethetransparencyoftherowpointsaccordingtotheircos2valuesusingtheoptionalpha.row="cos2".Forexample,typethis:

#Changethetransparencybycos2values

fviz_ca_row(res.ca,alpha.row="cos2")

Youcanvisualizethecos2ofrowpointsonallthedimensionsusingthecorrplotpackage:

library("corrplot")

corrplot(row$cos2,is.corr=FALSE)

Page 99: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

It'salsopossibletocreateabarplotofrowscos2usingthefunctionfviz_cos2()[infactoextra]:

#Cos2ofrowsonDim.1andDim.2

fviz_cos2(res.ca,choice="row",axes=1:2)

5.3.4.4Contributionsofrowstothedimensions

Thecontributionofrows(in%)tothedefinitionofthedimensionscanbeextractedasfollow:

Notethat,allrowpointsexceptOfficialarewellrepresentedbythefirsttwodimensions.ThisimpliesthatthepositionofthepointcorrespondingtheitemOfficialonthescatterplotshouldbeinterpretedwithsomecaution.AhigherdimensionalsolutionisprobablynecessaryfortheitemOfficial.

Page 100: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

head(row$contrib)

##Dim1Dim2Dim3

##Laundry18.2875.567.968

##Main_meal12.3894.741.859

##Dinner5.4711.322.097

##Breakfeast3.8253.703.069

##Tidying1.9982.970.489

##Dishes0.4262.843.634

RowsthatcontributethemosttoDim.1andDim.2arethemostimportantinexplainingthevariabilityinthedataset.Rowsthatdonotcontributemuchtoanydimensionorthatcontributetothelastdimensionsarelessimportant.

It’spossibletousethefunctioncorrplot()[corrplotpackage]tohighlightthemostcontributingrowpointsforeachdimension:

library("corrplot")

corrplot(row$contrib,is.corr=FALSE)

Thefunctionfviz_contrib()[factoextrapackage]canbeusedtodrawabarplotofrowcontributions.Ifyourdatacontainsmanyrows,youcandecidetoshowonlythetopcontributingrows.TheRcodebelowshowsthetop10rowscontributingtothedimensions:

#Contributionsofrowstodimension1

fviz_contrib(res.ca,choice="row",axes=1,top=10)

Therowvariableswiththelargervalue,contributethemosttothedefinitionofthedimensions.

Page 101: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Contributionsofrowstodimension2

fviz_contrib(res.ca,choice="row",axes=2,top=10)

Thetotalcontributiontodimension1and2canbeobtainedasfollow:

#Totalcontributiontodimension1and2

fviz_contrib(res.ca,choice="row",axes=1:2,top=10)

Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Thecalculationoftheexpectedcontributionvalue,undernullhypothesis,hasbeendetailedintheprincipalcomponentanalysischapter(4).

Itcanbeseenthat:

therowitemsRepairs,Laundry,Main_mealandDrivingarethemostimportantinthedefinitionofthefirstdimension.therowitemsHolidaysandRepairscontributethemosttothedimension2.

Themostimportant(or,contributing)rowpointscanbehighlightedonthescatterplotasfollow:

fviz_ca_row(res.ca,col.row="contrib",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

Page 102: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Thescatterplotgivesanideaofwhatpoleofthedimensionstherowcategoriesareactuallycontributingto.ItisevidentthatrowcategoriesRepairandDrivinghaveanimportantcontributiontothepositivepoleofthefirstdimension,whilethecategoriesLaundryandMain_mealhaveamajorcontributiontothenegativepoleofthefirstdimension;etc,....

Inotherwords,dimension1ismainlydefinedbytheoppositionofRepairandDriving(positivepole),andLaundryandMain_meal(negativepole).

Notethat,it'salsopossibletocontrolthetransparencyofrowpointsaccordingtotheircontributionvaluesusingtheoptionalpha.row="contrib".Forexample,typethis:

#Changethetransparencybycontribvalues

fviz_ca_row(res.ca,alpha.row="contrib",

repel=TRUE)

5.3.5Graphofcolumnvariables

5.3.5.1Results

Thefunctionget_ca_col()[infactoextra]isusedtoextracttheresultsforcolumnvariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2,thecontributionandtheinertiaofcolumnsvariables:

Page 103: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

col<-get_ca_col(res.ca)

col

##CorrespondenceAnalysis-Resultsforcolumns

##===================================================

##NameDescription

##1"$coord""Coordinatesforthecolumns"

##2"$cos2""Cos2forthecolumns"

##3"$contrib""contributionsofthecolumns"

##4"$inertia""Inertiaofthecolumns"

Togetaccesstothedifferentcomponents,usethis:

#Coordinatesofcolumnpoints

head(col$coord)

#Qualityofrepresentation

head(col$cos2)

#Contributions

head(col$contrib)

5.3.5.2Plots:qualityandcontribution

Thefviz_ca_col()isusedtoproducethegraphofcolumnpoints.Tocreateasimpleplot,typethis:

fviz_ca_col(res.ca)

Likerowpoints,it'salsopossibletocolorcolumnpointsbytheircos2values:

fviz_ca_col(res.ca,col.col="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

Theresultforcolumnsgivesthesameinformationasdescribedforrows.Forthisreason,we'lljustdisplayedtheresultforcolumnsinthissectionwithonlyaverybriefcomment.

Page 104: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

TheRcodebelowcreatesabarplotofcolumnscos2:

fviz_cos2(res.ca,choice="col",axes=1:2)

Tovisualizethecontributionofrowstothefirsttwodimensions,typethis:

fviz_contrib(res.ca,choice="col",axes=1:2)

Recallthat,thevalueofthecos2isbetween0and1.Acos2closedto1correspondstoacolumn/rowvariablesthatarewellrepresentedonthefactormap.

Notethat,onlythecolumnitemAlternatingisnotverywelldisplayedonthefirsttwodimensions.Thepositionofthisitemmustbeinterpretedwithcautioninthespaceformedbydimensions1and2.

Page 105: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.3.6Biplotoptions

Biplotisagraphicaldisplayofrowsandcolumnsin2or3dimensions.WehavealreadydescribedhowtocreateCAbiplotsinsection5.3.3.Here,we'lldescribedifferenttypesofCAbiplots.

5.3.6.1Symmetricbiplot

Asmentionedabove,thestandardplotofcorrespondenceanalysisisasymmetricbiplotinwhichbothrows(bluepoints)andcolumns(redtriangles)arerepresentedinthesamespaceusingtheprincipalcoordinates.Thesecoordinatesrepresenttherowandcolumnprofiles.Inthiscase,onlythedistancebetweenrowpointsorthedistancebetweencolumnpointscanbereallyinterpreted.

fviz_ca_biplot(res.ca,repel=TRUE)

Withsymmetricplot,theinter-distancebetweenrowsandcolumnscan'tbeinterpreted.Onlyageneralstatementscanbemadeaboutthepattern.

Page 106: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.3.6.2Asymmetricbiplot

Tomakeanasymetricbiplot,rows(orcolumns)pointsareplottedfromthestandardco-ordinates(S)andtheprofilesofthecolumns(ortherows)areplottedfromtheprincipalecoordinates(P)(M.Bendixen2003).

Foragivenaxis,thestandardandprincipleco-ordinatesarerelatedasfollows:

P=sqrt(eigenvalue)XS

P:theprincipalcoordinateofarow(oracolumn)ontheaxiseigenvalue:theeigenvalueoftheaxis

Dependingonthesituation,othertypesofdisplaycanbesetusingtheargumentmap(NenadicandGreenacre2007)inthefunctionfviz_ca_biplot()[infactoextra].

Theallowedoptionsfortheargumentmapare:

1. "rowprincipal"or"colprincipal"-thesearetheso-calledasymmetricbiplots,witheitherrowsin

Notethat,inordertointerpretthedistancebetweencolumnpointsandrowpoints,thesimplestwayistomakeanasymmetricplot.Thismeansthat,thecolumnprofilesmustbepresentedinrowspaceorvice-versa.

Page 107: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

principalcoordinatesandcolumnsinstandardcoordinates,orviceversa(alsoknownasrow-metric-preservingorcolumn-metric-preserving,respectively).

"rowprincipal":columnsarerepresentedinrowspace"colprincipal":rowsarerepresentedincolumnspace

2. "symbiplot"-bothrowsandcolumnsarescaledtohavevariancesequaltothesingularvalues(squarerootsofeigenvalues),whichgivesasymmetricbiplotbutdoesnotpreserveroworcolumnmetrics.

3. "rowgab"or"colgab":AsymetricmapsproposedbyGabriel&Odoroff(GabrielandOdoroff1990):

"rowgab":rowsinprincipalcoordinatesandcolumnsinstandardcoordinatesmultipliedbythemass."colgab":columnsinprincipalcoordinatesandrowsinstandardcoordinatesmultipliedbythemass.

4. "rowgreen"or"colgreen":Theso-calledcontributionbiplotsshowingvisuallythemostcontributingpoints(Greenacre2006b).

"rowgreen":rowsinprincipalcoordinatesandcolumnsinstandardcoordinatesmultipliedbysquarerootofthemass."colgreen":columnsinprincipalcoordinatesandrowsinstandardcoordinatesmultipliedbythesquarerootofthemass.

TheRcodebelowdrawsastandardasymetricbiplot:

fviz_ca_biplot(res.ca,

map="rowprincipal",arrow=c(TRUE,TRUE),

repel=TRUE)

Page 108: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Weused,theargumentarrows,whichisavectoroftwologicalsspecifyingiftheplotshouldcontainpoints(FALSE,default)orarrows(TRUE).Thefirstvaluesetstherowsandthesecondvaluesetsthecolumns.

Iftheanglebetweentwoarrowsisacute,thentheirisastrongassociationbetweenthecorrespondingrowandcolumn.

Tointerpretthedistancebetweenrowsandandacolumnyoushouldperpendicularlyprojectrowpointsonthecolumnarrow.

5.3.6.3Contributionbiplot

Inthestandardsymmetricbiplot(mentionedintheprevioussection),it'sdifficulttoknowthemostcontributingpointstothesolutionoftheCA.

MichaelGreenacreproposedanewscalingdisplayed(calledcontributionbiplot)whichincorporatesthecontributionofpoints(M.Greenacre2013).Inthisdisplay,pointsthatcontributeverylittletothesolution,areclosetothecenterofthebiplotandarerelativelyunimportanttotheinterpretation.

Firstly,youhavetodecidewhethertoanalysethecontributionsofrowsorcolumnstothedefinitionoftheaxes.

Inourexamplewe'llinterpretthecontributionofrowstotheaxes.Theargumentmap="colgreen"isused.Inthiscase,recallthatcolumnsareinprincipalcoordinatesandrowsinstandardcoordinates

Acontributionbiplotcanbedrawnusingtheargumentmap="rowgreen"ormap="colgreen".

Page 109: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

multipliedbythesquarerootofthemass.Foragivenrow,thesquareofthenewcoordinateonanaxisiisexactlythecontributionofthisrowtotheinertiaoftheaxisi.

fviz_ca_biplot(res.ca,map="colgreen",arrow=c(TRUE,FALSE),

repel=TRUE)

Inthegraphabove,thepositionofthecolumnprofilepointsisunchangedrelativetothatintheconventionalbiplot.However,thedistancesoftherowpointsfromtheplotoriginarerelatedtotheircontributionstothetwo-dimensionalfactormap.

Thecloseranarrowis(intermsofangulardistance)toanaxisthegreateristhecontributionoftherowcategoryonthataxisrelativetotheotheraxis.Ifthearrowishalfwaybetweenthetwo,itsrowcategorycontributestothetwoaxestothesameextent.

ItisevidentthatrowcategoryRepairshaveanimportantcontributiontothepositivepoleofthefirstdimension,whilethecategoriesLaundryandMain_mealhaveamajorcontributiontothenegativepoleofthefirstdimension;

Dimension2ismainlydefinedbytherowcategoryHolidays.

TherowcategoryDrivingcontributestothetwoaxestothesameextent.

5.3.7Dimensiondescription

Toeasilyidentifyrowandcolumnpointsthatarethemostassociatedwiththeprincipaldimensions,youcanusethefunctiondimdesc()[inFactoMineR].Row/columnvariablesaresortedbytheircoordinatesinthedimdesc()output.

Page 110: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Dimensiondescription

res.desc<-dimdesc(res.ca,axes=c(1,2))

Descriptionofdimension1:

#Descriptionofdimension1byrowpoints

head(res.desc[[1]]$row,4)

##coord

##Laundry-0.992

##Main_meal-0.876

##Dinner-0.693

##Breakfeast-0.509

#Descriptionofdimension1bycolumnpoints

head(res.desc[[1]]$col,4)

##coord

##Wife-0.8376

##Alternating-0.0622

##Jointly0.1494

##Husband1.1609

Descriptionofdimension2:

#Descriptionofdimension2byrowpoints

res.desc[[2]]$row

#Descriptionofdimension1bycolumnpoints

res.desc[[2]]$col

Page 111: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.4Supplementaryelements

5.4.1Dataformat

We'llusethedatasetchildren[inFactoMineRpackage].Itcontains18rowsand8columns:

data(children)

#head(children)

Dataformatforcorrespondenceanalysiswithsupplementaryelements

Thedatausedhereisacontingencytabledescribingtheanswersgivenbydifferentcategoriesofpeopletothefollowingquestion:Whatarethereasonsthatcanmakehesitateawomanoracoupletohavechildren?

Onlysomeoftherowsandcolumnswillbeusedtoperformthecorrespondenceanalysis(CA).Thecoordinatesoftheremaining(supplementary)rows/columnsonthefactormapwillbepredictedaftertheCA.

InCAterminology,ourdatacontains:

Activerows(rows1:14):Rowsthatareusedduringthecorrespondenceanalysis.Supplementaryrows(row.sup15:18):ThecoordinatesoftheserowswillbepredictedusingtheCAinformationandparametersobtainedwithactiverows/columnsActivecolumns(columns1:5):Columnsthatareusedforthecorrespondenceanalysis.Supplementarycolumns(col.sup6:8):Assupplementaryrows,thecoordinatesofthesecolumnswillbepredictedalso.

Page 112: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.4.2SpecificationinCA

Asmentionedabove,supplementaryrowsandcolumnsarenotusedforthedefinitionoftheprincipaldimensions.TheircoordinatesarepredictedusingonlytheinformationprovidedbytheperformedCAonactiverows/columns.

Tospecifysupplementaryrows/columns,thefunctionCA()[inFactoMineR]canbeusedasfollow:

CA(X,ncp=5,row.sup=NULL,col.sup=NULL,

graph=TRUE)

X:adataframe(contingencytable)row.sup:anumericvectorspecifyingtheindexesofthesupplementaryrowscol.sup:anumericvectorspecifyingtheindexesofthesupplementarycolumnsncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.

Forexample,typethis:

res.ca<-CA(children,row.sup=15:18,col.sup=6:8,

graph=FALSE)

5.4.3Biplotofrowsandcolumns

fviz_ca_biplot(res.ca,repel=TRUE)

Page 113: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

ActiverowsareinblueSupplementaryrowsareindarkblueColumnsareinredSupplementarycolumnsareindarkred

It'salsopossibletohidesupplementaryrowsandcolumnsusingtheargumentinvisible:

fviz_ca_biplot(res.ca,repel=TRUE,

invisible=c("row.sup","col.sup"))

5.4.4Supplementaryrows

Predictedresults(coordinatesandcos2)forthesupplementaryrows:

res.ca$row.sup

##$coord

##Dim1Dim2Dim3Dim4

##comfort0.2100.7030.07110.307

##disagreement0.1460.1190.1711-0.313

##world0.5230.1430.0840-0.106

##to_live0.3080.5020.52090.256

##

##$cos2

##Dim1Dim2Dim3Dim4

##comfort0.06890.77520.007930.1479

##disagreement0.13130.08690.179650.6021

##world0.87590.06540.022560.0362

Page 114: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##to_live0.13900.36850.396830.0956

Plotofactiveandsupplementaryrowpoints:

fviz_ca_row(res.ca,repel=TRUE)

5.4.5Supplementarycolumns

Predictedresults(coordinatesandcos2)forthesupplementarycolumns:

res.ca$col.sup

##$coord

##Dim1Dim2Dim3Dim4

##thirty0.1054-0.0597-0.10320.0698

##fifty-0.01710.0491-0.0157-0.0131

##more_fifty-0.1771-0.04810.1008-0.0852

##

##$cos2

##Dim1Dim2Dim3Dim4

##thirty0.13760.04410.131910.06028

##fifty0.01090.08990.009190.00637

##more_fifty0.28610.02110.092670.06620

Plotofactiveandsupplementarycolumnpoints:

fviz_ca_col(res.ca,repel=TRUE)

Supplementaryrowsareshownindarkbluecolor.

Page 115: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Supplementarycolumnsareshownindarkred.

Page 116: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.5Filteringresults

Ifyouhavemanyrow/columnvariables,it'spossibletovisualizeonlysomeofthemusingtheargumentsselect.rowandselect.col.

select.col,select.row:aselectionofcolumns/rowstobedrawn.AllowedvaluesareNULLoralistcontainingtheargumentsname,cos2orcontrib:

name:isacharactervectorcontainingcolumn/rownamestobedrawncos2:ifcos2isin[0,1],ex:0.6,thencolumns/rowswithacos2>0.6aredrawnifcos2>1,ex:5,thenthetop5activecolumns/rowsandtop5supplementarycolumns/rowswiththehighestcos2aredrawncontrib:ifcontrib>1,ex:5,thenthetop5columns/rowswiththehighestcontributionsaredrawn

#Visualizerowswithcos2>=0.8

fviz_ca_row(res.ca,select.row=list(cos2=0.8))

#Top5activerowsand5suppl.rowswiththehighestcos2

fviz_ca_row(res.ca,select.row=list(cos2=5))

#Selectbynames

name<-list(name=c("employment","fear","future"))

fviz_ca_row(res.ca,select.row=name)

#Top5contributingrowsandcolumns

fviz_ca_biplot(res.ca,select.row=list(contrib=5),

select.col=list(contrib=5))+

theme_minimal()

Page 117: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.6Outliers

Ifoneormore"outliers"arepresentinthecontingencytable,theycandominatetheinterpretationtheaxes(M.Bendixen2003).

Outliersarepointsthathavehighabsoluteco-ordinatevaluesandhighcontributions.Theyarerepresented,onthegraph,veryfarfromthecentroïd.Inthiscase,theremainingrow/columnpointstendtobetightlyclusteredinthegraphwhichbecomedifficulttointerpret.

IntheCAoutput,thecoordinatesofrow/columnpointsrepresentthenumberofstandarddeviationstherow/columnisawayfromthebarycentre(M.Bendixen2003).

Accordingto(M.Bendixen2003):

Outliersarepointsthatareareatleastonestandarddeviationawayfromthebarycentre.Theycontributealso,significantlytotheinterpretationtoonepoleofanaxis(M.Bendixen2003).

Therearenoapparentoutliersinourdata.Iftherewereoutliersinthedata,theymustbesuppressedortreatedassupplementarypointswhenre-runningthecorrespondenceanalysis.

Page 118: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.7Exportingresults

5.7.1ExportplotstoPDF/PNGfiles

Tosavethedifferentgraphsintopdforpngfiles,westartbycreatingtheplotofinterestasanRobject:

#Screeplot

scree.plot<-fviz_eig(res.ca)

#Biplotofrowandcolumnvariables

biplot.ca<-fviz_ca_biplot(res.ca)

Next,theplotscanbeexportedintoasinglepdffileasfollow(oneplotperpage):

library(ggpubr)

ggexport(plotlist=list(scree.plot,biplot.ca),

filename="CA.pdf")

Moreoptionsat:Chapter4(section:Exportingresults).

5.7.2Exportresultstotxt/csvfiles

EasytouseRfunction:write.infile()[inFactoMineR]package:

#ExportintoaTXTfile

write.infile(res.ca,"ca.txt",sep="\t")

#ExportintoaCSVfile

write.infile(res.ca,"ca.csv",sep=";")

Page 119: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.8Summary

Inconclusion,wedescribedhowtoperformandinterpretcorrespondenceanalysis(CA).WecomputedCAusingtheCA()function[FactoMineRpackage].Next,weusedthefactoextraRpackagetoproduceggplot2-basedvisualizationoftheCAresults.

Otherfunctions[packages]tocomputeCAinR,include:

1. Usingdudi.coa()[ade4]

library("ade4")

res.ca<-dudi.coa(housetasks,scannf=FALSE,nf=5)

Readmore:http://www.sthda.com/english/wiki/ca-using-ade4

2. Usingca()[ca]

library(ca)

res.ca<-ca(housetasks)

Readmore:http://www.sthda.com/english/wiki/ca-using-ca-package

3. Usingcorresp()[MASS]

library(MASS)

res.ca<-corresp(housetasks,nf=3)

Readmore:http://www.sthda.com/english/wiki/ca-using-mass

4. UsingepCA()[ExPosition]

library("ExPosition")

res.ca<-epCA(housetasks,graph=FALSE)

Nomatterwhatfunctionsyoudecidetouse,inthelistabove,thefactoextrapackagecanhandletheoutput.

fviz_eig(res.ca)#Screeplot

fviz_ca_biplot(res.ca)#Biplotofrowsandcolumns

Page 120: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

5.9Furtherreading

ForthemathematicalbackgroundbehindCA,refertothefollowingvideocourses,articlesandbooks:

CorrespondenceAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/Hhh6hCExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).Principalcomponentanalysis(article).(AbdiandWilliams2010).https://goo.gl/1Vtwq1.Correspondenceanalysisbasics(blogpost).https://goo.gl/Xyk8KT.UnderstandingtheMathofCorrespondenceAnalysiswithExamplesinR(blogpost).https://goo.gl/H9hxf9

Page 121: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 122: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6MultipleCorrespondenceAnalysis

Page 123: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.1Introduction

TheMultiplecorrespondenceanalysis(MCA)isanextensionofthesimplecorrespondenceanalysis(chapter5)forsummarizingandvisualizingadatatablecontainingmorethantwocategoricalvariables.Itcanalsobeseenasageneralizationofprincipalcomponentanalysiswhenthevariablestobeanalyzedarecategoricalinsteadofquantitative(AbdiandWilliams2010).

MCAisgenerallyusedtoanalyseadatasetfromsurvey.Thegoalistoidentify:

AgroupofindividualswithsimilarprofileintheiranswerstothequestionsTheassociationsbetweenvariablecategories

Previously,wedescribedhowtocomputeandinterpretthesimplecorrespondenceanalysis(chapter5).Inthecurrentchapter,wedemonstratehowtocomputeandvisualizemultiplecorrespondenceanalysisinRsoftwareusingFactoMineR(fortheanalysis)andfactoextra(fordatavisualization).Additionally,we'llshowhowtorevealthemostimportantvariablesthatcontributethemostinexplainingthevariationsinthedataset.Wecontinuebyexplaininghowtopredicttheresultsforsupplementaryindividualsandvariables.Finally,we'lldemonstratehowtofilterMCAresultsinordertokeeponlythemostcontributingvariables.

Page 124: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.2Computation

6.2.1Rpackages

SeveralfunctionsfromdifferentpackagesareavailableintheRsoftwareforcomputingmultiplecorrespondenceanalysis.Thesefunctions/packagesinclude:

MCA()function[FactoMineRpackage]dudi.mca()function[ade4package]andepMCA()[ExPositionpackage]

Nomatterwhatfunctionyoudecidetouse,youcaneasilyextractandvisualizetheMCAresultsusingRfunctionsprovidedinthefactoextraRpackage.

Here,we'lluseFactoMineR(fortheanalysis)andfactoextra(forggplot2-basedelegantvisualization).Toinstallthetwopackages,typethis:

install.packages(c("FactoMineR","factoextra"))

Loadthepackages:

library("FactoMineR")

library("factoextra")

6.2.2Dataformat

We'llusethedemodatasetspoisonavailableinFactoMineRpackage:

data(poison)

head(poison[,1:7],3)

##AgeTimeSickSexNauseaVomitingAbdominals

##1922Sick_yFNausea_yVomit_nAbdo_y

##250Sick_nFNausea_nVomit_nAbdo_n

##3616Sick_yFNausea_nVomit_yAbdo_y

Thisdataisaresultfromasurveycarriedoutonchildrenofprimaryschoolwhosufferedfromfoodpoisoning.Theywereaskedabouttheirsymptomsandaboutwhattheyate.

Thedatacontains55rows(individuals)and15columns(variables).We'lluseonlysomeoftheseindividuals(children)andvariablestoperformthemultiplecorrespondenceanalysis.ThecoordinatesoftheremainingindividualsandvariablesonthefactormapwillbepredictedfromthepreviousMCAresults.

InMCAterminology,ourdatacontains:

Activeindividuals(rows1:55):Individualsthatareusedinthemultiplecorrespondenceanalysis.

Page 125: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Activevariables(columns5:15):VariablesthatareusedintheMCA.Supplementaryvariables:Theydon'tparticipatetotheMCA.Thecoordinatesofthesevariableswillbepredicted.

Supplementaryquantitativevariables(quanti.sup):Columns1and2correspondingtothecolumnsageandtime,respectively.Supplementaryqualitativevariables(quali.sup:Columns3and4correspondingtothecolumnsSickandSex,respectively.Thisfactorvariableswillbeusedtocolorindividualsbygroups.

Subsetonlyactiveindividualsandvariablesformultiplecorrespondenceanalysis:

poison.active<-poison[1:55,5:15]

head(poison.active[,1:6],3)

##NauseaVomitingAbdominalsFeverDiarrhaePotato

##1Nausea_yVomit_nAbdo_yFever_yDiarrhea_yPotato_y

##2Nausea_nVomit_nAbdo_nFever_nDiarrhea_nPotato_y

##3Nausea_nVomit_yAbdo_yFever_yDiarrhea_yPotato_y

6.2.3Datasummary

TheRbasefunctionsummary()canbeusedtocomputethefrequencyofvariablecategories.Asthedatatablecontainsalargenumberofvariables,we'lldisplayonlytheresultsforthefirst4variables.

Statisticalsummaries:

#Summaryofthe4firstvariables

summary(poison.active)[,1:4]

##NauseaVomitingAbdominalsFever

##Nausea_n:43Vomit_n:33Abdo_n:18Fever_n:20

##Nausea_y:12Vomit_y:22Abdo_y:37Fever_y:35

Thesummary()functionsreturnthesizeofeachvariablecategory.

It'salsopossibletoplotthefrequencyofvariablecategories.TheRcodebelow,plotsthefirst4columns:

for(iin1:4){

plot(poison.active[,i],main=colnames(poison.active)[i],

ylab="Count",col="steelblue",las=2)

}

Page 126: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.2.4Rcode

ThefunctionMCA()[FactoMinerpackage]canbeused.Asimplifiedformatis:

MCA(X,ncp=5,graph=TRUE)

X:adataframewithnrows(individuals)andpcolumns(categoricalvariables)ncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.

IntheRcodebelow,theMCAisperformedonlyontheactiveindividuals/variables:

res.mca<-MCA(poison.active,graph=FALSE)

TheoutputoftheMCA()functionisalistincluding:

print(res.mca)

##**ResultsoftheMultipleCorrespondenceAnalysis(MCA)**

##Theanalysiswasperformedon55individuals,describedby11variables

##*Theresultsareavailableinthefollowingobjects:

##

##namedescription

##1"$eig""eigenvalues"

##2"$var""resultsforthevariables"

##3"$var$coord""coord.ofthecategories"

##4"$var$cos2""cos2forthecategories"

##5"$var$contrib""contributionsofthecategories"

Thegraphsabovecanbeusedtoidentifyvariablecategorieswithaverylowfrequency.Thesetypesofvariablescandistorttheanalysisandshouldberemoved.

Page 127: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##6"$var$v.test""v-testforthecategories"

##7"$ind""resultsfortheindividuals"

##8"$ind$coord""coord.fortheindividuals"

##9"$ind$cos2""cos2fortheindividuals"

##10"$ind$contrib""contributionsoftheindividuals"

##11"$call""intermediateresults"

##12"$call$marge.col""weightsofcolumns"

##13"$call$marge.li""weightsofrows"

TheobjectthatiscreatedusingthefunctionMCA()containsmanyinformationfoundinmanydifferentlistsandmatrices.Thesevaluesaredescribedinthenextsection.

Page 128: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.3Visualizationandinterpretation

We'llusethefactoextraRpackagetohelpintheinterpretationandthevisualizationofthemultiplecorrespondenceanalysis.Nomatterwhatfunctionyoudecidetouse[FactoMiner::MCA(),ade4::dudi.mca()],youcaneasilyextractandvisualizetheresultsofmultiplecorrespondenceanalysisusingRfunctionsprovidedinthefactoextraRpackage.

Thesefactoextrafunctionsinclude:

get_eigenvalue(res.mca):Extracttheeigenvalues/variancesretainedbyeachdimension(axis)fviz_eig(res.mca):Visualizetheeigenvalues/variancesget_mca_ind(res.mca),get_mca_var(res.mca):Extracttheresultsforindividualsandvariables,respectively.fviz_mca_ind(res.mca),fviz_mca_var(res.mca):Visualizetheresultsforindividualsandvariables,respectively.fviz_mca_biplot(res.mca):Makeabiplotofrowsandcolumns.

Inthenextsections,we'llillustrateeachofthesefunctions.

6.3.1Eigenvalues/Variances

Theproportionofvariancesretainedbythedifferentdimensions(axes)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage]asfollow:

library("factoextra")

eig.val<-get_eigenvalue(res.mca)

#head(eig.val)

TovisualizethepercentagesofinertiaexplainedbyeachMCAdimensions,usethefunctionfviz_eig()orfviz_screeplot()[factoextrapackage]:

fviz_screeplot(res.mca,addlabels=TRUE,ylim=c(0,45))

Notethat,theMCAresultsisinterpretedastheresultsfromasimplecorrespondenceanalysis(CA).Therefore,it'sstronglyrecommendedtoreadtheinterpretationofsimpleCAwhichhasbeencomprehensivelydescribedintheChapter5.

Page 129: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.3.2Biplot

Thefunctionfviz_mca_biplot()[factoextrapackage]isusedtodrawthebiplotofindividualsandvariablecategories:

fviz_mca_biplot(res.mca,

repel=TRUE,#Avoidtextoverlapping(slowifmanypoint)

ggtheme=theme_minimal())

Theplotaboveshowsaglobalpatternwithinthedata.Rows(individuals)arerepresentedbybluepointsandcolumns(variablecategories)byredtriangles.

Thedistancebetweenanyrowpointsorcolumnpointsgivesameasureoftheirsimilarity(or

Page 130: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

dissimilarity).Rowpointswithsimilarprofileareclosedonthefactormap.Thesameholdstrueforcolumnpoints.

6.3.3Graphofvariables

6.3.3.1Results

Thefunctionget_mca_var()[infactoextra]isusedtoextracttheresultsforvariablecategories.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofvariablecategories:

var<-get_mca_var(res.mca)

var

##MultipleCorrespondenceAnalysisResultsforvariables

##===================================================

##NameDescription

##1"$coord""Coordinatesforcategories"

##2"$cos2""Cos2forcategories"

##3"$contrib""contributionsofcategories"

Thecomponentsoftheget_mca_var()canbeusedintheplotofrowsasfollow:

var$coord:coordinatesofvariablestocreateascatterplotvar$cos2:representsthequalityoftherepresentationforvariablesonthefactormap.var$contrib:containsthecontributions(inpercentage)ofthevariablestothedefinitionofthedimensions.

Thedifferentcomponentscanbeaccessedasfollow:

#Coordinates

head(var$coord)

#Cos2:qualityonthefactoremap

head(var$cos2)

#Contributionstotheprincipalcomponents

head(var$contrib)

Inthissection,we'lldescribehowtovisualizevariablecategoriesonly.Next,we'llhighlightvariablecategoriesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.

6.3.3.2Correlationbetweenvariablesandprincipaldimensions

Notethat,it'spossibletoplotvariablecategoriesandtocolorthemaccordingtoeitheri)theirqualityonthefactormap(cos2)orii)theircontributionvaluestothedefinitionofdimensions(contrib).

Page 131: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

TovisualizethecorrelationbetweenvariablesandMCAprincipaldimensions,typethis:

fviz_mca_var(res.mca,choice="mca.cor",

repel=TRUE,#Avoidtextoverlapping(slow)

ggtheme=theme_minimal())

6.3.3.3Coordinatesofvariablecategories

TheRcodebelowdisplaysthecoordinatesofeachvariablecategoriesineachdimension(1,2and3):

head(round(var$coord,2),4)

##Dim1Dim2Dim3Dim4Dim5

##Nausea_n0.270.12-0.270.030.07

##Nausea_y-0.96-0.430.95-0.12-0.26

##Vomit_n0.48-0.410.080.270.05

##Vomit_y-0.720.61-0.13-0.41-0.08

Usethefunctionfviz_mca_var()[infactoextra]tovisualizeonlyvariablecategories:

fviz_mca_var(res.mca,

repel=TRUE,#Avoidtextoverlapping(slow)

ggtheme=theme_minimal())

Theplotabovehelpstoidentifyvariablesthatarethemostcorrelatedwitheachdimension.Thesquaredcorrelationsbetweenvariablesandthedimensionsareusedascoordinates.

Itcanbeseenthat,thevariablesDiarrhae,AbdominalsandFeverarethemostcorrelatedwithdimension1.Similarly,thevariablesCourgetteandPotatoarethemostcorrelatedwithdimension2.

Page 132: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

It'spossibletochangethecolorandtheshapeofthevariablepointsusingtheargumentscol.varandshape.varasfollow:

fviz_mca_var(res.mca,col.var="black",shape.var=15,

repel=TRUE)

Theplotaboveshowstherelationshipsbetweenvariablecategories.Itcanbeinterpretedasfollow:

Variablecategorieswithasimilarprofilearegroupedtogether.Negativelycorrelatedvariablecategoriesarepositionedonoppositesidesoftheplotorigin(opposedquadrants).Thedistancebetweencategorypointsandtheoriginmeasuresthequalityofthevariablecategoryonthefactormap.Categorypointsthatareawayfromtheoriginarewellrepresentedonthefactormap.

6.3.3.4Qualityofrepresentationofvariablecategories

Thetwodimensions1and2aresufficienttoretain46%ofthetotalinertia(variation)containedinthedata.Notallthepointsareequallywelldisplayedinthetwodimensions.

Thequalityoftherepresentationiscalledthesquaredcosine(cos2),whichmeasuresthedegreeofassociationbetweenvariablecategoriesandaparticularaxis.Thecos2ofvariablecategoriescanbeextractedasfollow:

head(var$cos2,4)

##Dim1Dim2Dim3Dim4Dim5

##Nausea_n0.2560.05280.25270.004080.01947

Page 133: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##Nausea_y0.2560.05280.25270.004080.01947

##Vomit_n0.3440.25120.01070.112290.00413

##Vomit_y0.3440.25120.01070.112290.00413

Ifavariablecategoryiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftherowitems,morethan2dimensionsarerequiredtoperfectlyrepresentthedata.

It'spossibletocolorvariablecategoriesbytheircos2valuesusingtheargumentcol.var="cos2".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.Forinstance,gradient.cols=c("white","blue","red")meansthat:

variablecategorieswithlowcos2valueswillbecoloredin"white"variablecategorieswithmidcos2valueswillbecoloredin"blue"variablecategorieswithhighcos2valueswillbecoloredin"red"

#Colorbycos2values:qualityonthefactormap

fviz_mca_var(res.mca,col.var="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE,#Avoidtextoverlapping

ggtheme=theme_minimal())

Notethat,it'salsopossibletochangethetransparencyofthevariablecategoriesaccordingtotheircos2valuesusingtheoptionalpha.var="cos2".Forexample,typethis:

#Changethetransparencybycos2values

fviz_mca_var(res.mca,alpha.var="cos2",

repel=TRUE,

ggtheme=theme_minimal())

Youcanvisualizethecos2ofrowcategoriesonallthedimensionsusingthecorrplotpackage:

library("corrplot")

Page 134: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

corrplot(var$cos2,is.corr=FALSE)

It'salsopossibletocreateabarplotofvariablecos2usingthefunctionfviz_cos2()[infactoextra]:

#Cos2ofvariablecategoriesonDim.1andDim.2

fviz_cos2(res.mca,choice="var",axes=1:2)

6.3.3.5Contributionofvariablecategoriestothedimensions

Thecontributionofthevariablecategories(in%)tothedefinitionofthedimensionscanbeextractedasfollow:

head(round(var$contrib,2),4)

##Dim1Dim2Dim3Dim4Dim5

##Nausea_n1.520.814.670.080.49

##Nausea_y5.432.9116.730.301.76

##Vomit_n3.737.070.364.260.19

##Vomit_y5.6010.610.546.390.29

Thefunctionfviz_contrib()[factoextrapackage]canbeusedtodrawabarplotofthecontributionof

Notethat,variablecategoriesFish_n,Fish_y,Icecream_nandIcecream_yarenotverywellrepresentedbythefirsttwodimensions.Thisimpliesthatthepositionofthecorrespondingpointsonthescatterplotshouldbeinterpretedwithsomecaution.Ahigherdimensionalsolutionisprobablynecessary.

Thevariablecategorieswiththelargervalue,contributethemosttothedefinitionofthedimensions.VariablecategoriesthatcontributethemosttoDim.1andDim.2arethemostimportantinexplainingthevariabilityinthedataset.

Page 135: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

variablecategories.TheRcodebelowshowsthetop15variablecategoriescontributingtothedimensions:

#Contributionsofrowstodimension1

fviz_contrib(res.mca,choice="var",axes=1,top=15)

#Contributionsofrowstodimension2

fviz_contrib(res.mca,choice="var",axes=2,top=15)

Thetotalcontributionstodimension1and2areobtainedasfollow:

#Totalcontributiontodimension1and2

fviz_contrib(res.mca,choice="var",axes=1:2,top=15)

Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Thecalculationoftheexpectedcontributionvalue,undernullhypothesis,hasbeendetailedintheprincipalcomponentanalysischapter.

Itcanbeseenthat:

thecategoriesAbdo_n,Diarrhea_n,Fever_nandMayo_narethemostimportantinthedefinitionofthefirstdimension.ThecategoriesCourg_n,Potato_n,Vomit_yandIcecream_ncontributethemosttothedimension2

Themostimportant(or,contributing)variablecategoriescanbehighlightedonthescatterplotasfollow:

fviz_mca_var(res.mca,col.var="contrib",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE,#avoidtextoverlapping(slow)

ggtheme=theme_minimal()

)

Page 136: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Notethat,it'salsopossibletocontrolthetransparencyofvariablecategoriesaccordingtotheircontributionvaluesusingtheoptionalpha.var="contrib".Forexample,typethis:

#Changethetransparencybycontribvalues

fviz_mca_var(res.mca,alpha.var="contrib",

repel=TRUE,

ggtheme=theme_minimal())

6.3.4Graphofindividuals

6.3.4.1Results

Thefunctionget_mca_ind()[infactoextra]isusedtoextracttheresultsforindividuals.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionsofindividuals:

ind<-get_mca_ind(res.mca)

ind

Theplotabovegivesanideaofwhatpoleofthedimensionsthecategoriesareactuallycontributingto.

ItisevidentthatthecategoriesAbdo_n,Diarrhea_n,Fever_nandMayo_nhaveanimportantcontributiontothepositivepoleofthefirstdimension,whilethecategoriesFever_yandDiarrhea_yhaveamajorcontributiontothenegativepoleofthefirstdimension;etc,....

Page 137: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##MultipleCorrespondenceAnalysisResultsforindividuals

##===================================================

##NameDescription

##1"$coord""Coordinatesfortheindividuals"

##2"$cos2""Cos2fortheindividuals"

##3"$contrib""contributionsoftheindividuals"

Togetaccesstothedifferentcomponents,usethis:

#Coordinatesofcolumnpoints

head(ind$coord)

#Qualityofrepresentation

head(ind$cos2)

#Contributions

head(ind$contrib)

6.3.4.2Plots:qualityandcontribution

Thefunctionfviz_mca_ind()[infactoextra]isusedtovisualizeonlyindividuals.Likevariablecategories,it'salsopossibletocolorindividualsbytheircos2values:

fviz_mca_ind(res.mca,col.ind="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE,#Avoidtextoverlapping(slowifmanypoints)

ggtheme=theme_minimal())

Theresultforindividualsgivesthesameinformationasdescribedforvariablecategories.Forthisreason,I'lljustdisplayedtheresultforindividualsinthissectionwithoutcommenting.

Page 138: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

TheRcodebelowcreatesabarplotsofindividualscos2andcontributions:

#Cos2ofindividuals

fviz_cos2(res.mca,choice="ind",axes=1:2,top=20)

#Contributionofindividualstothedimensions

fviz_contrib(res.mca,choice="ind",axes=1:2,top=20)

6.3.5Colorindividualsbygroups

TheRcodebelowcolorstheindividualsbygroupsusingthelevelsofthevariableVomiting.Theargumenthabillageisusedtospecifythefactorvariableforcoloringtheindividualsbygroups.AconcentrationellipsecanbealsoaddedaroundeachgroupusingtheargumentaddEllipses=TRUE.Ifyouwantaconfidenceellipsearoundthemeanpointofcategories,useellipse.type="confidence"Theargumentpaletteisusedtochangegroupcolors.

fviz_mca_ind(res.mca,

label="none",#hideindividuallabels

habillage="Vomiting",#colorbygroups

palette=c("#00AFBB","#E7B800"),

addEllipses=TRUE,ellipse.type="confidence",

ggtheme=theme_minimal())

Notethat,it'spossibletocolortheindividualsusinganyofthequalitativevariablesintheinitialdatatable(poison)

Page 139: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Notethat,tospecifythevalueoftheargumenthabillage,it'salsopossibletousetheindexofthecolumnasfollow(habillage=2).Additionally,youcanprovideanexternalgroupingvariableasfollow:habillage=poison$Vomiting.Forexample:

#habillage=indexofthecolumntobeusedasgroupingvariable

fviz_mca_ind(res.mca,habillage=2,addEllipses=TRUE)

#habillage=externalgroupingvariable

fviz_mca_ind(res.mca,habillage=poison$Vomiting,addEllipses=TRUE)

Ifyouwanttocolorindividualsusingmultiplecategoricalvariablesatthesametime,usethefunctionfviz_ellipses()[infactoextra]asfollow:

fviz_ellipses(res.mca,c("Vomiting","Fever"),

geom="point")

Alternatively,youcanspecifycategoricalvariableindices:

Page 140: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

fviz_ellipses(res.mca,1:4,geom="point")

6.3.6Dimensiondescription

Thefunctiondimdesc()[inFactoMineR]canbeusedtoidentifythemostcorrelatedvariableswithagivendimension:

res.desc<-dimdesc(res.mca,axes=c(1,2))

#Descriptionofdimension1

res.desc[[1]]

#Descriptionofdimension2

res.desc[[2]]

Page 141: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.4Supplementaryelements

6.4.1Definitionandtypes

Asdescribedabove(section6.2.2),thedatasetpoisoncontains:

supplementarycontinuousvariables(quanti.sup=1:2,columns1and2correspondingtothecolumnsageandtime,respectively)supplementaryqualitativevariables(quali.sup=3:4,correspondingtothecolumnsSickandSex,respectively).Thisfactorvariablesareusedtocolorindividualsbygroups

Thedatadoesn'tcontainsupplementaryindividuals.However,fordemonstration,we'llusetheindividuals53:55assupplementaryindividuals.

6.4.2SpecificationinMCA

Tospecifysupplementaryindividualsandvariables,thefunctionMCA()canbeusedasfollow:

MCA(X,ind.sup=NULL,quanti.sup=NULL,quali.sup=NULL,

graph=TRUE,axes=c(1,2))

X:adataframe.Rowsareindividualsandcolumnsarevariables.ind.sup:anumericvectorspecifyingtheindexesofthesupplementaryindividuals.quanti.sup,quali.sup:anumericvectorspecifying,respectively,theindexesofthequantitativeandqualitativevariables.graph:alogicalvalue.IfTRUEagraphisdisplayed.axes:avectoroflength2specifyingthecomponentstobeplotted.

Forexample,typethis:

res.mca<-MCA(poison,ind.sup=53:55,

quanti.sup=1:2,quali.sup=3:4,graph=FALSE)

6.4.3Results

Thepredictedresultsforsupplementaryindividuals/variablescanbeextractedasfollow:

#Supplementaryqualitativevariablecategories

res.mca$quali.sup

Supplementaryvariablesandindividualsarenotusedforthedeterminationoftheprincipaldimensions.Theircoordinatesarepredictedusingonlytheinformationprovidedbytheperformedmultiplecorrespondenceanalysisonactivevariables/individuals.

Page 142: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Supplementaryquantitativevariables

res.mca$quanti

#Supplementaryindividuals

res.mca$ind.sup

6.4.4Plots

Tomakeabiplotofindividualsandvariablecategories,typethis:

#Biplotofindividualsandvariablecategories

fviz_mca_biplot(res.mca,repel=TRUE,

ggtheme=theme_minimal())

ActiveindividualsareinblueSupplementaryindividualsareindarkblueActivevariablecategoriesareinredSupplementaryvariablecategoriesareindarkgreen

Ifyouwanttohighlightthecorrelationbetweenvariables(active&supplementary)anddimensions,usethefunctionfviz_mca_var()withtheargumentchoice="mca.cor":

fviz_mca_var(res.mca,choice="mca.cor",

Page 143: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

repel=TRUE)

TheRcodebelowplotsqualitativevariablecategories(active&supplementaryvariables):

fviz_mca_var(res.mca,repel=TRUE,

ggtheme=theme_minimal())

Forsupplementaryquantitativevariables,typethis:

fviz_mca_var(res.mca,choice="quanti.sup",

Page 144: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

ggtheme=theme_minimal())

Tovisualizesupplementaryindividuals,typethis:

fviz_mca_ind(res.mca,

label="ind.sup",#Showthelabelofind.suponly

ggtheme=theme_minimal())

Page 145: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.5Filteringresults

Ifyouhavemanyindividuals/variablecategories,it'spossibletovisualizeonlysomeofthemusingtheargumentsselect.indandselect.var.

select.ind,select.var:aselectionofindividuals/variablecategoriestobedrawn.AllowedvaluesareNULLoralistcontainingtheargumentsname,cos2orcontrib:

name:isacharactervectorcontainingindividuals/variablecategorynamestobeplottedcos2:ifcos2isin[0,1],ex:0.6,thenindividuals/variablecategorieswithacos2>0.6areplottedifcos2>1,ex:5,thenthetop5activeindividuals/variablecategoriesandtop5supplementarycolumns/rowswiththehighestcos2areplottedcontrib:ifcontrib>1,ex:5,thenthetop5individuals/variablecategorieswiththehighestcontributionsareplotted

#Visualizevariablecategorieswithcos2>=0.4

fviz_mca_var(res.mca,select.var=list(cos2=0.4))

#Top10activevariableswiththehighestcos2

fviz_mca_var(res.mca,select.var=list(cos2=10))

#Selectbynames

name<-list(name=c("Fever_n","Abdo_y","Diarrhea_n",

"Fever_Y","Vomit_y","Vomit_n"))

fviz_mca_var(res.mca,select.var=name)

#top5contributingindividualsandvariablecategories

fviz_mca_biplot(res.mca,select.ind=list(contrib=5),

select.var=list(contrib=5),

ggtheme=theme_minimal())

Whentheselectionisdoneaccordingtothecontributionvalues,supplementaryindividuals/variablecategoriesarenotshownbecausetheydon'tcontributetotheconstructionoftheaxes.

Page 146: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.6Exportingresults

6.6.1ExportplotstoPDF/PNGfiles

Twosteps:

1. CreatetheplotofinterestasanRobject:

#Screeplot

scree.plot<-fviz_eig(res.mca)

#Biplotofrowandcolumnvariables

biplot.mca<-fviz_mca_biplot(res.mca)

2. Exporttheplotsintoasinglepdffileasfollow(oneplotperpage):

library(ggpubr)

ggexport(plotlist=list(scree.plot,biplot.mca),

filename="MCA.pdf")

Moreoptionsat:Chapter4(section:Exportingresults).

6.6.2Exportresultstotxt/csvfiles

EasytouseRfunction:write.infile()[inFactoMineR]package.

#ExportintoaTXTfile

write.infile(res.mca,"mca.txt",sep="\t")

#ExportintoaCSVfile

write.infile(res.mca,"mca.csv",sep=";")

Page 147: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.7Summary

Inconclusion,wedescribedhowtoperformandinterpretmultiplecorrespondenceanalysis(CA).WecomputedMCAusingtheMCA()function[FactoMineRpackage].Next,weusedthefactoextraRpackagetoproduceggplot2-basedvisualizationoftheCAresults.

Otherfunctions[packages]tocomputeMCAinR,include:

1. Usingdudi.acm()[ade4]

library("ade4")

res.mca<-dudi.acm(poison.active,scannf=FALSE,nf=5)

4. UsingepMCA()[ExPosition]

library("ExPosition")

res.mca<-epMCA(poison.active,graph=FALSE,correction="bg")

Nomatterwhatfunctionsyoudecidetouse,inthelistabove,thefactoextrapackagecanhandletheoutput.

fviz_eig(res.mca)#Screeplot

fviz_mca_biplot(res.mca)#Biplotofrowsandcolumns

Page 148: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

6.8Furtherreading

ForthemathematicalbackgroundbehindMCA,refertothefollowingvideocourses,articlesandbooks:

CorrespondenceAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/Hhh6hCExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).Principalcomponentanalysis(article)(AbdiandWilliams2010).https://goo.gl/1Vtwq1.Correspondenceanalysisbasics(blogpost).https://goo.gl/Xyk8KT.

Page 149: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 150: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7FactorAnalysisofMixedData

Page 151: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.1Introduction

Factoranalysisofmixeddata(FAMD)isaprincipalcomponentmethoddedicatedtoanalyzeadatasetcontainingbothquantitativeandqualitativevariables(J.Pagès2004).Itmakesitpossibletoanalyzethesimilaritybetweenindividualsbytakingintoaccountamixedtypesofvariables.Additionally,onecanexploretheassociationbetweenallvariables,bothquantitativeandqualitativevariables.

Roughlyspeaking,theFAMDalgorithmcanbeseenasamixedbetweenprincipalcomponentanalysis(PCA)(Chapter4)andmultiplecorrespondenceanalysis(MCA)(Chapter6).Inotherwords,itactsasPCAquantitativevariablesandasMCAforqualitativevariables.

Quantitativeandqualitativevariablesarenormalizedduringtheanalysisinordertobalancetheinfluenceofeachsetofvariables.

Inthecurrentchapter,wedemonstratehowtocomputeandvisualizefactoranalysisofmixeddatausingFactoMineR(fortheanalysis)andfactoextra(fordatavisualization)Rpackages.

Page 152: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.2Computation

7.2.1Rpackages

Installrequiredpackagesasfollow:

install.packages(c("FactoMineR","factoextra"))

Loadthepackages:

library("FactoMineR")

library("factoextra")

7.2.2Dataformat

We'lluseasubsetofthewinedatasetavailableinFactoMineRpackage:

library("FactoMineR")

data(wine)

df<-wine[,c(1,2,16,22,29,28,30,31)]

head(df[,1:7],4)

##LabelSoilPlanteAcidityHarmonyIntensityOverall.quality

##2ELSaumurEnv12.002.113.142.863.39

##1CHASaumurEnv12.002.112.962.893.21

##1FONBourgueuilEnv11.752.183.143.073.54

##1VAUChinonEnv22.303.182.042.462.46

Toseethestructureofthedata,typethis:

str(df)

Thedatacontains21rows(wines,individuals)and8columns(variables):

Thefirsttwocolumnsarefactors(categoricalvariables):label(Saumur,BourgueilorChinon)andsoil(Reference,Env1,Env2orEnv4).Theremainingcolumnsarenumeric(continuousvariables).

7.2.3Rcode

ThefunctionFAMD()[FactoMinerpackage]canbeusedtocomputeFAMD.Asimplifiedformatis:

FAMD(base,ncp=5,sup.var=NULL,ind.sup=NULL,graph=TRUE)

base:adataframewithnrows(individuals)andpcolumns(variables).

Thegoalofthisstudyistoanalyzethecharacteristicsofthewines.

Page 153: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

ncp:thenumberofdimensionskeptintheresults(bydefault5)sup.var:avectorindicatingtheindexesofthesupplementaryvariables.ind.sup:avectorindicatingtheindexesofthesupplementaryindividuals.graph:alogicalvalue.IfTRUEagraphisdisplayed.

TocomputeFAMD,typethis:

library(FactoMineR)

res.famd<-FAMD(df,graph=FALSE)

TheoutputoftheFAMD()functionisalistincluding:

print(res.famd)

##*Theresultsareavailableinthefollowingobjects:

##

##namedescription

##1"$eig""eigenvaluesandinertia"

##2"$var""Resultsforthevariables"

##3"$ind""resultsfortheindividuals"

##4"$quali.var""Resultsforthequalitativevariables"

##5"$quanti.var""Resultsforthequantitativevariables"

Page 154: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.3Visualizationandinterpretation

We'llusethefollowingfactoextrafunctions:

get_eigenvalue(res.famd):Extracttheeigenvalues/variancesretainedbyeachdimension(axis).fviz_eig(res.famd):Visualizetheeigenvalues/variances.get_famd_ind(res.famd):Extracttheresultsforindividuals.get_famd_var(res.famd):Extracttheresultsforquantitativeandqualitativevariables.fviz_famd_ind(res.famd),fviz_famd_var(res.famd):Visualizetheresultsforindividualsandvariables,respectively.

Inthenextsections,we'llillustrateeachofthesefunctions.

7.3.1Eigenvalues/Variances

Theproportionofvariancesretainedbythedifferentdimensions(axes)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage]asfollow:

library("factoextra")

eig.val<-get_eigenvalue(res.famd)

head(eig.val)

##eigenvaluevariance.percentcumulative.variance.percent

##Dim.14.83243.9243.9

##Dim.21.85716.8860.8

##Dim.31.58214.3975.2

##Dim.41.14910.4585.6

##Dim.50.6525.9391.6

Thefunctionfviz_eig()orfviz_screeplot()[factoextrapackage]canbeusedtodrawthescreeplot(thepercentagesofinertiaexplainedbyeachFAMDdimensions):

fviz_screeplot(res.famd)

TohelpintheinterpretationofFAMD,wehighlyrecommendtoreadtheinterpretationofprincipalcomponentanalysis(Chapter(???)(principal-component-analysis))andmultiplecorrespondenceanalysis(Chapter(???)(multiple-correspondence-analysis)).Manyofthegraphspresentedherehavebeenalreadydescribedinourpreviouschapters.

Page 155: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.3.2Graphofvariables

7.3.2.1Allvariables

Thefunctionget_mfa_var()[infactoextra]isusedtoextracttheresultsforvariables.Bydefault,thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofallvariables:

var<-get_famd_var(res.famd)

var

##FAMDresultsforvariables

##===================================================

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

Thedifferentcomponentscanbeaccessedasfollow:

#Coordinatesofvariables

head(var$coord)

#Cos2:qualityofrepresentationonthefactoremap

head(var$cos2)

#Contributionstothedimensions

head(var$contrib)

Thefollowingfigureshowsthecorrelationbetweenvariables-bothquantitativeandqualitativevariables-andtheprincipaldimensions,aswellas,thecontributionofvariablestothedimensions1and2.Thefollowingfunctions[inthefactoextrapackage]areused:

fviz_famd_var()toplotbothquantitativeandqualitativevariablesfviz_contrib()tovisualizethecontributionofvariablestotheprincipaldimensions

#Plotofvariables

fviz_famd_var(res.famd,repel=TRUE)

Page 156: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Contributiontothefirstdimension

fviz_contrib(res.famd,"var",axes=1)

#Contributiontotheseconddimension

fviz_contrib(res.famd,"var",axes=2)

Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Readmoreinchapter(Chapter4).

Fromtheplotsabove,itcanbeseenthat:

variablesthatcontributethemosttothefirstdimensionare:Overall.qualityandHarmony.

variablesthatcontributethemosttotheseconddimensionare:SoilandAcidity.

Page 157: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.3.2.2Quantitativevariables

Toextracttheresultsforquantitativevariables,typethis:

quanti.var<-get_famd_var(res.famd,"quanti.var")

quanti.var

##FAMDresultsforquantitativevariables

##===================================================

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

Inthissection,we'lldescribehowtovisualizequantitativevariables.Additionally,we'llshowhowtohighlightvariablesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.

TheRcodebelowplotsquantitativevariables.Weuserepel=TRUE,toavoidtextoverlapping.

fviz_famd_var(res.famd,"quanti.var",repel=TRUE,

col.var="black")

Briefly,thegraphofvariables(correlationcircle)showstherelationshipbetweenvariables,thequalityoftherepresentationofvariables,aswellas,thecorrelationbetweenvariablesandthedimensions.ReadmoreatPCA(Chapter4),MCA(Chapter6)andMFA(Chapter8).

Page 158: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Themostcontributingquantitativevariablescanbehighlightedonthescatterplotusingtheargumentcol.var="contrib".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.

fviz_famd_var(res.famd,"quanti.var",col.var="contrib",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

Similarly,youcanhighlightquantitativevariablesusingtheircos2valuesrepresentingthequalityofrepresentationonthefactormap.Ifavariableiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftheitems,morethan2dimensionsmightberequiredtoperfectlyrepresentthedata.

#Colorbycos2values:qualityonthefactormap

fviz_famd_var(res.famd,"quanti.var",col.var="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

7.3.2.3Graphofqualitativevariables

Likequantitativevariables,theresultsforqualitativevariablescanbeextractedasfollow:

quali.var<-get_famd_var(res.famd,"quali.var")

quali.var

##FAMDresultsforqualitativevariablecategories

##===================================================

Page 159: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

Tovisualizequalitativevariables,typethis:

fviz_famd_var(res.famd,"quali.var",col.var="contrib",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07")

)

Theplotaboveshowsthecategoriesofthecategoricalvariables.

7.3.3Graphofindividuals

Togettheresultsforindividuals,typethis:

ind<-get_famd_ind(res.famd)

ind

##FAMDresultsforindividuals

##===================================================

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

Toplotindividuals,usethefunctionfviz_mfa_ind()[infactoextra].Bydefault,individualsarecolored

Page 160: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

inblue.However,likevariables,it'salsopossibletocolorindividualsbytheircos2andcontributionvalues:

fviz_famd_ind(res.famd,col.ind="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

Individualswithsimilarprofilesareclosetoeachotheronthefactormap.Fortheinterpretation,readmoreatChapter6(MCA)andChapter8(MFA).

Notethat,it'spossibletocolortheindividualsusinganyofthequalitativevariablesintheinitialdatatable.Todothis,theargumenthabillageisusedinthefviz_famd_ind()function.Forexample,ifyouwanttocolorthewinesaccordingtothesupplementaryqualitativevariable"Label",typethis:

fviz_mfa_ind(res.famd,

habillage="Label",#colorbygroups

palette=c("#00AFBB","#E7B800","#FC4E07"),

addEllipses=TRUE,ellipse.type="confidence",

repel=TRUE#Avoidtextoverlapping

)

Intheplotabove,thequalitativevariablecategoriesareshowninblack.Env1,Env2,Env3arethecategoriesofthesoil.Saumur,BourgueuilandChinonarethecategoriesofthewineLabel.Ifyoudon'twanttoshowthemontheplot,usetheargumentinvisible="quali.var".

Page 161: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Ifyouwanttocolorindividualsusingmultiplecategoricalvariablesatthesametime,usethefunctionfviz_ellipses()[infactoextra]asfollow:

fviz_ellipses(res.famd,c("Label","Soil"),repel=TRUE)

Page 162: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Alternatively,youcanspecifycategoricalvariableindices:

fviz_ellipses(res.famd,1:2,geom="point")

Page 163: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.4Summary

Thefactoranalysisofmixeddata(FAMD)makesitpossibletoanalyzeadataset,inwhichindividualsaredescribedbybothqualitativeandquantitativevariables.Inthisarticle,wedescribedhowtoperformandinterpretFAMDusingFactoMineRandfactoextraRpackages.

Page 164: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

7.5Furtherreading

FactorAnalysisofMixedDataUsingFactoMineR(videocourse).https://goo.gl/64gY3R

Page 165: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 166: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8MultipleFactorAnalysis

Page 167: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.1Introduction

Multiplefactoranalysis(MFA)(J.Pagès2002)isamultivariatedataanalysismethodforsummarizingandvisualizingacomplexdatatableinwhichindividualsaredescribedbyseveralsetsofvariables(quantitativeand/orqualitative)structuredintogroups.Ittakesintoaccountthecontributionofallactivegroupsofvariablestodefinethedistancebetweenindividuals.Thenumberofvariablesineachgroupmaydifferandthenatureofthevariables(qualitativeorquantitative)canvaryfromonegrouptotheotherbutthevariablesshouldbeofthesamenatureinagivengroup(AbdiandWilliams2010).

MFAmaybeconsideredasageneralfactoranalysis.Roughly,thecoreofMFAisbasedon:

Principalcomponentanalysis(PCA)(Chapter4)whenvariablesarequantitative,Multiplecorrespondenceanalysis(MCA)(Chapter6)whenvariablesarequalitative.

Thisglobalanalysis,wheremultiplesetsofvariablesaresimultaneouslyconsidered,requirestobalancetheinfluencesofeachsetofvariables.Therefore,inMFA,thevariablesareweightedduringtheanalysis.Variablesinthesamegrouparenormalizedusingthesameweightingvalue,whichcanvaryfromonegrouptoanother.Technically,MFAassignstoeachvariableofgroupj,aweightequaltotheinverseofthefirsteigenvalueoftheanalysis(PCAorMCAaccordingtothetypeofvariable)ofthegroupj.

Multiplefactoranalysiscanbeusedinavarietyoffields(J.Pagès2002),wherethevariablesareorganizedintogroups:

1. Surveyanalysis,whereanindividualisaperson;avariableisaquestion.Questionsareorganizedbythemes(groupsofquestions).

2. Sensoryanalysis,whereanindividualisafoodproduct.Afirstsetofvariablesincludessensoryvariables(sweetness,bitterness,etc.);asecondoneincludeschemicalvariables(pH,glucoserate,etc.).

3. Ecology,whereanindividualisanobservationplace.Afirstsetofvariablesdescribessoilcharacteristics;asecondonedescribesflora.

4. Timesseries,whereseveralindividualsareobservedatdifferentdates.Inthissituation,thereiscommonlytwowaysofdefininggroupsofvariables:

generally,variablesobservedatthesametime(date)aregatheredtogether.Whenvariablesarethesamefromonedatetotheothers,eachsetcangatherthedifferentdatesforonevariable.

Inthecurrentchapter,weshowhowtocomputeandvisualizemultiplefactoranalysisinRsoftwareusingFactoMineR(fortheanalysis)andfactoextra(fordatavisualization).Additional,we'llshowhowtorevealthemostimportantvariablesthatcontributethemostinexplainingthevariationsinthedataset.

Page 168: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.2Computation

8.2.1Rpackages

InstallFactoMineRandfactoextraasfollow:

install.packages(c("FactoMineR","factoextra"))

Loadthepackages:

library("FactoMineR")

library("factoextra")

8.2.2Dataformat

We'llusethedemodatasetswineavailableinFactoMineRpackage.Thisdatasetisaboutasensoryevaluationofwinesbydifferentjudges.

library("FactoMineR")

data(wine)

colnames(wine)

##[1]"Label""Soil"

##[3]"Odor.Intensity.before.shaking""Aroma.quality.before.shaking"

##[5]"Fruity.before.shaking""Flower.before.shaking"

##[7]"Spice.before.shaking""Visual.intensity"

##[9]"Nuance""Surface.feeling"

##[11]"Odor.Intensity""Quality.of.odour"

##[13]"Fruity""Flower"

##[15]"Spice""Plante"

##[17]"Phenolic""Aroma.intensity"

##[19]"Aroma.persistency""Aroma.quality"

##[21]"Attack.intensity""Acidity"

##[23]"Astringency""Alcohol"

##[25]"Balance""Smooth"

##[27]"Bitterness""Intensity"

##[29]"Harmony""Overall.quality"

##[31]"Typical"

Animageofthedataisshownbelow:

Page 169: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

DataformatforMultipleFactoranalysis

(Imagesource,FactoMineR,http://factominer.free.fr)

Thedatacontains21rows(wines,individuals)and31columns(variables):

Thefirsttwocolumnsarecategoricalvariables:label(Saumur,BourgueilorChinon)andsoil(Reference,Env1,Env2orEnv4).The29nextcolumnsarecontinuoussensoryvariables.Foreachwine,thevalueisthemeanscoreforallthejudges.

Thevariablesareorganizedingroupsasfollow:

1. Firstgroup-Agroupofcategoricalvariablesspecifyingtheoriginofthewines,includingthevariableslabelandsoilcorrespondingtothefirst2columnsinthedatatable.InFactoMineRterminology,theargumentsgroup=2isusedtodefinethefirst2columnsasagroup.

2. Secondgroup-Agroupofcontinuousvariables,describingtheodorofthewinesbeforeshaking,includingthevariables:Odor.Intensity.before.shaking,Aroma.quality.before.shaking,Fruity.before.shaking,Flower.before.shakingandSpice.before.shaking.Thesevariablescorrespondstothenext5columnsafterthefirstgroup.FactoMineRterminology:group=5.

3. Thirdgroup-Agroupofcontinuousvariablesquantifyingthevisualinspectionofthewines,includingthevariables:Visual.intensity,NuanceandSurface.feeling.Thesevariablescorrespondstothenext3columnsafterthesecondgroup.FactoMineRterminology:group=3.

4. Fourthgroup-Agroupofcontinuousvariablesconcerningtheodorofthewinesaftershaking,

Thegoalofthisstudyistoanalyzethecharacteristicsofthewines.

Page 170: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

includingthevariables:Odor.Intensity,Quality.of.odour,Fruity,Flower,Spice,Plante,Phenolic,Aroma.intensity,Aroma.persistencyandAroma.quality.Thesevariablescorrespondstothenext10columnsafterthethirdgroup.FactoMineRterminology:group=10.

5. Fithgroup-Agroupofcontinuousvariablesevaluatingthetasteofthewines,includingthevariablesAttack.intensity,Acidity,Astringency,Alcohol,Balance,Smooth,Bitterness,IntensityandHarmony.Thesevariablescorrespondstothenext9columnsafterthefourthgroup.FactoMineRterminology:group=9.

6. Sixthgroup-Agroupofcontinuousvariablesconcerningtheoveralljudgementofthewines,includingthevariablesOverall.qualityandTypical.Thesevariablescorrespondstothenext2columnsafterthefithgroup.FactoMineRterminology:group=2.

8.2.3Rcode

ThefunctionMFA()[FactoMinerpackage]canbeused.Asimplifiedformatis:

MFA(base,group,type=rep("s",length(group)),ind.sup=NULL,

name.group=NULL,num.group.sup=NULL,graph=TRUE)

base:adataframewithnrows(individuals)andpcolumns(variables)group:avectorwiththenumberofvariablesineachgroup.type:thetypeofvariablesineachgroup.Bydefault,allvariablesarequantitativeandscaledtounitvariance.Allowedvaluesinclude:

"c"or"s"forquantitativevariables.If"s",thevariablesarescaledtounitvariance."n"forcategoricalvariables."f"forfrequencies(fromacontingencytables).

ind.sup:avectorindicatingtheindexesofthesupplementaryindividuals.

Insummary:

Wehave6groupsofvariables,whichcanbespecifiedtotheFactoMineRasfollow:group=c(2,5,3,10,9,2).

Thesegroupscanbenamedasfollow:name.group=c("origin","odor","visual","odor.after.shaking","taste","overall").

Amongthe6groupsofvariables,oneiscategoricalandfivegroupscontaincontinuousvariables.It'srecommended,tostandardizethecontinuousvariablesduringtheanalysis.Standardizationmakesvariablescomparable,inthesituationwherethevariablesaremeasuredindifferentunits.InFactoMineR,theargumenttype="s"specifiesthatagivengroupofvariablesshouldbestandardized.Ifyoudon'twantstandardization,usetype="c".Tospecifycategoricalvariables,type="n"isused.Inourexample,we'llusetype=c("n","s","s","s","s","s").

Page 171: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

name.group:avectorcontainingthenameofthegroups(bydefault,NULLandthegrouparenamedgroup.1,group.2andsoon).num.group.sup:theindexesoftheillustrativegroups(bydefault,NULLandnogroupareillustrative).graph:alogicalvalue.IfTRUEagraphisdisplayed.

TheRcodebelowperformstheMFAonthewinesdatausingthegroups:odor,visual,odoraftershakingandtaste.Thesegroupsarenamedactivegroups.Theremaininggroupofvariables-origin(thefirstgroup)andoveralljudgement(thesixthgroup)-arenamedsupplementarygroups;num.group.sup=c(1,6):

library(FactoMineR)

data(wine)

res.mfa<-MFA(wine,

group=c(2,5,3,10,9,2),

type=c("n","s","s","s","s","s"),

name.group=c("origin","odor","visual",

"odor.after.shaking","taste","overall"),

num.group.sup=c(1,6),

graph=FALSE)

TheoutputoftheMFA()functionisalistincluding:

print(res.mfa)

##**ResultsoftheMultipleFactorAnalysis(MFA)**

##Theanalysiswasperformedon21individuals,describedby31variables

##*Resultsareavailableinthefollowingobjects:

##

##name

##1"$eig"

##2"$separate.analyses"

##3"$group"

##4"$partial.axes"

##5"$inertia.ratio"

##6"$ind"

##7"$quanti.var"

##8"$quanti.var.sup"

##9"$quali.var.sup"

##10"$summary.quanti"

##11"$summary.quali"

##12"$global.pca"

##description

##1"eigenvalues"

##2"separateanalysesforeachgroupofvariables"

##3"resultsforallthegroups"

##4"resultsforthepartialaxes"

##5"inertiaratio"

##6"resultsfortheindividuals"

##7"resultsforthequantitativevariables"

##8"resultsforthequantitativesupplementaryvariables"

##9"resultsforthecategoricalsupplementaryvariables"

##10"summaryforthequantitativevariables"

##11"summaryforthecategoricalvariables"

##12"resultsfortheglobalPCA"

Page 172: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.3Visualizationandinterpretation

We'llusethefactoextraRpackagetohelpintheinterpretationandthevisualizationofthemultiplefactoranalysis.

Thefunctionsbelow[infactoextrapackage]willbeused:

get_eigenvalue(res.mfa):Extracttheeigenvalues/variancesretainedbyeachdimension(axis).fviz_eig(res.mfa):Visualizetheeigenvalues/variances.get_mfa_ind(res.mfa):Extracttheresultsforindividuals.get_mfa_var(res.mfa):Extracttheresultsforquantitativeandqualitativevariables,aswellas,forgroupsofvariables.fviz_mfa_ind(res.mfa),fviz_mfa_var(res.mfa):Visualizetheresultsforindividualsandvariables,respectively.

Inthenextsections,we'llillustrateeachofthesefunctions.

8.3.1Eigenvalues/Variances

Theproportionofvariancesretainedbythedifferentdimensions(axes)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage]asfollow:

library("factoextra")

eig.val<-get_eigenvalue(res.mfa)

head(eig.val)

##eigenvaluevariance.percentcumulative.variance.percent

##Dim.13.46249.3849.4

##Dim.21.36719.4968.9

##Dim.30.6158.7877.7

##Dim.40.3725.3183.0

##Dim.50.2703.8686.8

##Dim.60.2022.8989.7

Thefunctionfviz_eig()orfviz_screeplot()[factoextrapackage]canbeusedtodrawthescreeplot:

fviz_screeplot(res.mfa)

TohelpintheinterpretationofMFA,wehighlyrecommendtoreadtheinterpretationofprincipalcomponentanalysis(Chapter(???)(principal-component-analysis)),simple(Chapter(???)(correspondence-analysis))andmultiplecorrespondenceanalysis(Chapter(???)(multiple-correspondence-analysis)).Manyofthegraphspresentedherehavebeenalreadydescribedinpreviouschapter.

Page 173: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.3.2Graphofvariables

8.3.2.1Groupsofvariables

Thefunctionget_mfa_var()[infactoextra]isusedtoextracttheresultsforgroupsofvariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofgroups,aswellas,the

group<-get_mfa_var(res.mfa,"group")

group

##MultipleFactorAnalysisresultsforvariablegroups

##===================================================

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

##4"$correlation""Correlationbetweengroupsandprincipaldimensions"

Thedifferentcomponentscanbeaccessedasfollow:

#Coordinatesofgroups

head(group$coord)

#Cos2:qualityofrepresentationonthefactoremap

head(group$cos2)

#Contributionstothedimensions

head(group$contrib)

Toplotthegroupsofvariables,typethis:

fviz_mfa_var(res.mfa,"group")

Page 174: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

redcolor=activegroupsofvariablesgreencolor=supplementarygroupsofvariables

Theplotaboveillustratesthecorrelationbetweengroupsanddimensions.Thecoordinatesofthefouractivegroupsonthefirstdimensionarealmostidentical.Thismeansthattheycontributesimilarlytothefirstdimension.Concerningtheseconddimension,thetwogroups-odorandodor.after.shake-havethehighestcoordinatesindicatingahighestcontributiontotheseconddimension.

Todrawabarplotofgroupscontributiontothedimensions,usethefunctionfviz_contrib():

#Contributiontothefirstdimension

fviz_contrib(res.mfa,"group",axes=1)

#Contributiontotheseconddimension

fviz_contrib(res.mfa,"group",axes=2)

Page 175: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.3.2.2Quantitativevariables

Thefunctionget_mfa_var()[infactoextra]isusedtoextracttheresultsforquantitativevariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofvariables:

quanti.var<-get_mfa_var(res.mfa,"quanti.var")

quanti.var

##MultipleFactorAnalysisresultsforquantitativevariables

##===================================================

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

Thedifferentcomponentscanbeaccessedasfollow:

#Coordinates

head(quanti.var$coord)

#Cos2:qualityonthefactoremap

head(quanti.var$cos2)

#Contributionstothedimensions

head(quanti.var$contrib)

Inthissection,we'lldescribehowtovisualizequantitativevariablescoloredbygroups.Next,we'llhighlightvariablesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.

Tointerpretthegraphspresentedhere,readthechapteronPCA(Chapter(???)(principal-component-analysis))andMCA(Chapter(???)(multiple-correspondence-analysis)).

Page 176: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Correlationbetweenquantitativevariablesanddimensions.TheRcodebelowplotsquantitativevariablescoloredbygroups.Theargumentpaletteisusedtochangegroupcolors(see?ggpubr::ggparformoreinformationaboutpalette).Supplementaryquantitativevariablesareindashedarrowandvioletcolor.Weuserepel=TRUE,toavoidtextoverlapping.

fviz_mfa_var(res.mfa,"quanti.var",palette="jco",

col.var.sup="violet",repel=TRUE)

Tomaketheplotmorereadable,wecanusegeom=c("point","text")insteadofgeom=c("arrow","text").We'llchangealsothelegendpositionfrom"right"to"bottom",usingtheargumentlegend="bottom":

fviz_mfa_var(res.mfa,"quanti.var",palette="jco",

col.var.sup="violet",repel=TRUE,

geom=c("point","text"),legend="bottom")

Page 177: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Briefly,thegraphofvariables(correlationcircle)showstherelationshipbetweenvariables,thequalityoftherepresentationofvariables,aswellas,thecorrelationbetweenvariablesandthedimensions:

Positivecorrelatedvariablesaregroupedtogether,whereasnegativeonesarepositionedonoppositesidesoftheplotorigin(opposedquadrants).

Thedistancebetweenvariablepointsandtheoriginmeasuresthequalityofthevariableonthefactormap.Variablepointsthatareawayfromtheoriginarewellrepresentedonthefactormap.

Foragivendimension,themostcorrelatedvariablestothedimensionareclosetothedimension.

Forexample,thefirstdimensionrepresentsthepositivesentimentsaboutwines:"intensity"and"harmony".Themostcorrelatedvariablestotheseconddimensionare:i)SpicebeforeshakingandOdorintensitybeforeshakingfortheodorgroup;ii)Spice,PlantandOdorintensityfortheodoraftershakinggroupandiii)Bitternessforthetastegroup.Thisdimensionrepresentsessentiallythe"spicyness"andthevegetalcharacteristicduetoolfaction.

Page 178: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Thecontributionofquantitativevariables(in%)tothedefinitionofthedimensionscanbevisualizedusingthefunctionfviz_contrib()[factoextrapackage].Variablesarecoloredbygroups.TheRcodebelowshowsthetop20variablecategoriescontributingtothedimensions:

#Contributionstodimension1

fviz_contrib(res.mfa,choice="quanti.var",axes=1,top=20,

palette="jco")

#Contributionstodimension2

fviz_contrib(res.mfa,choice="quanti.var",axes=2,top=20,

palette="jco")

Page 179: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Thecalculationoftheexpectedcontributionvalue,undernullhypothesis,hasbeendetailedintheprincipalcomponentanalysischapter(Chapter4).

Themostcontributingquantitativevariablescanbehighlightedonthescatterplotusingtheargumentcol.var="contrib".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.

fviz_mfa_var(res.mfa,"quanti.var",col.var="contrib",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

col.var.sup="violet",repel=TRUE,

geom=c("point","text"))

Similarly,youcanhighlightquantitativevariablesusingtheircos2valuesrepresentingthequalityofrepresentationonthefactormap.Ifavariableiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftherowitems,morethan2dimensionsmightberequiredtoperfectlyrepresentthedata.

Thevariableswiththelargervalue,contributethemosttothedefinitionofthedimensions.VariablesthatcontributethemosttoDim.1andDim.2arethemostimportantinexplainingthevariabilityinthedataset.

Page 180: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Colorbycos2values:qualityonthefactormap

fviz_mfa_var(res.mfa,col.var="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

col.var.sup="violet",repel=TRUE)

Tocreateabarplotofvariablescos2,typethis:

fviz_cos2(res.mfa,choice="quanti.var",axes=1)

8.3.3Graphofindividuals

Togettheresultsforindividuals,typethis:

ind<-get_mfa_ind(res.mfa)

ind

##MultipleFactorAnalysisresultsforindividuals

##===================================================

##NameDescription

##1"$coord""Coordinates"

##2"$cos2""Cos2,qualityofrepresentation"

##3"$contrib""Contributions"

##4"$coord.partiel""Partialcoordinates"

##5"$within.inertia""Withininertia"

##6"$within.partial.inertia""Withinpartialinertia"

Toplotindividuals,usethefunctionfviz_mfa_ind()[infactoextra].Bydefault,individualsarecoloredinblue.However,likevariables,it'salsopossibletocolorindividualsbytheircos2values:

fviz_mfa_ind(res.mfa,col.ind="cos2",

gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),

repel=TRUE)

Page 181: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Individualswithsimilarprofilesareclosetoeachotheronthefactormap.Thefirstaxis,mainlyopposesthewine1DAMand,thewines1VAUand2ING.Asdescribedintheprevioussection,thefirstdimensionrepresentstheharmonyandtheintensityofwines.Thus,thewine1DAM(positivecoordinates)wasevaluatedasthemost"intense"and"harmonious"contrarytowines1VAUand2ING(negativecoordinates)whicharetheleast"intense"and"harmonious".ThesecondaxisisessentiallyassociatedwiththetwowinesT1andT2characterizedbyastrongvalueofthevariablesSpice.before.shakingandOdor.intensity.before.shaking.

Mostofthesupplementaryqualitativevariablecategoriesareclosetotheoriginofthemap.Thisresultindicatesthattheconcernedcategoriesarenotrelatedtothefirstaxis(wine"intensity"&"harmony")orthesecondaxis(wineT1andT2).

ThecategoryEnv4hashighcoordinatesonthesecondaxisrelatedtoT1andT2.

Thecategory"Reference"isknowntoberelatedtoanexcellentwine-producingsoil.Asexpected,ouranalysisdemonstratesthatthecategory"Reference"hashighcoordinatesonthefirstaxis,whichispositivelycorrelatedwithwines"intensity"and"harmony".

Intheplotabove,thesupplementaryqualitativevariablecategoriesareshowninblack.Env1,Env2,Env3arethecategoriesofthesoil.Saumur,BourgueuilandChinonarethecategoriesofthewineLabel.Ifyoudon'twanttoshowthemontheplot,usetheargumentinvisible="quali.var".

Page 182: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Notethat,it'spossibletocolortheindividualsusinganyofthequalitativevariablesintheinitialdatatable.Todothis,theargumenthabillageisusedinthefviz_mfa_ind()function.Forexample,ifyouwanttocolorthewinesaccordingtothesupplementaryqualitativevariable"Label",typethis:

fviz_mfa_ind(res.mfa,

habillage="Label",#colorbygroups

palette=c("#00AFBB","#E7B800","#FC4E07"),

addEllipses=TRUE,ellipse.type="confidence",

repel=TRUE#Avoidtextoverlapping

)

Ifyouwanttocolorindividualsusingmultiplecategoricalvariablesatthesametime,usethefunctionfviz_ellipses()[infactoextra]asfollow:

fviz_ellipses(res.mfa,c("Label","Soil"),repel=TRUE)

Page 183: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Alternatively,youcanspecifycategoricalvariableindices:

fviz_ellipses(res.mca,1:2,geom="point")

8.3.4Graphofpartialindividuals

Theresultsforindividualsobtainedfromtheanalysisperformedwithasinglegrouparenamedpartialindividuals.Inotherwords,anindividualconsideredfromthepointofviewofasinglegroupiscalledpartialindividual.

Inthedefaultfviz_mfa_ind()plot,foragivenindividual,thepointcorrespondstothemeanindividualorthecenterofgravityofthepartialpointsoftheindividual.Thatis,theindividualviewedbyallgroupsofvariables.

Foragivenindividual,thereareasmanypartialpointsasgroupsofvariables.

Thegraphofpartialindividualsrepresentseachwineviewedbyeachgroupanditsbarycenter.Toplotthepartialpointsofallindividuals,typethis:

fviz_mfa_ind(res.mfa,partial="all")

Page 184: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Ifyouwanttovisualizepartialpointsforwinesofinterest,letsayc("1DAM","1VAU","2ING"),usethis:

fviz_mfa_ind(res.mfa,partial=c("1DAM","1VAU","2ING"))

Redcolorrepresentsthewinesseenbyonlytheodorvariables;violetcolorrepresentsthewinesseenbyonlythevisualvariables,andsoon.

Thewine1DAMhasbeendescribedintheprevioussectionasparticularly"intense"and"harmonious",particularlybytheodorgroup:Ithasahighcoordinateonthefirstaxisfromthepointofviewoftheodorvariablesgroupcomparedtothepointofviewoftheothergroups.

Fromtheodorgroup'spointofview,2INGwasmore"intense"and"harmonious"than1VAUbutfromthetastegroup'spointofview,1VAUwasmore"intense"and"harmonious"than2ING.

8.3.5Graphofpartialaxes

ThegraphofpartialaxesshowstherelationshipbetweentheprincipalaxesoftheMFAandtheonesobtainedfromanalyzingeachgroupusingeitheraPCA(forgroupsofcontinuousvariables)oraMCA(forqualitativevariables).

Page 185: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

fviz_mfa_axes(res.mfa)

Itcanbeseenthat,hefirstdimensionofeachgroupishighlycorrelatedtotheMFA'sfirstone.TheseconddimensionoftheMFAisessentiallycorrelatedtotheseconddimensionoftheolfactorygroups.

Page 186: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.4Summary

Themultiplefactoranalysis(MFA)makesitpossibletoanalyseindividualscharacterizedbymultiplesetsofvariables.Inthisarticle,wedescribedhowtoperformandinterpretMFAusingFactoMineRandfactoextraRpackages.

Page 187: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

8.5Furtherreading

ForthemathematicalbackgroundbehindMFA,refertothefollowingvideocourses,articlesandbooks:

MultipleFactorAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/WcmHHt.ExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).Principalcomponentanalysis(article)(AbdiandWilliams2010).https://goo.gl/1Vtwq1.SimultaneousanalysisofdistinctOmicsdatasetswithintegrationofbiologicalknowledge:MultipleFactorAnalysisapproach(Tayracetal.2009).

Page 188: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 189: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9HCPC:HierarchicalClusteringonPrincipalComponents

Page 190: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9.1Introduction

Clusteringisoneoftheimportantdataminingmethodsfordiscoveringknowledgeinmultivariatedatasets.Thegoalistoidentifygroups(i.e.clusters)ofsimilarobjectswithinadatasetofinterest.Tolearnmoreaboutclustering,youcanreadourbookentitled"PracticalGuidetoClusterAnalysisinR"(https://goo.gl/DmJ5y5).

Briefly,thetwomostcommonclusteringstrategiesare:

1. Hierarchicalclustering,usedforidentifyinggroupsofsimilarobservationsinadataset.2. Partitioningclusteringsuchask-meansalgorithm,usedforsplittingadatasetintoseveral

groups.

TheHCPC(HierarchicalClusteringonPrincipalComponents)approachallowsustocombinethethreestandardmethodsusedinmultivariatedataanalyses(Husson,Josse,andJ.2010):

1. Principalcomponentmethods(PCA,CA,MCA,FAMD,MFA),2. Hierarchicalclusteringand3. Partitioningclustering,particularlythek-meansmethod.

ThischapterdescribesWHYandHOWtocombineprincipalcomponentsandclusteringmethods.Finally,wedemonstratehowtocomputeandvisualizeHCPCusingRsoftware.

Page 191: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9.2WhyHCPC?

Combiningprincipalcomponentmethodsandclusteringmethodsareusefulinatleastthreesituations.

9.2.1Case1:Continuousvariables

Inthesituationwhereyouhaveamultidimensionaldatasetcontainingmultiplecontinuousvariables,theprincipalcomponentanalysis(PCA)canbeusedtoreducethedimensionofthedataintofewcontinuousvariablescontainingthemostimportantinformationinthedata.Next,youcanperformclusteranalysisonthePCAresults.

ThePCAstepcanbeconsideredasadenoisingstepwhichcanleadtoamorestableclustering.Thismightbeveryusefulifyouhavealargedatasetwithmultiplevariables,suchasingeneexpressiondata.

9.2.2Case2:Clusteringoncategoricaldata

Inordertoperformclusteringanalysisoncategoricaldata,thecorrespondenceanalysis(CA,foranalyzingcontingencytable)andthemultiplecorrespondenceanalysis(MCA,foranalyzingmultidimensionalcategoricalvariables)canbeusedtotransformcategoricalvariablesintoasetoffewcontinuousvariables(theprincipalcomponents).Theclusteranalysiscanbethenappliedonthe(M)CAresults.

Inthiscase,the(M)CAmethodcanbeconsideredaspre-processingstepswhichallowtocomputeclusteringoncategoricaldata.

9.2.3Case3:Clusteringonmixeddata

Whenyouhaveamixeddataofcontinuousandcategoricalvariables,youcanfirstperformFAMD(factoranalysisofmixeddata)orMFA(multiplefactoranalysis).Next,youcanapplyclusteranalysisontheFAMD/MFAoutputs.

Page 192: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9.3AlgorithmoftheHCPCmethod

ThealgorithmoftheHCPCmethod,asimplementedintheFactoMineRpackage,canbesummarizedasfollow:

1. Computeprincipalcomponentmethods:PCA,(M)CAorMFAdependingonthetypesofvariablesinthedatasetandthestructureofthedataset.Atthisstep,youcanchoosethenumberofdimensionstoberetainedintheoutputbyspecifyingtheargumentncp.Thedefaultvalueis5.

2. Computehierarchicalclustering:HierarchicalclusteringisperformedusingtheWard'scriterionontheselectedprincipalcomponents.Wardcriterionisusedinthehierarchicalclusteringbecauseitisbasedonthemultidimensionalvariancelikeprincipalcomponentanalysis.

3. Choosethenumberofclustersbasedonthehierarchicaltree:Aninitialpartitioningisperformedbycuttingthehierarchicaltree.

4. PerformK-meansclusteringtoimprovetheinitialpartitionobtainedfromhierarchicalclustering.Thefinalpartitioningsolution,obtainedafterconsolidationwithk-means,canbe(slightly)differentfromtheoneobtainedwiththehierarchicalclustering.

Page 193: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9.4Computation

9.4.1Rpackages

We'llusetwoRpackages:i)FactoMineRforcomputingHCPCandii)factoextraforvisualizingtheresults.

Toinstallthepackages,typethis:

install.packages(c("FactoMineR","factoextra"))

Aftertheinstallation,loadthepackagesasfollow:

library(factoextra)

library(FactoMineR)

9.4.2Rfunction

ThefunctionHCPC()[inFactoMineRpackage]canbeusedtocomputehierarchicalclusteringonprincipalcomponents.

Asimplifiedformatis:

HCPC(res,nb.clust=0,min=3,max=NULL,graph=TRUE)

res:Eithertheresultofafactoranalysisoradataframe.nb.clust:anintegerspecifyingthenumberofclusters.Possiblevaluesare:

0:thetreeiscutattheleveltheuserclickson-1:thetreeisautomaticallycutatthesuggestedlevelAnypositiveinteger:thetreeiscutwithnb.clustersclusters

min,max:theminimumandthemaximumnumberofclusterstobegenerated,respectivelygraph:ifTRUE,graphicsaredisplayed

9.4.3Caseofcontinuousvariables

Westartbycomputingagaintheprincipalcomponentanalysis(PCA).Theargumentncp=3isusedinthefunctionPCA()tokeeponlythefirstthreeprincipalcomponents.Next,theHCPCisappliedontheresultofthePCA.

library(FactoMineR)

#ComputePCAwithncp=3

res.pca<-PCA(USArrests,ncp=3,graph=FALSE)

#Computehierarchicalclusteringonprincipalcomponents

res.hcpc<-HCPC(res.pca,graph=FALSE)

Tovisualizethedendrogramgeneratedbythehierarchicalclustering,we'llusethefunctionfviz_dend()

Page 194: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

[infactoextrapackage]:

fviz_dend(res.hcpc,

cex=0.7,#Labelsize

palette="jco",#Colorpalettesee?ggpubr::ggpar

rect=TRUE,rect_fill=TRUE,#Addrectanglearoundgroups

rect_border="jco",#Rectanglecolor

labels_track_height=0.8#Augmenttheroomforlabels

)

It'spossibletovisualizeindividualsontheprincipalcomponentmapandtocolorindividualsaccordingtotheclustertheybelongto.Thefunctionfviz_cluster()[infactoextra]canbeusedtovisualizeindividualsclusters.

fviz_cluster(res.hcpc,

repel=TRUE,#Avoidlabeloverlapping

show.clust.cent=TRUE,#Showclustercenters

palette="jco",#Colorpalettesee?ggpubr::ggpar

ggtheme=theme_minimal(),

main="Factormap"

)

Thedendrogramsuggests4clusterssolution.

Page 195: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

YoucanalsodrawathreedimensionalplotcombiningthehierarchicalclusteringandthefactorialmapusingtheRbasefunctionplot():

#Principalcomponents+tree

plot(res.hcpc,choice="3D.map")

Page 196: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

ThefunctionHCPC()returnsalistcontaining:

data.clust:Theoriginaldatawithasupplementarycolumncalledclasscontainingthepartition.desc.var:Thevariablesdescribingclustersdesc.ind:Themoretypicalindividualsofeachclusterdesc.axes:Theaxesdescribingclusters

Todisplaytheoriginaldatawithclusterassignments,typethis:

head(res.hcpc$data.clust,10)

##MurderAssaultUrbanPopRapeclust

##Alabama13.22365821.23

##Alaska10.02634844.54

##Arizona8.12948031.04

##Arkansas8.81905019.53

##California9.02769140.64

##Colorado7.92047838.74

##Connecticut3.31107711.12

##Delaware5.92387215.82

##Florida15.43358031.94

##Georgia17.42116025.83

Page 197: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Inthetableabove,thelastcolumncontainstheclusterassignments.

Todisplayquantitativevariablesthatdescribethemosteachcluster,typethis:

res.hcpc$desc.var$quanti

Here,weshowonlysomecolumnsofinterest:"Meanincategory","OverallMean","p.value"

##$`1`

##MeanincategoryOverallmeanp.value

##UrbanPop52.165.549.68e-05

##Murder3.67.795.57e-05

##Rape12.221.235.08e-05

##Assault78.5170.763.52e-06

##

##$`2`

##MeanincategoryOverallmeanp.value

##UrbanPop73.8865.540.00522

##Murder5.667.790.01759

##

##$`3`

##MeanincategoryOverallmeanp.value

##Murder13.97.791.32e-05

##Assault243.6170.766.97e-03

##UrbanPop53.865.541.19e-02

##

##$`4`

##MeanincategoryOverallmeanp.value

##Rape33.221.238.69e-08

##Assault257.4170.761.32e-05

##UrbanPop76.065.542.45e-03

##Murder10.87.793.58e-03

Fromtheoutputabove,itcanbeseenthat:

thevariablesUrbanPop,Murder,RapeandAssaultaremostsignificantlyassociatedwiththecluster1.Forexample,themeanvalueoftheAssaultvariableincluster1is78.53whichislessthanit'soverallmean(170.76)acrossallclusters.Therefore,Itcanbeconcludethatthecluster1ischaracterizedbyalowrateofAssaultcomparedtoallclusters.

thevariablesUrbanPopandMurderaremostsignificantlyassociatedwiththecluster2.

...andsoon...

Similarly,toshowprincipaldimensionsthatarethemostassociatedwithclusters,typethis:

res.hcpc$desc.axes$quanti

##$`1`

##MeanincategoryOverallmeanp.value

##Dim.1-1.96-5.64e-162.27e-07

##

##$`2`

Page 198: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

##MeanincategoryOverallmeanp.value

##Dim.20.743-5.37e-160.000336

##

##$`3`

##MeanincategoryOverallmeanp.value

##Dim.11.061-5.64e-163.96e-02

##Dim.30.3973.54e-174.25e-02

##Dim.2-1.477-5.37e-165.72e-06

##

##$`4`

##MeanincategoryOverallmeanp.value

##Dim.11.89-5.64e-166.15e-07

Finally,representativeindividualsofeachclustercanbeextractedasfollow:

res.hcpc$desc.ind$para

##Cluster:1

##IdahoSouthDakotaMaineIowaNewHampshire

##0.3670.4990.5010.5530.589

##--------------------------------------------------------

##Cluster:2

##OhioOklahomaPennsylvaniaKansasIndiana

##0.2800.5050.5090.6040.710

##--------------------------------------------------------

##Cluster:3

##AlabamaSouthCarolinaGeorgiaTennesseeLouisiana

##0.3550.5340.6140.8520.878

##--------------------------------------------------------

##Cluster:4

##MichiganArizonaNewMexicoMarylandTexas

##0.3250.4530.5180.9010.924

9.4.4Caseofcategoricalvariables

Forcategoricalvariables,computeCAorMCAandthenapplythefunctionHCPC()ontheresultsasdescribedabove.

Here,we'llusetheteadata[inFactoMineR]asdemodataset:Rowsrepresenttheindividualsandcolumnsrepresentcategoricalvariables.

Theresultsaboveindicatethat,individualsinclusters1and4havehighcoordinatesonaxes1.Individualsincluster2havehighcoordinatesonthesecondaxis.Individualswhobelongtothethirdclusterhavehighcoordinatesonaxes1,2and3.

Foreachcluster,thetop5closestindividualstotheclustercenterisshown.Thedistancebetweeneachindividualandtheclustercenterisprovided.Forexample,representativeindividualsforcluster1include:Idaho,SouthDakota,Maine,IowaandNewHampshire.

Page 199: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

Westart,byperforminganMCAontheindividuals.Wekeepthefirst20axesoftheMCAwhichretain87%oftheinformation.

#Loadingdata

library(FactoMineR)

data(tea)

#PerformingMCA

res.mca<-MCA(tea,

ncp=20,#Numberofcomponentskept

quanti.sup=19,#Quantitativesupplementaryvariables

quali.sup=c(20:36),#Qualitativesupplementaryvariables

graph=FALSE)

Next,weapplyhierarchicalclusteringontheresultsoftheMCA:

res.hcpc<-HCPC(res.mca,graph=FALSE,max=3)

Theresultscanbevisualizedasfollow:

#Dendrogram

fviz_dend(res.hcpc,show_labels=FALSE)

#Individualsfacormap

fviz_cluster(res.hcpc,geom="point",main="Factormap")

Asmentionedabove,clusterscanbedescribedbyi)variablesand/orcategories,ii)principalaxesandiii)individuals.Intheexamplebelow,wedisplayonlyasubsetoftheresults.

Descriptionbyvariablesandcategories

#Descriptionbyvariables

res.hcpc$desc.var$test.chi2

##p.valuedf

##where8.47e-794

##how3.14e-474

##price1.86e-2810

##tearoom9.62e-192

Page 200: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

#Descriptionbyvariablecategories

res.hcpc$desc.var$category

##$`1`

##Cla/ModMod/ClaGlobalp.value

##where=chainstore85.993.864.02.09e-40

##how=teabag84.181.256.71.48e-25

##tearoom=Not.tearoom70.797.280.71.08e-18

##price=p_branded83.244.931.71.63e-09

##

##$`2`

##Cla/ModMod/ClaGlobalp.value

##where=teashop90.084.410.03.70e-30

##how=unpackaged66.775.012.05.35e-20

##price=p_upscale49.181.217.72.39e-17

##Tea=green27.328.111.04.44e-03

##

##$`3`

##Cla/ModMod/ClaGlobalp.value

##where=chainstore+teashop85.972.826.05.73e-34

##how=teabag+unpackaged67.068.531.31.38e-19

##tearoom=tearoom77.648.919.31.25e-16

##pub=pub63.543.521.01.13e-09

Descriptionbyprincipalcomponents

res.hcpc$desc.axes

DescriptionbyIndividuals

res.hcpc$desc.ind$para

Thevariablesthatcharacterizethemosttheclustersarethevariables"where"and"how".Eachclusterischaracterizedbyacategoryofthevariables"where"and"how".Forexample,individualswhobelongtothefirstclusterbuyteaasteabaginchainstores.

Page 201: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9.5Summary

Wedescribedhowtocomputehierarchicalclusteringonprincipalcomponents(HCPC).Thisapproachisusefulinsituations,including:

Whenyouhavealargedatasetcontainingcontinuousvariables,aprincipalcomponentanalysiscanbeusedtoreducethedimensionofthedatabeforethehierarchicalclusteringanalysis.

Whenyouhaveadatasetcontainingcategoricalvariables,a(Multiple)Correspondenceanalysiscanbeusedtotransformthecategoricalvariablesintofewcontinuousprincipalcomponents,whichcanbeusedastheinputoftheclusteranalysis.

WeusedtheFactoMineRpackagetocomputetheHCPCandthefactoextraRpackageforggplot2-basedelegantdatavisualization.

Page 202: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

9.6Furtherreading

PracticalguidetoclusteranalysisinR(Book).https://goo.gl/DmJ5y5HCPC:HierarchicalClusteringonPrincipalComponents(Videos).https://goo.gl/jdYGoK

Page 203: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)
Page 204: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

ReferencesAbdi,Hervé,andLynneJ.Williams.2010.“PrincipalComponentAnalysis.”JohnWileyandSons,Inc.WIREsCompStat2:433–59.http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf.

Bendixen,Mike.2003.“APracticalGuidetotheUseofCorrespondenceAnalysisinMarketingResearch.”MarketingBulletin14.http://marketing-bulletin.massey.ac.nz/V14/MB_V14_T2_Bendixen.pdf.

Bendixen,MikeT.1995.“CompositionalPerceptualMappingUsingChi‐squaredTreesAnalysisandCorrespondenceAnalysis.”JournalofMarketingManagement11(6):571–81.doi:10.1080/0267257X.1995.9964368.

Gabriel,K.Ruben,andCharlesL.Odoroff.1990.“BiplotsinBiomedicalResearch.”StatisticsinMedicine9(5).WileySubscriptionServices,Inc.,AWileyCompany:469–85.doi:10.1002/sim.4780090502.

Greenacre,Michael.2013.“ContributionBiplots.”JournalofComputationalandGraphicalStatistics22(1):107–22.http://dx.doi.org/10.1080/10618600.2012.702494.

Husson,Francois,JulieJosse,SebastienLe,andJeremyMazet.2017.FactoMineR:MultivariateExploratoryDataAnalysisandDataMining.https://CRAN.R-project.org/package=FactoMineR.

Husson,Francois,SebastienLe,andJérômePagès.2017.ExploratoryMultivariateAnalysisbyExampleUsingR.2nded.BocaRaton,Florida:Chapman;Hall/CRC.http://factominer.free.fr/bookV2/index.html.

Husson,François,J.Josse,andPagèsJ.2010.“PrincipalComponentMethods-HierarchicalClustering-PartitionalClustering:WhyWouldWeNeedtoChooseforVisualizingData?”UnpublishedData.http://www.sthda.com/english/upload/hcpc_husson_josse.pdf.

Jollife,I.T.2002.PrincipalComponentAnalysis.2nded.NewYork:Springer-Verlag.https://goo.gl/SB86SR.

Kaiser,HenryF.1961.“ANoteonGuttman’sLowerBoundfortheNumberofCommonFactors.”BritishJournalofStatisticalPsychology14:1–2.

Kassambara,Alboukadel,andFabianMundt.2017.Factoextra:ExtractandVisualizetheResultsofMultivariateDataAnalyses.http://www.sthda.com/english/rpkgs/factoextra.

Nenadic,O.,andM.Greenacre.2007.“CorrespondenceAnalysisinR,withTwo-andThree-DimensionalGraphics:ThecaPackage.”JournalofStatisticalSoftware20(3):1–13.http://www.jstatsoft.org.

Pagès,J.2002.“AnalyseFactorielleMultipleAppliquéeAuxVariablesQualitativesetAuxDonnéesMixtes.”RevueStatistiqueAppliquee4:5–37.

Page 205: Practical Guide To Principal Component Methods in R (Multivariate Analysis Book 2)

———.2004.“AnalyseFactorielledeDonneesMixtes.”RevueStatistiqueAppliquee4:93–111.

Peres-Neto,PedroR.,DonaldA.Jackson,andKeithM.Somers.2005.“HowManyPrincipalComponents?StoppingRulesforDeterminingtheNumberofNon-TrivialAxesRevisited.”BritishJournalofStatisticalPsychology49:974–97.

Tayrac,Mariede,SébastienLê,MarcAubry,JeanMosser,andFrançoisHusson.2009.“SimultaneousAnalysisofDistinctOmicsDataSetswithIntegrationofBiologicalKnowledge:MultipleFactorAnalysisApproach.”BMCGenomics10(1):32.https://doi.org/10.1186/1471-2164-10-32.