Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
PracticalGuidetoPrincipalComponentMethodsinR
MultivariateAnalysisAlboukadelKASSAMBARA
PracticalGuidetoPrincipalComponentMethodsinR
Preface
0.1Whatyouwilllearn
Largedatasetscontainingmultiplesamplesandvariablesarecollectedeverydaybyresearchersinvariousfields,suchasinBio-medical,marketing,andgeo-spatialfields.
Discoveringknowledgefromthesedatarequiresspecifictechniquesforanalyzingdatasetscontainingmultiplevariables.Multivariateanalysis(MVA)referstoasetoftechniquesusedforanalyzingadatasetcontainingmorethanonevariable.
Amongthesetechniques,thereare:
Clusteranalysisforidentifyinggroupsofobservationswithsimilarprofileaccordingtoaspecificcriteria.Principalcomponentmethods,whichconsistofsummarizingandvisualizingthemostimportantinformationcontainedinamultivariatedataset.
Previously,wepublishedabookentitled"PracticalGuideToClusterAnalysisinR"(https://goo.gl/DmJ5y5).TheaimofthecurrentbookistoprovideasolidpracticalguidancetoprincipalcomponentmethodsinR.Additionally,wedevelopedanRpackagenamedfactoextratocreate,easily,aggplot2-basedelegantplotsoftheresultsofprincipalcomponentmethod.Factoextraofficialonlinedocumentation:http://www.sthda.com/english/rpkgs/factoextra
Oneofthedifficultiesinherentinmultivariateanalysisistheproblemofvisualizingdatathathasmanyvariables.InR,therearemanyfunctionsandpackagesfordisplayingagraphoftherelationshipbetweentwovariables(http://www.sthda.com/english/wiki/data-visualization).Therearealsocommandsfordisplayingdifferentthree-dimensionalviews.Butwhentherearemorethanthreevariables,itismoredifficulttovisualizetheirrelationships.
Fortunately,indatasetswithmanyvariables,somevariablesareoftencorrelated.Thiscanbeexplainedbythefactthat,morethanonevariablemightbemeasuringthesamedrivingprinciplegoverningthebehaviorofthesystem.Correlationindicatesthatthereisredundancyinthedata.Whenthishappens,youcansimplifytheproblembyreplacingagroupofcorrelatedvariableswithasinglenewvariable.
Principalcomponentanalysisisarigorousstatisticalmethodusedforachievingthissimplification.Themethodcreatesanewsetofvariables,calledprincipalcomponents.Eachprincipalcomponentisalinearcombinationoftheoriginalvariables.Alltheprincipalcomponentsareorthogonaltoeachother,sothereisnoredundantinformation.
Thetypeofprincipalcomponentmethodstousedependsonvariabletypescontainedinthedataset.Thispracticalguidewilldescribethefollowingmethods:
1. PrincipalComponentAnalysis(PCA),whichisoneofthemostpopularmultivariateanalysismethod.ThegoalofPCAistosummarizetheinformationcontainedinacontinuous(i.e,quantitative)multivariatedatabyreducingthedimensionalityofthedatawithoutloosingimportantinformation.
2. CorrespondenceAnalysis(CA),whichisanextensionoftheprincipalcomponentanalysisforanalyzingalargecontingencytableformedbytwoqualitativevariables(orcategoricaldata).
3. MultipleCorrespondenceAnalysis(MCA),whichisanadaptationofCAtoadatatablecontainingmorethantwocategoricalvariables.
4. FactorAnalysisofMixedData(FAMD),dedicatedtoanalyzeadatasetcontainingbothquantitativeandqualitativevariables.
5. MultipleFactorAnalysis(MFA),dedicatedtoanalyzedatasets,inwhichvariablesareorganizedintogroups(qualitativeand/orquantitativevariables).
Additionally,we'lldiscusstheHCPC(HierarchicalClusteringonPrincipalComponent)method.Itappliesagglomerativehierarchicalclusteringontheresultsofprincipalcomponentmethods(PCA,CA,MCA,FAMD,MFA).Itallowsus,forexample,toperformclusteringanalysisonanytypeofdata(quantitative,qualitativeormixeddata).
Figure1illustratesthetypeofanalysistobeperformeddependingonthetypeofvariablescontainedinthedataset.
Principalcomponentmethods
0.2Keyfeaturesofthisbook
Althoughthereareseveralgoodbooksonprincipalcomponentmethodsandrelatedtopics,wefeltthatmanyofthemareeithertootheoreticalortooadvanced.
Ourgoalwastowriteapracticalguidetomultivariateanalysis,visualizationandinterpretation,focusingonprincipalcomponentmethods.
ThebookpresentsthebasicprinciplesofthedifferentmethodsandprovidemanyexamplesinR.Thisbookofferssolidguidanceindataminingforstudentsandresearchers.
Keyfeatures
CoversprincipalcomponentmethodsandimplementationinRShort,self-containedchapterswithtestedexamplesthatallowforflexibilityindesigningacourseandforeasyreference
Attheendofeachchapter,wepresentRlabsectionsinwhichwesystematicallyworkthroughapplicationsofthevariousmethodsdiscussedinthatchapter.Additionally,weprovidelinkstootherresourcesandtoourhand-curatedlistofvideosonprincipalcomponentmethodsforfurtherlearning.
0.3Howthisbookisorganized
Thisbookisdividedinto4partsand6chapters.PartIprovidesaquickintroductiontoR(chapter2)andpresentsrequiredRpackagesfortheanalysisandvisualization(chapter3).
InPartII,wedescribeclassicalmultivariateanalysismethods:
PrincipalComponentAnalysis-PCA(chapter4)CorrespondenceAnalysis-CA(chapter5)MultipleCorrespondenceAnalysis-MCA(chapter6)
InpartIII,wecontinuebydiscussingadvancedmethodsforanalyzingadatasetcontainingamixofvariables(qualitative&quantitative)organizedornotintogroups:
FactorAnalysisofMixedData-FAMD(chapter7)and,MultipleFactorAnalysis-MFA(chapter8).
Finally,weshowinPartIV,howtoperformhierarchicalclusteringonprincipalcomponents(HCPC)(chapter9),whichisusefulforperformingclusteringwithadatasetcontainingonlyqualitativevariablesorwithamixeddataofqualitativeandquantitativevariables.
Someexamplesofplotsgeneratedinthisbookareshownhereafter.You'lllearnhowtocreate,customizeandinterprettheseplots.
1. Eigenvalues/variancesofprincipalcomponents.Proportionofinformationretainedbyeachprincipalcomponent.
2. PCA-Graphofvariables:
Controlvariablecolorsusingtheircontributionstotheprincipalcomponents.
Highlightthemostcontributingvariablestoeachprincipaldimension:
3. PCA-Graphofindividuals:
Controlautomaticallythecolorofindividualsusingthecos2(thequalityoftheindividualsonthefactormap)
Changethepointsizeaccordingtothecos2ofthecorrespondingindividuals:
4. PCA-Biplotofindividualsandvariables
5. Correspondenceanalysis.Associationbetweencategoricalvariables.
6. FAMD-Analyzingmixeddata
7. Clusteringonprincipalcomponents
0.4Bookwebsite
Thewebsiteforthisbookislocatedat:http://www.sthda.com/english/.Itcontainsnumberofresources.
0.5ExecutingtheRcodesfromthePDF
ForasinglelineRcode,youcanjustcopythecodefromthePDFtotheRconsole.
Foramultiple-lineRcodes,anerrorisgenerated,sometimes,whenyoucopyandpastedirectlytheRcodefromthePDFtotheRconsole.Ifthishappens,asolutionisto:
PastefirstlythecodeinyourRcodeeditororinyourtexteditorCopythecodefromyourtext/codeeditortotheRconsole
0.6Acknowledgment
Isincerelythankalldevelopersfortheireffortsbehindthepackagesthatfactoextradependson,namely,ggplot2(HadleyWickham,Springer-VerlagNewYork,2009),FactoMineR(SebastienLeetal.,JournalofStatisticalSoftware,2008),dendextend(TalGalili,Bioinformatics,2015),cluster(MartinMaechleretal.,2016)andmore.
0.7Colophon
Thisbookwasbuiltwith:
R3.3.2factoextra1.0.5FactoMineR1.36ggpubr0.1.5dplyr0.7.2bookdown0.4.3
1AbouttheauthorAlboukadelKassambaraisaPhDinBioinformaticsandCancerBiology.Heworkssincemanyyearsongenomicdataanalysisandvisualization(readmore:http://www.alboukadel.com/).
Hehasworkexperiencesinstatisticalandcomputationalmethodstoidentifyprognosticandpredictivebiomarkersignaturesthroughintegrativeanalysisoflarge-scalegenomicandclinicaldatasets.
Hecreatedabioinformaticsweb-toolnamedGenomicScape(www.genomicscape.com)whichisaneasy-to-usewebtoolforgeneexpressiondataanalysisandvisualization.
Hedevelopedalsoatrainingwebsiteondatascience,namedSTHDA(StatisticalToolsforHigh-throughputDataAnalysis,www.sthda.com/english),whichcontainsmanytutorialsondataanalysisandvisualizationusingRsoftwareandpackages.
HeistheauthorofmanypopularRpackagesfor:
multivariatedataanalysis(factoextra,http://www.sthda.com/english/rpkgs/factoextra),survivalanalysis(survminer,http://www.sthda.com/english/rpkgs/survminer/),correlationanalysis(ggcorrplot,http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2),creatingpublicationreadyplotsinR(ggpubr,http://www.sthda.com/english/rpkgs/ggpubr).
Recently,hepublishedthreebooksondataanalysisandvisualization:
1. PracticalGuidetoClusterAnalysisinR(https://goo.gl/DmJ5y5)2. GuidetoCreateBeautifulGraphicsinR(https://goo.gl/vJ0OYb).3. CompleteGuideto3DPlotsinR(https://goo.gl/v5gwl0).
2IntroductiontoRRisafreeandpowerfulstatisticalsoftwareforanalyzingandvisualizingdata.IfyouwanttolearneasilytheessentialofRprogramming,visitourseriesoftutorialsavailableonSTHDA:http://www.sthda.com/english/wiki/r-basics-quick-and-easy.
Inthischapter,weprovideaverybriefintroductiontoR,forinstallingR/RStudioaswellasimportingyourdataintoRforcomputingprincipalcomponentmethods.
2.1InstallingRandRStudio
RandRStudiocanbeinstalledonWindows,MACOSXandLinuxplatforms.RStudioisanintegrateddevelopmentenvironmentforRthatmakesusingReasier.Itincludesaconsole,codeeditorandtoolsforplotting.
1. RcanbedownloadedandinstalledfromtheComprehensiveRArchiveNetwork(CRAN)webpage(http://cran.r-project.org/)
2. AfterinstallingRsoftware,installalsotheRStudiosoftwareavailableat:http://www.rstudio.com/products/RStudio/.
3. LaunchRStudioandstartuseRinsideRstudio.
Rstudiointerface
2.2InstallingandloadingRpackages
AnRpackageisanextensionofRcontainingdatasetsandspecificRfunctionstosolvespecificquestions.
Forexample,inthisbook,you'lllearnhowtocomputeandvisualizeprincipalcomponentmethodsusingFactoMineRandfactoextraRpackages.
TherearethousandsotherRpackagesavailablefordownloadandinstallationfromCRAN,Bioconductor(biologyrelatedRpackages)andGitHubrepositories.
1. HowtoinstallpackagesfromCRAN?Usethefunctioninstall.packages():
install.packages("FactoMineR")
install.packages("factoextra")
2. HowtoinstallpackagesfromGitHub?Youshouldfirstinstalldevtoolsifyoudon'thaveitalreadyinstalledonyourcomputer:
Forexample,thefollowingRcodeinstallsthelatestdevelopmentalversionoffactoextraRpackagedevelopedbyA.Kassambara(https://github.com/kassambara/facoextra)formultivariatedataanalysisandelegantvisualization.
install.packages("devtools")
devtools::install_github("kassambara/factoextra")
Notethat,GitHubcontainsthelatestdevelopmentalversionofRpackages.
3. Afterinstallation,youmustfirstloadthepackageforusingthefunctionsinthepackage.Thefunctionlibrary()isusedforthistask.
library("FactoMineR")
library("factoextra")
Now,wecanuseRfunctions,suchasPCA()[intheFactoMineRpackage]forperformingprincipalcomponentanalysis.
2.3GettinghelpwithfunctionsinR
Ifyouwanttolearnmoreaboutagivenfunction,sayPCA(),typethisinRconsole:
?PCA
2.4ImportingyourdataintoR
1. Prepareyourfileasfollow:
Usethefirstrowascolumnnames.Generally,columnsrepresentvariablesUsethefirstcolumnasrownames.Generallyrowsrepresentobservationsorindividuals.Eachrow/columnnameshouldbeunique,soremoveduplicatednames.Avoidnameswithblankspaces.Goodcolumnnames:Long_jumporLong.jump.Badcolumnname:Longjump.Avoidnameswithspecialsymbols:?,$,*,+,#,(,),-,/,},{,|,>,<etc.Onlyunderscorecanbeused.Avoidbeginningvariablenameswithanumber.Useletterinstead.Goodcolumnnames:sport_100morx100m.Badcolumnname:100mRiscasesensitive.ThismeansthatNameisdifferentfromNameorNAME.Avoidblankrowsinyourdata.Deleteanycommentsinyourfile.ReplacemissingvaluesbyNA(fornotavailable)Ifyouhaveacolumncontainingdate,usethefourdigitformat.Goodformat:01/01/2016.Badformat:01/01/16
2. Thefinalfileshouldlooklikethis:
GeneraldataformatforimportationintoR
3. Saveyourfile
Werecommendtosaveyourfileinto.txt(tab-delimitedtextfile)or.csv(commaseparatedvaluefile)format.
4. GetyourdataintoR:
UsetheRcodebelow.Youwillbeaskedtochooseafile:
#.txtfile:Readtabseparatedvalues
my_data<-read.delim(file.choose(),row.names=1)
#.csvfile:Readcomma(",")separatedvalues
my_data<-read.csv(file.choose(),row.names=1)
#.csvfile:Readsemicolon(";")separatedvalues
my_data<-read.csv2(file.choose(),row.names=1)
Usingthesefunctions,theimporteddatawillbeofclassdata.frame(Rterminology).
YoucanreadmoreabouthowtoimportdataintoRatthislink:http://www.sthda.com/english/wiki/importing-data-into-r
2.5Demodatasets
Rcomeswithseveralbuilt-indatasets,whicharegenerallyusedasdemodataforplayingwithRfunctions.ThemostusedRdemodatasetsinclude:USArrests,irisandmtcars.Toloadademodataset,usethefunctiondata()asfollow:
data("USArrests")#Loading
head(USArrests,3)#Printthefirst3rows
##MurderAssaultUrbanPopRape
##Alabama13.22365821.2
##Alaska10.02634844.5
##Arizona8.12948031.0
IfyouwantlearnmoreaboutUSArrestsdatasets,typethis:
?USArrests
Toselectjustcertaincolumnsfromadataframe,youcaneitherrefertothecolumnsbynameorbytheirlocation(i.e.,column1,2,3,etc.).
#Accessthedatain'Murder'column
#dollarsignisused
head(USArrests$Murder)
##[1]13.210.08.18.89.07.9
#Orusethis
USArrests[,'Murder']
#Orusethis
USArrests[,1]#columnnumber1
2.6CloseyourR/RStudiosession
EachtimeyoucloseR/RStudio,youwillbeaskedwhetheryouwanttosavethedatafromyourRsession.Ifyoudecidetosave,thedatawillbeavailableinfutureRsessions.
3RequiredRpackages
3.1FactoMineR&factoextra
ThereareanumberofRpackagesimplementingprincipalcomponentmethods.Thesepackagesinclude:FactoMineR,ade4,stats,ca,MASSandExPosition.
However,theresultispresenteddifferentlydependingontheusedpackage.
Tohelpintheinterpretationandinthevisualizationofmultivariateanalysis-suchasclusteranalysisandprincipalcomponentmethods-wedevelopedaneasy-to-useRpackagenamedfactoextra(officialonlinedocumentation:http://www.sthda.com/english/rpkgs/factoextra)(KassambaraandMundt2017).
Nomatterwhichpackageyoudecidetouseforcomputingprincipalcomponentmethods,thefactoextraRpackagecanhelptoextracteasily,inahumanreadabledataformat,theanalysisresultsfromthedifferentpackagesmentionedabove.factoextraprovidesalsoconvenientsolutionstocreateggplot2-basedbeautifulgraphs.
Inthisbook,we'llusemainly:
theFactoMineRpackage(F.Hussonetal.2017)tocomputeprincipalcomponentmethods;andthefactoextrapackage(KassambaraandMundt2017)forextracting,visualizingandinterpretingtheresults.
Theotherpackages-ade4,ExPosition,etc-willbepresentedbriefly.
TheFigure2.1illustratesthekeyfunctionalityofFactoMineRandfactoextra.
KeyfeaturesofFactoMineRandfactoextraformultivariateanalysis
Methods,whichoutputscanbevisualizedusingthefactoextrapackageareshownontheFigure2.2:
PrincipalcomponentmethodsandclusteringmethodssupportedbythefactoextraRpackage
3.2Installation
3.2.1InstallingFactoMineR
TheFactoMineRpackagecanbeinstalledandloadedasfollow:
#Install
install.packages("FactoMineR")
#Load
library("FactoMineR")
3.2.2Installingfactoextra
factoextracanbeinstalledfromCRANasfollow:
install.packages("factoextra")
Or,installthelatestdevelopmentalversionfromGithub
if(!require(devtools))install.packages("devtools")
devtools::install_github("kassambara/factoextra")
Loadfactoextraasfollow:
library("factoextra")
3.3MainRfunctions
3.3.1MainfunctionsinFactoMineR
Functionsforcomputingprincipalcomponentmethodsandclustering:
Functions DescriptionPCA Principalcomponentanalysis.CA Correspondenceanalysis.MCA Multiplecorrespondenceanalysis.FAMD Factoranalysisofmixeddata.MFA Multiplefactoranalysis.HCPC Hierarchicalclusteringonprincipalcomponents.dimdesc Dimensiondescription.
3.3.2Mainfunctionsinfactoextra
factoextrafunctionscoveredinthisbookarelistedinthetablebelow.Seetheonlinedocumentation(http://www.sthda.com/english/rpkgs/factoextra)foracompletelist.
Visualizingprincipalcomponentmethodoutputs
Functions Descriptionfviz_eig(orfviz_eigenvalue) Visualizeeigenvalues.fviz_pca GraphofPCAresults.fviz_ca GraphofCAresults.fviz_mca GraphofMCAresults.fviz_mfa GraphofMFAresults.fviz_famd GraphofFAMDresults.fviz_hmfa GraphofHMFAresults.fviz_ellipses Plotellipsesaroundgroups.fviz_cos2 Visualizeelementcos2.1fviz_contrib Visualizeelementcontributions.2
Extractingdatafromprincipalcomponentmethodoutputs.Thefollowingfunctionsextractalltheresults(coordinates,squaredcosine,contributions)fortheactiveindividuals/variablesfromtheanalysisoutputs.
Functions Descriptionget_eigenvalue Accesstothedimensioneigenvalues.get_pca AccesstoPCAoutputs.get_ca AccesstoCAoutputs.
get_mca AccesstoMCAoutputs.get_mfa AccesstoMFAoutputs.get_famd AccesstoMFAoutputs.get_hmfa AccesstoHMFAoutputs.facto_summarize Summarizetheanalysis.
Clusteringanalysisandvisualization
Functions Descriptionfviz_dend EnhancedVisualizationofDendrogram.fviz_cluster VisualizeClusteringResults.
1. Cos2:qualityofrepresentationoftherow/columnvariablesontheprincipalcomponentmaps.↩
2. Thisisthecontributionofrow/columnelementstothedefinitionoftheprincipalcomponents.↩
4PrincipalComponentAnalysis
4.1Introduction
Principalcomponentanalysis(PCA)allowsustosummarizeandtovisualizetheinformationinadatasetcontainingindividuals/observationsdescribedbymultipleinter-correlatedquantitativevariables.Eachvariablecouldbeconsideredasadifferentdimension.Ifyouhavemorethan3variablesinyourdatasets,itcouldbeverydifficulttovisualizeamulti-dimensionalhyperspace.
Principalcomponentanalysisisusedtoextracttheimportantinformationfromamultivariatedatatableandtoexpressthisinformationasasetoffewnewvariablescalledprincipalcomponents.Thesenewvariablescorrespondtoalinearcombinationoftheoriginals.Thenumberofprincipalcomponentsislessthanorequaltothenumberoforiginalvariables.
Theinformationinagivendatasetcorrespondstothetotalvariationitcontains.ThegoalofPCAistoidentifydirections(orprincipalcomponents)alongwhichthevariationinthedataismaximal.
Inotherwords,PCAreducesthedimensionalityofamultivariatedatatotwoorthreeprincipalcomponents,thatcanbevisualizedgraphically,withminimallossofinformation.
Inthischapter,wedescribethebasicideaofPCAand,demonstratehowtocomputeandvisualizePCAusingRsoftware.Additionally,we'llshowhowtorevealthemostimportantvariablesthatexplainthevariationsinadataset.
4.2Basics
UnderstandingthedetailsofPCArequiresknowledgeoflinearalgebra.Here,we'llexplainonlythebasicswithsimplegraphicalrepresentationofthedata.
InthePlot1Abelow,thedataarerepresentedintheX-Ycoordinatesystem.Thedimensionreductionisachievedbyidentifyingtheprincipaldirections,calledprincipalcomponents,inwhichthedatavaries.
PCAassumesthatthedirectionswiththelargestvariancesarethemost“important”(i.e,themostprincipal).
Inthefigurebelow,thePC1axisisthefirstprincipaldirectionalongwhichthesamplesshowthelargestvariation.ThePC2axisisthesecondmostimportantdirectionanditisorthogonaltothePC1axis.
Thedimensionalityofourtwo-dimensionaldatacanbereducedtoasingledimensionbyprojectingeachsampleontothefirstprincipalcomponent(Plot1B)
Technicallyspeaking,theamountofvarianceretainedbyeachprincipalcomponentismeasuredbytheso-calledeigenvalue.
Notethat,thePCAmethodisparticularlyusefulwhenthevariableswithinthedatasetarehighlycorrelated.Correlationindicatesthatthereisredundancyinthedata.Duetothisredundancy,PCAcanbeusedtoreducetheoriginalvariablesintoasmallernumberofnewvariables(=principalcomponents)explainingmostofthevarianceintheoriginalvariables.
Takentogether,themainpurposeofprincipalcomponentanalysisisto:
identifyhiddenpatterninadataset,reducethedimensionnalityofthedatabyremovingthenoiseandredundancyinthedata,identifycorrelatedvariables
4.3Computation
4.3.1Rpackages
SeveralfunctionsfromdifferentpackagesareavailableintheRsoftwareforcomputingPCA:
prcomp()andprincomp()[built-inRstatspackage],PCA()[FactoMineRpackage],dudi.pca()[ade4package],andepPCA()[ExPositionpackage]
Nomatterwhatfunctionyoudecidetouse,youcaneasilyextractandvisualizetheresultsofPCAusingRfunctionsprovidedinthefactoextraRpackage.
Installthetwopackagesasfollow:
install.packages(c("FactoMineR","factoextra"))
LoadtheminR,bytypingthis:
library("FactoMineR")
library("factoextra")
4.3.2Dataformat
We'llusethedemodatasetsdecathlon2fromthefactoextrapackage:
data(decathlon2)
#head(decathlon2)
AsillustratedinFigure3.1,thedatausedheredescribesathletes'performanceduringtwosportingevents(DesctarandOlympicG).Itcontains27individuals(athletes)describedby13variables.
Here,we'llusethetwopackagesFactoMineR(fortheanalysis)andfactoextra(forggplot2-basedvisualization).
Principalcomponentanalysisdataformat
InPCAterminology,ourdatacontains:
Activeindividuals(inlightblue,rows1:23):Individualsthatareusedduringtheprincipalcomponentanalysis.Supplementaryindividuals(indarkblue,rows24:27):ThecoordinatesoftheseindividualswillbepredictedusingthePCAinformationandparametersobtainedwithactiveindividuals/variablesActivevariables(inpink,columns1:10):Variablesthatareusedfortheprincipalcomponentanalysis.Supplementaryvariables:Assupplementaryindividuals,thecoordinatesofthesevariableswill
Notethat,onlysomeoftheseindividualsandvariableswillbeusedtoperformtheprincipalcomponentanalysis.ThecoordinatesoftheremainingindividualsandvariablesonthefactormapwillbepredictedafterthePCA.
bepredictedalso.Thesecanbe:Supplementarycontinuousvariables(red):Columns11and12correspondingrespectivelytotherankandthepointsofathletes.Supplementaryqualitativevariables(green):Column13correspondingtothetwoathlete-ticmeetings(2004OlympicGameor2004Decastar).Thisisacategorical(orfactor)variablefactor.Itcanbeusedtocolorindividualsbygroups.
Westartbysubsettingactiveindividualsandactivevariablesfortheprincipalcomponentanalysis:
decathlon2.active<-decathlon2[1:23,1:10]
head(decathlon2.active[,1:6],4)
##X100mLong.jumpShot.putHigh.jumpX400mX110m.hurdle
##SEBRLE11.07.5814.82.0749.814.7
##CLAY10.87.4014.31.8649.414.1
##BERNARD11.07.2314.21.9248.915.0
##YURKOV11.37.0915.22.1050.415.3
4.3.3Datastandardization
Inprincipalcomponentanalysis,variablesareoftenscaled(i.e.standardized).Thisisparticularlyrecommendedwhenvariablesaremeasuredindifferentscales(e.g:kilograms,kilometers,centimeters,...);otherwise,thePCAoutputsobtainedwillbeseverelyaffected.
Thegoalistomakethevariablescomparable.Generallyvariablesarescaledtohavei)standarddeviationoneandii)meanzero.
ThestandardizationofdataisanapproachwidelyusedinthecontextofgeneexpressiondataanalysisbeforePCAandclusteringanalysis.Wemightalsowanttoscalethedatawhenthemeanand/orthestandarddeviationofvariablesarelargelydifferent.
Whenscalingvariables,thedatacanbetransformedasfollow:
xi−mean(x)sd(x)\frac{x_i-mean(x)}{sd(x)}
Wheremean(x)mean(x)isthemeanofxvalues,andsd(x)sd(x)isthestandarddeviation(SD).
TheRbasefunctionscale()canbeusedtostandardizethedata.Ittakesanumericmatrixasaninputandperformsthescalingonthecolumns.
4.3.4Rcode
ThefunctionPCA()[FactoMineRpackage]canbeused.Asimplifiedformatis:
Notethat,bydefault,thefunctionPCA()[inFactoMineR],standardizesthedataautomaticallyduringthePCA;soyoudon'tneeddothistransformationbeforethePCA.
PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE)
X:adataframe.Rowsareindividualsandcolumnsarenumericvariablesscale.unit:alogicalvalue.IfTRUE,thedataarescaledtounitvariancebeforetheanalysis.Thisstandardizationtothesamescaleavoidssomevariablestobecomedominantjustbecauseoftheirlargemeasurementunits.Itmakesvariablecomparable.ncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.
TheRcodebelow,computesprincipalcomponentanalysisontheactiveindividuals/variables:
library("FactoMineR")
res.pca<-PCA(decathlon2.active,graph=FALSE)
TheoutputofthefunctionPCA()isalist,includingthefollowingcomponents:
print(res.pca)
##**ResultsforthePrincipalComponentAnalysis(PCA)**
##Theanalysiswasperformedon23individuals,describedby10variables
##*Theresultsareavailableinthefollowingobjects:
##
##namedescription
##1"$eig""eigenvalues"
##2"$var""resultsforthevariables"
##3"$var$coord""coord.forthevariables"
##4"$var$cor""correlationsvariables-dimensions"
##5"$var$cos2""cos2forthevariables"
##6"$var$contrib""contributionsofthevariables"
##7"$ind""resultsfortheindividuals"
##8"$ind$coord""coord.fortheindividuals"
##9"$ind$cos2""cos2fortheindividuals"
##10"$ind$contrib""contributionsoftheindividuals"
##11"$call""summarystatistics"
##12"$call$centre""meanofthevariables"
##13"$call$ecart.type""standarderrorofthevariables"
##14"$call$row.w""weightsfortheindividuals"
##15"$call$col.w""weightsforthevariables"
TheobjectthatiscreatedusingthefunctionPCA()containsmanyinformationfoundinmanydifferentlistsandmatrices.Thesevaluesaredescribedinthenextsection.
4.4VisualizationandInterpretation
We'llusethefactoextraRpackagetohelpintheinterpretationofPCA.Nomatterwhatfunctionyoudecidetouse[stats::prcomp(),FactoMiner::PCA(),ade4::dudi.pca(),ExPosition::epPCA()],youcaneasilyextractandvisualizetheresultsofPCAusingRfunctionsprovidedinthefactoextraRpackage.
Thesefunctionsinclude:
get_eigenvalue(res.pca):Extracttheeigenvalues/variancesofprincipalcomponentsfviz_eig(res.pca):Visualizetheeigenvaluesget_pca_ind(res.pca),get_pca_var(res.pca):Extracttheresultsforindividualsandvariables,respectively.fviz_pca_ind(res.pca),fviz_pca_var(res.pca):Visualizetheresultsindividualsandvariables,respectively.fviz_pca_biplot(res.pca):Makeabiplotofindividualsandvariables.
Inthenextsections,we'llillustrateeachofthesefunctions.
4.4.1Eigenvalues/Variances
Asdescribedinprevioussections,theeigenvaluesmeasuretheamountofvariationretainedbyeachprincipalcomponent.EigenvaluesarelargeforthefirstPCsandsmallforthesubsequentPCs.Thatis,thefirstPCscorrespondstothedirectionswiththemaximumamountofvariationinthedataset.
Weexaminetheeigenvaluestodeterminethenumberofprincipalcomponentstobeconsidered.Theeigenvaluesandtheproportionofvariances(i.e.,information)retainedbytheprincipalcomponents(PCs)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage].
library("factoextra")
eig.val<-get_eigenvalue(res.pca)
eig.val
##eigenvaluevariance.percentcumulative.variance.percent
##Dim.14.12441.2441.2
##Dim.21.83918.3959.6
##Dim.31.23912.3972.0
##Dim.40.8198.1980.2
##Dim.50.7027.0287.2
##Dim.60.4234.2391.5
##Dim.70.3033.0394.5
##Dim.80.2742.7497.2
##Dim.90.1551.5598.8
##Dim.100.1221.22100.0
Thesumofalltheeigenvaluesgiveatotalvarianceof10.
Theproportionofvariationexplainedbyeacheigenvalueisgiveninthesecondcolumn.Forexample,4.124dividedby10equals0.4124,or,about41.24%ofthevariationisexplainedbythisfirsteigenvalue.Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsofvariationexplainedtoobtaintherunningtotal.Forinstance,41.242%plus18.385%equals59.627%,
andsoforth.Therefore,about59.627%ofthevariationisexplainedbythefirsttwoeigenvaluestogether.
EigenvaluescanbeusedtodeterminethenumberofprincipalcomponentstoretainafterPCA(Kaiser1961):
Aneigenvalue>1indicatesthatPCsaccountformorevariancethanaccountedbyoneoftheoriginalvariablesinstandardizeddata.ThisiscommonlyusedasacutoffpointforwhichPCsareretained.Thisholdstrueonlywhenthedataarestandardized.
Youcanalsolimitthenumberofcomponenttothatnumberthataccountsforacertainfractionofthetotalvariance.Forexample,ifyouaresatisfiedwith70%ofthetotalvarianceexplainedthenusethenumberofcomponentstoachievethat.
Unfortunately,thereisnowell-acceptedobjectivewaytodecidehowmanyprincipalcomponentsareenough.Thiswilldependonthespecificfieldofapplicationandthespecificdataset.Inpractice,wetendtolookatthefirstfewprincipalcomponentsinordertofindinterestingpatternsinthedata.
Inouranalysis,thefirstthreeprincipalcomponentsexplain72%ofthevariation.Thisisanacceptablylargepercentage.
AnalternativemethodtodeterminethenumberofprincipalcomponentsistolookataScreePlot,whichistheplotofeigenvaluesorderedfromlargesttothesmallest.Thenumberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareallrelativelysmallandofcomparablesize(Jollife2002,Peres-Neto,Jackson,andSomers(2005)).
Thescreeplotcanbeproducedusingthefunctionfviz_eig()orfviz_screeplot()[factoextrapackage].
fviz_eig(res.pca,addlabels=TRUE,ylim=c(0,50))
4.4.2Graphofvariables
4.4.2.1Results
Asimplemethodtoextracttheresults,forvariables,fromaPCAoutputistousethefunctionget_pca_var()[factoextrapackage].Thisfunctionprovidesalistofmatricescontainingalltheresultsfortheactivevariables(coordinates,correlationbetweenvariablesandaxes,squaredcosineandcontributions)
var<-get_pca_var(res.pca)
var
##PrincipalComponentAnalysisResultsforvariables
##===================================================
##NameDescription
##1"$coord""Coordinatesforthevariables"
##2"$cor""Correlationsbetweenvariablesanddimensions"
##3"$cos2""Cos2forthevariables"
##4"$contrib""contributionsofthevariables"
Thecomponentsoftheget_pca_var()canbeusedintheplotofvariablesasfollow:
var$coord:coordinatesofvariablestocreateascatterplotvar$cos2:representsthequalityofrepresentationforvariablesonthefactormap.It'scalculatedasthesquaredcoordinates:var.cos2=var.coord*var.coord.var$contrib:containsthecontributions(inpercentage)ofthevariablestotheprincipalcomponents.Thecontributionofavariable(var)toagivenprincipalcomponentis(inpercentage):(var.cos2*100)/(totalcos2ofthecomponent).
Thedifferentcomponentscanbeaccessedasfollow:
#Coordinates
head(var$coord)
#Cos2:qualityonthefactoremap
head(var$cos2)
#Contributionstotheprincipalcomponents
head(var$contrib)
Fromtheplotabove,wemightwanttostopatthefifthprincipalcomponent.87%oftheinformation(variances)containedinthedataareretainedbythefirstfiveprincipalcomponents.
Notethat,it'spossibletoplotvariablesandtocolorthemaccordingtoeitheri)theirqualityonthefactormap(cos2)orii)theircontributionvaluestotheprincipalcomponents(contrib).
Inthissection,wedescribehowtovisualizevariablesanddrawconclusionsabouttheircorrelations.Next,wehighlightvariablesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstotheprincipalcomponents.
4.4.2.2Correlationcircle
Thecorrelationbetweenavariableandaprincipalcomponent(PC)isusedasthecoordinatesofthevariableonthePC.Therepresentationofvariablesdiffersfromtheplotoftheobservations:Theobservationsarerepresentedbytheirprojections,butthevariablesarerepresentedbytheircorrelations(AbdiandWilliams2010).
#Coordinatesofvariables
head(var$coord,4)
##Dim.1Dim.2Dim.3Dim.4Dim.5
##X100m-0.851-0.17940.3020.0336-0.194
##Long.jump0.7940.2809-0.191-0.11540.233
##Shot.put0.7340.08540.5180.1285-0.249
##High.jump0.610-0.46520.3300.14460.403
Toplotvariables,typethis:
fviz_pca_var(res.pca,col.var="black")
Theplotaboveisalsoknownasvariablecorrelationplots.Itshowstherelationshipsbetweenallvariables.Itcanbeinterpretedasfollow:
Positivelycorrelatedvariablesaregroupedtogether.Negativelycorrelatedvariablesarepositionedonoppositesidesoftheplotorigin(opposed
quadrants).Thedistancebetweenvariablesandtheoriginmeasuresthequalityofthevariablesonthefactormap.Variablesthatareawayfromtheoriginarewellrepresentedonthefactormap.
4.4.2.3Qualityofrepresentation
Thequalityofrepresentationofthevariablesonfactormapiscalledcos2(squarecosine,squaredcoordinates).Youcanaccesstothecos2asfollow:
head(var$cos2,4)
##Dim.1Dim.2Dim.3Dim.4Dim.5
##X100m0.7240.032180.09090.001130.0378
##Long.jump0.6310.078880.03630.013310.0544
##Shot.put0.5390.007290.26790.016500.0619
##High.jump0.3720.216420.10900.020890.1622
Youcanvisualizethecos2ofvariablesonallthedimensionsusingthecorrplotpackage:
library("corrplot")
corrplot(var$cos2,is.corr=FALSE)
It’salsopossibletocreateabarplotofvariablescos2usingthefunctionfviz_cos2()[infactoextra]:
#Totalcos2ofvariablesonDim.1andDim.2
fviz_cos2(res.pca,choice="var",axes=1:2)
Notethat,
Ahighcos2indicatesagoodrepresentationofthevariableontheprincipalcomponent.Inthiscasethevariableispositionedclosetothecircumferenceofthecorrelationcircle.
Alowcos2indicatesthatthevariableisnotperfectlyrepresentedbythePCs.Inthiscasethevariableisclosetothecenterofthecircle.
Foragivenvariable,thesumofthecos2onalltheprincipalcomponentsisequaltoone.
Ifavariableisperfectlyrepresentedbyonlytwoprincipalcomponents(Dim.1&Dim.2),thesumofthecos2onthesetwoPCsisequaltoone.Inthiscasethevariableswillbepositionedonthecircleofcorrelations.
Forsomeofthevariables,morethan2componentsmightberequiredtoperfectlyrepresentthedata.Inthiscasethevariablesarepositionedinsidethecircleofcorrelations.
Insummary:
Thecos2valuesareusedtoestimatethequalityoftherepresentationThecloseravariableistothecircleofcorrelations,thebetteritsrepresentationonthefactormap(andthemoreimportantitistointerpretthesecomponents)Variablesthatareclosedtothecenteroftheplotarelessimportantforthefirstcomponents.
It'spossibletocolorvariablesbytheircos2valuesusingtheargumentcol.var="cos2".Thisproducesagradientcolors.Inthiscase,theargumentgradient.colscanbeusedtoprovideacustomcolor.Forinstance,gradient.cols=c("white","blue","red")meansthat:
variableswithlowcos2valueswillbecoloredin"white"variableswithmidcos2valueswillbecoloredin"blue"variableswithhighcos2valueswillbecoloredinred
#Colorbycos2values:qualityonthefactormap
fviz_pca_var(res.pca,col.var="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE#Avoidtextoverlapping
)
Notethat,it'salsopossibletochangethetransparencyofthevariablesaccordingtotheircos2valuesusingtheoptionalpha.var="cos2".Forexample,typethis:
#Changethetransparencybycos2values
fviz_pca_var(res.pca,alpha.var="cos2")
4.4.2.4ContributionsofvariablestoPCs
Thecontributionsofvariablesinaccountingforthevariabilityinagivenprincipalcomponentareexpressedinpercentage.
VariablesthatarecorrelatedwithPC1(i.e.,Dim.1)andPC2(i.e.,Dim.2)arethemostimportantinexplainingthevariabilityinthedataset.VariablesthatdonotcorrelatedwithanyPCorcorrelatedwiththelastdimensionsarevariableswithlowcontributionandmightberemovedtosimplifytheoverallanalysis.
Thecontributionofvariablescanbeextractedasfollow:
head(var$contrib,4)
##Dim.1Dim.2Dim.3Dim.4Dim.5
##X100m17.541.7517.340.1385.39
##Long.jump15.294.2902.931.6257.75
##Shot.put13.060.39721.622.0148.82
##High.jump9.0211.7728.792.55023.12
It’spossibletousethefunctioncorrplot()[corrplotpackage]tohighlightthemostcontributingvariablesforeachdimension:
library("corrplot")
corrplot(var$contrib,is.corr=FALSE)
Thefunctionfviz_contrib()[factoextrapackage]canbeusedtodrawabarplotofvariablecontributions.Ifyourdatacontainsmanyvariables,youcandecidetoshowonlythetopcontributingvariables.TheRcodebelowshowsthetop10variablescontributingtotheprincipalcomponents:
#ContributionsofvariablestoPC1
fviz_contrib(res.pca,choice="var",axes=1,top=10)
#ContributionsofvariablestoPC2
fviz_contrib(res.pca,choice="var",axes=2,top=10)
Thelargerthevalueofthecontribution,themorethevariablecontributestothecomponent.
ThetotalcontributiontoPC1andPC2isobtainedwiththefollowingRcode:
fviz_contrib(res.pca,choice="var",axes=1:2,top=10)
Thereddashedlineonthegraphaboveindicatestheexpectedaveragecontribution.Ifthecontributionofthevariableswereuniform,theexpectedvaluewouldbe1/length(variables)=1/10=10%.Foragivencomponent,avariablewithacontributionlargerthanthiscutoffcouldbeconsideredasimportantincontributingtothecomponent.
Notethat,thetotalcontributionofagivenvariable,onexplainingthevariationsretainedbytwoprincipalcomponents,sayPC1andPC2,iscalculatedascontrib=[(C1*Eig1)+(C2*Eig2)]/(Eig1+Eig2),where
C1andC2arethecontributionsofthevariableonPC1andPC2,respectivelyEig1andEig2aretheeigenvaluesofPC1andPC2,respectively.RecallthateigenvaluesmeasuretheamountofvariationretainedbyeachPC.
Inthiscase,theexpectedaveragecontribution(cutoff)iscalculatedasfollow:Asmentionedabove,ifthecontributionsofthe10variableswereuniform,theexpectedaveragecontributiononagivenPCwouldbe1/10=10%.TheexpectedaveragecontributionofavariableforPC1andPC2is:[(10*Eig1)+(10*Eig2)]/(Eig1+Eig2)
Themostimportant(or,contributing)variablescanbehighlightedonthecorrelationplotasfollow:
fviz_pca_var(res.pca,col.var="contrib",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07")
)
Itcanbeseenthatthevariables-X100m,Long.jumpandPole.vault-contributethemosttothedimensions1and2.
Notethat,it'salsopossibletochangethetransparencyofvariablesaccordingtotheircontribvaluesusingtheoptionalpha.var="contrib".Forexample,typethis:
#Changethetransparencybycontribvalues
fviz_pca_var(res.pca,alpha.var="contrib")
4.4.2.5Colorbyacustomcontinuousvariable
Intheprevioussections,weshowedhowtocolorvariablesbytheircontributionsandtheircos2.Notethat,it'spossibletocolorvariablesbyanycustomcontinuousvariable.ThecoloringvariableshouldhavethesamelengthasthenumberofactivevariablesinthePCA(heren=10).
Forexample,typethis:
#Createarandomcontinuousvariableoflength10
set.seed(123)
my.cont.var<-rnorm(10)
#Colorvariablesbythecontinuousvariable
fviz_pca_var(res.pca,col.var=my.cont.var,
gradient.cols=c("blue","yellow","red"),
legend.title="Cont.Var")
4.4.2.6Colorbygroups
It'salsopossibletochangethecolorofvariablesbygroupsdefinedbyaqualitative/categoricalvariable,alsocalledfactorinRterminology.
Aswedon'thaveanygroupingvariableinourdatasetsforclassifyingvariables,we'llcreateit.
Inthefollowingdemoexample,westartbyclassifyingthevariablesinto3groupsusingthekmeansclusteringalgorithm.Next,weusetheclustersreturnedbythekmeansalgorithmtocolorvariables.
#Createagroupingvariableusingkmeans
#Create3groupsofvariables(centers=3)
set.seed(123)
res.km<-kmeans(var$coord,centers=3,nstart=25)
grp<-as.factor(res.km$cluster)
#Colorvariablesbygroups
fviz_pca_var(res.pca,col.var=grp,
palette=c("#0073C2FF","#EFC000FF","#868686FF"),
legend.title="Cluster")
Notethat,ifyouareinterestedinlearningclustering,wepreviouslypublishedabooknamed"PracticalGuideToClusterAnalysisinR"(https://goo.gl/DmJ5y5).
4.4.3Dimensiondescription
Inthesection4.4.2.4,wedescribedhowtohighlightvariablesaccordingtotheircontributionstotheprincipalcomponents.
Notealsothat,thefunctiondimdesc()[inFactoMineR],fordimensiondescription,canbeusedtoidentifythemostsignificantlyassociatedvariableswithagivenprincipalcomponent.Itcanbeusedasfollow:
res.desc<-dimdesc(res.pca,axes=c(1,2),proba=0.05)
#Descriptionofdimension1
res.desc$Dim.1
##$quanti
##correlationp.value
##Long.jump0.7946.06e-06
##Discus0.7434.84e-05
##Shot.put0.7346.72e-05
##High.jump0.6101.99e-03
##Javeline0.4284.15e-02
##X400m-0.7021.91e-04
##X110m.hurdle-0.7642.20e-05
##X100m-0.8512.73e-07
Notethat,tochangethecolorofgroupstheargumentpaletteshouldbeused.Tochangegradientcolors,theargumentgradient.colsshouldbeused.
#Descriptionofdimension2
res.desc$Dim.2
##$quanti
##correlationp.value
##Pole.vault0.8073.21e-06
##X1500m0.7849.38e-06
##High.jump-0.4652.53e-02
Intheoutputabove,$quantimeansresultsforquantitativevariables.Notethat,variablesaresortedbythep-valueofthecorrelation.
4.4.4Graphofindividuals
4.4.4.1Results
Theresults,forindividualscanbeextractedusingthefunctionget_pca_ind()[factoextrapackage].Similarlytotheget_pca_var(),thefunctionget_pca_ind()providesalistofmatricescontainingalltheresultsfortheindividuals(coordinates,correlationbetweenvariablesandaxes,squaredcosineandcontributions)
ind<-get_pca_ind(res.pca)
ind
##PrincipalComponentAnalysisResultsforindividuals
##===================================================
##NameDescription
##1"$coord""Coordinatesfortheindividuals"
##2"$cos2""Cos2fortheindividuals"
##3"$contrib""contributionsoftheindividuals"
Togetaccesstothedifferentcomponents,usethis:
#Coordinatesofindividuals
head(ind$coord)
#Qualityofindividuals
head(ind$cos2)
#Contributionsofindividuals
head(ind$contrib)
4.4.4.2Plots:qualityandcontribution
Thefviz_pca_ind()isusedtoproducethegraphofindividuals.Tocreateasimpleplot,typethis:
fviz_pca_ind(res.pca)
Likevariables,it'salsopossibletocolorindividualsbytheircos2values:
fviz_pca_ind(res.pca,col.ind="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE#Avoidtextoverlapping(slowifmanypoints)
)
Youcanalsochangethepointsizeaccordingthecos2ofthecorrespondingindividuals:
fviz_pca_ind(res.pca,pointsize="cos2",
pointshape=21,fill="#E7B800",
repel=TRUE#Avoidtextoverlapping(slowifmanypoints)
)
Notethat,individualsthataresimilararegroupedtogetherontheplot.
Tochangebothpointsizeandcolorbycos2,trythis:
fviz_pca_ind(res.pca,col.ind="cos2",pointsize="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE#Avoidtextoverlapping(slowifmanypoints)
)
Tocreateabarplotofthequalityofrepresentation(cos2)ofindividualsonthefactormap,youcanusethefunctionfviz_cos2()aspreviouslydescribedforvariables:
fviz_cos2(res.pca,choice="ind")
Tovisualizethecontributionofindividualstothefirsttwoprincipalcomponents,typethis:
#TotalcontributiononPC1andPC2
fviz_contrib(res.pca,choice="ind",axes=1:2)
4.4.4.3Colorbyacustomcontinuousvariable
Asforvariables,individualscanbecoloredbyanycustomcontinuousvariablebyspecifyingtheargumentcol.ind.
Forexample,typethis:
#Createarandomcontinuousvariableoflength23,
#SamelengthasthenumberofactiveindividualsinthePCA
set.seed(123)
my.cont.var<-rnorm(23)
#Colorvariablesbythecontinuousvariable
fviz_pca_ind(res.pca,col.ind=my.cont.var,
gradient.cols=c("blue","yellow","red"),
legend.title="Cont.Var")
4.4.4.4Colorbygroups
Here,wedescribehowtocolorindividualsbygroup.Additionally,weshowhowtoaddconcentrationellipsesandconfidenceellipsesbygroups.Forthis,we'llusetheirisdataasdemodatasets.
Irisdatasetslooklikethis:
head(iris,3)
##Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
##15.13.51.40.2setosa
##24.93.01.40.2setosa
##34.73.21.30.2setosa
Thecolumn"Species"willbeusedasgroupingvariable.Westartbycomputingprincipalcomponentanalysisasfollow:
#ThevariableSpecies(index=5)isremoved
#beforePCAanalysis
iris.pca<-PCA(iris[,-5],graph=FALSE)
IntheRcodebelow:theargumenthabillageorcol.indcanbeusedtospecifythefactorvariableforcoloringtheindividualsbygroups.
Toaddaconcentrationellipsearoundeachgroup,specifytheargumentaddEllipses=TRUE.Theargumentpalettecanbeusedtochangegroupcolors.
fviz_pca_ind(iris.pca,
geom.ind="point",#showpointsonly(nbutnot"text")
col.ind=iris$Species,#colorbygroups
palette=c("#00AFBB","#E7B800","#FC4E07"),
addEllipses=TRUE,#Concentrationellipses
legend.title="Groups"
)
#Addconfidenceellipses
fviz_pca_ind(iris.pca,geom.ind="point",col.ind=iris$Species,
palette=c("#00AFBB","#E7B800","#FC4E07"),
addEllipses=TRUE,ellipse.type="confidence",
legend.title="Groups"
)
Notethat,allowedvaluesforpaletteinclude:
Toremovethegroupmeanpoint,specifytheargumentmean.point=FALSE.
Ifyouwantconfidenceellipsesinsteadofconcentrationellipses,useellipse.type="confidence".
"grey"forgreycolorpalettes;brewerpalettese.g."RdBu","Blues",...;Toviewall,typethisinR:RColorBrewer::display.brewer.all().customcolorpalettee.g.c("blue","red");andscientificjournalpalettesfromggsciRpackage,e.g.:"npg","aaas","lancet","jco","ucscgb","uchicago","simpsons"and"rickandmorty".
Forexample,tousethejco(journalofclinicaloncology)colorpalette,typethis:
fviz_pca_ind(iris.pca,
label="none",#hideindividuallabels
habillage=iris$Species,#colorbygroups
addEllipses=TRUE,#Concentrationellipses
palette="jco"
)
4.4.5Graphcustomization
Notethat,fviz_pca_ind()andfviz_pca_var()andrelatedfunctionsarewrapperaroundthecorefunctionfviz()[infactoextra].fviz()isawrapperaroundthefunctionggscatter()[inggpubr].Therefore,furtherarguments,tobepassedtothefunctionfviz()andggscatter(),canbespecifiedinfviz_pca_ind()andfviz_pca_var().
Here,wepresentsomeoftheseadditionalargumentstocustomizethePCAgraphofvariablesandindividuals.
4.4.5.1Dimensions
Bydefault,variables/individualsarerepresentedondimensions1and2.Ifyouwanttovisualizethemondimensions2and3,forexample,youshouldspecifytheargumentaxes=c(2,3).
#Variablesondimensions2and3
fviz_pca_var(res.pca,axes=c(2,3))
#Individualsondimensions2and3
fviz_pca_ind(res.pca,axes=c(2,3))
4.4.5.2Plotelements:point,text,arrow
Theargumentgeom(forgeometry)andderivativesareusedtospecifythegeometryelementsorgraphicalelementstobeusedforplotting.
1. geom.var:atextspecifyingthegeometrytobeusedforplottingvariables.Allowedvaluesarethecombinationofc("point","arrow","text").
Usegeom.var="point",toshowonlypoints;Usegeom.var="text"toshowonlytextlabels;Usegeom.var=c("point","text")toshowbothpointsandtextlabelsUsegeom.var=c("arrow","text")toshowarrowsandlabels(default).
Forexample,typethis:
#Showvariablepointsandtextlabels
fviz_pca_var(res.pca,geom.var=c("point","text"))
2. geom.ind:atextspecifyingthegeometrytobeusedforplottingindividuals.Allowedvaluesarethecombinationofc("point","text").
Usegeom.ind="point",toshowonlypoints;Usegeom.ind="text"toshowonlytextlabels;Usegeom.ind=c("point","text")toshowbothpointandtextlabels(default)
Forexample,typethis:
#Showindividualstextlabelsonly
fviz_pca_ind(res.pca,geom.ind="text")
4.4.5.3Sizeandshapeofplotelements
1. labelsize:fontsizeforthetextlabels,e.g.:labelsize=4.2. pointsize:thesizeofpoints,e.g.:pointsize=1.5.3. arrowsize:thesizeofarrows.Controlsthethicknessofarrows,e.g.:arrowsize=0.5.4. pointshape:theshapeofpoints,pointshape=21.Typeggpubr::show_point_shapes()tosee
availablepointshapes.
#Changethesizeofarrowsanlabels
fviz_pca_var(res.pca,arrowsize=1,labelsize=5,
repel=TRUE)
#Changepointssize,shapeandfillcolor
#Changelabelsize
fviz_pca_ind(res.pca,
pointsize=3,pointshape=21,fill="lightblue",
labelsize=5,repel=TRUE)
4.4.5.4Ellipses
Aswedescribedintheprevioussection4.4.4.4,whencoloringindividualsbygroups,youcanaddpointconcentrationellipsesusingtheargumentaddEllipses=TRUE.
Notethat,theargumentellipse.typecanbeusedtochangethetypeofellipses.Possiblevaluesare:
"convex":plotconvexhullofasetopoints."confidence":plotconfidenceellipsesaroundgroupmeanpointsasthefunctioncoord.ellipse()[inFactoMineR]."t":assumesamultivariatet-distribution."norm":assumesamultivariatenormaldistribution."euclid":drawsacirclewiththeradiusequaltolevel,representingtheeuclideandistancefromthecenter.Thisellipseprobablywon'tappearcircularunlesscoord_fixed()isapplied.
Theargumentellipse.levelisalsoavailabletochangethesizeoftheconcentrationellipseinnormalprobability.Forexample,specifyellipse.level=0.95orellipse.level=0.66.
#Addconfidenceellipses
fviz_pca_ind(iris.pca,geom.ind="point",
col.ind=iris$Species,#colorbygroups
palette=c("#00AFBB","#E7B800","#FC4E07"),
addEllipses=TRUE,ellipse.type="confidence",
legend.title="Groups"
)
#Convexhull
fviz_pca_ind(iris.pca,geom.ind="point",
col.ind=iris$Species,#colorbygroups
palette=c("#00AFBB","#E7B800","#FC4E07"),
addEllipses=TRUE,ellipse.type="convex",
legend.title="Groups"
)
4.4.5.5Groupmeanpoints
Whencoloringindividualsbygroups(section4.4.4.4),themeanpointsofgroups(barycenters)arealsodisplayedbydefault.
Toremovethemeanpoints,usetheargumentmean.point=FALSE.
fviz_pca_ind(iris.pca,
geom.ind="point",#showpointsonly(butnot"text")
group.ind=iris$Species,#colorbygroups
legend.title="Groups",
mean.point=FALSE)
4.4.5.6Axislines
Theargumentaxes.linetypecanbeusedtospecifythelinetypeofaxes.Defaultis"dashed".Allowedvaluesinclude"blank","solid","dotted",etc.Toseeallpossiblevaluestypeggpubr::show_line_types()inR.
Toremoveaxislines,useaxes.linetype="blank":
fviz_pca_var(res.pca,axes.linetype="blank")
4.4.5.7Graphicalparameters
Tochangeeasilythegraphicalofanyggplots,youcanusethefunctionggpar()[ggpubrpackage]
Thegraphicalparametersthatcanbechangedusingggpar()include:
Maintitles,axislabelsandlegendtitlesLegendposition.Possiblevalues:"top","bottom","left","right","none".Colorpalette.
Themes.Allowedvaluesinclude:theme_gray(),theme_bw(),theme_minimal(),theme_classic(),theme_void().
ind.p<-fviz_pca_ind(iris.pca,geom="point",col.ind=iris$Species)
ggpubr::ggpar(ind.p,
title="PrincipalComponentAnalysis",
subtitle="Irisdataset",
caption="Source:factoextra",
xlab="PC1",ylab="PC2",
legend.title="Species",legend.position="top",
ggtheme=theme_gray(),palette="jco"
)
4.4.6Biplot
Tomakeasimplebiplotofindividualsandvariables,typethis:
fviz_pca_biplot(res.pca,repel=TRUE,
col.var="#2E9FDF",#Variablescolor
col.ind="#696969"#Individualscolor
)
Now,usingtheiris.pcaoutput,let's:
makeabiplotofindividualsandvariableschangethecolorofindividualsbygroups:col.ind=iris$Speciesshowonlythelabelsforvariables:label="var"orusegeom.ind="point"
fviz_pca_biplot(iris.pca,
col.ind=iris$Species,palette="jco",
addEllipses=TRUE,label="var",
col.var="black",repel=TRUE,
legend.title="Species")
Notethat,thebiplotmightbeonlyusefulwhenthereisalownumberofvariablesandindividualsinthedataset;otherwisethefinalplotwouldbeunreadable.
Notealsothat,thecoordinateofindividualsandvariablesarenoteconstructedonthesamespace.Therefore,onbiplot,youshouldmainlyfocusonthedirectionofvariablesbutnotontheirabsolutepositionsontheplot.
Roughlyspeakingabiplotcanbeinterpretedasfollow:
anindividualthatisonthesamesideofagivenvariablehasahighvalueforthisvariable;anindividualthatisontheoppositesideofagivenvariablehasalowvalueforthisvariable.
Inthefollowingexample,wewanttocolorbothindividualsandvariablesbygroups.Thetrickistousepointshape=21forindividualpoints.Thisparticularpointshapecanbefilledbyacolorusingtheargumentfill.ind.Theborderlinecolorofindividualpointsissetto"black"usingcol.ind.Tocolorvariablebygroups,theargumentcol.varwillbeused.
Tocustomizeindividualsandvariablecolors,weusethehelperfunctionsfill_palette()andcolor_palette()[inggpubrpackage].
fviz_pca_biplot(iris.pca,
#Fillindividualsbygroups
geom.ind="point",
pointshape=21,
pointsize=2.5,
fill.ind=iris$Species,
col.ind="black",
#Colorvariablebygroups
col.var=factor(c("sepal","sepal","petal","petal")),
legend.title=list(fill="Species",color="Clusters"),
repel=TRUE#Avoidlabeloverplotting
)+
ggpubr::fill_palette("jco")+#Indiviualfillcolor
ggpubr::color_palette("npg")#Variablecolors
Anothercomplexexampleistocolorindividualsbygroups(discretecolor)andvariablesbytheircontributionstotheprincipalcomponents(gradientcolors).Additionally,we'llchangethetransparencyofvariablesbytheircontributionsusingtheargumentalpha.var.
fviz_pca_biplot(iris.pca,
#Individuals
geom.ind="point",
fill.ind=iris$Species,col.ind="black",
pointshape=21,pointsize=2,
palette="jco",
addEllipses=TRUE,
#Variables
alpha.var="contrib",col.var="contrib",
gradient.cols="RdYlBu",
legend.title=list(fill="Species",color="Contrib",
alpha="Contrib")
)
4.5Supplementaryelements
4.5.1Definitionandtypes
Asdescribedabove(section4.3.2),thedecathlon2datasetscontainsupplementarycontinuousvariables(quanti.sup,columns11:12),supplementaryqualitativevariables(quali.sup,column13)andsupplementaryindividuals(ind.sup,rows24:27).
Supplementaryvariablesandindividualsarenotusedforthedeterminationoftheprincipalcomponents.Theircoordinatesarepredictedusingonlytheinformationprovidedbytheperformedprincipalcomponentanalysisonactivevariables/individuals.
4.5.2SpecificationinPCA
Tospecifysupplementaryindividualsandvariables,thefunctionPCA()canbeusedasfollow:
PCA(X,ind.sup=NULL,
quanti.sup=NULL,quali.sup=NULL,graph=TRUE)
X:adataframe.Rowsareindividualsandcolumnsarenumericvariables.ind.sup:anumericvectorspecifyingtheindexesofthesupplementaryindividualsquanti.sup,quali.sup:anumericvectorspecifying,respectively,theindexesofthequantitativeandqualitativevariablesgraph:alogicalvalue.IfTRUEagraphisdisplayed.
Forexample,typethis:
res.pca<-PCA(decathlon2,ind.sup=24:27,
quanti.sup=11:12,quali.sup=13,graph=FALSE)
4.5.3Quantitativevariables
Predictedresults(coordinates,correlationandcos2)forthesupplementaryquantitativevariables:
res.pca$quanti.sup
##$coord
##Dim.1Dim.2Dim.3Dim.4Dim.5
##Rank-0.701-0.2452-0.1830.0558-0.0738
##Points0.9640.07770.158-0.1662-0.0311
##
##$cor
##Dim.1Dim.2Dim.3Dim.4Dim.5
##Rank-0.701-0.2452-0.1830.0558-0.0738
##Points0.9640.07770.158-0.1662-0.0311
##
##$cos2
##Dim.1Dim.2Dim.3Dim.4Dim.5
##Rank0.4920.060120.03360.003110.00545
##Points0.9290.006030.02500.027630.00097
Visualizeallvariables(activeandsupplementaryones):
fviz_pca_var(res.pca)
Furtherargumentstocustomizetheplot:
#Changecolorofvariables
fviz_pca_var(res.pca,
col.var="black",#Activevariables
col.quanti.sup="red"#Suppl.quantitativevariables
)
#Hideactivevariablesontheplot,
#showonlysupplementaryvariables
fviz_pca_var(res.pca,invisible="var")
#Hidesupplementaryvariables
fviz_pca_var(res.pca,invisible="quanti.sup")
Notethat,bydefault,supplementaryquantitativevariablesareshowninbluecoloranddashedlines.
Usingthefviz_pca_var(),thequantitativesupplementaryvariablesaredisplayedautomaticallyonthecorrelationcircleplot.Notethat,youcanaddthequanti.supvariablesmanually,usingthe
#Plotofactivevariables
p<-fviz_pca_var(res.pca,invisible="quanti.sup")
#Addsupplementaryactivevariables
fviz_add(p,res.pca$quanti.sup$coord,
geom=c("arrow","text"),
color="red")
4.5.4Individuals
Predictedresultsforthesupplementaryindividuals(ind.sup):
res.pca$ind.sup
Visualizeallindividuals(activeandsupplementaryones).Onthegraph,youcanaddalsothesupplementaryqualitativevariables(quali.sup),whichcoordinatesisaccessibleusingres.pca$quali.supp$coord.
p<-fviz_pca_ind(res.pca,col.ind.sup="blue",repel=TRUE)
p<-fviz_add(p,res.pca$quali.sup$coord,color="red")
p
fviz_add()function,forfurthercustomization.Anexampleisshownbelow.
Supplementaryindividualsareshowninblue.Thelevelsofthesupplementaryqualitativevariableareshowninredcolor.
4.5.5Qualitativevariables
Intheprevioussection,weshowedthatyoucanaddthesupplementaryqualitativevariablesonindividualsplotusingfviz_add().
Notethat,thesupplementaryqualitativevariablescanbealsousedforcoloringindividualsbygroups.Thiscanhelptointerpretthedata.Thedatasetsdecathlon2containasupplementaryqualitativevariableatcolumns13correspondingtothetypeofcompetitions.
Theresultsconcerningthesupplementaryqualitativevariableare:
res.pca$quali
Tocolorindividualsbyasupplementaryqualitativevariable,theargumenthabillageisusedtospecifytheindexofthesupplementaryqualitativevariable.Historically,thisargumentnamecomesfromtheFactoMineRpackage.It'safrenchwordmeaning"dressing"inenglish.TokeepconsistencybetweenFactoMineRandfactoextra,wedecidedtokeepthesameargumentname
fviz_pca_ind(res.pca,habillage=13,
addEllipses=TRUE,ellipse.type="confidence",
palette="jco",repel=TRUE)
Recallthat,toremovethemeanpointsofgroups,specifytheargumentmean.point=FALSE.
4.6Filteringresults
Ifyouhavemanyindividuals/variable,it'spossibletovisualizeonlysomeofthemusingtheargumentsselect.indandselect.var.
select.ind,select.var:aselectionofindividuals/variabletobeplotted.AllowedvaluesareNULLoralistcontainingtheargumentsname,cos2orcontrib:
name:isacharactervectorcontainingindividuals/variablenamestobeplottedcos2:ifcos2isin[0,1],ex:0.6,thenindividuals/variableswithacos2>0.6areplottedifcos2>1,ex:5,thenthetop5activeindividuals/variablesandtop5supplementarycolumns/rowswiththehighestcos2areplottedcontrib:ifcontrib>1,ex:5,thenthetop5individuals/variableswiththehighestcontributionsareplotted
#Visualizevariablewithcos2>=0.6
fviz_pca_var(res.pca,select.var=list(cos2=0.6))
#Top5activevariableswiththehighestcos2
fviz_pca_var(res.pca,select.var=list(cos2=5))
#Selectbynames
name<-list(name=c("Long.jump","High.jump","X100m"))
fviz_pca_var(res.pca,select.var=name)
#top5contributingindividualsandvariable
fviz_pca_biplot(res.pca,select.ind=list(contrib=5),
select.var=list(contrib=5),
ggtheme=theme_minimal())
Whentheselectionisdoneaccordingtothecontributionvalues,supplementaryindividuals/variablesarenotshownbecausetheydon'tcontributetotheconstructionoftheaxes.
4.7Exportingresults
4.7.1ExportplotstoPDF/PNGfiles
Thefactoextrapackageproducesaggplot2-basedgraphs.Tosaveanyggplots,thestandardRcodeisasfollow:
#Printtheplottoapdffile
pdf("myplot.pdf")
print(myplot)
dev.off()
Inthefollowingexamples,we'llshowyouhowtosavethedifferentgraphsintopdforpngfiles.
ThefirststepistocreatetheplotsyouwantasanRobject:
#Screeplot
scree.plot<-fviz_eig(res.pca)
#Plotofindividuals
ind.plot<-fviz_pca_ind(res.pca)
#Plotofvariables
var.plot<-fviz_pca_var(res.pca)
Next,theplotscanbeexportedintoasinglepdffileasfollow:
pdf("PCA.pdf")#Createanewpdfdevice
print(scree.plot)
print(ind.plot)
print(var.plot)
dev.off()#Closethepdfdevice
Toprinteachplottospecificpngfile,theRcodelookslikethis:
#Printscreeplottoapngfile
png("pca-scree-plot.png")
print(scree.plot)
dev.off()
#Printindividualsplottoapngfile
png("pca-variables.png")
print(var.plot)
dev.off()
#Printvariablesplottoapngfile
png("pca-individuals.png")
Notethat,usingtheaboveRcodewillcreatethePDFfileintoyourcurrentworkingdirectory.Toseethepathofyourcurrentworkingdirectory,typegetwd()intheRconsole.
print(ind.plot)
dev.off()
Anotheralternative,toexportggplots,istousethefunctionggexport()[inggpubrpackage].Welikeggexport(),becauseit'sverysimple.WithonelineRcode,itallowsustoexportindividualplotstoafile(pdf,epsorpng)(oneplotperpage).Itcanalsoarrangetheplots(2plotperpage,forexample)beforeexportingthem.Theexamplesbelowdemonstrateshowtoexportggplotsusingggexport().
Exportindividualplotstoapdffile(oneplotperpage):
library(ggpubr)
ggexport(plotlist=list(scree.plot,ind.plot,var.plot),
filename="PCA.pdf")
Arrangeandexport.Specifynrowandncoltodisplaymultipleplotsonthesamepage:
ggexport(plotlist=list(scree.plot,ind.plot,var.plot),
nrow=2,ncol=2,
filename="PCA.pdf")
Exportplotstopngfiles.Ifyouspecifyalistofplots,thenmultiplepngfileswillbeautomaticallycreatedtoholdeachplot.
ggexport(plotlist=list(scree.plot,ind.plot,var.plot),
filename="PCA.png")
4.7.2Exportresultstotxt/csvfiles
AlltheoutputsofthePCA(individuals/variablescoordinates,contributions,etc)canbeexportedatonce,intoaTXT/CSVfile,usingthefunctionwrite.infile()[inFactoMineR]package:
#ExportintoaTXTfile
write.infile(res.pca,"pca.txt",sep="\t")
#ExportintoaCSVfile
write.infile(res.pca,"pca.csv",sep=";")
4.8Summary
Inconclusion,wedescribedhowtoperformandinterpretprincipalcomponentanalysis(PCA).WecomputedPCAusingthePCA()function[FactoMineR].Next,weusedthefactoextraRpackagetoproduceggplot2-basedvisualizationofthePCAresults.
Thereareotherfunctions[packages]tocomputePCAinR:
1. Usingprcomp()[stats]
res.pca<-prcomp(iris[,-5],scale.=TRUE)
Readmore:http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp
2. Usingprincomp()[stats]
res.pca<-princomp(iris[,-5],cor=TRUE)
Readmore:http://www.sthda.com/english/wiki/pca-using-prcomp-and-princomp
3. Usingdudi.pca()[ade4]
library("ade4")
res.pca<-dudi.pca(iris[,-5],scannf=FALSE,nf=5)
Readmore:http://www.sthda.com/english/wiki/pca-using-ade4-and-factoextra
4. UsingepPCA()[ExPosition]
library("ExPosition")
res.pca<-epPCA(iris[,-5],graph=FALSE)
Nomatterwhatfunctionsyoudecidetouse,inthelistabove,thefactoextrapackagecanhandletheoutputforcreatingbeautifulplotssimilartowhatwedescribedintheprevioussectionsforFactoMineR:
fviz_eig(res.pca)#Screeplot
fviz_pca_ind(res.pca)#Graphofindividuals
fviz_pca_var(res.pca)#Graphofvariables
4.9Furtherreading
ForthemathematicalbackgroundbehindCA,refertothefollowingvideocourses,articlesandbooks:
Principalcomponentanalysis(article)(AbdiandWilliams2010).https://goo.gl/1Vtwq1.PrincipalComponentAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/VZJsnMExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).PrincipalComponentAnalysis(book)(Jollife2002).
Seealso:
PCAusingprcomp()andprincomp()(tutorial).http://www.sthda.com/english/wiki/pca-using-prcomp-and-princompPCAusingade4andfactoextra(tutorial).http://www.sthda.com/english/wiki/pca-using-ade4-and-factoextra
5CorrespondenceAnalysis
5.1Introduction
Correspondenceanalysis(CA)isanextensionofprincipalcomponentanalysis(Chapter4)suitedtoexplorerelationshipsamongqualitativevariables(orcategoricaldata).Likeprincipalcomponentanalysis,itprovidesasolutionforsummarizingandvisualizingdatasetintwo-dimensionplots.
Here,wedescribethesimplecorrespondenceanalysis,whichisusedtoanalyzefrequenciesformedbytwocategoricaldata,adatatableknownascontengencytable.Itprovidesfactorscores(coordinates)forbothrowandcolumnpointsofcontingencytable.Thesecoordinatesareusedtovisualizegraphicallytheassociationbetweenrowandcolumnelementsinthecontingencytable.
Whenanalyzingatwo-waycontingencytable,atypicalquestioniswhethercertainrowelementsareassociatedwithsomeelementsofcolumnelements.Correspondenceanalysisisageometricapproachforvisualizingtherowsandcolumnsofatwo-waycontingencytableaspointsinalow-dimensionalspace,suchthatthepositionsoftherowandcolumnpointsareconsistentwiththeirassociationsinthetable.Theaimistohaveaglobalviewofthedatathatisusefulforinterpretation.
Inthecurrentchapter,we'llshowhowtocomputeandinterpretcorrespondenceanalysisusingtwoRpackages:i)FactoMineRfortheanalysisandii)factoextrafordatavisualization.Additionally,we'llshowhowtorevealthemostimportantvariablesthatexplainthevariationsinadataset.Wecontinuebyexplaininghowtoapplycorrespondenceanalysisusingsupplementaryrowsandcolumns.Thisisimportant,ifyouwanttomakepredictionswithCA.ThelastsectionsofthisguidedescribealsohowtofilterCAresultinordertokeeponlythemostcontributingvariables.Finally,we'llseehowtodealwithoutliers.
5.2Computation
5.2.1Rpackages
SeveralfunctionsfromdifferentpackagesareavailableintheRsoftwareforcomputingcorrespondenceanalysis:
CA()[FactoMineRpackage],ca()[capackage],dudi.coa()[ade4package],corresp()[MASSpackage],andepCA()[ExPositionpackage]
Nomatterwhatfunctionyoudecidetouse,youcaneasilyextractandvisualizetheresultsofcorrespondenceanalysisusingRfunctionsprovidedinthefactoextraRpackage.
Here,we'lluseFactoMineR(fortheanalysis)andfactoextra(forggplot2-basedelegantvisualization).Toinstallthetwopackages,typethis:
install.packages(c("FactoMineR","factoextra"))
Loadthepackages:
library("FactoMineR")
library("factoextra")
5.2.2Dataformat
Thedatashouldbeacontingencytable.We'llusethedemodatasetshousetasksavailableinthefactoextraRpackage
data(housetasks)
#head(housetasks)
Thedataisacontingencytablecontaining13housetasksandtheirrepartitioninthecouple:
rowsarethedifferenttasksvaluesarethefrequenciesofthetasksdone:
bythewifeonlyalternativelybythehusbandonlyorjointly
Thedataisillustratedinthefollowingimage:
5.2.3Graphofcontingencytablesandchi-squaretest
Theabovecontingencytableisnotverylarge.Therefore,it'seasytovisuallyinspectandinterpretrowandcolumnprofiles:
It'sevidentthat,thehousetasks-Laundry,Main_MealandDinner-aremorefrequentlydonebythe"Wife".RepairsanddrivingaredominantlydonebythehusbandHolidaysarefrequentlyassociatedwiththecolumn"jointly"
Exploratorydataanalysisandvisualizationofcontingencytableshavebeencoveredinourpreviousarticle:Chi-SquaretestofindependenceinR.Briefly,contingencytablecanbevisualizedusingthefunctionsballoonplot()[gplotspackage]andmosaicplot()[garphicspackage]:
library("gplots")
#1.convertthedataasatable
dt<-as.table(as.matrix(housetasks))
#2.Graph
balloonplot(t(dt),main="housetasks",xlab="",ylab="",
label=FALSE,show.margins=FALSE)
Forasmallcontingencytable,youcanusetheChi-squaretesttoevaluatewhetherthereisasignificantdependencebetweenrowandcolumncategories:
chisq<-chisq.test(housetasks)
chisq
##
##Pearson'sChi-squaredtest
##
##data:housetasks
##X-squared=2000,df=40,p-value<2e-16
Notethat,rowandcolumnsumsareprintedbydefaultinthebottomandrightmargins,respectively.Thesevaluesarehidden,intheaboveplot,usingtheargumentshow.margins=FALSE.
Inourexample,therowandthecolumnvariablesarestatisticallysignificantlyassociated(p-value=rchisq$p.value).
5.2.4RcodetocomputeCA
ThefunctionCA()[FactoMinerpackage]canbeused.Asimplifiedformatis:
CA(X,ncp=5,graph=TRUE)
X:adataframe(contingencytable)ncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.
Tocomputecorrespondenceanalysis,typethis:
library("FactoMineR")
res.ca<-CA(housetasks,graph=FALSE)
TheoutputofthefunctionCA()isalistincluding:
print(res.ca)
##**ResultsoftheCorrespondenceAnalysis(CA)**
##Therowvariablehas13categories;thecolumnvariablehas4categories
##Thechisquareofindependencebetweenthetwovariablesisequalto1944(p-value=0).
##*Theresultsareavailableinthefollowingobjects:
##
##namedescription
##1"$eig""eigenvalues"
##2"$col""resultsforthecolumns"
##3"$col$coord""coord.forthecolumns"
##4"$col$cos2""cos2forthecolumns"
##5"$col$contrib""contributionsofthecolumns"
##6"$row""resultsfortherows"
##7"$row$coord""coord.fortherows"
##8"$row$cos2""cos2fortherows"
##9"$row$contrib""contributionsoftherows"
##10"$call""summarycalledparameters"
##11"$call$marge.col""weightsofthecolumns"
##12"$call$marge.row""weightsoftherows"
TheobjectthatiscreatedusingthefunctionCA()containsmanyinformationfoundinmanydifferentlistsandmatrices.Thesevaluesaredescribedinthenextsection.
5.3Visualizationandinterpretation
We'llusethefollowingfunctions[infactoextra]tohelpintheinterpretationandthevisualizationofthecorrespondenceanalysis:
get_eigenvalue(res.ca):Extracttheeigenvalues/variancesretainedbyeachdimension(axis)fviz_eig(res.ca):Visualizetheeigenvaluesget_ca_row(res.ca),get_ca_col(res.ca):Extracttheresultsforrowsandcolumns,respectively.fviz_ca_row(res.ca),fviz_ca_col(res.ca):Visualizetheresultsforrowsandcolumns,respectively.fviz_ca_biplot(res.ca):Makeabiplotofrowsandcolumns.
Inthenextsections,we'llillustrateeachofthesefunctions.
5.3.1Statisticalsignificance
Tointerpretcorrespondenceanalysis,thefirststepistoevaluatewhetherthereisasignificantdependencybetweentherowsandcolumns.
Arigorousmethodistousethechi-squarestatisticforexaminingtheassociationbetweenrowandcolumnvariables.Thisappearsatthetopofthereportgeneratedbythefunctionsummary(res.ca)orprint(res.ca),seesection5.2.4.Ahighchi-squarestatisticmeansstronglinkbetweenrowandcolumnvariables.
#Chi-squarestatistics
chi2<-1944.456
#Degreeoffreedom
df<-(nrow(housetasks)-1)*(ncol(housetasks)-1)
#P-value
pval<-pchisq(chi2,df=df,lower.tail=FALSE)
pval
##[1]0
5.3.2Eigenvalues/Variances
Recallthat,weexaminetheeigenvaluestodeterminethenumberofaxistobeconsidered.Theeigenvaluesandtheproportionofvariancesretainedbythedifferentaxescanbeextractedusingthefunctionget_eigenvalue()[factoextrapackage].Eigenvaluesarelargeforthefirstaxisandsmallforthesubsequentaxis.
library("factoextra")
eig.val<-get_eigenvalue(res.ca)
eig.val
##eigenvaluevariance.percentcumulative.variance.percent
##Dim.10.54348.748.7
Inourexample,theassociationishighlysignificant(chi-square:1944.456,p=0).
##Dim.20.44539.988.6
##Dim.30.12711.4100.0
Eigenvaluescorrespondtotheamountofinformationretainedbyeachaxis.Dimensionsareordereddecreasinglyandlistedaccordingtotheamountofvarianceexplainedinthesolution.Dimension1explainsthemostvarianceinthesolution,followedbydimension2andsoon.
Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsofvariationexplainedtoobtaintherunningtotal.Forinstance,48.69%plus39.91%equals88.6%,andsoforth.Therefore,about88.6%ofthevariationisexplainedbythefirsttwodimensions.
Eigenvaluescanbeusedtodeterminethenumberofaxestoretain.Thereisno“ruleofthumb”tochoosethenumberofdimensionstokeepforthedatainterpretation.Itdependsontheresearchquestionandtheresearcher'sneed.Forexample,ifyouaresatisfiedwith80%ofthetotalvariancesexplainedthenusethenumberofdimensionsnecessarytoachievethat.
Inouranalysis,thefirsttwoaxesexplain88.6%ofthevariation.Thisisanacceptablylargepercentage.
AnalternativemethodtodeterminethenumberofdimensionsistolookataScreePlot,whichistheplotofeigenvalues/variancesorderedfromlargesttothesmallest.Thenumberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareallrelativelysmallandofcomparablesize.
Thescreeplotcanbeproducedusingthefunctionfviz_eig()orfviz_screeplot()[factoextrapackage].
fviz_screeplot(res.ca,addlabels=TRUE,ylim=c(0,50))
Notethat,agooddimensionreductionisachievedwhenthethefirstfewdimensionsaccountforalargeproportionofthevariability.
It'salsopossibletocalculateanaverageeigenvalueabovewhichtheaxisshouldbekeptinthesolution.
Ourdatacontains13rowsand4columns.
Ifthedatawererandom,theexpectedvalueoftheeigenvalueforeachaxiswouldbe1/(nrow(housetasks)-1)=1/12=8.33%intermsofrows.
Likewise,theaverageaxisshouldaccountfor1/(ncol(housetasks)-1)=1/3=33.33%intermsofthe4columns.
Accordingto(M.T.Bendixen1995):
TheRcodebelow,drawsthescreeplotwithareddashedlinespecifyingtheaverageeigenvalue:
fviz_screeplot(res.ca)+
geom_hline(yintercept=33.33,linetype=2,color="red")
Accordingtothegraphabove,onlydimensions1and2shouldbeusedinthesolution.Thedimension3explainsonly11.4%ofthetotalinertiawhichisbelowtheaverageeigeinvalue(33.33%)andtoolittletobekeptforfurtheranalysis.
Thepointatwhichthescreeplotshowsabend(socalled"elbow")canbeconsideredasindicatinganoptimaldimensionality.
Anyaxiswithacontributionlargerthanthemaximumofthesetwopercentagesshouldbeconsideredasimportantandincludedinthesolutionfortheinterpretationofthedata.
Dimensions1and2explainapproximately48.7%and39.9%ofthetotalinertiarespectively.Thiscorrespondstoacumulativetotalof88.6%oftotalinertiaretainedbythe2dimensions.Thehighertheretention,themoresubtletyintheoriginaldataisretainedinthelow-dimensionalsolution(M.Bendixen2003).
5.3.3Biplot
Thefunctionfviz_ca_biplot()[factoextrapackage]canbeusedtodrawthebiplotofrowsandcolumnsvariables.
#repel=TRUEtoavoidtextoverlapping(slowifmanypoint)
fviz_ca_biplot(res.ca,repel=TRUE)
Thegraphaboveiscalledsymetricplotandshowsaglobalpatternwithinthedata.Rowsarerepresentedbybluepointsandcolumnsbyredtriangles.
Thedistancebetweenanyrowpointsorcolumnpointsgivesameasureoftheirsimilarity(ordissimilarity).Rowpointswithsimilarprofileareclosedonthefactormap.Thesameholdstruefor
Notethat,youcanusemorethan2dimensions.However,thesupplementarydimensionsareunlikelytocontributesignificantlytotheinterpretationofnatureoftheassociationbetweentherowsandcolumns.
columnpoints.
Thenextstepfortheinterpretationistodeterminewhichrowandcolumnvariablescontributethemostinthedefinitionofthedifferentdimensionsretainedinthemodel.
5.3.4Graphofrowvariables
5.3.4.1Results
Thefunctionget_ca_row()[infactoextra]isusedtoextracttheresultsforrowvariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2,thecontributionandtheinertiaofrowvariables:
row<-get_ca_row(res.ca)
row
##CorrespondenceAnalysis-Resultsforrows
##===================================================
##NameDescription
##1"$coord""Coordinatesfortherows"
##2"$cos2""Cos2fortherows"
##3"$contrib""contributionsoftherows"
##4"$inertia""Inertiaoftherows"
Thecomponentsoftheget_ca_row()functioncanbeusedintheplotofrowsasfollow:
row$coord:coordinatesofeachrowpointineachdimension(1,2and3).Usedtocreatethescatterplot.row$cos2:qualityofrepresentationofrows.
Thisgraphshowsthat:
housetaskssuchasdinner,breakfeast,laundryaredonemoreoftenbythewifeDrivingandrepairsaredonebythehusband......
Symetricplotrepresentstherowandcolumnprofilessimultaneouslyinacommonspace.Inthiscase,onlythedistancebetweenrowpointsorthedistancebetweencolumnpointscanbereallyinterpreted.
Thedistancebetweenanyrowandcolumnitemsisnotmeaningful!Youcanonlymakeageneralstatementsabouttheobservedpattern.
Inordertointerpretthedistancebetweencolumnandrowpoints,thecolumnprofilesmustbepresentedinrowspaceorvice-versa.Thistypeofmapiscalledasymmetricbiplotandisdiscussedattheendofthisarticle.
var$contrib:contributionofrows(in%)tothedefinitionofthedimensions.
Thedifferentcomponentscanbeaccessedasfollow:
#Coordinates
head(row$coord)
#Cos2:qualityonthefactoremap
head(row$cos2)
#Contributionstotheprincipalcomponents
head(row$contrib)
Inthissection,wedescribehowtovisualizerowpointsonly.Next,wehighlightrowsaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.
5.3.4.2Coordinatesofrowpoints
TheRcodebelowdisplaysthecoordinatesofeachrowpointineachdimension(1,2and3):
head(row$coord)
##Dim1Dim2Dim3
##Laundry-0.9920.495-0.3167
##Main_meal-0.8760.490-0.1641
##Dinner-0.6930.308-0.2074
##Breakfeast-0.5090.4530.2204
##Tidying-0.394-0.434-0.0942
##Dishes-0.189-0.4420.2669
Usethefunctionfviz_ca_row()[infactoextra]tovisualizeonlyrowpoints:
fviz_ca_row(res.ca,repel=TRUE)
Notethat,it'spossibletoplotrowpointsandtocolorthemaccordingtoeitheri)theirqualityonthefactormap(cos2)orii)theircontributionvaluestothedefinitionofdimensions(contrib).
It'spossibletochangethecolorandtheshapeoftherowpointsusingtheargumentscol.rowandshape.rowasfollow:
fviz_ca_row(res.ca,col.row="steelblue",shape.row=15)
Theplotaboveshowstherelationshipsbetweenrowpoints:
Rowswithasimilarprofilearegroupedtogether.Negativelycorrelatedrowsarepositionedonoppositesidesoftheplotorigin(opposedquadrants).Thedistancebetweenrowpointsandtheoriginmeasuresthequalityoftherowpointsonthefactormap.Rowpointsthatareawayfromtheoriginarewellrepresentedonthefactormap.
5.3.4.3Qualityofrepresentationofrows
Theresultoftheanalysisshowsthat,thecontingencytablehasbeensuccessfullyrepresentedinlowdimensionspaceusingcorrespondenceanalysis.Thetwodimensions1and2aresufficienttoretain88.6%ofthetotalinertia(variation)containedinthedata.
However,notallthepointsareequallywelldisplayedinthetwodimensions.
Recallthat,thequalityofrepresentationoftherowsonthefactormapiscalledthesquaredcosine(cos2)orthesquaredcorrelations.
Thecos2measuresthedegreeofassociationbetweenrows/columnsandaparticularaxis.Thecos2ofrowpointscanbeextractedasfollow:
head(row$cos2,4)
##Dim1Dim2Dim3
##Laundry0.7400.1850.0755
##Main_meal0.7420.2320.0260
##Dinner0.7770.1540.0697
##Breakfeast0.5050.4000.0948
Thevaluesofthecos2arecomprisedbetween0and1.Thesumofthecos2forrowsonalltheCAdimensionsisequaltoone.
Ifarowitemiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftherowitems,morethan2dimensionsarerequiredtoperfectlyrepresentthedata.
It'spossibletocolorrowpointsbytheircos2valuesusingtheargumentcol.row="cos2".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.Forinstance,gradient.cols=c("white","blue","red")meansthat:
variableswithlowcos2valueswillbecoloredin"white"variableswithmidcos2valueswillbecoloredin"blue"variableswithhighcos2valueswillbecoloredinred
#Colorbycos2values:qualityonthefactormap
fviz_ca_row(res.ca,col.row="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
Thequalityofrepresentationofaroworcolumninndimensionsissimplythesumofthesquaredcosineofthatroworcolumnoverthendimensions.
Notethat,it'salsopossibletochangethetransparencyoftherowpointsaccordingtotheircos2valuesusingtheoptionalpha.row="cos2".Forexample,typethis:
#Changethetransparencybycos2values
fviz_ca_row(res.ca,alpha.row="cos2")
Youcanvisualizethecos2ofrowpointsonallthedimensionsusingthecorrplotpackage:
library("corrplot")
corrplot(row$cos2,is.corr=FALSE)
It'salsopossibletocreateabarplotofrowscos2usingthefunctionfviz_cos2()[infactoextra]:
#Cos2ofrowsonDim.1andDim.2
fviz_cos2(res.ca,choice="row",axes=1:2)
5.3.4.4Contributionsofrowstothedimensions
Thecontributionofrows(in%)tothedefinitionofthedimensionscanbeextractedasfollow:
Notethat,allrowpointsexceptOfficialarewellrepresentedbythefirsttwodimensions.ThisimpliesthatthepositionofthepointcorrespondingtheitemOfficialonthescatterplotshouldbeinterpretedwithsomecaution.AhigherdimensionalsolutionisprobablynecessaryfortheitemOfficial.
head(row$contrib)
##Dim1Dim2Dim3
##Laundry18.2875.567.968
##Main_meal12.3894.741.859
##Dinner5.4711.322.097
##Breakfeast3.8253.703.069
##Tidying1.9982.970.489
##Dishes0.4262.843.634
RowsthatcontributethemosttoDim.1andDim.2arethemostimportantinexplainingthevariabilityinthedataset.Rowsthatdonotcontributemuchtoanydimensionorthatcontributetothelastdimensionsarelessimportant.
It’spossibletousethefunctioncorrplot()[corrplotpackage]tohighlightthemostcontributingrowpointsforeachdimension:
library("corrplot")
corrplot(row$contrib,is.corr=FALSE)
Thefunctionfviz_contrib()[factoextrapackage]canbeusedtodrawabarplotofrowcontributions.Ifyourdatacontainsmanyrows,youcandecidetoshowonlythetopcontributingrows.TheRcodebelowshowsthetop10rowscontributingtothedimensions:
#Contributionsofrowstodimension1
fviz_contrib(res.ca,choice="row",axes=1,top=10)
Therowvariableswiththelargervalue,contributethemosttothedefinitionofthedimensions.
#Contributionsofrowstodimension2
fviz_contrib(res.ca,choice="row",axes=2,top=10)
Thetotalcontributiontodimension1and2canbeobtainedasfollow:
#Totalcontributiontodimension1and2
fviz_contrib(res.ca,choice="row",axes=1:2,top=10)
Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Thecalculationoftheexpectedcontributionvalue,undernullhypothesis,hasbeendetailedintheprincipalcomponentanalysischapter(4).
Itcanbeseenthat:
therowitemsRepairs,Laundry,Main_mealandDrivingarethemostimportantinthedefinitionofthefirstdimension.therowitemsHolidaysandRepairscontributethemosttothedimension2.
Themostimportant(or,contributing)rowpointscanbehighlightedonthescatterplotasfollow:
fviz_ca_row(res.ca,col.row="contrib",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
Thescatterplotgivesanideaofwhatpoleofthedimensionstherowcategoriesareactuallycontributingto.ItisevidentthatrowcategoriesRepairandDrivinghaveanimportantcontributiontothepositivepoleofthefirstdimension,whilethecategoriesLaundryandMain_mealhaveamajorcontributiontothenegativepoleofthefirstdimension;etc,....
Inotherwords,dimension1ismainlydefinedbytheoppositionofRepairandDriving(positivepole),andLaundryandMain_meal(negativepole).
Notethat,it'salsopossibletocontrolthetransparencyofrowpointsaccordingtotheircontributionvaluesusingtheoptionalpha.row="contrib".Forexample,typethis:
#Changethetransparencybycontribvalues
fviz_ca_row(res.ca,alpha.row="contrib",
repel=TRUE)
5.3.5Graphofcolumnvariables
5.3.5.1Results
Thefunctionget_ca_col()[infactoextra]isusedtoextracttheresultsforcolumnvariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2,thecontributionandtheinertiaofcolumnsvariables:
col<-get_ca_col(res.ca)
col
##CorrespondenceAnalysis-Resultsforcolumns
##===================================================
##NameDescription
##1"$coord""Coordinatesforthecolumns"
##2"$cos2""Cos2forthecolumns"
##3"$contrib""contributionsofthecolumns"
##4"$inertia""Inertiaofthecolumns"
Togetaccesstothedifferentcomponents,usethis:
#Coordinatesofcolumnpoints
head(col$coord)
#Qualityofrepresentation
head(col$cos2)
#Contributions
head(col$contrib)
5.3.5.2Plots:qualityandcontribution
Thefviz_ca_col()isusedtoproducethegraphofcolumnpoints.Tocreateasimpleplot,typethis:
fviz_ca_col(res.ca)
Likerowpoints,it'salsopossibletocolorcolumnpointsbytheircos2values:
fviz_ca_col(res.ca,col.col="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
Theresultforcolumnsgivesthesameinformationasdescribedforrows.Forthisreason,we'lljustdisplayedtheresultforcolumnsinthissectionwithonlyaverybriefcomment.
TheRcodebelowcreatesabarplotofcolumnscos2:
fviz_cos2(res.ca,choice="col",axes=1:2)
Tovisualizethecontributionofrowstothefirsttwodimensions,typethis:
fviz_contrib(res.ca,choice="col",axes=1:2)
Recallthat,thevalueofthecos2isbetween0and1.Acos2closedto1correspondstoacolumn/rowvariablesthatarewellrepresentedonthefactormap.
Notethat,onlythecolumnitemAlternatingisnotverywelldisplayedonthefirsttwodimensions.Thepositionofthisitemmustbeinterpretedwithcautioninthespaceformedbydimensions1and2.
5.3.6Biplotoptions
Biplotisagraphicaldisplayofrowsandcolumnsin2or3dimensions.WehavealreadydescribedhowtocreateCAbiplotsinsection5.3.3.Here,we'lldescribedifferenttypesofCAbiplots.
5.3.6.1Symmetricbiplot
Asmentionedabove,thestandardplotofcorrespondenceanalysisisasymmetricbiplotinwhichbothrows(bluepoints)andcolumns(redtriangles)arerepresentedinthesamespaceusingtheprincipalcoordinates.Thesecoordinatesrepresenttherowandcolumnprofiles.Inthiscase,onlythedistancebetweenrowpointsorthedistancebetweencolumnpointscanbereallyinterpreted.
fviz_ca_biplot(res.ca,repel=TRUE)
Withsymmetricplot,theinter-distancebetweenrowsandcolumnscan'tbeinterpreted.Onlyageneralstatementscanbemadeaboutthepattern.
5.3.6.2Asymmetricbiplot
Tomakeanasymetricbiplot,rows(orcolumns)pointsareplottedfromthestandardco-ordinates(S)andtheprofilesofthecolumns(ortherows)areplottedfromtheprincipalecoordinates(P)(M.Bendixen2003).
Foragivenaxis,thestandardandprincipleco-ordinatesarerelatedasfollows:
P=sqrt(eigenvalue)XS
P:theprincipalcoordinateofarow(oracolumn)ontheaxiseigenvalue:theeigenvalueoftheaxis
Dependingonthesituation,othertypesofdisplaycanbesetusingtheargumentmap(NenadicandGreenacre2007)inthefunctionfviz_ca_biplot()[infactoextra].
Theallowedoptionsfortheargumentmapare:
1. "rowprincipal"or"colprincipal"-thesearetheso-calledasymmetricbiplots,witheitherrowsin
Notethat,inordertointerpretthedistancebetweencolumnpointsandrowpoints,thesimplestwayistomakeanasymmetricplot.Thismeansthat,thecolumnprofilesmustbepresentedinrowspaceorvice-versa.
principalcoordinatesandcolumnsinstandardcoordinates,orviceversa(alsoknownasrow-metric-preservingorcolumn-metric-preserving,respectively).
"rowprincipal":columnsarerepresentedinrowspace"colprincipal":rowsarerepresentedincolumnspace
2. "symbiplot"-bothrowsandcolumnsarescaledtohavevariancesequaltothesingularvalues(squarerootsofeigenvalues),whichgivesasymmetricbiplotbutdoesnotpreserveroworcolumnmetrics.
3. "rowgab"or"colgab":AsymetricmapsproposedbyGabriel&Odoroff(GabrielandOdoroff1990):
"rowgab":rowsinprincipalcoordinatesandcolumnsinstandardcoordinatesmultipliedbythemass."colgab":columnsinprincipalcoordinatesandrowsinstandardcoordinatesmultipliedbythemass.
4. "rowgreen"or"colgreen":Theso-calledcontributionbiplotsshowingvisuallythemostcontributingpoints(Greenacre2006b).
"rowgreen":rowsinprincipalcoordinatesandcolumnsinstandardcoordinatesmultipliedbysquarerootofthemass."colgreen":columnsinprincipalcoordinatesandrowsinstandardcoordinatesmultipliedbythesquarerootofthemass.
TheRcodebelowdrawsastandardasymetricbiplot:
fviz_ca_biplot(res.ca,
map="rowprincipal",arrow=c(TRUE,TRUE),
repel=TRUE)
Weused,theargumentarrows,whichisavectoroftwologicalsspecifyingiftheplotshouldcontainpoints(FALSE,default)orarrows(TRUE).Thefirstvaluesetstherowsandthesecondvaluesetsthecolumns.
Iftheanglebetweentwoarrowsisacute,thentheirisastrongassociationbetweenthecorrespondingrowandcolumn.
Tointerpretthedistancebetweenrowsandandacolumnyoushouldperpendicularlyprojectrowpointsonthecolumnarrow.
5.3.6.3Contributionbiplot
Inthestandardsymmetricbiplot(mentionedintheprevioussection),it'sdifficulttoknowthemostcontributingpointstothesolutionoftheCA.
MichaelGreenacreproposedanewscalingdisplayed(calledcontributionbiplot)whichincorporatesthecontributionofpoints(M.Greenacre2013).Inthisdisplay,pointsthatcontributeverylittletothesolution,areclosetothecenterofthebiplotandarerelativelyunimportanttotheinterpretation.
Firstly,youhavetodecidewhethertoanalysethecontributionsofrowsorcolumnstothedefinitionoftheaxes.
Inourexamplewe'llinterpretthecontributionofrowstotheaxes.Theargumentmap="colgreen"isused.Inthiscase,recallthatcolumnsareinprincipalcoordinatesandrowsinstandardcoordinates
Acontributionbiplotcanbedrawnusingtheargumentmap="rowgreen"ormap="colgreen".
multipliedbythesquarerootofthemass.Foragivenrow,thesquareofthenewcoordinateonanaxisiisexactlythecontributionofthisrowtotheinertiaoftheaxisi.
fviz_ca_biplot(res.ca,map="colgreen",arrow=c(TRUE,FALSE),
repel=TRUE)
Inthegraphabove,thepositionofthecolumnprofilepointsisunchangedrelativetothatintheconventionalbiplot.However,thedistancesoftherowpointsfromtheplotoriginarerelatedtotheircontributionstothetwo-dimensionalfactormap.
Thecloseranarrowis(intermsofangulardistance)toanaxisthegreateristhecontributionoftherowcategoryonthataxisrelativetotheotheraxis.Ifthearrowishalfwaybetweenthetwo,itsrowcategorycontributestothetwoaxestothesameextent.
ItisevidentthatrowcategoryRepairshaveanimportantcontributiontothepositivepoleofthefirstdimension,whilethecategoriesLaundryandMain_mealhaveamajorcontributiontothenegativepoleofthefirstdimension;
Dimension2ismainlydefinedbytherowcategoryHolidays.
TherowcategoryDrivingcontributestothetwoaxestothesameextent.
5.3.7Dimensiondescription
Toeasilyidentifyrowandcolumnpointsthatarethemostassociatedwiththeprincipaldimensions,youcanusethefunctiondimdesc()[inFactoMineR].Row/columnvariablesaresortedbytheircoordinatesinthedimdesc()output.
#Dimensiondescription
res.desc<-dimdesc(res.ca,axes=c(1,2))
Descriptionofdimension1:
#Descriptionofdimension1byrowpoints
head(res.desc[[1]]$row,4)
##coord
##Laundry-0.992
##Main_meal-0.876
##Dinner-0.693
##Breakfeast-0.509
#Descriptionofdimension1bycolumnpoints
head(res.desc[[1]]$col,4)
##coord
##Wife-0.8376
##Alternating-0.0622
##Jointly0.1494
##Husband1.1609
Descriptionofdimension2:
#Descriptionofdimension2byrowpoints
res.desc[[2]]$row
#Descriptionofdimension1bycolumnpoints
res.desc[[2]]$col
5.4Supplementaryelements
5.4.1Dataformat
We'llusethedatasetchildren[inFactoMineRpackage].Itcontains18rowsand8columns:
data(children)
#head(children)
Dataformatforcorrespondenceanalysiswithsupplementaryelements
Thedatausedhereisacontingencytabledescribingtheanswersgivenbydifferentcategoriesofpeopletothefollowingquestion:Whatarethereasonsthatcanmakehesitateawomanoracoupletohavechildren?
Onlysomeoftherowsandcolumnswillbeusedtoperformthecorrespondenceanalysis(CA).Thecoordinatesoftheremaining(supplementary)rows/columnsonthefactormapwillbepredictedaftertheCA.
InCAterminology,ourdatacontains:
Activerows(rows1:14):Rowsthatareusedduringthecorrespondenceanalysis.Supplementaryrows(row.sup15:18):ThecoordinatesoftheserowswillbepredictedusingtheCAinformationandparametersobtainedwithactiverows/columnsActivecolumns(columns1:5):Columnsthatareusedforthecorrespondenceanalysis.Supplementarycolumns(col.sup6:8):Assupplementaryrows,thecoordinatesofthesecolumnswillbepredictedalso.
5.4.2SpecificationinCA
Asmentionedabove,supplementaryrowsandcolumnsarenotusedforthedefinitionoftheprincipaldimensions.TheircoordinatesarepredictedusingonlytheinformationprovidedbytheperformedCAonactiverows/columns.
Tospecifysupplementaryrows/columns,thefunctionCA()[inFactoMineR]canbeusedasfollow:
CA(X,ncp=5,row.sup=NULL,col.sup=NULL,
graph=TRUE)
X:adataframe(contingencytable)row.sup:anumericvectorspecifyingtheindexesofthesupplementaryrowscol.sup:anumericvectorspecifyingtheindexesofthesupplementarycolumnsncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.
Forexample,typethis:
res.ca<-CA(children,row.sup=15:18,col.sup=6:8,
graph=FALSE)
5.4.3Biplotofrowsandcolumns
fviz_ca_biplot(res.ca,repel=TRUE)
ActiverowsareinblueSupplementaryrowsareindarkblueColumnsareinredSupplementarycolumnsareindarkred
It'salsopossibletohidesupplementaryrowsandcolumnsusingtheargumentinvisible:
fviz_ca_biplot(res.ca,repel=TRUE,
invisible=c("row.sup","col.sup"))
5.4.4Supplementaryrows
Predictedresults(coordinatesandcos2)forthesupplementaryrows:
res.ca$row.sup
##$coord
##Dim1Dim2Dim3Dim4
##comfort0.2100.7030.07110.307
##disagreement0.1460.1190.1711-0.313
##world0.5230.1430.0840-0.106
##to_live0.3080.5020.52090.256
##
##$cos2
##Dim1Dim2Dim3Dim4
##comfort0.06890.77520.007930.1479
##disagreement0.13130.08690.179650.6021
##world0.87590.06540.022560.0362
##to_live0.13900.36850.396830.0956
Plotofactiveandsupplementaryrowpoints:
fviz_ca_row(res.ca,repel=TRUE)
5.4.5Supplementarycolumns
Predictedresults(coordinatesandcos2)forthesupplementarycolumns:
res.ca$col.sup
##$coord
##Dim1Dim2Dim3Dim4
##thirty0.1054-0.0597-0.10320.0698
##fifty-0.01710.0491-0.0157-0.0131
##more_fifty-0.1771-0.04810.1008-0.0852
##
##$cos2
##Dim1Dim2Dim3Dim4
##thirty0.13760.04410.131910.06028
##fifty0.01090.08990.009190.00637
##more_fifty0.28610.02110.092670.06620
Plotofactiveandsupplementarycolumnpoints:
fviz_ca_col(res.ca,repel=TRUE)
Supplementaryrowsareshownindarkbluecolor.
Supplementarycolumnsareshownindarkred.
5.5Filteringresults
Ifyouhavemanyrow/columnvariables,it'spossibletovisualizeonlysomeofthemusingtheargumentsselect.rowandselect.col.
select.col,select.row:aselectionofcolumns/rowstobedrawn.AllowedvaluesareNULLoralistcontainingtheargumentsname,cos2orcontrib:
name:isacharactervectorcontainingcolumn/rownamestobedrawncos2:ifcos2isin[0,1],ex:0.6,thencolumns/rowswithacos2>0.6aredrawnifcos2>1,ex:5,thenthetop5activecolumns/rowsandtop5supplementarycolumns/rowswiththehighestcos2aredrawncontrib:ifcontrib>1,ex:5,thenthetop5columns/rowswiththehighestcontributionsaredrawn
#Visualizerowswithcos2>=0.8
fviz_ca_row(res.ca,select.row=list(cos2=0.8))
#Top5activerowsand5suppl.rowswiththehighestcos2
fviz_ca_row(res.ca,select.row=list(cos2=5))
#Selectbynames
name<-list(name=c("employment","fear","future"))
fviz_ca_row(res.ca,select.row=name)
#Top5contributingrowsandcolumns
fviz_ca_biplot(res.ca,select.row=list(contrib=5),
select.col=list(contrib=5))+
theme_minimal()
5.6Outliers
Ifoneormore"outliers"arepresentinthecontingencytable,theycandominatetheinterpretationtheaxes(M.Bendixen2003).
Outliersarepointsthathavehighabsoluteco-ordinatevaluesandhighcontributions.Theyarerepresented,onthegraph,veryfarfromthecentroïd.Inthiscase,theremainingrow/columnpointstendtobetightlyclusteredinthegraphwhichbecomedifficulttointerpret.
IntheCAoutput,thecoordinatesofrow/columnpointsrepresentthenumberofstandarddeviationstherow/columnisawayfromthebarycentre(M.Bendixen2003).
Accordingto(M.Bendixen2003):
Outliersarepointsthatareareatleastonestandarddeviationawayfromthebarycentre.Theycontributealso,significantlytotheinterpretationtoonepoleofanaxis(M.Bendixen2003).
Therearenoapparentoutliersinourdata.Iftherewereoutliersinthedata,theymustbesuppressedortreatedassupplementarypointswhenre-runningthecorrespondenceanalysis.
5.7Exportingresults
5.7.1ExportplotstoPDF/PNGfiles
Tosavethedifferentgraphsintopdforpngfiles,westartbycreatingtheplotofinterestasanRobject:
#Screeplot
scree.plot<-fviz_eig(res.ca)
#Biplotofrowandcolumnvariables
biplot.ca<-fviz_ca_biplot(res.ca)
Next,theplotscanbeexportedintoasinglepdffileasfollow(oneplotperpage):
library(ggpubr)
ggexport(plotlist=list(scree.plot,biplot.ca),
filename="CA.pdf")
Moreoptionsat:Chapter4(section:Exportingresults).
5.7.2Exportresultstotxt/csvfiles
EasytouseRfunction:write.infile()[inFactoMineR]package:
#ExportintoaTXTfile
write.infile(res.ca,"ca.txt",sep="\t")
#ExportintoaCSVfile
write.infile(res.ca,"ca.csv",sep=";")
5.8Summary
Inconclusion,wedescribedhowtoperformandinterpretcorrespondenceanalysis(CA).WecomputedCAusingtheCA()function[FactoMineRpackage].Next,weusedthefactoextraRpackagetoproduceggplot2-basedvisualizationoftheCAresults.
Otherfunctions[packages]tocomputeCAinR,include:
1. Usingdudi.coa()[ade4]
library("ade4")
res.ca<-dudi.coa(housetasks,scannf=FALSE,nf=5)
Readmore:http://www.sthda.com/english/wiki/ca-using-ade4
2. Usingca()[ca]
library(ca)
res.ca<-ca(housetasks)
Readmore:http://www.sthda.com/english/wiki/ca-using-ca-package
3. Usingcorresp()[MASS]
library(MASS)
res.ca<-corresp(housetasks,nf=3)
Readmore:http://www.sthda.com/english/wiki/ca-using-mass
4. UsingepCA()[ExPosition]
library("ExPosition")
res.ca<-epCA(housetasks,graph=FALSE)
Nomatterwhatfunctionsyoudecidetouse,inthelistabove,thefactoextrapackagecanhandletheoutput.
fviz_eig(res.ca)#Screeplot
fviz_ca_biplot(res.ca)#Biplotofrowsandcolumns
5.9Furtherreading
ForthemathematicalbackgroundbehindCA,refertothefollowingvideocourses,articlesandbooks:
CorrespondenceAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/Hhh6hCExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).Principalcomponentanalysis(article).(AbdiandWilliams2010).https://goo.gl/1Vtwq1.Correspondenceanalysisbasics(blogpost).https://goo.gl/Xyk8KT.UnderstandingtheMathofCorrespondenceAnalysiswithExamplesinR(blogpost).https://goo.gl/H9hxf9
6MultipleCorrespondenceAnalysis
6.1Introduction
TheMultiplecorrespondenceanalysis(MCA)isanextensionofthesimplecorrespondenceanalysis(chapter5)forsummarizingandvisualizingadatatablecontainingmorethantwocategoricalvariables.Itcanalsobeseenasageneralizationofprincipalcomponentanalysiswhenthevariablestobeanalyzedarecategoricalinsteadofquantitative(AbdiandWilliams2010).
MCAisgenerallyusedtoanalyseadatasetfromsurvey.Thegoalistoidentify:
AgroupofindividualswithsimilarprofileintheiranswerstothequestionsTheassociationsbetweenvariablecategories
Previously,wedescribedhowtocomputeandinterpretthesimplecorrespondenceanalysis(chapter5).Inthecurrentchapter,wedemonstratehowtocomputeandvisualizemultiplecorrespondenceanalysisinRsoftwareusingFactoMineR(fortheanalysis)andfactoextra(fordatavisualization).Additionally,we'llshowhowtorevealthemostimportantvariablesthatcontributethemostinexplainingthevariationsinthedataset.Wecontinuebyexplaininghowtopredicttheresultsforsupplementaryindividualsandvariables.Finally,we'lldemonstratehowtofilterMCAresultsinordertokeeponlythemostcontributingvariables.
6.2Computation
6.2.1Rpackages
SeveralfunctionsfromdifferentpackagesareavailableintheRsoftwareforcomputingmultiplecorrespondenceanalysis.Thesefunctions/packagesinclude:
MCA()function[FactoMineRpackage]dudi.mca()function[ade4package]andepMCA()[ExPositionpackage]
Nomatterwhatfunctionyoudecidetouse,youcaneasilyextractandvisualizetheMCAresultsusingRfunctionsprovidedinthefactoextraRpackage.
Here,we'lluseFactoMineR(fortheanalysis)andfactoextra(forggplot2-basedelegantvisualization).Toinstallthetwopackages,typethis:
install.packages(c("FactoMineR","factoextra"))
Loadthepackages:
library("FactoMineR")
library("factoextra")
6.2.2Dataformat
We'llusethedemodatasetspoisonavailableinFactoMineRpackage:
data(poison)
head(poison[,1:7],3)
##AgeTimeSickSexNauseaVomitingAbdominals
##1922Sick_yFNausea_yVomit_nAbdo_y
##250Sick_nFNausea_nVomit_nAbdo_n
##3616Sick_yFNausea_nVomit_yAbdo_y
Thisdataisaresultfromasurveycarriedoutonchildrenofprimaryschoolwhosufferedfromfoodpoisoning.Theywereaskedabouttheirsymptomsandaboutwhattheyate.
Thedatacontains55rows(individuals)and15columns(variables).We'lluseonlysomeoftheseindividuals(children)andvariablestoperformthemultiplecorrespondenceanalysis.ThecoordinatesoftheremainingindividualsandvariablesonthefactormapwillbepredictedfromthepreviousMCAresults.
InMCAterminology,ourdatacontains:
Activeindividuals(rows1:55):Individualsthatareusedinthemultiplecorrespondenceanalysis.
Activevariables(columns5:15):VariablesthatareusedintheMCA.Supplementaryvariables:Theydon'tparticipatetotheMCA.Thecoordinatesofthesevariableswillbepredicted.
Supplementaryquantitativevariables(quanti.sup):Columns1and2correspondingtothecolumnsageandtime,respectively.Supplementaryqualitativevariables(quali.sup:Columns3and4correspondingtothecolumnsSickandSex,respectively.Thisfactorvariableswillbeusedtocolorindividualsbygroups.
Subsetonlyactiveindividualsandvariablesformultiplecorrespondenceanalysis:
poison.active<-poison[1:55,5:15]
head(poison.active[,1:6],3)
##NauseaVomitingAbdominalsFeverDiarrhaePotato
##1Nausea_yVomit_nAbdo_yFever_yDiarrhea_yPotato_y
##2Nausea_nVomit_nAbdo_nFever_nDiarrhea_nPotato_y
##3Nausea_nVomit_yAbdo_yFever_yDiarrhea_yPotato_y
6.2.3Datasummary
TheRbasefunctionsummary()canbeusedtocomputethefrequencyofvariablecategories.Asthedatatablecontainsalargenumberofvariables,we'lldisplayonlytheresultsforthefirst4variables.
Statisticalsummaries:
#Summaryofthe4firstvariables
summary(poison.active)[,1:4]
##NauseaVomitingAbdominalsFever
##Nausea_n:43Vomit_n:33Abdo_n:18Fever_n:20
##Nausea_y:12Vomit_y:22Abdo_y:37Fever_y:35
Thesummary()functionsreturnthesizeofeachvariablecategory.
It'salsopossibletoplotthefrequencyofvariablecategories.TheRcodebelow,plotsthefirst4columns:
for(iin1:4){
plot(poison.active[,i],main=colnames(poison.active)[i],
ylab="Count",col="steelblue",las=2)
}
6.2.4Rcode
ThefunctionMCA()[FactoMinerpackage]canbeused.Asimplifiedformatis:
MCA(X,ncp=5,graph=TRUE)
X:adataframewithnrows(individuals)andpcolumns(categoricalvariables)ncp:numberofdimensionskeptinthefinalresults.graph:alogicalvalue.IfTRUEagraphisdisplayed.
IntheRcodebelow,theMCAisperformedonlyontheactiveindividuals/variables:
res.mca<-MCA(poison.active,graph=FALSE)
TheoutputoftheMCA()functionisalistincluding:
print(res.mca)
##**ResultsoftheMultipleCorrespondenceAnalysis(MCA)**
##Theanalysiswasperformedon55individuals,describedby11variables
##*Theresultsareavailableinthefollowingobjects:
##
##namedescription
##1"$eig""eigenvalues"
##2"$var""resultsforthevariables"
##3"$var$coord""coord.ofthecategories"
##4"$var$cos2""cos2forthecategories"
##5"$var$contrib""contributionsofthecategories"
Thegraphsabovecanbeusedtoidentifyvariablecategorieswithaverylowfrequency.Thesetypesofvariablescandistorttheanalysisandshouldberemoved.
##6"$var$v.test""v-testforthecategories"
##7"$ind""resultsfortheindividuals"
##8"$ind$coord""coord.fortheindividuals"
##9"$ind$cos2""cos2fortheindividuals"
##10"$ind$contrib""contributionsoftheindividuals"
##11"$call""intermediateresults"
##12"$call$marge.col""weightsofcolumns"
##13"$call$marge.li""weightsofrows"
TheobjectthatiscreatedusingthefunctionMCA()containsmanyinformationfoundinmanydifferentlistsandmatrices.Thesevaluesaredescribedinthenextsection.
6.3Visualizationandinterpretation
We'llusethefactoextraRpackagetohelpintheinterpretationandthevisualizationofthemultiplecorrespondenceanalysis.Nomatterwhatfunctionyoudecidetouse[FactoMiner::MCA(),ade4::dudi.mca()],youcaneasilyextractandvisualizetheresultsofmultiplecorrespondenceanalysisusingRfunctionsprovidedinthefactoextraRpackage.
Thesefactoextrafunctionsinclude:
get_eigenvalue(res.mca):Extracttheeigenvalues/variancesretainedbyeachdimension(axis)fviz_eig(res.mca):Visualizetheeigenvalues/variancesget_mca_ind(res.mca),get_mca_var(res.mca):Extracttheresultsforindividualsandvariables,respectively.fviz_mca_ind(res.mca),fviz_mca_var(res.mca):Visualizetheresultsforindividualsandvariables,respectively.fviz_mca_biplot(res.mca):Makeabiplotofrowsandcolumns.
Inthenextsections,we'llillustrateeachofthesefunctions.
6.3.1Eigenvalues/Variances
Theproportionofvariancesretainedbythedifferentdimensions(axes)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage]asfollow:
library("factoextra")
eig.val<-get_eigenvalue(res.mca)
#head(eig.val)
TovisualizethepercentagesofinertiaexplainedbyeachMCAdimensions,usethefunctionfviz_eig()orfviz_screeplot()[factoextrapackage]:
fviz_screeplot(res.mca,addlabels=TRUE,ylim=c(0,45))
Notethat,theMCAresultsisinterpretedastheresultsfromasimplecorrespondenceanalysis(CA).Therefore,it'sstronglyrecommendedtoreadtheinterpretationofsimpleCAwhichhasbeencomprehensivelydescribedintheChapter5.
6.3.2Biplot
Thefunctionfviz_mca_biplot()[factoextrapackage]isusedtodrawthebiplotofindividualsandvariablecategories:
fviz_mca_biplot(res.mca,
repel=TRUE,#Avoidtextoverlapping(slowifmanypoint)
ggtheme=theme_minimal())
Theplotaboveshowsaglobalpatternwithinthedata.Rows(individuals)arerepresentedbybluepointsandcolumns(variablecategories)byredtriangles.
Thedistancebetweenanyrowpointsorcolumnpointsgivesameasureoftheirsimilarity(or
dissimilarity).Rowpointswithsimilarprofileareclosedonthefactormap.Thesameholdstrueforcolumnpoints.
6.3.3Graphofvariables
6.3.3.1Results
Thefunctionget_mca_var()[infactoextra]isusedtoextracttheresultsforvariablecategories.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofvariablecategories:
var<-get_mca_var(res.mca)
var
##MultipleCorrespondenceAnalysisResultsforvariables
##===================================================
##NameDescription
##1"$coord""Coordinatesforcategories"
##2"$cos2""Cos2forcategories"
##3"$contrib""contributionsofcategories"
Thecomponentsoftheget_mca_var()canbeusedintheplotofrowsasfollow:
var$coord:coordinatesofvariablestocreateascatterplotvar$cos2:representsthequalityoftherepresentationforvariablesonthefactormap.var$contrib:containsthecontributions(inpercentage)ofthevariablestothedefinitionofthedimensions.
Thedifferentcomponentscanbeaccessedasfollow:
#Coordinates
head(var$coord)
#Cos2:qualityonthefactoremap
head(var$cos2)
#Contributionstotheprincipalcomponents
head(var$contrib)
Inthissection,we'lldescribehowtovisualizevariablecategoriesonly.Next,we'llhighlightvariablecategoriesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.
6.3.3.2Correlationbetweenvariablesandprincipaldimensions
Notethat,it'spossibletoplotvariablecategoriesandtocolorthemaccordingtoeitheri)theirqualityonthefactormap(cos2)orii)theircontributionvaluestothedefinitionofdimensions(contrib).
TovisualizethecorrelationbetweenvariablesandMCAprincipaldimensions,typethis:
fviz_mca_var(res.mca,choice="mca.cor",
repel=TRUE,#Avoidtextoverlapping(slow)
ggtheme=theme_minimal())
6.3.3.3Coordinatesofvariablecategories
TheRcodebelowdisplaysthecoordinatesofeachvariablecategoriesineachdimension(1,2and3):
head(round(var$coord,2),4)
##Dim1Dim2Dim3Dim4Dim5
##Nausea_n0.270.12-0.270.030.07
##Nausea_y-0.96-0.430.95-0.12-0.26
##Vomit_n0.48-0.410.080.270.05
##Vomit_y-0.720.61-0.13-0.41-0.08
Usethefunctionfviz_mca_var()[infactoextra]tovisualizeonlyvariablecategories:
fviz_mca_var(res.mca,
repel=TRUE,#Avoidtextoverlapping(slow)
ggtheme=theme_minimal())
Theplotabovehelpstoidentifyvariablesthatarethemostcorrelatedwitheachdimension.Thesquaredcorrelationsbetweenvariablesandthedimensionsareusedascoordinates.
Itcanbeseenthat,thevariablesDiarrhae,AbdominalsandFeverarethemostcorrelatedwithdimension1.Similarly,thevariablesCourgetteandPotatoarethemostcorrelatedwithdimension2.
It'spossibletochangethecolorandtheshapeofthevariablepointsusingtheargumentscol.varandshape.varasfollow:
fviz_mca_var(res.mca,col.var="black",shape.var=15,
repel=TRUE)
Theplotaboveshowstherelationshipsbetweenvariablecategories.Itcanbeinterpretedasfollow:
Variablecategorieswithasimilarprofilearegroupedtogether.Negativelycorrelatedvariablecategoriesarepositionedonoppositesidesoftheplotorigin(opposedquadrants).Thedistancebetweencategorypointsandtheoriginmeasuresthequalityofthevariablecategoryonthefactormap.Categorypointsthatareawayfromtheoriginarewellrepresentedonthefactormap.
6.3.3.4Qualityofrepresentationofvariablecategories
Thetwodimensions1and2aresufficienttoretain46%ofthetotalinertia(variation)containedinthedata.Notallthepointsareequallywelldisplayedinthetwodimensions.
Thequalityoftherepresentationiscalledthesquaredcosine(cos2),whichmeasuresthedegreeofassociationbetweenvariablecategoriesandaparticularaxis.Thecos2ofvariablecategoriescanbeextractedasfollow:
head(var$cos2,4)
##Dim1Dim2Dim3Dim4Dim5
##Nausea_n0.2560.05280.25270.004080.01947
##Nausea_y0.2560.05280.25270.004080.01947
##Vomit_n0.3440.25120.01070.112290.00413
##Vomit_y0.3440.25120.01070.112290.00413
Ifavariablecategoryiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftherowitems,morethan2dimensionsarerequiredtoperfectlyrepresentthedata.
It'spossibletocolorvariablecategoriesbytheircos2valuesusingtheargumentcol.var="cos2".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.Forinstance,gradient.cols=c("white","blue","red")meansthat:
variablecategorieswithlowcos2valueswillbecoloredin"white"variablecategorieswithmidcos2valueswillbecoloredin"blue"variablecategorieswithhighcos2valueswillbecoloredin"red"
#Colorbycos2values:qualityonthefactormap
fviz_mca_var(res.mca,col.var="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE,#Avoidtextoverlapping
ggtheme=theme_minimal())
Notethat,it'salsopossibletochangethetransparencyofthevariablecategoriesaccordingtotheircos2valuesusingtheoptionalpha.var="cos2".Forexample,typethis:
#Changethetransparencybycos2values
fviz_mca_var(res.mca,alpha.var="cos2",
repel=TRUE,
ggtheme=theme_minimal())
Youcanvisualizethecos2ofrowcategoriesonallthedimensionsusingthecorrplotpackage:
library("corrplot")
corrplot(var$cos2,is.corr=FALSE)
It'salsopossibletocreateabarplotofvariablecos2usingthefunctionfviz_cos2()[infactoextra]:
#Cos2ofvariablecategoriesonDim.1andDim.2
fviz_cos2(res.mca,choice="var",axes=1:2)
6.3.3.5Contributionofvariablecategoriestothedimensions
Thecontributionofthevariablecategories(in%)tothedefinitionofthedimensionscanbeextractedasfollow:
head(round(var$contrib,2),4)
##Dim1Dim2Dim3Dim4Dim5
##Nausea_n1.520.814.670.080.49
##Nausea_y5.432.9116.730.301.76
##Vomit_n3.737.070.364.260.19
##Vomit_y5.6010.610.546.390.29
Thefunctionfviz_contrib()[factoextrapackage]canbeusedtodrawabarplotofthecontributionof
Notethat,variablecategoriesFish_n,Fish_y,Icecream_nandIcecream_yarenotverywellrepresentedbythefirsttwodimensions.Thisimpliesthatthepositionofthecorrespondingpointsonthescatterplotshouldbeinterpretedwithsomecaution.Ahigherdimensionalsolutionisprobablynecessary.
Thevariablecategorieswiththelargervalue,contributethemosttothedefinitionofthedimensions.VariablecategoriesthatcontributethemosttoDim.1andDim.2arethemostimportantinexplainingthevariabilityinthedataset.
variablecategories.TheRcodebelowshowsthetop15variablecategoriescontributingtothedimensions:
#Contributionsofrowstodimension1
fviz_contrib(res.mca,choice="var",axes=1,top=15)
#Contributionsofrowstodimension2
fviz_contrib(res.mca,choice="var",axes=2,top=15)
Thetotalcontributionstodimension1and2areobtainedasfollow:
#Totalcontributiontodimension1and2
fviz_contrib(res.mca,choice="var",axes=1:2,top=15)
Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Thecalculationoftheexpectedcontributionvalue,undernullhypothesis,hasbeendetailedintheprincipalcomponentanalysischapter.
Itcanbeseenthat:
thecategoriesAbdo_n,Diarrhea_n,Fever_nandMayo_narethemostimportantinthedefinitionofthefirstdimension.ThecategoriesCourg_n,Potato_n,Vomit_yandIcecream_ncontributethemosttothedimension2
Themostimportant(or,contributing)variablecategoriescanbehighlightedonthescatterplotasfollow:
fviz_mca_var(res.mca,col.var="contrib",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE,#avoidtextoverlapping(slow)
ggtheme=theme_minimal()
)
Notethat,it'salsopossibletocontrolthetransparencyofvariablecategoriesaccordingtotheircontributionvaluesusingtheoptionalpha.var="contrib".Forexample,typethis:
#Changethetransparencybycontribvalues
fviz_mca_var(res.mca,alpha.var="contrib",
repel=TRUE,
ggtheme=theme_minimal())
6.3.4Graphofindividuals
6.3.4.1Results
Thefunctionget_mca_ind()[infactoextra]isusedtoextracttheresultsforindividuals.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionsofindividuals:
ind<-get_mca_ind(res.mca)
ind
Theplotabovegivesanideaofwhatpoleofthedimensionsthecategoriesareactuallycontributingto.
ItisevidentthatthecategoriesAbdo_n,Diarrhea_n,Fever_nandMayo_nhaveanimportantcontributiontothepositivepoleofthefirstdimension,whilethecategoriesFever_yandDiarrhea_yhaveamajorcontributiontothenegativepoleofthefirstdimension;etc,....
##MultipleCorrespondenceAnalysisResultsforindividuals
##===================================================
##NameDescription
##1"$coord""Coordinatesfortheindividuals"
##2"$cos2""Cos2fortheindividuals"
##3"$contrib""contributionsoftheindividuals"
Togetaccesstothedifferentcomponents,usethis:
#Coordinatesofcolumnpoints
head(ind$coord)
#Qualityofrepresentation
head(ind$cos2)
#Contributions
head(ind$contrib)
6.3.4.2Plots:qualityandcontribution
Thefunctionfviz_mca_ind()[infactoextra]isusedtovisualizeonlyindividuals.Likevariablecategories,it'salsopossibletocolorindividualsbytheircos2values:
fviz_mca_ind(res.mca,col.ind="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE,#Avoidtextoverlapping(slowifmanypoints)
ggtheme=theme_minimal())
Theresultforindividualsgivesthesameinformationasdescribedforvariablecategories.Forthisreason,I'lljustdisplayedtheresultforindividualsinthissectionwithoutcommenting.
TheRcodebelowcreatesabarplotsofindividualscos2andcontributions:
#Cos2ofindividuals
fviz_cos2(res.mca,choice="ind",axes=1:2,top=20)
#Contributionofindividualstothedimensions
fviz_contrib(res.mca,choice="ind",axes=1:2,top=20)
6.3.5Colorindividualsbygroups
TheRcodebelowcolorstheindividualsbygroupsusingthelevelsofthevariableVomiting.Theargumenthabillageisusedtospecifythefactorvariableforcoloringtheindividualsbygroups.AconcentrationellipsecanbealsoaddedaroundeachgroupusingtheargumentaddEllipses=TRUE.Ifyouwantaconfidenceellipsearoundthemeanpointofcategories,useellipse.type="confidence"Theargumentpaletteisusedtochangegroupcolors.
fviz_mca_ind(res.mca,
label="none",#hideindividuallabels
habillage="Vomiting",#colorbygroups
palette=c("#00AFBB","#E7B800"),
addEllipses=TRUE,ellipse.type="confidence",
ggtheme=theme_minimal())
Notethat,it'spossibletocolortheindividualsusinganyofthequalitativevariablesintheinitialdatatable(poison)
Notethat,tospecifythevalueoftheargumenthabillage,it'salsopossibletousetheindexofthecolumnasfollow(habillage=2).Additionally,youcanprovideanexternalgroupingvariableasfollow:habillage=poison$Vomiting.Forexample:
#habillage=indexofthecolumntobeusedasgroupingvariable
fviz_mca_ind(res.mca,habillage=2,addEllipses=TRUE)
#habillage=externalgroupingvariable
fviz_mca_ind(res.mca,habillage=poison$Vomiting,addEllipses=TRUE)
Ifyouwanttocolorindividualsusingmultiplecategoricalvariablesatthesametime,usethefunctionfviz_ellipses()[infactoextra]asfollow:
fviz_ellipses(res.mca,c("Vomiting","Fever"),
geom="point")
Alternatively,youcanspecifycategoricalvariableindices:
fviz_ellipses(res.mca,1:4,geom="point")
6.3.6Dimensiondescription
Thefunctiondimdesc()[inFactoMineR]canbeusedtoidentifythemostcorrelatedvariableswithagivendimension:
res.desc<-dimdesc(res.mca,axes=c(1,2))
#Descriptionofdimension1
res.desc[[1]]
#Descriptionofdimension2
res.desc[[2]]
6.4Supplementaryelements
6.4.1Definitionandtypes
Asdescribedabove(section6.2.2),thedatasetpoisoncontains:
supplementarycontinuousvariables(quanti.sup=1:2,columns1and2correspondingtothecolumnsageandtime,respectively)supplementaryqualitativevariables(quali.sup=3:4,correspondingtothecolumnsSickandSex,respectively).Thisfactorvariablesareusedtocolorindividualsbygroups
Thedatadoesn'tcontainsupplementaryindividuals.However,fordemonstration,we'llusetheindividuals53:55assupplementaryindividuals.
6.4.2SpecificationinMCA
Tospecifysupplementaryindividualsandvariables,thefunctionMCA()canbeusedasfollow:
MCA(X,ind.sup=NULL,quanti.sup=NULL,quali.sup=NULL,
graph=TRUE,axes=c(1,2))
X:adataframe.Rowsareindividualsandcolumnsarevariables.ind.sup:anumericvectorspecifyingtheindexesofthesupplementaryindividuals.quanti.sup,quali.sup:anumericvectorspecifying,respectively,theindexesofthequantitativeandqualitativevariables.graph:alogicalvalue.IfTRUEagraphisdisplayed.axes:avectoroflength2specifyingthecomponentstobeplotted.
Forexample,typethis:
res.mca<-MCA(poison,ind.sup=53:55,
quanti.sup=1:2,quali.sup=3:4,graph=FALSE)
6.4.3Results
Thepredictedresultsforsupplementaryindividuals/variablescanbeextractedasfollow:
#Supplementaryqualitativevariablecategories
res.mca$quali.sup
Supplementaryvariablesandindividualsarenotusedforthedeterminationoftheprincipaldimensions.Theircoordinatesarepredictedusingonlytheinformationprovidedbytheperformedmultiplecorrespondenceanalysisonactivevariables/individuals.
#Supplementaryquantitativevariables
res.mca$quanti
#Supplementaryindividuals
res.mca$ind.sup
6.4.4Plots
Tomakeabiplotofindividualsandvariablecategories,typethis:
#Biplotofindividualsandvariablecategories
fviz_mca_biplot(res.mca,repel=TRUE,
ggtheme=theme_minimal())
ActiveindividualsareinblueSupplementaryindividualsareindarkblueActivevariablecategoriesareinredSupplementaryvariablecategoriesareindarkgreen
Ifyouwanttohighlightthecorrelationbetweenvariables(active&supplementary)anddimensions,usethefunctionfviz_mca_var()withtheargumentchoice="mca.cor":
fviz_mca_var(res.mca,choice="mca.cor",
repel=TRUE)
TheRcodebelowplotsqualitativevariablecategories(active&supplementaryvariables):
fviz_mca_var(res.mca,repel=TRUE,
ggtheme=theme_minimal())
Forsupplementaryquantitativevariables,typethis:
fviz_mca_var(res.mca,choice="quanti.sup",
ggtheme=theme_minimal())
Tovisualizesupplementaryindividuals,typethis:
fviz_mca_ind(res.mca,
label="ind.sup",#Showthelabelofind.suponly
ggtheme=theme_minimal())
6.5Filteringresults
Ifyouhavemanyindividuals/variablecategories,it'spossibletovisualizeonlysomeofthemusingtheargumentsselect.indandselect.var.
select.ind,select.var:aselectionofindividuals/variablecategoriestobedrawn.AllowedvaluesareNULLoralistcontainingtheargumentsname,cos2orcontrib:
name:isacharactervectorcontainingindividuals/variablecategorynamestobeplottedcos2:ifcos2isin[0,1],ex:0.6,thenindividuals/variablecategorieswithacos2>0.6areplottedifcos2>1,ex:5,thenthetop5activeindividuals/variablecategoriesandtop5supplementarycolumns/rowswiththehighestcos2areplottedcontrib:ifcontrib>1,ex:5,thenthetop5individuals/variablecategorieswiththehighestcontributionsareplotted
#Visualizevariablecategorieswithcos2>=0.4
fviz_mca_var(res.mca,select.var=list(cos2=0.4))
#Top10activevariableswiththehighestcos2
fviz_mca_var(res.mca,select.var=list(cos2=10))
#Selectbynames
name<-list(name=c("Fever_n","Abdo_y","Diarrhea_n",
"Fever_Y","Vomit_y","Vomit_n"))
fviz_mca_var(res.mca,select.var=name)
#top5contributingindividualsandvariablecategories
fviz_mca_biplot(res.mca,select.ind=list(contrib=5),
select.var=list(contrib=5),
ggtheme=theme_minimal())
Whentheselectionisdoneaccordingtothecontributionvalues,supplementaryindividuals/variablecategoriesarenotshownbecausetheydon'tcontributetotheconstructionoftheaxes.
6.6Exportingresults
6.6.1ExportplotstoPDF/PNGfiles
Twosteps:
1. CreatetheplotofinterestasanRobject:
#Screeplot
scree.plot<-fviz_eig(res.mca)
#Biplotofrowandcolumnvariables
biplot.mca<-fviz_mca_biplot(res.mca)
2. Exporttheplotsintoasinglepdffileasfollow(oneplotperpage):
library(ggpubr)
ggexport(plotlist=list(scree.plot,biplot.mca),
filename="MCA.pdf")
Moreoptionsat:Chapter4(section:Exportingresults).
6.6.2Exportresultstotxt/csvfiles
EasytouseRfunction:write.infile()[inFactoMineR]package.
#ExportintoaTXTfile
write.infile(res.mca,"mca.txt",sep="\t")
#ExportintoaCSVfile
write.infile(res.mca,"mca.csv",sep=";")
6.7Summary
Inconclusion,wedescribedhowtoperformandinterpretmultiplecorrespondenceanalysis(CA).WecomputedMCAusingtheMCA()function[FactoMineRpackage].Next,weusedthefactoextraRpackagetoproduceggplot2-basedvisualizationoftheCAresults.
Otherfunctions[packages]tocomputeMCAinR,include:
1. Usingdudi.acm()[ade4]
library("ade4")
res.mca<-dudi.acm(poison.active,scannf=FALSE,nf=5)
4. UsingepMCA()[ExPosition]
library("ExPosition")
res.mca<-epMCA(poison.active,graph=FALSE,correction="bg")
Nomatterwhatfunctionsyoudecidetouse,inthelistabove,thefactoextrapackagecanhandletheoutput.
fviz_eig(res.mca)#Screeplot
fviz_mca_biplot(res.mca)#Biplotofrowsandcolumns
6.8Furtherreading
ForthemathematicalbackgroundbehindMCA,refertothefollowingvideocourses,articlesandbooks:
CorrespondenceAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/Hhh6hCExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).Principalcomponentanalysis(article)(AbdiandWilliams2010).https://goo.gl/1Vtwq1.Correspondenceanalysisbasics(blogpost).https://goo.gl/Xyk8KT.
7FactorAnalysisofMixedData
7.1Introduction
Factoranalysisofmixeddata(FAMD)isaprincipalcomponentmethoddedicatedtoanalyzeadatasetcontainingbothquantitativeandqualitativevariables(J.Pagès2004).Itmakesitpossibletoanalyzethesimilaritybetweenindividualsbytakingintoaccountamixedtypesofvariables.Additionally,onecanexploretheassociationbetweenallvariables,bothquantitativeandqualitativevariables.
Roughlyspeaking,theFAMDalgorithmcanbeseenasamixedbetweenprincipalcomponentanalysis(PCA)(Chapter4)andmultiplecorrespondenceanalysis(MCA)(Chapter6).Inotherwords,itactsasPCAquantitativevariablesandasMCAforqualitativevariables.
Quantitativeandqualitativevariablesarenormalizedduringtheanalysisinordertobalancetheinfluenceofeachsetofvariables.
Inthecurrentchapter,wedemonstratehowtocomputeandvisualizefactoranalysisofmixeddatausingFactoMineR(fortheanalysis)andfactoextra(fordatavisualization)Rpackages.
7.2Computation
7.2.1Rpackages
Installrequiredpackagesasfollow:
install.packages(c("FactoMineR","factoextra"))
Loadthepackages:
library("FactoMineR")
library("factoextra")
7.2.2Dataformat
We'lluseasubsetofthewinedatasetavailableinFactoMineRpackage:
library("FactoMineR")
data(wine)
df<-wine[,c(1,2,16,22,29,28,30,31)]
head(df[,1:7],4)
##LabelSoilPlanteAcidityHarmonyIntensityOverall.quality
##2ELSaumurEnv12.002.113.142.863.39
##1CHASaumurEnv12.002.112.962.893.21
##1FONBourgueuilEnv11.752.183.143.073.54
##1VAUChinonEnv22.303.182.042.462.46
Toseethestructureofthedata,typethis:
str(df)
Thedatacontains21rows(wines,individuals)and8columns(variables):
Thefirsttwocolumnsarefactors(categoricalvariables):label(Saumur,BourgueilorChinon)andsoil(Reference,Env1,Env2orEnv4).Theremainingcolumnsarenumeric(continuousvariables).
7.2.3Rcode
ThefunctionFAMD()[FactoMinerpackage]canbeusedtocomputeFAMD.Asimplifiedformatis:
FAMD(base,ncp=5,sup.var=NULL,ind.sup=NULL,graph=TRUE)
base:adataframewithnrows(individuals)andpcolumns(variables).
Thegoalofthisstudyistoanalyzethecharacteristicsofthewines.
ncp:thenumberofdimensionskeptintheresults(bydefault5)sup.var:avectorindicatingtheindexesofthesupplementaryvariables.ind.sup:avectorindicatingtheindexesofthesupplementaryindividuals.graph:alogicalvalue.IfTRUEagraphisdisplayed.
TocomputeFAMD,typethis:
library(FactoMineR)
res.famd<-FAMD(df,graph=FALSE)
TheoutputoftheFAMD()functionisalistincluding:
print(res.famd)
##*Theresultsareavailableinthefollowingobjects:
##
##namedescription
##1"$eig""eigenvaluesandinertia"
##2"$var""Resultsforthevariables"
##3"$ind""resultsfortheindividuals"
##4"$quali.var""Resultsforthequalitativevariables"
##5"$quanti.var""Resultsforthequantitativevariables"
7.3Visualizationandinterpretation
We'llusethefollowingfactoextrafunctions:
get_eigenvalue(res.famd):Extracttheeigenvalues/variancesretainedbyeachdimension(axis).fviz_eig(res.famd):Visualizetheeigenvalues/variances.get_famd_ind(res.famd):Extracttheresultsforindividuals.get_famd_var(res.famd):Extracttheresultsforquantitativeandqualitativevariables.fviz_famd_ind(res.famd),fviz_famd_var(res.famd):Visualizetheresultsforindividualsandvariables,respectively.
Inthenextsections,we'llillustrateeachofthesefunctions.
7.3.1Eigenvalues/Variances
Theproportionofvariancesretainedbythedifferentdimensions(axes)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage]asfollow:
library("factoextra")
eig.val<-get_eigenvalue(res.famd)
head(eig.val)
##eigenvaluevariance.percentcumulative.variance.percent
##Dim.14.83243.9243.9
##Dim.21.85716.8860.8
##Dim.31.58214.3975.2
##Dim.41.14910.4585.6
##Dim.50.6525.9391.6
Thefunctionfviz_eig()orfviz_screeplot()[factoextrapackage]canbeusedtodrawthescreeplot(thepercentagesofinertiaexplainedbyeachFAMDdimensions):
fviz_screeplot(res.famd)
TohelpintheinterpretationofFAMD,wehighlyrecommendtoreadtheinterpretationofprincipalcomponentanalysis(Chapter(???)(principal-component-analysis))andmultiplecorrespondenceanalysis(Chapter(???)(multiple-correspondence-analysis)).Manyofthegraphspresentedherehavebeenalreadydescribedinourpreviouschapters.
7.3.2Graphofvariables
7.3.2.1Allvariables
Thefunctionget_mfa_var()[infactoextra]isusedtoextracttheresultsforvariables.Bydefault,thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofallvariables:
var<-get_famd_var(res.famd)
var
##FAMDresultsforvariables
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
Thedifferentcomponentscanbeaccessedasfollow:
#Coordinatesofvariables
head(var$coord)
#Cos2:qualityofrepresentationonthefactoremap
head(var$cos2)
#Contributionstothedimensions
head(var$contrib)
Thefollowingfigureshowsthecorrelationbetweenvariables-bothquantitativeandqualitativevariables-andtheprincipaldimensions,aswellas,thecontributionofvariablestothedimensions1and2.Thefollowingfunctions[inthefactoextrapackage]areused:
fviz_famd_var()toplotbothquantitativeandqualitativevariablesfviz_contrib()tovisualizethecontributionofvariablestotheprincipaldimensions
#Plotofvariables
fviz_famd_var(res.famd,repel=TRUE)
#Contributiontothefirstdimension
fviz_contrib(res.famd,"var",axes=1)
#Contributiontotheseconddimension
fviz_contrib(res.famd,"var",axes=2)
Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Readmoreinchapter(Chapter4).
Fromtheplotsabove,itcanbeseenthat:
variablesthatcontributethemosttothefirstdimensionare:Overall.qualityandHarmony.
variablesthatcontributethemosttotheseconddimensionare:SoilandAcidity.
7.3.2.2Quantitativevariables
Toextracttheresultsforquantitativevariables,typethis:
quanti.var<-get_famd_var(res.famd,"quanti.var")
quanti.var
##FAMDresultsforquantitativevariables
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
Inthissection,we'lldescribehowtovisualizequantitativevariables.Additionally,we'llshowhowtohighlightvariablesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.
TheRcodebelowplotsquantitativevariables.Weuserepel=TRUE,toavoidtextoverlapping.
fviz_famd_var(res.famd,"quanti.var",repel=TRUE,
col.var="black")
Briefly,thegraphofvariables(correlationcircle)showstherelationshipbetweenvariables,thequalityoftherepresentationofvariables,aswellas,thecorrelationbetweenvariablesandthedimensions.ReadmoreatPCA(Chapter4),MCA(Chapter6)andMFA(Chapter8).
Themostcontributingquantitativevariablescanbehighlightedonthescatterplotusingtheargumentcol.var="contrib".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.
fviz_famd_var(res.famd,"quanti.var",col.var="contrib",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
Similarly,youcanhighlightquantitativevariablesusingtheircos2valuesrepresentingthequalityofrepresentationonthefactormap.Ifavariableiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftheitems,morethan2dimensionsmightberequiredtoperfectlyrepresentthedata.
#Colorbycos2values:qualityonthefactormap
fviz_famd_var(res.famd,"quanti.var",col.var="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
7.3.2.3Graphofqualitativevariables
Likequantitativevariables,theresultsforqualitativevariablescanbeextractedasfollow:
quali.var<-get_famd_var(res.famd,"quali.var")
quali.var
##FAMDresultsforqualitativevariablecategories
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
Tovisualizequalitativevariables,typethis:
fviz_famd_var(res.famd,"quali.var",col.var="contrib",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07")
)
Theplotaboveshowsthecategoriesofthecategoricalvariables.
7.3.3Graphofindividuals
Togettheresultsforindividuals,typethis:
ind<-get_famd_ind(res.famd)
ind
##FAMDresultsforindividuals
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
Toplotindividuals,usethefunctionfviz_mfa_ind()[infactoextra].Bydefault,individualsarecolored
inblue.However,likevariables,it'salsopossibletocolorindividualsbytheircos2andcontributionvalues:
fviz_famd_ind(res.famd,col.ind="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
Individualswithsimilarprofilesareclosetoeachotheronthefactormap.Fortheinterpretation,readmoreatChapter6(MCA)andChapter8(MFA).
Notethat,it'spossibletocolortheindividualsusinganyofthequalitativevariablesintheinitialdatatable.Todothis,theargumenthabillageisusedinthefviz_famd_ind()function.Forexample,ifyouwanttocolorthewinesaccordingtothesupplementaryqualitativevariable"Label",typethis:
fviz_mfa_ind(res.famd,
habillage="Label",#colorbygroups
palette=c("#00AFBB","#E7B800","#FC4E07"),
addEllipses=TRUE,ellipse.type="confidence",
repel=TRUE#Avoidtextoverlapping
)
Intheplotabove,thequalitativevariablecategoriesareshowninblack.Env1,Env2,Env3arethecategoriesofthesoil.Saumur,BourgueuilandChinonarethecategoriesofthewineLabel.Ifyoudon'twanttoshowthemontheplot,usetheargumentinvisible="quali.var".
Ifyouwanttocolorindividualsusingmultiplecategoricalvariablesatthesametime,usethefunctionfviz_ellipses()[infactoextra]asfollow:
fviz_ellipses(res.famd,c("Label","Soil"),repel=TRUE)
Alternatively,youcanspecifycategoricalvariableindices:
fviz_ellipses(res.famd,1:2,geom="point")
7.4Summary
Thefactoranalysisofmixeddata(FAMD)makesitpossibletoanalyzeadataset,inwhichindividualsaredescribedbybothqualitativeandquantitativevariables.Inthisarticle,wedescribedhowtoperformandinterpretFAMDusingFactoMineRandfactoextraRpackages.
7.5Furtherreading
FactorAnalysisofMixedDataUsingFactoMineR(videocourse).https://goo.gl/64gY3R
8MultipleFactorAnalysis
8.1Introduction
Multiplefactoranalysis(MFA)(J.Pagès2002)isamultivariatedataanalysismethodforsummarizingandvisualizingacomplexdatatableinwhichindividualsaredescribedbyseveralsetsofvariables(quantitativeand/orqualitative)structuredintogroups.Ittakesintoaccountthecontributionofallactivegroupsofvariablestodefinethedistancebetweenindividuals.Thenumberofvariablesineachgroupmaydifferandthenatureofthevariables(qualitativeorquantitative)canvaryfromonegrouptotheotherbutthevariablesshouldbeofthesamenatureinagivengroup(AbdiandWilliams2010).
MFAmaybeconsideredasageneralfactoranalysis.Roughly,thecoreofMFAisbasedon:
Principalcomponentanalysis(PCA)(Chapter4)whenvariablesarequantitative,Multiplecorrespondenceanalysis(MCA)(Chapter6)whenvariablesarequalitative.
Thisglobalanalysis,wheremultiplesetsofvariablesaresimultaneouslyconsidered,requirestobalancetheinfluencesofeachsetofvariables.Therefore,inMFA,thevariablesareweightedduringtheanalysis.Variablesinthesamegrouparenormalizedusingthesameweightingvalue,whichcanvaryfromonegrouptoanother.Technically,MFAassignstoeachvariableofgroupj,aweightequaltotheinverseofthefirsteigenvalueoftheanalysis(PCAorMCAaccordingtothetypeofvariable)ofthegroupj.
Multiplefactoranalysiscanbeusedinavarietyoffields(J.Pagès2002),wherethevariablesareorganizedintogroups:
1. Surveyanalysis,whereanindividualisaperson;avariableisaquestion.Questionsareorganizedbythemes(groupsofquestions).
2. Sensoryanalysis,whereanindividualisafoodproduct.Afirstsetofvariablesincludessensoryvariables(sweetness,bitterness,etc.);asecondoneincludeschemicalvariables(pH,glucoserate,etc.).
3. Ecology,whereanindividualisanobservationplace.Afirstsetofvariablesdescribessoilcharacteristics;asecondonedescribesflora.
4. Timesseries,whereseveralindividualsareobservedatdifferentdates.Inthissituation,thereiscommonlytwowaysofdefininggroupsofvariables:
generally,variablesobservedatthesametime(date)aregatheredtogether.Whenvariablesarethesamefromonedatetotheothers,eachsetcangatherthedifferentdatesforonevariable.
Inthecurrentchapter,weshowhowtocomputeandvisualizemultiplefactoranalysisinRsoftwareusingFactoMineR(fortheanalysis)andfactoextra(fordatavisualization).Additional,we'llshowhowtorevealthemostimportantvariablesthatcontributethemostinexplainingthevariationsinthedataset.
8.2Computation
8.2.1Rpackages
InstallFactoMineRandfactoextraasfollow:
install.packages(c("FactoMineR","factoextra"))
Loadthepackages:
library("FactoMineR")
library("factoextra")
8.2.2Dataformat
We'llusethedemodatasetswineavailableinFactoMineRpackage.Thisdatasetisaboutasensoryevaluationofwinesbydifferentjudges.
library("FactoMineR")
data(wine)
colnames(wine)
##[1]"Label""Soil"
##[3]"Odor.Intensity.before.shaking""Aroma.quality.before.shaking"
##[5]"Fruity.before.shaking""Flower.before.shaking"
##[7]"Spice.before.shaking""Visual.intensity"
##[9]"Nuance""Surface.feeling"
##[11]"Odor.Intensity""Quality.of.odour"
##[13]"Fruity""Flower"
##[15]"Spice""Plante"
##[17]"Phenolic""Aroma.intensity"
##[19]"Aroma.persistency""Aroma.quality"
##[21]"Attack.intensity""Acidity"
##[23]"Astringency""Alcohol"
##[25]"Balance""Smooth"
##[27]"Bitterness""Intensity"
##[29]"Harmony""Overall.quality"
##[31]"Typical"
Animageofthedataisshownbelow:
DataformatforMultipleFactoranalysis
(Imagesource,FactoMineR,http://factominer.free.fr)
Thedatacontains21rows(wines,individuals)and31columns(variables):
Thefirsttwocolumnsarecategoricalvariables:label(Saumur,BourgueilorChinon)andsoil(Reference,Env1,Env2orEnv4).The29nextcolumnsarecontinuoussensoryvariables.Foreachwine,thevalueisthemeanscoreforallthejudges.
Thevariablesareorganizedingroupsasfollow:
1. Firstgroup-Agroupofcategoricalvariablesspecifyingtheoriginofthewines,includingthevariableslabelandsoilcorrespondingtothefirst2columnsinthedatatable.InFactoMineRterminology,theargumentsgroup=2isusedtodefinethefirst2columnsasagroup.
2. Secondgroup-Agroupofcontinuousvariables,describingtheodorofthewinesbeforeshaking,includingthevariables:Odor.Intensity.before.shaking,Aroma.quality.before.shaking,Fruity.before.shaking,Flower.before.shakingandSpice.before.shaking.Thesevariablescorrespondstothenext5columnsafterthefirstgroup.FactoMineRterminology:group=5.
3. Thirdgroup-Agroupofcontinuousvariablesquantifyingthevisualinspectionofthewines,includingthevariables:Visual.intensity,NuanceandSurface.feeling.Thesevariablescorrespondstothenext3columnsafterthesecondgroup.FactoMineRterminology:group=3.
4. Fourthgroup-Agroupofcontinuousvariablesconcerningtheodorofthewinesaftershaking,
Thegoalofthisstudyistoanalyzethecharacteristicsofthewines.
includingthevariables:Odor.Intensity,Quality.of.odour,Fruity,Flower,Spice,Plante,Phenolic,Aroma.intensity,Aroma.persistencyandAroma.quality.Thesevariablescorrespondstothenext10columnsafterthethirdgroup.FactoMineRterminology:group=10.
5. Fithgroup-Agroupofcontinuousvariablesevaluatingthetasteofthewines,includingthevariablesAttack.intensity,Acidity,Astringency,Alcohol,Balance,Smooth,Bitterness,IntensityandHarmony.Thesevariablescorrespondstothenext9columnsafterthefourthgroup.FactoMineRterminology:group=9.
6. Sixthgroup-Agroupofcontinuousvariablesconcerningtheoveralljudgementofthewines,includingthevariablesOverall.qualityandTypical.Thesevariablescorrespondstothenext2columnsafterthefithgroup.FactoMineRterminology:group=2.
8.2.3Rcode
ThefunctionMFA()[FactoMinerpackage]canbeused.Asimplifiedformatis:
MFA(base,group,type=rep("s",length(group)),ind.sup=NULL,
name.group=NULL,num.group.sup=NULL,graph=TRUE)
base:adataframewithnrows(individuals)andpcolumns(variables)group:avectorwiththenumberofvariablesineachgroup.type:thetypeofvariablesineachgroup.Bydefault,allvariablesarequantitativeandscaledtounitvariance.Allowedvaluesinclude:
"c"or"s"forquantitativevariables.If"s",thevariablesarescaledtounitvariance."n"forcategoricalvariables."f"forfrequencies(fromacontingencytables).
ind.sup:avectorindicatingtheindexesofthesupplementaryindividuals.
Insummary:
Wehave6groupsofvariables,whichcanbespecifiedtotheFactoMineRasfollow:group=c(2,5,3,10,9,2).
Thesegroupscanbenamedasfollow:name.group=c("origin","odor","visual","odor.after.shaking","taste","overall").
Amongthe6groupsofvariables,oneiscategoricalandfivegroupscontaincontinuousvariables.It'srecommended,tostandardizethecontinuousvariablesduringtheanalysis.Standardizationmakesvariablescomparable,inthesituationwherethevariablesaremeasuredindifferentunits.InFactoMineR,theargumenttype="s"specifiesthatagivengroupofvariablesshouldbestandardized.Ifyoudon'twantstandardization,usetype="c".Tospecifycategoricalvariables,type="n"isused.Inourexample,we'llusetype=c("n","s","s","s","s","s").
name.group:avectorcontainingthenameofthegroups(bydefault,NULLandthegrouparenamedgroup.1,group.2andsoon).num.group.sup:theindexesoftheillustrativegroups(bydefault,NULLandnogroupareillustrative).graph:alogicalvalue.IfTRUEagraphisdisplayed.
TheRcodebelowperformstheMFAonthewinesdatausingthegroups:odor,visual,odoraftershakingandtaste.Thesegroupsarenamedactivegroups.Theremaininggroupofvariables-origin(thefirstgroup)andoveralljudgement(thesixthgroup)-arenamedsupplementarygroups;num.group.sup=c(1,6):
library(FactoMineR)
data(wine)
res.mfa<-MFA(wine,
group=c(2,5,3,10,9,2),
type=c("n","s","s","s","s","s"),
name.group=c("origin","odor","visual",
"odor.after.shaking","taste","overall"),
num.group.sup=c(1,6),
graph=FALSE)
TheoutputoftheMFA()functionisalistincluding:
print(res.mfa)
##**ResultsoftheMultipleFactorAnalysis(MFA)**
##Theanalysiswasperformedon21individuals,describedby31variables
##*Resultsareavailableinthefollowingobjects:
##
##name
##1"$eig"
##2"$separate.analyses"
##3"$group"
##4"$partial.axes"
##5"$inertia.ratio"
##6"$ind"
##7"$quanti.var"
##8"$quanti.var.sup"
##9"$quali.var.sup"
##10"$summary.quanti"
##11"$summary.quali"
##12"$global.pca"
##description
##1"eigenvalues"
##2"separateanalysesforeachgroupofvariables"
##3"resultsforallthegroups"
##4"resultsforthepartialaxes"
##5"inertiaratio"
##6"resultsfortheindividuals"
##7"resultsforthequantitativevariables"
##8"resultsforthequantitativesupplementaryvariables"
##9"resultsforthecategoricalsupplementaryvariables"
##10"summaryforthequantitativevariables"
##11"summaryforthecategoricalvariables"
##12"resultsfortheglobalPCA"
8.3Visualizationandinterpretation
We'llusethefactoextraRpackagetohelpintheinterpretationandthevisualizationofthemultiplefactoranalysis.
Thefunctionsbelow[infactoextrapackage]willbeused:
get_eigenvalue(res.mfa):Extracttheeigenvalues/variancesretainedbyeachdimension(axis).fviz_eig(res.mfa):Visualizetheeigenvalues/variances.get_mfa_ind(res.mfa):Extracttheresultsforindividuals.get_mfa_var(res.mfa):Extracttheresultsforquantitativeandqualitativevariables,aswellas,forgroupsofvariables.fviz_mfa_ind(res.mfa),fviz_mfa_var(res.mfa):Visualizetheresultsforindividualsandvariables,respectively.
Inthenextsections,we'llillustrateeachofthesefunctions.
8.3.1Eigenvalues/Variances
Theproportionofvariancesretainedbythedifferentdimensions(axes)canbeextractedusingthefunctionget_eigenvalue()[factoextrapackage]asfollow:
library("factoextra")
eig.val<-get_eigenvalue(res.mfa)
head(eig.val)
##eigenvaluevariance.percentcumulative.variance.percent
##Dim.13.46249.3849.4
##Dim.21.36719.4968.9
##Dim.30.6158.7877.7
##Dim.40.3725.3183.0
##Dim.50.2703.8686.8
##Dim.60.2022.8989.7
Thefunctionfviz_eig()orfviz_screeplot()[factoextrapackage]canbeusedtodrawthescreeplot:
fviz_screeplot(res.mfa)
TohelpintheinterpretationofMFA,wehighlyrecommendtoreadtheinterpretationofprincipalcomponentanalysis(Chapter(???)(principal-component-analysis)),simple(Chapter(???)(correspondence-analysis))andmultiplecorrespondenceanalysis(Chapter(???)(multiple-correspondence-analysis)).Manyofthegraphspresentedherehavebeenalreadydescribedinpreviouschapter.
8.3.2Graphofvariables
8.3.2.1Groupsofvariables
Thefunctionget_mfa_var()[infactoextra]isusedtoextracttheresultsforgroupsofvariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofgroups,aswellas,the
group<-get_mfa_var(res.mfa,"group")
group
##MultipleFactorAnalysisresultsforvariablegroups
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
##4"$correlation""Correlationbetweengroupsandprincipaldimensions"
Thedifferentcomponentscanbeaccessedasfollow:
#Coordinatesofgroups
head(group$coord)
#Cos2:qualityofrepresentationonthefactoremap
head(group$cos2)
#Contributionstothedimensions
head(group$contrib)
Toplotthegroupsofvariables,typethis:
fviz_mfa_var(res.mfa,"group")
redcolor=activegroupsofvariablesgreencolor=supplementarygroupsofvariables
Theplotaboveillustratesthecorrelationbetweengroupsanddimensions.Thecoordinatesofthefouractivegroupsonthefirstdimensionarealmostidentical.Thismeansthattheycontributesimilarlytothefirstdimension.Concerningtheseconddimension,thetwogroups-odorandodor.after.shake-havethehighestcoordinatesindicatingahighestcontributiontotheseconddimension.
Todrawabarplotofgroupscontributiontothedimensions,usethefunctionfviz_contrib():
#Contributiontothefirstdimension
fviz_contrib(res.mfa,"group",axes=1)
#Contributiontotheseconddimension
fviz_contrib(res.mfa,"group",axes=2)
8.3.2.2Quantitativevariables
Thefunctionget_mfa_var()[infactoextra]isusedtoextracttheresultsforquantitativevariables.Thisfunctionreturnsalistcontainingthecoordinates,thecos2andthecontributionofvariables:
quanti.var<-get_mfa_var(res.mfa,"quanti.var")
quanti.var
##MultipleFactorAnalysisresultsforquantitativevariables
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
Thedifferentcomponentscanbeaccessedasfollow:
#Coordinates
head(quanti.var$coord)
#Cos2:qualityonthefactoremap
head(quanti.var$cos2)
#Contributionstothedimensions
head(quanti.var$contrib)
Inthissection,we'lldescribehowtovisualizequantitativevariablescoloredbygroups.Next,we'llhighlightvariablesaccordingtoeitheri)theirqualityofrepresentationonthefactormaporii)theircontributionstothedimensions.
Tointerpretthegraphspresentedhere,readthechapteronPCA(Chapter(???)(principal-component-analysis))andMCA(Chapter(???)(multiple-correspondence-analysis)).
Correlationbetweenquantitativevariablesanddimensions.TheRcodebelowplotsquantitativevariablescoloredbygroups.Theargumentpaletteisusedtochangegroupcolors(see?ggpubr::ggparformoreinformationaboutpalette).Supplementaryquantitativevariablesareindashedarrowandvioletcolor.Weuserepel=TRUE,toavoidtextoverlapping.
fviz_mfa_var(res.mfa,"quanti.var",palette="jco",
col.var.sup="violet",repel=TRUE)
Tomaketheplotmorereadable,wecanusegeom=c("point","text")insteadofgeom=c("arrow","text").We'llchangealsothelegendpositionfrom"right"to"bottom",usingtheargumentlegend="bottom":
fviz_mfa_var(res.mfa,"quanti.var",palette="jco",
col.var.sup="violet",repel=TRUE,
geom=c("point","text"),legend="bottom")
Briefly,thegraphofvariables(correlationcircle)showstherelationshipbetweenvariables,thequalityoftherepresentationofvariables,aswellas,thecorrelationbetweenvariablesandthedimensions:
Positivecorrelatedvariablesaregroupedtogether,whereasnegativeonesarepositionedonoppositesidesoftheplotorigin(opposedquadrants).
Thedistancebetweenvariablepointsandtheoriginmeasuresthequalityofthevariableonthefactormap.Variablepointsthatareawayfromtheoriginarewellrepresentedonthefactormap.
Foragivendimension,themostcorrelatedvariablestothedimensionareclosetothedimension.
Forexample,thefirstdimensionrepresentsthepositivesentimentsaboutwines:"intensity"and"harmony".Themostcorrelatedvariablestotheseconddimensionare:i)SpicebeforeshakingandOdorintensitybeforeshakingfortheodorgroup;ii)Spice,PlantandOdorintensityfortheodoraftershakinggroupandiii)Bitternessforthetastegroup.Thisdimensionrepresentsessentiallythe"spicyness"andthevegetalcharacteristicduetoolfaction.
Thecontributionofquantitativevariables(in%)tothedefinitionofthedimensionscanbevisualizedusingthefunctionfviz_contrib()[factoextrapackage].Variablesarecoloredbygroups.TheRcodebelowshowsthetop20variablecategoriescontributingtothedimensions:
#Contributionstodimension1
fviz_contrib(res.mfa,choice="quanti.var",axes=1,top=20,
palette="jco")
#Contributionstodimension2
fviz_contrib(res.mfa,choice="quanti.var",axes=2,top=20,
palette="jco")
Thereddashedlineonthegraphaboveindicatestheexpectedaveragevalue,Ifthecontributionswereuniform.Thecalculationoftheexpectedcontributionvalue,undernullhypothesis,hasbeendetailedintheprincipalcomponentanalysischapter(Chapter4).
Themostcontributingquantitativevariablescanbehighlightedonthescatterplotusingtheargumentcol.var="contrib".Thisproducesagradientcolors,whichcanbecustomizedusingtheargumentgradient.cols.
fviz_mfa_var(res.mfa,"quanti.var",col.var="contrib",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
col.var.sup="violet",repel=TRUE,
geom=c("point","text"))
Similarly,youcanhighlightquantitativevariablesusingtheircos2valuesrepresentingthequalityofrepresentationonthefactormap.Ifavariableiswellrepresentedbytwodimensions,thesumofthecos2isclosedtoone.Forsomeoftherowitems,morethan2dimensionsmightberequiredtoperfectlyrepresentthedata.
Thevariableswiththelargervalue,contributethemosttothedefinitionofthedimensions.VariablesthatcontributethemosttoDim.1andDim.2arethemostimportantinexplainingthevariabilityinthedataset.
#Colorbycos2values:qualityonthefactormap
fviz_mfa_var(res.mfa,col.var="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
col.var.sup="violet",repel=TRUE)
Tocreateabarplotofvariablescos2,typethis:
fviz_cos2(res.mfa,choice="quanti.var",axes=1)
8.3.3Graphofindividuals
Togettheresultsforindividuals,typethis:
ind<-get_mfa_ind(res.mfa)
ind
##MultipleFactorAnalysisresultsforindividuals
##===================================================
##NameDescription
##1"$coord""Coordinates"
##2"$cos2""Cos2,qualityofrepresentation"
##3"$contrib""Contributions"
##4"$coord.partiel""Partialcoordinates"
##5"$within.inertia""Withininertia"
##6"$within.partial.inertia""Withinpartialinertia"
Toplotindividuals,usethefunctionfviz_mfa_ind()[infactoextra].Bydefault,individualsarecoloredinblue.However,likevariables,it'salsopossibletocolorindividualsbytheircos2values:
fviz_mfa_ind(res.mfa,col.ind="cos2",
gradient.cols=c("#00AFBB","#E7B800","#FC4E07"),
repel=TRUE)
Individualswithsimilarprofilesareclosetoeachotheronthefactormap.Thefirstaxis,mainlyopposesthewine1DAMand,thewines1VAUand2ING.Asdescribedintheprevioussection,thefirstdimensionrepresentstheharmonyandtheintensityofwines.Thus,thewine1DAM(positivecoordinates)wasevaluatedasthemost"intense"and"harmonious"contrarytowines1VAUand2ING(negativecoordinates)whicharetheleast"intense"and"harmonious".ThesecondaxisisessentiallyassociatedwiththetwowinesT1andT2characterizedbyastrongvalueofthevariablesSpice.before.shakingandOdor.intensity.before.shaking.
Mostofthesupplementaryqualitativevariablecategoriesareclosetotheoriginofthemap.Thisresultindicatesthattheconcernedcategoriesarenotrelatedtothefirstaxis(wine"intensity"&"harmony")orthesecondaxis(wineT1andT2).
ThecategoryEnv4hashighcoordinatesonthesecondaxisrelatedtoT1andT2.
Thecategory"Reference"isknowntoberelatedtoanexcellentwine-producingsoil.Asexpected,ouranalysisdemonstratesthatthecategory"Reference"hashighcoordinatesonthefirstaxis,whichispositivelycorrelatedwithwines"intensity"and"harmony".
Intheplotabove,thesupplementaryqualitativevariablecategoriesareshowninblack.Env1,Env2,Env3arethecategoriesofthesoil.Saumur,BourgueuilandChinonarethecategoriesofthewineLabel.Ifyoudon'twanttoshowthemontheplot,usetheargumentinvisible="quali.var".
Notethat,it'spossibletocolortheindividualsusinganyofthequalitativevariablesintheinitialdatatable.Todothis,theargumenthabillageisusedinthefviz_mfa_ind()function.Forexample,ifyouwanttocolorthewinesaccordingtothesupplementaryqualitativevariable"Label",typethis:
fviz_mfa_ind(res.mfa,
habillage="Label",#colorbygroups
palette=c("#00AFBB","#E7B800","#FC4E07"),
addEllipses=TRUE,ellipse.type="confidence",
repel=TRUE#Avoidtextoverlapping
)
Ifyouwanttocolorindividualsusingmultiplecategoricalvariablesatthesametime,usethefunctionfviz_ellipses()[infactoextra]asfollow:
fviz_ellipses(res.mfa,c("Label","Soil"),repel=TRUE)
Alternatively,youcanspecifycategoricalvariableindices:
fviz_ellipses(res.mca,1:2,geom="point")
8.3.4Graphofpartialindividuals
Theresultsforindividualsobtainedfromtheanalysisperformedwithasinglegrouparenamedpartialindividuals.Inotherwords,anindividualconsideredfromthepointofviewofasinglegroupiscalledpartialindividual.
Inthedefaultfviz_mfa_ind()plot,foragivenindividual,thepointcorrespondstothemeanindividualorthecenterofgravityofthepartialpointsoftheindividual.Thatis,theindividualviewedbyallgroupsofvariables.
Foragivenindividual,thereareasmanypartialpointsasgroupsofvariables.
Thegraphofpartialindividualsrepresentseachwineviewedbyeachgroupanditsbarycenter.Toplotthepartialpointsofallindividuals,typethis:
fviz_mfa_ind(res.mfa,partial="all")
Ifyouwanttovisualizepartialpointsforwinesofinterest,letsayc("1DAM","1VAU","2ING"),usethis:
fviz_mfa_ind(res.mfa,partial=c("1DAM","1VAU","2ING"))
Redcolorrepresentsthewinesseenbyonlytheodorvariables;violetcolorrepresentsthewinesseenbyonlythevisualvariables,andsoon.
Thewine1DAMhasbeendescribedintheprevioussectionasparticularly"intense"and"harmonious",particularlybytheodorgroup:Ithasahighcoordinateonthefirstaxisfromthepointofviewoftheodorvariablesgroupcomparedtothepointofviewoftheothergroups.
Fromtheodorgroup'spointofview,2INGwasmore"intense"and"harmonious"than1VAUbutfromthetastegroup'spointofview,1VAUwasmore"intense"and"harmonious"than2ING.
8.3.5Graphofpartialaxes
ThegraphofpartialaxesshowstherelationshipbetweentheprincipalaxesoftheMFAandtheonesobtainedfromanalyzingeachgroupusingeitheraPCA(forgroupsofcontinuousvariables)oraMCA(forqualitativevariables).
fviz_mfa_axes(res.mfa)
Itcanbeseenthat,hefirstdimensionofeachgroupishighlycorrelatedtotheMFA'sfirstone.TheseconddimensionoftheMFAisessentiallycorrelatedtotheseconddimensionoftheolfactorygroups.
8.4Summary
Themultiplefactoranalysis(MFA)makesitpossibletoanalyseindividualscharacterizedbymultiplesetsofvariables.Inthisarticle,wedescribedhowtoperformandinterpretMFAusingFactoMineRandfactoextraRpackages.
8.5Furtherreading
ForthemathematicalbackgroundbehindMFA,refertothefollowingvideocourses,articlesandbooks:
MultipleFactorAnalysisCourseUsingFactoMineR(Videocourses).https://goo.gl/WcmHHt.ExploratoryMultivariateAnalysisbyExampleUsingR(book)(F.Husson,Le,andPagès2017).Principalcomponentanalysis(article)(AbdiandWilliams2010).https://goo.gl/1Vtwq1.SimultaneousanalysisofdistinctOmicsdatasetswithintegrationofbiologicalknowledge:MultipleFactorAnalysisapproach(Tayracetal.2009).
9HCPC:HierarchicalClusteringonPrincipalComponents
9.1Introduction
Clusteringisoneoftheimportantdataminingmethodsfordiscoveringknowledgeinmultivariatedatasets.Thegoalistoidentifygroups(i.e.clusters)ofsimilarobjectswithinadatasetofinterest.Tolearnmoreaboutclustering,youcanreadourbookentitled"PracticalGuidetoClusterAnalysisinR"(https://goo.gl/DmJ5y5).
Briefly,thetwomostcommonclusteringstrategiesare:
1. Hierarchicalclustering,usedforidentifyinggroupsofsimilarobservationsinadataset.2. Partitioningclusteringsuchask-meansalgorithm,usedforsplittingadatasetintoseveral
groups.
TheHCPC(HierarchicalClusteringonPrincipalComponents)approachallowsustocombinethethreestandardmethodsusedinmultivariatedataanalyses(Husson,Josse,andJ.2010):
1. Principalcomponentmethods(PCA,CA,MCA,FAMD,MFA),2. Hierarchicalclusteringand3. Partitioningclustering,particularlythek-meansmethod.
ThischapterdescribesWHYandHOWtocombineprincipalcomponentsandclusteringmethods.Finally,wedemonstratehowtocomputeandvisualizeHCPCusingRsoftware.
9.2WhyHCPC?
Combiningprincipalcomponentmethodsandclusteringmethodsareusefulinatleastthreesituations.
9.2.1Case1:Continuousvariables
Inthesituationwhereyouhaveamultidimensionaldatasetcontainingmultiplecontinuousvariables,theprincipalcomponentanalysis(PCA)canbeusedtoreducethedimensionofthedataintofewcontinuousvariablescontainingthemostimportantinformationinthedata.Next,youcanperformclusteranalysisonthePCAresults.
ThePCAstepcanbeconsideredasadenoisingstepwhichcanleadtoamorestableclustering.Thismightbeveryusefulifyouhavealargedatasetwithmultiplevariables,suchasingeneexpressiondata.
9.2.2Case2:Clusteringoncategoricaldata
Inordertoperformclusteringanalysisoncategoricaldata,thecorrespondenceanalysis(CA,foranalyzingcontingencytable)andthemultiplecorrespondenceanalysis(MCA,foranalyzingmultidimensionalcategoricalvariables)canbeusedtotransformcategoricalvariablesintoasetoffewcontinuousvariables(theprincipalcomponents).Theclusteranalysiscanbethenappliedonthe(M)CAresults.
Inthiscase,the(M)CAmethodcanbeconsideredaspre-processingstepswhichallowtocomputeclusteringoncategoricaldata.
9.2.3Case3:Clusteringonmixeddata
Whenyouhaveamixeddataofcontinuousandcategoricalvariables,youcanfirstperformFAMD(factoranalysisofmixeddata)orMFA(multiplefactoranalysis).Next,youcanapplyclusteranalysisontheFAMD/MFAoutputs.
9.3AlgorithmoftheHCPCmethod
ThealgorithmoftheHCPCmethod,asimplementedintheFactoMineRpackage,canbesummarizedasfollow:
1. Computeprincipalcomponentmethods:PCA,(M)CAorMFAdependingonthetypesofvariablesinthedatasetandthestructureofthedataset.Atthisstep,youcanchoosethenumberofdimensionstoberetainedintheoutputbyspecifyingtheargumentncp.Thedefaultvalueis5.
2. Computehierarchicalclustering:HierarchicalclusteringisperformedusingtheWard'scriterionontheselectedprincipalcomponents.Wardcriterionisusedinthehierarchicalclusteringbecauseitisbasedonthemultidimensionalvariancelikeprincipalcomponentanalysis.
3. Choosethenumberofclustersbasedonthehierarchicaltree:Aninitialpartitioningisperformedbycuttingthehierarchicaltree.
4. PerformK-meansclusteringtoimprovetheinitialpartitionobtainedfromhierarchicalclustering.Thefinalpartitioningsolution,obtainedafterconsolidationwithk-means,canbe(slightly)differentfromtheoneobtainedwiththehierarchicalclustering.
9.4Computation
9.4.1Rpackages
We'llusetwoRpackages:i)FactoMineRforcomputingHCPCandii)factoextraforvisualizingtheresults.
Toinstallthepackages,typethis:
install.packages(c("FactoMineR","factoextra"))
Aftertheinstallation,loadthepackagesasfollow:
library(factoextra)
library(FactoMineR)
9.4.2Rfunction
ThefunctionHCPC()[inFactoMineRpackage]canbeusedtocomputehierarchicalclusteringonprincipalcomponents.
Asimplifiedformatis:
HCPC(res,nb.clust=0,min=3,max=NULL,graph=TRUE)
res:Eithertheresultofafactoranalysisoradataframe.nb.clust:anintegerspecifyingthenumberofclusters.Possiblevaluesare:
0:thetreeiscutattheleveltheuserclickson-1:thetreeisautomaticallycutatthesuggestedlevelAnypositiveinteger:thetreeiscutwithnb.clustersclusters
min,max:theminimumandthemaximumnumberofclusterstobegenerated,respectivelygraph:ifTRUE,graphicsaredisplayed
9.4.3Caseofcontinuousvariables
Westartbycomputingagaintheprincipalcomponentanalysis(PCA).Theargumentncp=3isusedinthefunctionPCA()tokeeponlythefirstthreeprincipalcomponents.Next,theHCPCisappliedontheresultofthePCA.
library(FactoMineR)
#ComputePCAwithncp=3
res.pca<-PCA(USArrests,ncp=3,graph=FALSE)
#Computehierarchicalclusteringonprincipalcomponents
res.hcpc<-HCPC(res.pca,graph=FALSE)
Tovisualizethedendrogramgeneratedbythehierarchicalclustering,we'llusethefunctionfviz_dend()
[infactoextrapackage]:
fviz_dend(res.hcpc,
cex=0.7,#Labelsize
palette="jco",#Colorpalettesee?ggpubr::ggpar
rect=TRUE,rect_fill=TRUE,#Addrectanglearoundgroups
rect_border="jco",#Rectanglecolor
labels_track_height=0.8#Augmenttheroomforlabels
)
It'spossibletovisualizeindividualsontheprincipalcomponentmapandtocolorindividualsaccordingtotheclustertheybelongto.Thefunctionfviz_cluster()[infactoextra]canbeusedtovisualizeindividualsclusters.
fviz_cluster(res.hcpc,
repel=TRUE,#Avoidlabeloverlapping
show.clust.cent=TRUE,#Showclustercenters
palette="jco",#Colorpalettesee?ggpubr::ggpar
ggtheme=theme_minimal(),
main="Factormap"
)
Thedendrogramsuggests4clusterssolution.
YoucanalsodrawathreedimensionalplotcombiningthehierarchicalclusteringandthefactorialmapusingtheRbasefunctionplot():
#Principalcomponents+tree
plot(res.hcpc,choice="3D.map")
ThefunctionHCPC()returnsalistcontaining:
data.clust:Theoriginaldatawithasupplementarycolumncalledclasscontainingthepartition.desc.var:Thevariablesdescribingclustersdesc.ind:Themoretypicalindividualsofeachclusterdesc.axes:Theaxesdescribingclusters
Todisplaytheoriginaldatawithclusterassignments,typethis:
head(res.hcpc$data.clust,10)
##MurderAssaultUrbanPopRapeclust
##Alabama13.22365821.23
##Alaska10.02634844.54
##Arizona8.12948031.04
##Arkansas8.81905019.53
##California9.02769140.64
##Colorado7.92047838.74
##Connecticut3.31107711.12
##Delaware5.92387215.82
##Florida15.43358031.94
##Georgia17.42116025.83
Inthetableabove,thelastcolumncontainstheclusterassignments.
Todisplayquantitativevariablesthatdescribethemosteachcluster,typethis:
res.hcpc$desc.var$quanti
Here,weshowonlysomecolumnsofinterest:"Meanincategory","OverallMean","p.value"
##$`1`
##MeanincategoryOverallmeanp.value
##UrbanPop52.165.549.68e-05
##Murder3.67.795.57e-05
##Rape12.221.235.08e-05
##Assault78.5170.763.52e-06
##
##$`2`
##MeanincategoryOverallmeanp.value
##UrbanPop73.8865.540.00522
##Murder5.667.790.01759
##
##$`3`
##MeanincategoryOverallmeanp.value
##Murder13.97.791.32e-05
##Assault243.6170.766.97e-03
##UrbanPop53.865.541.19e-02
##
##$`4`
##MeanincategoryOverallmeanp.value
##Rape33.221.238.69e-08
##Assault257.4170.761.32e-05
##UrbanPop76.065.542.45e-03
##Murder10.87.793.58e-03
Fromtheoutputabove,itcanbeseenthat:
thevariablesUrbanPop,Murder,RapeandAssaultaremostsignificantlyassociatedwiththecluster1.Forexample,themeanvalueoftheAssaultvariableincluster1is78.53whichislessthanit'soverallmean(170.76)acrossallclusters.Therefore,Itcanbeconcludethatthecluster1ischaracterizedbyalowrateofAssaultcomparedtoallclusters.
thevariablesUrbanPopandMurderaremostsignificantlyassociatedwiththecluster2.
...andsoon...
Similarly,toshowprincipaldimensionsthatarethemostassociatedwithclusters,typethis:
res.hcpc$desc.axes$quanti
##$`1`
##MeanincategoryOverallmeanp.value
##Dim.1-1.96-5.64e-162.27e-07
##
##$`2`
##MeanincategoryOverallmeanp.value
##Dim.20.743-5.37e-160.000336
##
##$`3`
##MeanincategoryOverallmeanp.value
##Dim.11.061-5.64e-163.96e-02
##Dim.30.3973.54e-174.25e-02
##Dim.2-1.477-5.37e-165.72e-06
##
##$`4`
##MeanincategoryOverallmeanp.value
##Dim.11.89-5.64e-166.15e-07
Finally,representativeindividualsofeachclustercanbeextractedasfollow:
res.hcpc$desc.ind$para
##Cluster:1
##IdahoSouthDakotaMaineIowaNewHampshire
##0.3670.4990.5010.5530.589
##--------------------------------------------------------
##Cluster:2
##OhioOklahomaPennsylvaniaKansasIndiana
##0.2800.5050.5090.6040.710
##--------------------------------------------------------
##Cluster:3
##AlabamaSouthCarolinaGeorgiaTennesseeLouisiana
##0.3550.5340.6140.8520.878
##--------------------------------------------------------
##Cluster:4
##MichiganArizonaNewMexicoMarylandTexas
##0.3250.4530.5180.9010.924
9.4.4Caseofcategoricalvariables
Forcategoricalvariables,computeCAorMCAandthenapplythefunctionHCPC()ontheresultsasdescribedabove.
Here,we'llusetheteadata[inFactoMineR]asdemodataset:Rowsrepresenttheindividualsandcolumnsrepresentcategoricalvariables.
Theresultsaboveindicatethat,individualsinclusters1and4havehighcoordinatesonaxes1.Individualsincluster2havehighcoordinatesonthesecondaxis.Individualswhobelongtothethirdclusterhavehighcoordinatesonaxes1,2and3.
Foreachcluster,thetop5closestindividualstotheclustercenterisshown.Thedistancebetweeneachindividualandtheclustercenterisprovided.Forexample,representativeindividualsforcluster1include:Idaho,SouthDakota,Maine,IowaandNewHampshire.
Westart,byperforminganMCAontheindividuals.Wekeepthefirst20axesoftheMCAwhichretain87%oftheinformation.
#Loadingdata
library(FactoMineR)
data(tea)
#PerformingMCA
res.mca<-MCA(tea,
ncp=20,#Numberofcomponentskept
quanti.sup=19,#Quantitativesupplementaryvariables
quali.sup=c(20:36),#Qualitativesupplementaryvariables
graph=FALSE)
Next,weapplyhierarchicalclusteringontheresultsoftheMCA:
res.hcpc<-HCPC(res.mca,graph=FALSE,max=3)
Theresultscanbevisualizedasfollow:
#Dendrogram
fviz_dend(res.hcpc,show_labels=FALSE)
#Individualsfacormap
fviz_cluster(res.hcpc,geom="point",main="Factormap")
Asmentionedabove,clusterscanbedescribedbyi)variablesand/orcategories,ii)principalaxesandiii)individuals.Intheexamplebelow,wedisplayonlyasubsetoftheresults.
Descriptionbyvariablesandcategories
#Descriptionbyvariables
res.hcpc$desc.var$test.chi2
##p.valuedf
##where8.47e-794
##how3.14e-474
##price1.86e-2810
##tearoom9.62e-192
#Descriptionbyvariablecategories
res.hcpc$desc.var$category
##$`1`
##Cla/ModMod/ClaGlobalp.value
##where=chainstore85.993.864.02.09e-40
##how=teabag84.181.256.71.48e-25
##tearoom=Not.tearoom70.797.280.71.08e-18
##price=p_branded83.244.931.71.63e-09
##
##$`2`
##Cla/ModMod/ClaGlobalp.value
##where=teashop90.084.410.03.70e-30
##how=unpackaged66.775.012.05.35e-20
##price=p_upscale49.181.217.72.39e-17
##Tea=green27.328.111.04.44e-03
##
##$`3`
##Cla/ModMod/ClaGlobalp.value
##where=chainstore+teashop85.972.826.05.73e-34
##how=teabag+unpackaged67.068.531.31.38e-19
##tearoom=tearoom77.648.919.31.25e-16
##pub=pub63.543.521.01.13e-09
Descriptionbyprincipalcomponents
res.hcpc$desc.axes
DescriptionbyIndividuals
res.hcpc$desc.ind$para
Thevariablesthatcharacterizethemosttheclustersarethevariables"where"and"how".Eachclusterischaracterizedbyacategoryofthevariables"where"and"how".Forexample,individualswhobelongtothefirstclusterbuyteaasteabaginchainstores.
9.5Summary
Wedescribedhowtocomputehierarchicalclusteringonprincipalcomponents(HCPC).Thisapproachisusefulinsituations,including:
Whenyouhavealargedatasetcontainingcontinuousvariables,aprincipalcomponentanalysiscanbeusedtoreducethedimensionofthedatabeforethehierarchicalclusteringanalysis.
Whenyouhaveadatasetcontainingcategoricalvariables,a(Multiple)Correspondenceanalysiscanbeusedtotransformthecategoricalvariablesintofewcontinuousprincipalcomponents,whichcanbeusedastheinputoftheclusteranalysis.
WeusedtheFactoMineRpackagetocomputetheHCPCandthefactoextraRpackageforggplot2-basedelegantdatavisualization.
9.6Furtherreading
PracticalguidetoclusteranalysisinR(Book).https://goo.gl/DmJ5y5HCPC:HierarchicalClusteringonPrincipalComponents(Videos).https://goo.gl/jdYGoK
ReferencesAbdi,Hervé,andLynneJ.Williams.2010.“PrincipalComponentAnalysis.”JohnWileyandSons,Inc.WIREsCompStat2:433–59.http://staff.ustc.edu.cn/~zwp/teach/MVA/abdi-awPCA2010.pdf.
Bendixen,Mike.2003.“APracticalGuidetotheUseofCorrespondenceAnalysisinMarketingResearch.”MarketingBulletin14.http://marketing-bulletin.massey.ac.nz/V14/MB_V14_T2_Bendixen.pdf.
Bendixen,MikeT.1995.“CompositionalPerceptualMappingUsingChi‐squaredTreesAnalysisandCorrespondenceAnalysis.”JournalofMarketingManagement11(6):571–81.doi:10.1080/0267257X.1995.9964368.
Gabriel,K.Ruben,andCharlesL.Odoroff.1990.“BiplotsinBiomedicalResearch.”StatisticsinMedicine9(5).WileySubscriptionServices,Inc.,AWileyCompany:469–85.doi:10.1002/sim.4780090502.
Greenacre,Michael.2013.“ContributionBiplots.”JournalofComputationalandGraphicalStatistics22(1):107–22.http://dx.doi.org/10.1080/10618600.2012.702494.
Husson,Francois,JulieJosse,SebastienLe,andJeremyMazet.2017.FactoMineR:MultivariateExploratoryDataAnalysisandDataMining.https://CRAN.R-project.org/package=FactoMineR.
Husson,Francois,SebastienLe,andJérômePagès.2017.ExploratoryMultivariateAnalysisbyExampleUsingR.2nded.BocaRaton,Florida:Chapman;Hall/CRC.http://factominer.free.fr/bookV2/index.html.
Husson,François,J.Josse,andPagèsJ.2010.“PrincipalComponentMethods-HierarchicalClustering-PartitionalClustering:WhyWouldWeNeedtoChooseforVisualizingData?”UnpublishedData.http://www.sthda.com/english/upload/hcpc_husson_josse.pdf.
Jollife,I.T.2002.PrincipalComponentAnalysis.2nded.NewYork:Springer-Verlag.https://goo.gl/SB86SR.
Kaiser,HenryF.1961.“ANoteonGuttman’sLowerBoundfortheNumberofCommonFactors.”BritishJournalofStatisticalPsychology14:1–2.
Kassambara,Alboukadel,andFabianMundt.2017.Factoextra:ExtractandVisualizetheResultsofMultivariateDataAnalyses.http://www.sthda.com/english/rpkgs/factoextra.
Nenadic,O.,andM.Greenacre.2007.“CorrespondenceAnalysisinR,withTwo-andThree-DimensionalGraphics:ThecaPackage.”JournalofStatisticalSoftware20(3):1–13.http://www.jstatsoft.org.
Pagès,J.2002.“AnalyseFactorielleMultipleAppliquéeAuxVariablesQualitativesetAuxDonnéesMixtes.”RevueStatistiqueAppliquee4:5–37.
———.2004.“AnalyseFactorielledeDonneesMixtes.”RevueStatistiqueAppliquee4:93–111.
Peres-Neto,PedroR.,DonaldA.Jackson,andKeithM.Somers.2005.“HowManyPrincipalComponents?StoppingRulesforDeterminingtheNumberofNon-TrivialAxesRevisited.”BritishJournalofStatisticalPsychology49:974–97.
Tayrac,Mariede,SébastienLê,MarcAubry,JeanMosser,andFrançoisHusson.2009.“SimultaneousAnalysisofDistinctOmicsDataSetswithIntegrationofBiologicalKnowledge:MultipleFactorAnalysisApproach.”BMCGenomics10(1):32.https://doi.org/10.1186/1471-2164-10-32.