70
Missing Data; Missing Data Methods in ML; Multiple Imputation via Predictive Mean Matching EPSY 905: Multivariate Analysis Spring 2016 Lecture #11: April 13, 2016 EPSY 905: Lecture 11 -- Missing Data Methods

Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MissingData;MissingDataMethodsinML;

MultipleImputationviaPredictiveMeanMatching

EPSY905:MultivariateAnalysisSpring2016

Lecture#11:April13,2016

EPSY905:Lecture11-- MissingDataMethods

Page 2: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

Today’sLecture

• Thebasicsofmissingdata:Ø Typesofmissingdata

• HowNOTtohandlemissingdataØ Deletionmethods (bothpairwiseandlistwise)Ø Mean-substitutionØ SingleImputation

• Howmaximumlikelihoodworkswithmissingdata

• MultipleimputationformissingdataØ HowimputationworksØ Howtoconductanalyseswithmissingdatausingimputation

EPSY905:Lecture11-- MissingDataMethods 2

Page 3: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ExampleData#1

• Todemonstratesomeoftheideasoftypesofmissingdata,let’sconsiderasituationwhereyouhavecollectedtwovariables:

Ø IQscoresØ Jobperformance

• ImagineyouareanemployerlookingtohireemployeesforajobwhereIQisimportant

EPSY905:Lecture11-- MissingDataMethods 3

Page 4: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

EPSY905:Lecture11-- MissingDataMethods 4

IQ Performance78 984 1384 1085 887 791 792 994 994 1196 799 7105 10105 11106 15108 10112 10113 12115 14118 16134 12

CompleteDataFromEnders(2010)

Page 5: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

TYPESOFMISSINGDATA

EPSY905:Lecture11-- MissingDataMethods 5

Page 6: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

OurNotationalSetup

• Let’sletD denoteourdatamatrix,whichwillincludedependent(Y)andindependent(X)variables

𝐃 = 𝐗,𝐘

• Problem: someelementsof𝐃 aremissing

EPSY905:Lecture11-- MissingDataMethods 6

Page 7: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MissingnessIndicatorVariables

• WecanconstructanalternatematrixM consistingofindicatorsofmissingnessforeachelementinourdatamatrixD

𝑀'( = 0 ifthe𝑖+,observation’s𝑗+, variableisnotmissing𝑀'( = 1 ifthe𝑖+,observation’s𝑗+, variableismissing

• Let𝑴012 and𝑴3'2 denotetheobservedandmissingpartsof𝑴

𝑴 = {𝑴012,𝑴3'2}

EPSY905:Lecture11-- MissingDataMethods 7

Page 8: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

TypesofMissingData

• Averyroughtypologyofmissingdataputsmissingobservationsintothreecategories:

1. MissingCompletelyAtRandom(MCAR)2. MissingAtRandom(MAR)3. MissingNotAtRandom(MNAR)

EPSY905:Lecture11-- MissingDataMethods 8

Page 9: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MissingCompletelyAtRandom(MCAR)

• MissingdataareMCARiftheeventsthatleadtomissingnessareindependentof:

Ø Theobserved variables-and-

Ø Theunobserved parameters ofinterest

• Examples:Ø Plannedmissingness insurvey research

w Somelarge-scale testsaresampled usingbookletsw Students receiveonlyafewofthetotalnumberofitemsw The itemsnotreceived aretreatedasmissing – butthat iscompletely afunctionofsampling andnoothermechanism

EPSY905:Lecture11-- MissingDataMethods 9

Page 10: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

A(More)FormalMCARDefinition

• Ourmissingdataindicators,𝑴 arestatisticallyindependent ofourobserveddata𝑫

𝑃 𝑴|𝑫 = 𝑃 𝑴thiscomesfromhowindependenceworkswithpdfs

• Likesayingamissingobservationisduetopurerandomness(i.e.,flippingacoin)

EPSY905:Lecture11-- MissingDataMethods 10

Page 11: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ImplicationsofMCAR

• Becausethemechanismofmissingisnotduetoanythingotherthanchance,inclusionofMCARindatawillnotbiasyourresults

Ø Canusemethodsbasedonlistwisedeletion,multipleimputation,ormaximumlikelihood

• Youreffectivesamplesizeislowered,thoughØ Lesspower, lessefficiency

EPSY905:Lecture11-- MissingDataMethods 11

Page 12: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

IQ Performance78 -84 1384 -85 887 791 792 994 994 1196 -99 7105 10105 11106 15108 10112 -113 12115 14118 16134 -

MCARData

Missingdataaredispersedrandomlythroughoutdata

MeanIQofcompletecases:99.7MeanIQofincompletecases:100.8

EPSY905:Lecture11-- MissingDataMethods 12

Page 13: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MissingAtRandom(MAR)

• DataareMARiftheprobabilityofmissingdependsonlyonsome(orall)oftheobserveddata

• 𝑴 isindependentof𝑫𝑚𝑖𝑠𝑃 𝑴 𝑫 = 𝑃(𝑴|𝑫012)

EPSY905:Lecture11-- MissingDataMethods 13

Page 14: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

IQ Perf Indicator78 - 184 - 184 - 185 - 187 - 191 7 092 9 094 9 094 11 096 7 099 7 0105 10 0105 11 0106 15 0108 10 0112 10 0113 12 0115 14 0118 16 0134 12 0

MARData

Missingdataarerelatedtootherdata:

AnyIQlessthan90didnothaveaperformancevariable

MeanIQofincompletecases:83.6MeanIQofcompletecases:105.5

EPSY905:Lecture11-- MissingDataMethods 14

Page 15: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ImplicationsofMAR

• Ifdataaremissingatrandom,biasedresultscouldoccur

• Inferencesbasedonlistwisedeletionwillbebiasedandinefficient

Ø Fewerdatapoints=moreerror inanalysis

• Inferencesbasedonmaximumlikelihoodwillbeunbiasedbutinefficient

• WewillfocusonmethodsforMARdatatoday

EPSY905:Lecture11-- MissingDataMethods 15

Page 16: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MissingNotAtRandom(MNAR)

• DataareMNARiftheprobabilityofmissingdataisrelatedtovaluesofthevariableitself

𝑃 𝑴 𝑫 = 𝑃(𝑴|𝑫012, 𝑫3'2)

• Oftencallednon-ignorablemissingnessØ Inferences basedonlistwisedeletionormaximumlikelihoodwillbebiasedandinefficient

• Needtoprovidestatisticalmodelformissingdatasimultaneouslywithestimationoforiginalmodel

EPSY905:Lecture11-- MissingDataMethods 16

Page 17: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

SURVIVINGMISSINGDATA:ABRIEFGUIDE

EPSY905:Lecture11-- MissingDataMethods 17

Page 18: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

UsingStatisticalMethodswithMissingData

• Missingdatacanalteryouranalysisresultsdramaticallydependingupon:1.Thetypeofmissingdata2.Thetypeofanalysisalgorithm

• Thechoiceofanalgorithmandmissingdatamethodisimportantinavoidingissuesduetomissingdata

EPSY905:Lecture11-- MissingDataMethods 18

Page 19: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

TheWorstCaseScenario:MNAR

• TheworstcasescenarioiswhendataareMNAR:missingnotatrandom

Ø Non-ignorablemissing

• YoucannoteasilygetoutofthismessØ Insteadyouhave tobeclairvoyant

• Analysesalgorithmsmustincorporatemodelsformissingdata

Ø Andthesemodelsmustalsoberight

EPSY905:Lecture11-- MissingDataMethods 19

Page 20: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

TheReality

• Inmostempiricalstudies,MNARasaconditionisanafterthought

• ItisimpossibletoknowdefinitivelyifdatatrulyareMNARØ Sodataaretreated asMARorMCAR

• HypothesistestsdoexistforMCARØ Althoughtheyhavesomeissues

EPSY905:Lecture11-- MissingDataMethods 20

Page 21: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

TheBestCaseScenario:MCAR

• UnderMCAR,prettymuchanythingyoudowithyourdatawillgiveyouthe“right”(unbiased)estimatesofyourmodelparameters

• MCARisveryunlikelytooccurØ Inpractice,MCARistreated asequallyunlikelyasMNAR

EPSY905:Lecture11-- MissingDataMethods 21

Page 22: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

TheMiddleGround:MAR

• MARisthecommoncompromiseusedinmostempiricalresearch

Ø UnderMAR,maximumlikelihoodalgorithmsareunbiased

• Maximumlikelihoodisformanymethods:Ø Linearmixedmodels inPROCMIXED,gls,lavaanØ Modelswith“latent”randomeffects (CFA/SEMmodels) inMplus,lavaan

EPSY905:Lecture11-- MissingDataMethods 22

Page 23: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MISSINGDATAINMAXIMUMLIKELIHOOD

EPSY905:Lecture11-- MissingDataMethods 23

Page 24: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MissingDatawithMaximumLikelihood

• Handlingmissingdatainmaximumlikelihoodismuchmorestraightforwardduetothecalculationofthelog-likelihoodfunction

Ø Eachsubjectcontributesaportionduetotheirobservations

• Ifsomeofthedataaremissing,thelog-likelihoodfunctionusesareducedformoftheMVNdistribution

Ø CapitalizingonthepropertyoftheMVNthatsubsets ofvariables fromanMVNdistributionarealsoMVN

• Thetotallog-likelihoodisthenmaximizedØ Missingdatajustare“skipped”– theydonotcontribute

EPSY905:Lecture11-- MissingDataMethods 24

Page 25: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

EachPerson’sContributiontotheLog-Likelihood

• Foraperson𝑝,theMVNlog-likelihoodcanbewritten:

log 𝐿B = −𝑉2 log 2𝜋 −

12 log 𝚺𝒑 −

𝒚B − 𝝁BK𝚺BLM 𝒚B − 𝝁B2

Ø Fromourexampleswithmissingdata,subjectscouldeitherhavealloftheirdata…so their inputintolog 𝐿B uses:

𝒚B =𝑦B,OP𝑦B,QRST ;

𝝁B = 𝐗B𝜷 = 1 11 0

𝛽Y𝛽M

= 𝛽Y + 𝛽M𝛽Y

=𝜇OP𝜇QRST ;

𝚺B =𝜎OP] 𝜎OP,QRST

𝜎OP,QRST 𝜎QRST]

Ø …orcouldbemissingtheperformance variable, yielding:

𝒚B = 𝑦B,OP ; 𝝁B = 𝐗B𝜷 = 1 1𝛽Y𝛽M

= [𝛽Y + 𝛽M] = 𝜇OP ; 𝚺B = 𝜎OP]

EPSY905:Lecture11-- MissingDataMethods 25

Page 26: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

EvaluationofMissingDatainPROCMIXED(andprettymuchallotherpackages)

• Ifthedependentvariablesaremissing,PROCMIXEDautomaticallyskipsthosevariablesinthelikelihood

Ø TheREPEATED statement specifies observations withthesamesubject ID–andusesthenon-missingobservations fromthatsubjectonly

• Ifindependentvariablesaremissing,however,PROCMIXEDuseslistwisedeletion

Ø IfyouhavemissingIVs,thisisaproblemØ Youcansometimes phrase IVsasDVs,though

• SASSyntax(identicaltowhenyouhavecompletedata):

EPSY905:Lecture11-- MissingDataMethods 26

Page 27: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

AnalysisofMCARDatawithPROCMIXED

• Covariancematricesfromslide#4(MIXEDisclosertocomplete):

• Estimated𝐑 matrixfromPROCMIXED:

• Outputforeachobservation(obs #1=missing,obs #2=complete):

EPSY905:Lecture11-- MissingDataMethods 27

MCARData(Pairwise Deletion)

IQ 115.6 19.4

Performance 19.4 8.0

CompleteData

IQ 189.6 19.5

Performance 19.5 6.8

Page 28: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MCARAnalysis:EstimatedFixedEffects

• Estimatedmeanvectors:

• Estimatedfixedeffects:

• Means– IQ=89.36+10.64=100;Performance=10.64EPSY905:Lecture11-- MissingDataMethods 28

Variable MCARData(pairwise deletion)

CompleteData

IQ 93.73 100

Performance 10.6 10.35

Page 29: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

AnalysisofMARDatawithPROCMIXED

• Covariancematricesfromslide#4(MIXEDisclosertocomplete):

• Estimated𝐑 matrixfromPROCMIXED:

• Outputforeachobservation(obs #1=missing,obs #10=complete):

EPSY905:Lecture11-- MissingDataMethods 29

CompleteData

IQ 189.6 19.5

Performance 19.5 6.8

MAR Data(Pairwise Deletion)

IQ 130.2 19.5

Performance 19.5 7.3

Page 30: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MARAnalysis:EstimatedFixedEffects

• Estimatedmeanvectors:

• Estimatedfixedeffects:

• Means– IQ=90.15+9.85=100;Performance=9.85

EPSY905:Lecture11-- MissingDataMethods 30

Variable MCARData(pairwise deletion)

CompleteData

IQ 105.4 100

Performance 10.7 10.35

Page 31: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

AdditionalIssueswithMissingDataandMaximumLikelihood

• Giventhestructureofthemissingdata,thestandarderrorsoftheestimatedparametersmaybecomputeddifferently

Ø Standarderrorscomefrom-1*inverse informationmatrixw Informationmatrix=matrixofsecondderivatives =hessian

• SeveralversionsofthismatrixexistØ Somebasedonwhatisexpected underthemodel

w Thedefault inSAS– goodonlyforMCARdataØ Somebasedonwhatisobserved fromthedata

w Empirical optioninSAS– worksforMARdata(onlyforfixedeffects)

• Implication:someSEsmaybebiasedifdataareMARØ Mayleadtoincorrect hypothesis testresultsØ Correction needed forlikelihoodratio/deviance test statistics

w Notavailable inSAS;availableforsomemodels inMplusEPSY905:Lecture11-- MissingDataMethods 31

Page 32: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

WhenMLGoesBad…

• Forlinearmodelswithmissingdependentvariable(s)PROCMIXEDandalmosteveryotherstatpackageworksgreat

Ø ML“skips”overthemissingDVsinthelikelihoodfunction,usingonlythedatayouhaveobserved

• Forlinearmodelswithmissingindependentvariable(s),PROCMIXEDandalmosteveryotherstatpackageuseslist-wisedeletion

Ø Givesbiasedparameter estimates underMAR

EPSY905:Lecture11-- MissingDataMethods 32

Page 33: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

OptionsforMARforLinearModelswithMissingIndependentVariables

1. UseMLEstimatorsandhopeforMCAR

2. RephraseIVsasDVsØ InSAS:hardtodo,butpossible forsomemodels

w Dummycoding,correlated randomeffectsw Relyonproperties ofhowcorrelations/covariances arerelated tolinearmodel

coefficients 𝛽Ø InMplus:mucheasier…looksmore likeastructuralequationmodel

w Predicted variablesthenfunction likeDVsinMIXED

3. ImputeIVs(multipletimes)andthenuseMLEstimatorsØ Notusuallyagreatidea…but oftentheonlyoption

EPSY905:Lecture11-- MissingDataMethods 33

Page 34: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

HOWNOT TOHANDLEMISSINGDATA

EPSY905:Lecture11-- MissingDataMethods 34

Page 35: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

BadWaystoHandleMissingData

• Dealingwithmissingdataisimportant,asthemechanismsyouchoosecandramaticallyalteryourresults

• Thispointwasnotfullyrealizedwhenthefirstmethodsformissingdatawerecreated

Ø Eachofthemethodsdescribed inthissectionshouldneverbeusedØ Given toshowperspective – andtoallowyoutounderstandwhathappens ifyouwere tochooseeach

EPSY905:Lecture11-- MissingDataMethods 35

Page 36: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

DeletionMethods

• Deletionmethodsarejustthat:methodsthathandlemissingdatabydeletingobservations

Ø Listwisedeletion:delete theentireobservation ifanyvaluesaremissingØ Pairwisedeletion:deleteapairofobservations ifeitherofthevaluesaremissing

• Assumptions:DataareMCAR

• Limitations:Ø Reduction instatisticalpower ifMCARØ Biasedestimates ifMARorMNAR

EPSY905:Lecture11-- MissingDataMethods 36

Page 37: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ListwiseDeletion

• Listwisedeletiondiscardsall ofthedatafromanobservationifoneormorevariablesaremissing

• Mostfrequentlyusedinstatisticalsoftwarepackagesthatarenotoptimizingalikelihoodfunction(needML)

• Inlinearmodels:Ø SASGLMlist-wisedeletes caseswhere IVs orDVs aremissing

EPSY905:Lecture11-- MissingDataMethods 37

Page 38: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

PairwiseDeletion

• Pairwisedeletiondiscardsapairofobservationsifeitheroneismissing

Ø Different fromlistwise:usesmoredata(restofdatanotthrownout)

• Assumes:MCAR

• Limitations:Ø Reduction instatisticalpower ifMCARØ Biasedestimates ifMARorMNAR

• Canbeanissuewhenformingcovariance/correlationmatrices

Ø Maymakethemnon-invertible, problem ifusedasinputintostatisticalprocedures

EPSY905:Lecture11-- MissingDataMethods 38

Page 39: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

SingleImputationMethods

• Singleimputationmethodsreplacemissingdatawithsometypeofvalue

Ø Single: onevalueusedØ Imputation: replacemissingdatawithvalue

• Upside:canuseentiredatasetifmissingvaluesarereplaced

• Downside:biasedparameterestimatesandstandarderrors(evenifmissingisMCAR)

Ø Type-Ierror issues

• Still:neverusethesetechniques

EPSY905:Lecture11-- MissingDataMethods 39

Page 40: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

UnconditionalMeanImputation• Unconditionalmeanimputationreplacesthemissingvaluesofavariablewithitsestimatedmean

Ø Unconditional=meanvaluewithoutanyinputfromothervariables• Example:missingOxygen=47.1;missingRunTime =10.7;

missingRunPulse =171.9

• Notice:uniformlysmallerstandarddeviationsEPSY905:Lecture11-- MissingDataMethods 40

BeforeSingleImputation: AfterSingleImputation:

Page 41: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ConditionalMeanImputation(Regression)

• Conditionalmeanimputationusesregressionanalysestoimputemissingvalues

Ø Themissingvaluesareimputedusingthepredicted values ineachregression(conditionalmeans)

• Forourdatawewouldformregressionsforeachoutcomeusingtheothervariables

Ø OXYGEN=β01 +β11*RUNTIME +β21*PULSEØ RUNTIME=β02 +β12*OXYGEN +β22*PULSEØ PULSE=β03 +β13*OXYGEN+β23*RUNTIME

• MoreaccuratethanunconditionalmeanimputationØ Butstillprovides biasedparameters andSEs

EPSY905:Lecture11-- MissingDataMethods 41

Page 42: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

StochasticConditionalMeanImputation

• Stochasticconditionalmeanimputationaddsarandomcomponenttotheimputation

Ø Representing theerror term ineachregression equationØ AssumesMARrather thanMCAR

• Again,usesregressionanalysestoimputedata:Ø OXYGEN=β01 +β11*RUNTIME +β21*PULSE +ErrorØ RUNTIME=β02 +β12*OXYGEN +β22*PULSE + ErrorØ PULSE=β03 +β13*OXYGEN+β23*RUNTIME +Error

• Errorisrandom:drawnfromanormaldistributionØ Zeromeanandvarianceequal toresidualvariance𝜎R] forrespective regression

EPSY905:Lecture11-- MissingDataMethods 42

Page 43: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ImputationbyProximity:HotDeckMatching

• Hotdeckmatchingusesrealdata– fromotherobservationsasitsbasisforimputing

• Observationsare“matched”usingsimilarscoresonvariablesinthedataset

Ø Imputedvaluescomedirectlyfrommatchedobservations

• Upside:Helpstopreserveunivariatedistributions;givesdatainanappropriaterange

• Downside:biasedestimates(especiallyofregressioncoefficients),too-smallstandarderrors

EPSY905:Lecture11-- MissingDataMethods 43

Page 44: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ScaleImputationbyAveraging

• Inpsychometrictests,acommonmethodofimputationhasbeentouseascaleaverageratherthantotalscore

Ø Canre-scale tototalscorebytaking#items*averagescore

• Problem:treatingmissingitemsthiswayislikeusingpersonmean

Ø Reduces standarderrorsØ Makescalculationofreliabilitybiased

EPSY905:Lecture11-- MissingDataMethods 44

Page 45: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

LongitudinalImputation:LastObservationCarriedForward

• Acommonlyusedimputationmethodinlongitudinaldatahasbeentotreatobservationsthatdroppedoutbycarryingforwardthelastobservation

Ø Morecommoninmedicalstudiesandclinicaltrials

• Assumesscoresdonotchangeafterdropout– badideaØ Thoughttobeconservative

• CanexaggerategroupdifferencesØ Limitsstandard errors thathelpdetectgroupdifferences

EPSY905:Lecture11-- MissingDataMethods 45

Page 46: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

WhySingleImputationIsBadScience

• Overall,themethodsdescribedinthissectionarenotusefulforhandlingmissingdata

• Ifyouusethemyouwilllikelygetastatisticalanswerthatisanartifact

Ø Actualestimates youinterpret (parameter estimates) willbebiased(ineitherdirection)

Ø Standarderrorswillbetoosmallw LeadstoType-IErrors

• Puttingthistogether:youwilllikelyendupmakingconclusionsaboutyourdatathatarewrong

EPSY905:Lecture11-- MissingDataMethods 46

Page 47: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

WHATTODOWHENMLWON’TGO:MULTIPLEIMPUTATION

EPSY905:Lecture11-- MissingDataMethods 47

Page 48: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MultipleImputation

• Ratherthanusingsingleimputation,abettermethodistousemultipleimputation

Ø Themultiplyimputedvalueswillendupaddingvariabilitytoanalyses–helpingwithbiasedparameter andSEestimates

• Multipleimputationisamechanismbywhichyou“fillin”yourmissingdatawith“plausible”values

Ø Endupwithmultipledatasets– need torunmultipleanalysesØ Missingdataarepredicted usingastatisticalmodelusingtheobserved data(theMARassumption) foreachobservation

• MIispossibleduetostatisticalassumptionsØ Themostoftenusedassumption isthattheobserved dataaremultivariatenormal

EPSY905:Lecture11-- MissingDataMethods 48

Page 49: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MultipleImputationSteps

1. Themissingdataarefilledinanumberoftimes(say,mtimes)togeneratem completedatasets

2. Themcompletedatasetsareanalyzedusingstandardstatisticalanalyses

3. Theresultsfromthem completedatasetsarecombinedtoproduceinferentialresults

EPSY905:Lecture11-- MissingDataMethods 49

Page 50: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

Distributions:TheKeytoMultipleImputation

• Thekeyideabehindmultipleimputationisthateachmissingvaluehasadistribution oflikelyvalues

Ø Thedistribution reflects theuncertaintyaboutwhatthevariablemayhavebeen

• Multipleimputationcanbeaccomplishedusingvariablesoutsideananalysis

Ø Allcontribute tomultivariate normaldistributionØ Harder tojustifywhyun-important variables omitted

• Singleimputation,byanymethod,disregardstheuncertaintyineachmissingdatapoint

Ø Results fromsinglyimputeddatasetsmaybebiasedorhavehigherType-Ierrors

EPSY905:Lecture11-- MissingDataMethods 50

Page 51: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MultipleImputationinR

• Rhastwowell-knownpackagesformultipleimputation:MICEandAmelia

Ø MICE:MultipleImputationbyChainedEquationsØ Amelia:MIviaMCMCassumingMultivariate Normal

• TheMICEpackagehasanumberofmethodsavailableformultipleimputation

Ø Werestrict ourfocustodayonthosethatarecalled“predictivemeanmatching”asthese areverycommonlyused inresearch

EPSY905:Lecture11-- MissingDataMethods 51

Page 52: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

IMPUTATIONPHASE

EPSY905:Lecture11-- MissingDataMethods 52

Page 53: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MultipleImputationviaChainedEquations• ThekeytoMIviachainedequations iscreatingasetofunivariateregressionmodels

Ø Oneforeachvariablewithmissingdata

• Theprocessstartsbyrandomly imputing values forallmissingobservations

• Foreachvariable withmissingdata,insequence, aregression isrunwhereby apredictedmeaniscreated forallobservations withmissingdata

Ø Thepredictedmeancomesfromapplyingthesampledvaluesoftheposteriordistributionoftheregressionweights(sosamplingerrordoesn’tgetforgotten)

• Eachmissingobservation isthenimputed viaoneofahostofmethods (upnext)

• Thisprocesscontinues forallvariables withmissingdataØ Eachtimethrough,newlyimputedvaluesaidintheestimationofthebetas

• Attheendoftheprocess,anumberof“imputed” (complete) datasetsnowexist

EPSY905:Lecture11-- MissingDataMethods 53

Page 54: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ExampleofMIviaCEandPMM• Todemonstratemultipleimputationwithchainedequationsusingpredictivemeanmatchingwewillconsiderthreevariablesfromouroft-usedmathperformancedata

Ø Performance (PERF),Sex(MALE),andCollegeCreditHours (CC)

• Amountofmissingdatainsample:

• WewillimputeforPERFandforCCEPSY905:Lecture11-- MissingDataMethods 54

Page 55: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

RegressionsforPERFandCC

• Tomakethechainedequationswork,weregressionmodelsforeachvariablewithmissingdata:𝑃𝐸𝑅𝐹B = 𝛽YQ + 𝛽dQ𝑀B + 𝛽eeQ 𝐶𝐶B + 𝛽d∗eeQ 𝑀B𝐶𝐶B + 𝑒BQ

𝐶𝐶B = 𝛽Ye + 𝛽de𝑀B + 𝛽Qe𝑃B + 𝛽d∗Qe 𝑀B𝑃B + 𝑒Be

• Eachvariable’spredictionmodelcanbemadeupofanything(we’lljustuseallothervariablesplustheinteraction)

• We’llcontinuethisprocessinourRexample

EPSY905:Lecture11-- MissingDataMethods 55

Page 56: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MIviaPredictiveMeanMatching

• Intheimputationphase,missingvaluesareimputedusingmatchingonthepredictedmean(fromR)

EPSY905:Lecture11-- MissingDataMethods 56

Page 57: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

RunningMICE

• Youdon’thavetousemyclunkyRcodetomakeanimputationrun…youcanuseMICE:

• HereissomeMICEimputationdistributions:

EPSY905:Lecture11-- MissingDataMethods 57

Page 58: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MULTIPLEIMPUTATION:ANALYSISPHASE

EPSY905:Lecture11-- MissingDataMethods 58

Page 59: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

UpNext:MultipleAnalyses

• OnceyourunMICE,thenextstepistouseeachoftheimputeddatasetsinitsownanalysis

Ø CalledtheanalysisphaseØ Forourexample, thatwouldbe100times

• Themultipleanalysesarethencompiledandprocessedintoasingleresult

Ø Yieldingtheanswers toyouranalysisquestions (estimates, SEs,andP-values)

• GOODNEWS:MICEwillautomateallofthisforyou

EPSY905:Lecture11-- MissingDataMethods 59

Page 60: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

AnalysisPhase

• AnalysisPhase:runtheanalysisonallimputeddatasets

• Syntaxrunsforeachdataset(with())

• Model01nowhasall100modelscontainedwithinit

EPSY905:Lecture11-- MissingDataMethods 60

Page 61: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

MULTIPLEIMPUTATION:POOLINGPHASE

EPSY905:Lecture11-- MissingDataMethods 61

Page 62: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

PoolingParametersfromAnalysesofImputedDataSets

• Inthepoolingphase,theresultsarepooledandreported

• Forparameterestimates,thepoolingisstraightforwardØ Theestimated parameter istheaverage parameter valueacrossallimputeddatasets

• Forstandarderrors,poolingismorecomplicatedØ Have toworryaboutsourcesofvariation:

w Variationfromsamplingerrorthatwouldhavebeenpresent hadthedatanotbeenmissing

w Variationfromsamplingerrorresultingfrommissing data

EPSY905:Lecture11-- MissingDataMethods 62

Page 63: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

PoolingStandardErrorsAcrossImputationAnalyses

• Standarderrorinformationcomesfromtwosourcesofvariationfromimputationanalyses(for𝑚 imputations)

• WithinImputationVariation:

𝑉i =1𝑚j𝑆𝐸']

3

'lM• BetweenImputationVariation(here𝜃 isanestimatedparameterfromanimputationanalysis):

𝑉n =1

𝑚 − 1j𝜃o' − �̅�

]3

'lM

• Then,thetotalsamplingvarianceis:𝑉K = 𝑉i + 𝑉n +qrd

• Thesubsequent(imputationpooled)SEis𝑆𝐸 = 𝑉KEPSY905:Lecture11-- MissingDataMethods 63

Page 64: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

PoolingPhaseinR:MICE’spool()Function

• MICEhasafunctionthatwillpoolallresults(resultsforusingcertainclassesofmodels,thatis)

• Behold,theresults:

EPSY905:Lecture11-- MissingDataMethods 64

Page 65: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

AdditionalPoolingInformation

• Thedecompositionofimputationvarianceleadstotwohelpfuldiagnosticmeasuresabouttheimputation:

• FractionofMissingInformation: 𝐹𝑀𝐼 =qrt

urv

qwØ Measure ofinfluenceofmissingdataonsamplingvariance

EPSY905:Lecture11-- MissingDataMethods 65

Page 66: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

ISSUESWITHIMPUTATION

EPSY905:Lecture11-- MissingDataMethods 66

Page 67: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

CommonIssuesthatcanHinderImputation

• MCMCConvergenceØ Need“stable”meanvector/covariance matrix

• Non-normaldata:counts,skeweddistributions,categorical(ordinalornominal)variables

Ø MplusisagoodoptionØ Someclaimitdoesn’tmatterasmuchwithmanyimputations

• PreservationofmodeleffectsØ Imputationcanstripouteffects indata

w Interactions aremostdifficult– formasauxiliaryvariable

• ImputationofmultileveldataØ Differingcovariancematrices

EPSY905:Lecture11-- MissingDataMethods 67

Page 68: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

NumberofImputations

• Thenumberofimputations(𝑚 fromthepreviousslides)isimportant:biggerisbetter

Ø Basically,runasmanyasyoucan(100s)

• TakealookattheSEsforourparametersasIvariedthenumberofimputations:

EPSY905:Lecture11-- MissingDataMethods 68

Parameter 𝑚 = 1 𝑚 = 10 𝑚 = 30 𝑚 = 100

Intercept 8.722 9.442 9.672 9.558

RunTime 0.366 0.386 0.399 0.389

RunPulse 0.053 0.053 0.057 0.056

Page 69: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

WRAPPINGUP

EPSY905:Lecture11-- MissingDataMethods 69

Page 70: Missing Data; Missing Data Methods in ML; Multiple ...€¦ · Analysis of MAR Data with PROC MIXED • Covariance matrices from slide #4 (MIXED is closer to complete): • Estimated

WrappingUp

• Missingdataarecommoninstatisticalanalyses

• TheyarefrequentlyneglectedØ MNAR:hardtomodelmissingdataandobserved datasimultaneouslyØ MCAR:doesn’toftenhappenØ MAR:mostmissingimputationassumesMVN

• Moreoftenthannot,MListhebestchoiceØ Software isgettingbetterathandlingmissingdata

EPSY905:Lecture11-- MissingDataMethods 70