50
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012

Advanced Methods and Analysis for the Learning and Social Sciences

  • Upload
    millie

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Methods and Analysis for the Learning and Social Sciences. PSY505 Spring term, 2012 April 9, 2012. Today’s Class. Missing Data and Imputation. Missing Data. Frequently, when collecting large amounts of data from diverse sources, there are missing values for some data sources. - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced Methods and Analysis for the Learning and Social Sciences

Advanced Methods and Analysis for the Learning and Social Sciences

PSY505Spring term, 2012

April 9, 2012

Page 2: Advanced Methods and Analysis for the Learning and Social Sciences

Today’s Class

• Missing Data and Imputation

Page 3: Advanced Methods and Analysis for the Learning and Social Sciences

Missing Data

• Frequently, when collecting large amounts of data from diverse sources, there are missing values for some data sources

Page 4: Advanced Methods and Analysis for the Learning and Social Sciences

Examples

• It’s easy to think of examples; can anyone here give examples from your own current or past research or projects

Page 5: Advanced Methods and Analysis for the Learning and Social Sciences

Classes of missing data

• Missing all data/“Unit nonresponse”• Missing all of one source of data– E.g. student did not fill out questionnaire but used tutor

• Missing specific data/“Item nonresponse”– E.g. student did not answer one question on

questionnaire• Subject dropout/attrition– Subject ceased to be part of population during study

• E.g. student was suspended for a fight

Page 6: Advanced Methods and Analysis for the Learning and Social Sciences

What do we do?

Page 7: Advanced Methods and Analysis for the Learning and Social Sciences

Case Deletion

• Simply delete any case that has at least one missing value

• Alternate form: Simply delete any case that is missing the dependent variable

Page 8: Advanced Methods and Analysis for the Learning and Social Sciences

Case Deletion

• In what situations might this be acceptable?

• In what situations might this be unacceptable?

• In what situations might this be practically impossible?

Page 9: Advanced Methods and Analysis for the Learning and Social Sciences

Case Deletion

• In what situations might this be acceptable?– Relatively little missing data in sample– Dependent variable missing, and journal unlikely to accept

imputed dependent variable– Almost all data missing for case

• Example: A student who is absent during entire usage of tutor

• In what situations might this be unacceptable?

• In what situations might this be practically impossible?

Page 10: Advanced Methods and Analysis for the Learning and Social Sciences

Case Deletion• In what situations might this be acceptable?

• In what situations might this be unacceptable?– Data loss appears to be non-random

• Example: The students who fail to answer “How much marijuana do you smoke?” have lower GPA than the average student who does answer that question

– Data loss is due to attrition, and you care about inference up until the point of the data loss• Student completes pre-test, tutor, and post-test, but not retention test

• In what situations might this be practically impossible?

Page 11: Advanced Methods and Analysis for the Learning and Social Sciences

Case Deletion

• In what situations might this be acceptable?

• In what situations might this be unacceptable?

• In what situations might this be practically impossible?– Almost all students missing at least some data

Page 12: Advanced Methods and Analysis for the Learning and Social Sciences

Analysis-by-Analysis Case Deletion

• Common approach

• Advantages?

• Disadvantages?

Page 13: Advanced Methods and Analysis for the Learning and Social Sciences

Analysis-by-Analysis Case Deletion

• Common approach

• Advantages?– Every analysis involves all available data

• Disadvantages?– Difficult to do omnibus or multivariate analyses– Are your analyses fully comparable to each other?

Page 14: Advanced Methods and Analysis for the Learning and Social Sciences

Mean Substitution

• Replace all missing data with the mean value for the data set

• Mathematically equivalent: unitize all variables, and treat missing values as 0

Page 15: Advanced Methods and Analysis for the Learning and Social Sciences

Mean Substitution

• Advantages?

• Disadvantages?

Page 16: Advanced Methods and Analysis for the Learning and Social Sciences

Mean Substitution

• Advantages?– Simple to Conduct– Essentially drops missing data from analysis

without dropping case from analysis entirely

• Disadvantages?– Distorts variances and correlations– May bias against significance in an over-

conservative fashion

Page 17: Advanced Methods and Analysis for the Learning and Social Sciences

Distortion From Mean Substitution

• Imagine a sample where the true sample is that 50 out of 1000 students have smoked marijuana

• GPA• Smokers: M=2.6, SD=0.5• Non-Smokers: M=3.3, SD=0.5

Page 18: Advanced Methods and Analysis for the Learning and Social Sciences

Distortion From Mean Substitution

• However, 30 of the 50 smokers refuse to answer whether they smoke, and 20 of the 950 non-smokers refuse to answer

• And the respondents who remain are fully representative

• GPA• Smokers: M=2.6• Non-Smokers: M=3.3

Page 19: Advanced Methods and Analysis for the Learning and Social Sciences

Distortion From Mean Substitution

• GPA• Smokers: M=2.6• Non-Smokers: M=3.3• Overall Average: M=3.285

Page 20: Advanced Methods and Analysis for the Learning and Social Sciences

Distortion From Mean Substitution

• GPA• Smokers: M=2.6• Non-Smokers: M=3.3• Overall Average: M=3.285

• Smokers (Mean Sub): M= 3.02• Non-Smokers (Mean Sub): M= 3.3

Page 21: Advanced Methods and Analysis for the Learning and Social Sciences

MAR and MNAR

• “Missing At Random”• “Missing Not At Random”

Page 22: Advanced Methods and Analysis for the Learning and Social Sciences

MAR

• Data is MAR if

• R = Missing data• Ycom = Complete data set (if nothing missing)• Yobs = Observed data set

Page 23: Advanced Methods and Analysis for the Learning and Social Sciences

MAR

• In other words

• If values for R are not dependent on whether R is missing or not, the data is MAR

Page 24: Advanced Methods and Analysis for the Learning and Social Sciences

MAR and MNAR• Are these MAR or MNAR?

• Students who smoke marijuana are less likely to answer whether they smoke marijuana

• Students who smoke marijuana are likely to lie and say they do not smoke marijuana

• Some students don’t answer all questions out of laziness• Some data is not recorded due to server logging errors• Some students are not present for whole study due to

suspension from school due to fighting

Page 25: Advanced Methods and Analysis for the Learning and Social Sciences

MAR and MNAR

• MAR-based estimation may often be reasonably robust to violation of MAR assumption(Graham et al., 2007; Collins et al., 2001)

• Often difficult to verify for real data– In many cases, you don’t know why data is

missing…

Page 26: Advanced Methods and Analysis for the Learning and Social Sciences

MAR-assuming approaches

• Single Imputation• Multiple Imputation• Maximum Likelihood Estimation

Page 27: Advanced Methods and Analysis for the Learning and Social Sciences

Single Imputation

• Replace all missing items with statistically plausible values and then conduct statistical analysis

• Mean substitution is a simple form of single imputation

Page 28: Advanced Methods and Analysis for the Learning and Social Sciences

Single Imputation

• Relatively simple to conduct

• Probably OK when limited missing data

Page 29: Advanced Methods and Analysis for the Learning and Social Sciences

Other Single Imputation Procedures

Page 30: Advanced Methods and Analysis for the Learning and Social Sciences

Other Single Imputation Procedures

• Hot-Deck Substitution: Replace each missing value with a value randomly drawn from other students (for the same variable)

• Very conservative; biases strongly towards no effect by discarding any possible association for that value

Page 31: Advanced Methods and Analysis for the Learning and Social Sciences

Other Single Imputation Procedures

• Linear regression/classification:

• For missing data for variable X• Build regressor or classifier predicting

observed cases of variable X from all other variables

• Substitute predictor of X for missing values

Page 32: Advanced Methods and Analysis for the Learning and Social Sciences

Other Single Imputation Procedures

• Linear regression/classification:

• For missing data for variable X• Build regressor or classifier predicting observed cases

of variable X from all other variables• Substitute predictor of X for missing values

• Limitation: if you want to correlate X to other variables, this will increase the strength of correlation

Page 33: Advanced Methods and Analysis for the Learning and Social Sciences

Other Single Imputation Procedures

• Distribution-based linear regression/classification:

• For missing data for variable X• Build regressor or classifier predicting observed cases of variable X

from all other variables• Compute probability density function for X

– Based on confidence interval if X normally distributed• Randomly draw from probability density function of each missing

value

• Limitation: A lot of work, still reduces data variance in undesirable fashions

Page 34: Advanced Methods and Analysis for the Learning and Social Sciences

Multiple Imputation

• Conduct procedure similar to single imputation many times, creating many data sets– 10-20 times recommended by Schafer & Graham

(2002)

• Conduct meta-analysis across data sets, to determine both overall answer and degree of uncertainty

Page 35: Advanced Methods and Analysis for the Learning and Social Sciences

Multiple Imputation Procedure

• Several procedures – essentially extensions of single imputation procedures

• One example

Page 36: Advanced Methods and Analysis for the Learning and Social Sciences

Multiple Imputation Procedure

• Conduct linear regression/classification

• For each data set– Add noise to each data point, drawn from a

distribution which maps to the distribution of the original (non-missing) data set for that variable

– Note: if original distribution is non-normal, use non-normal noise distribution

Page 37: Advanced Methods and Analysis for the Learning and Social Sciences

Maximum Likelihood

• Use expectation maximization to iterate between

• Estimating the function to predict missing data, using both observed data and predictions of missing data

• Predicting missing data

Page 38: Advanced Methods and Analysis for the Learning and Social Sciences

Maximum Likelihood

• Limitations:

• Do not account for noise and variance in observations as well as Multiple Imputation

• Not very robust to departures from MAR

Page 39: Advanced Methods and Analysis for the Learning and Social Sciences

MNAR Estimation

• Selection models– Predict missingness on a variable from other

variables– Then attempt to predict missing cases using both

the other variables, and the model of situations when the variable is missing

Page 40: Advanced Methods and Analysis for the Learning and Social Sciences

Reducing Missing Values

• Of course, the best way to deal with missing values is to not have missing values in the first place

• For more on this, please take PSY503!

Page 41: Advanced Methods and Analysis for the Learning and Social Sciences

Asgn. 9

• Students who completed Asgn. 9, please discuss your solutions– Christian– Mike Wixon– Zak

Page 42: Advanced Methods and Analysis for the Learning and Social Sciences

Wixon Solution: Over-fitting?

• Why might it be overfit?

Page 43: Advanced Methods and Analysis for the Learning and Social Sciences

Rogoff Solution: Over-fitting?

• StandardizedExam =• 0.2396 * studentID +• -6.6658 * MathCareerInterest +• 26.919 * MathCourseInHS +• -12.6207

Page 44: Advanced Methods and Analysis for the Learning and Social Sciences

Now to compare the models ofWixon versus Rogoff

• On a new test set with no missing data

Page 45: Advanced Methods and Analysis for the Learning and Social Sciences

Results (r)Student Training Set Test Set

Wixon 0.7786 0.7710

Rogoff 0.666 0.7867

Page 46: Advanced Methods and Analysis for the Learning and Social Sciences

Wait, what?

Page 47: Advanced Methods and Analysis for the Learning and Social Sciences

Wait, what?

• Correlation in test set between student ID and standardized exam score = -0.0636

• But coefficient = 0.2396• So weighted value ranges from 0.2396 to

11.98, which was not enough to hurt prediction

Page 48: Advanced Methods and Analysis for the Learning and Social Sciences

Asgn. 10

• Questions?• Comments?

Page 49: Advanced Methods and Analysis for the Learning and Social Sciences

Next Class• Wednesday, April 11• 3pm-5pm• AK232

• Power Analysis

• Readings• Lachin, J.M. (1981) Introduction to Sample Size Determination and

Power Analysis for Clinical Trials Controlled Clinical Trials,2,93-113.[pdf]• Cohen, J. (1992) A Power Primer. Psychological Bulletin , 112 (1), 155-

159.

• Assignments Due: None

Page 50: Advanced Methods and Analysis for the Learning and Social Sciences

The End