14
Detecting Bad Data CARMA Research Module Jeff Stanton

Carma internet research module detecting bad data

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Carma internet research module   detecting bad data

Detecting Bad Data

CARMA Research ModuleJeff Stanton

Page 2: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-2)

Sources of Data Problems in Online Studies

Technical errors:Programming errors: Not common, but damaging when they

occurServer errors: Can halt the collection of dataTransmission errors: Uncommon and usually isolated to one

record or fieldResponse fraud:

Inadvertent multiple response and malicious multiple response

Missing dataIntentionally malicious patterns of response leading to

outliers or self-contradictory data

Page 3: Carma internet research module   detecting bad data

Response Fraud

Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process

Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality

Minimal frauds: skipping questions, not thinking through the answers

Maximal frauds: A robot that randomly answers

May 18-20, 2006 Internet Data Collection Methods (Day 2-3)

Page 4: Carma internet research module   detecting bad data

Duplicate Detection

Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columnsCreate a new variable that contains this unique “checksum”

value for each row/case

Sort the dataset on the checksumCreate a lag difference variable that subtracts the

checksum for each neighboring rowSort on the lag variable and investigate all cases of zero

or small differences

May 18-20, 2006 Internet Data Collection Methods (Day 2-4)

Page 5: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-5)

Bogus Response Detection

Calculate common univariate statistics using the complete row of responses for each subject

Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min)

Sort the cases by the mean valueLook for extreme outliers on the high and low ends

Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum

Look for anomalies and trace them back to the original data for that subject

Page 6: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-6)

Multivariate Outlier DetectionUse Mahalanobis distance to detect outliers

Regress a set of related items on an arbitrary dependent variableSort by Mahalanobis distance: Larger distances are suggestive of outliers

Use autocorrelation to detect unusual data patternsFlip the data: Cases become variables and variables become casesRun an autocorrelation functionLook at the ACF graphs to find oddly regular patterns of responding

(autocorrs in excess of .5 across one or more lags)I have provided example SPSS code in the utilities area of the

LMS for each of these tests

Page 7: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-7)

Mahalanobis

Page 8: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-8)

Plot, Sort, and Examine

Page 9: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-9)

An ACF Indicating No Pattern

Page 10: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-10)

An ACF with a Suspicious Pattern

Page 11: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-11)

Common Missing Data Mitigation Techniques

Item imputationFor composite scales expressed as the average of a set of

items, ignore any missing that appear on a small subsetMean substitution

Suppresses variabilityTime series imputation

Mean of neighboring points; suppresses spikesRegression imputation, works well for highly

intercorrelated variablesFull information maximum likelihood imputation

Available in some SEM programs

Page 12: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-12)

Excel Tips

Your friend the “fill” functionThe power of “Paste Special”Sorting: Click on Data/Sort

Page 13: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-13)

Excel Statistical Formulas

=average(value, value…)Gives the arithmetic mean of a collection of

cells and/or numeric values=stdev(value, value…) // stdevp(value,

value…)Gives the sample/population standard

deviation of a collection of cells and/or numeric values

=sum(value, value…)Gives the sum of a collection of cells and/or

numeric values=correl(vector1, vector2)

Gives the pearson correlation between two vectors

=if(<test>,<value if true>,<value if false>)Makes a logical test and returns a different

value depending on whether the test is true or false

Example =if(1=1, “Yes!”, “No…”)

=find(<find text>, <within text>, <start>)Looks for the string <find text> within the

string <within text> and returns the position of the first occurrence after <start>

Example: =find(“=“, “fish=head”, 1)=Len(<string>)

Returns the number of characters in a string

Example =Len(“Ouch”)=Right(<string>,<length>)

Returns the rightmost <length> characters in string

Example: =Right(“fishhead“,4)=Left(<string>,<length>) works similarly

Page 14: Carma internet research module   detecting bad data

May 18-20, 2006 Internet Data Collection Methods (Day 2-14)

Summary of Bad Data Problems

Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…

Unmotivated responding: participant uses same option over and over again

Malicious patterns: Participate enters some unusually regular pattern of responses

There are at least five errors of these kinds in the exercise dataset (see below)