Upload
syracuse-university
View
549
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Detecting Bad Data
CARMA Research ModuleJeff Stanton
May 18-20, 2006 Internet Data Collection Methods (Day 2-2)
Sources of Data Problems in Online Studies
Technical errors:Programming errors: Not common, but damaging when they
occurServer errors: Can halt the collection of dataTransmission errors: Uncommon and usually isolated to one
record or fieldResponse fraud:
Inadvertent multiple response and malicious multiple response
Missing dataIntentionally malicious patterns of response leading to
outliers or self-contradictory data
Response Fraud
Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process
Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality
Minimal frauds: skipping questions, not thinking through the answers
Maximal frauds: A robot that randomly answers
May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
Duplicate Detection
Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columnsCreate a new variable that contains this unique “checksum”
value for each row/case
Sort the dataset on the checksumCreate a lag difference variable that subtracts the
checksum for each neighboring rowSort on the lag variable and investigate all cases of zero
or small differences
May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
May 18-20, 2006 Internet Data Collection Methods (Day 2-5)
Bogus Response Detection
Calculate common univariate statistics using the complete row of responses for each subject
Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min)
Sort the cases by the mean valueLook for extreme outliers on the high and low ends
Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum
Look for anomalies and trace them back to the original data for that subject
May 18-20, 2006 Internet Data Collection Methods (Day 2-6)
Multivariate Outlier DetectionUse Mahalanobis distance to detect outliers
Regress a set of related items on an arbitrary dependent variableSort by Mahalanobis distance: Larger distances are suggestive of outliers
Use autocorrelation to detect unusual data patternsFlip the data: Cases become variables and variables become casesRun an autocorrelation functionLook at the ACF graphs to find oddly regular patterns of responding
(autocorrs in excess of .5 across one or more lags)I have provided example SPSS code in the utilities area of the
LMS for each of these tests
May 18-20, 2006 Internet Data Collection Methods (Day 2-7)
Mahalanobis
May 18-20, 2006 Internet Data Collection Methods (Day 2-8)
Plot, Sort, and Examine
May 18-20, 2006 Internet Data Collection Methods (Day 2-9)
An ACF Indicating No Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-10)
An ACF with a Suspicious Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-11)
Common Missing Data Mitigation Techniques
Item imputationFor composite scales expressed as the average of a set of
items, ignore any missing that appear on a small subsetMean substitution
Suppresses variabilityTime series imputation
Mean of neighboring points; suppresses spikesRegression imputation, works well for highly
intercorrelated variablesFull information maximum likelihood imputation
Available in some SEM programs
May 18-20, 2006 Internet Data Collection Methods (Day 2-12)
Excel Tips
Your friend the “fill” functionThe power of “Paste Special”Sorting: Click on Data/Sort
May 18-20, 2006 Internet Data Collection Methods (Day 2-13)
Excel Statistical Formulas
=average(value, value…)Gives the arithmetic mean of a collection of
cells and/or numeric values=stdev(value, value…) // stdevp(value,
value…)Gives the sample/population standard
deviation of a collection of cells and/or numeric values
=sum(value, value…)Gives the sum of a collection of cells and/or
numeric values=correl(vector1, vector2)
Gives the pearson correlation between two vectors
=if(<test>,<value if true>,<value if false>)Makes a logical test and returns a different
value depending on whether the test is true or false
Example =if(1=1, “Yes!”, “No…”)
=find(<find text>, <within text>, <start>)Looks for the string <find text> within the
string <within text> and returns the position of the first occurrence after <start>
Example: =find(“=“, “fish=head”, 1)=Len(<string>)
Returns the number of characters in a string
Example =Len(“Ouch”)=Right(<string>,<length>)
Returns the rightmost <length> characters in string
Example: =Right(“fishhead“,4)=Left(<string>,<length>) works similarly
May 18-20, 2006 Internet Data Collection Methods (Day 2-14)
Summary of Bad Data Problems
Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back…
Unmotivated responding: participant uses same option over and over again
Malicious patterns: Participate enters some unusually regular pattern of responses
There are at least five errors of these kinds in the exercise dataset (see below)