Upload
lora-hampton
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Data Management Techniques: How to Scrub Your Data
Andrea D. Hart
Catherine Callow-Heusser
Quality Assurance: The Overlooked and Underused Part of Data Management
Plan carefully prior to starting data collection.
Careful planning will reduce later data problems and cost increases.
Plan with data analysis in mind!
Quality Assurance
Field EditsDirectly after an interview, the interviewer should
scan for missing data points, illogical skip patterns, quantitative items that need more complete answers, observational items, or any item that needs clarification from a supervisor. Field edits should be made in a different color (green pencil) to denote the field edit as separate from the data collection time point.
Quality AssuranceQuality Control (QC)
Whether one uses paper instruments or direct computer entry, there should be a second pair of eyes checking the data before data entry. This quality control person should be fully trained on the interviewing procedures and the data entry rules. Any edits made by the QC person should be done in a consistent color (red pen). Erasures should not be made beyond the interview, so a history of decisions is documented.
Quality Control
• Communication should be clear between researcher, supervisor, interviewer, and quality control personnel to maintain consistent data collection rules and interpretations.
• When the quality control person gives feedback to the interviewer regarding problems with the interview or needed clarification, this corrects similar errors in the future and prevents interviewer drift.
Double Data Entry• Entering data twice and then comparing the two
data files for differences has been a standard practice for researchers for many years.
• For example, datafile 1 should have the exact information as datafile 2. So var1 from datafile1 subtracted from var1 from datafile2 should equal zero. Simple code can check when they are anything but zero. Then either a scanned or hard copy of the original interview is checked to resolve the difference. Often, at this point, data entry rules are established that deal with ambiguous data points.
Double Data EntryPROS:
• Leaves a “paper trail” for checking data errors• Low tech, no specialized software needed• Allows another quality control check on data
points (reading clarifying comments written in margins of questionnaire)
CONS: • Time consuming• Labor intensive
Scanning DataSoftware programs are available that scan interview
forms and read data (generally filled in dots) based on a programmed template. These programs can assign variable and value labels and provide data files for use in SAS and SPSS.
Pros:• Faster (depending on the design of the interview form
and the ease of making a template), thereby reducing data entry budgets
• Facilitates archiving hard copies as digital media (searchable by id’s) reducing storage space and tedious filing
Scanning Data
Cons:• With few exceptions, generally requires expensive
equipment and software. Additionally, may require higher levels of expertise from data entry personnel.
• Must redo forms not designed with a scanning template in mind to allow for “dots” to be filled.
• Forms that have “dots” to fill in are not user friendly to a variety of populations.
• Does not allow for double data entry and the reliability of accurate entry varies across scanning programs and across the user’s ability to fill in the “dots”.
Direct Data EntryDirect data entry allows participants or
interviewers to directly enter data into a computer via interactive screens.
Pros: • Circumvents error introduced by multiple data
handling steps• Requires little or no additional data entry personnel• Eliminates out-of-range values and skip pattern errors
Cons:• Generally, no “paper trail” to check data errors or recreate
corrupt data• Requires fairly sophisticated programming to create a
foolproof data entry interface• Computer equipment is needed
Range ChecksQueries or frequencies for values that are not possible or
highly unlikely should be run. For example, a yearly income of $50,000 is not impossible, but highly unlikely for a teen mother who qualifies for Early Head Start.
Skip PatternsQueries should be run for values that are illogical. For example, if there is an item that says, “no, I’ve never drunk alcohol” on a screen for alcohol use, then there should be no data for the section on alcohol use.
Naming Protocols
• In a longitudinal study, large amounts of data will be collected. Some of this data will be collected across several time points. It is crucial to have consistent, easy to decipher file names, variable names, and variable labels.
• It’s an art to devise variable names for large data sets that can be “decoded” and fit into only 8 characters (if using SPSS for data analysis).
Variable Naming Protocols
One strategy is to divide the variable name into 3 parts: a prefix, a stem, and a suffix.
Prefix: 2 or 3 characters denoting the timepoint the data was gathered and from whom it was asked or how it was gathered.
• M1 = Time 1, asked of the mother• F2 = Time 2, asked of the father• C3 = Time 3, asked of the childcare provider• V1 = Time 1, video-coded data
Variable Naming Protocols
Stem: After the prefix, the next set of characters should denote the construct being measured.
• V1intr = time 1, video-coded, parent intrusiveness
• M3dep = time 3, mother data, depression
Variable Naming Protocols
Suffix: The suffix can be used for scales with item numbers or summary scores.
• M3dep01 = time 3, mother data, construct depression, item number 1
• M3deptot = time 3, mother data, construct depression, total summary score
• M3depsu = time 3, mother data, construct depression, suicide subscale score
Calculate Summary Scores or Reliability Checks
• The computer should be used to calculate summary scores, reliability scores, or check hand-calculated summary scores. This allows missing data to be treated consistently with rules.
• For example, a researcher may want to assign a missing value to any summary score that is missing more than 25% of the individual items.
Reports• Databases like MS Access can be used to
produce any useful collection of data that is supported by the database.
• These reports can be used for management of data collection, particularly when data collection time points during a longitudinal study overlap.E.g., weekly reports of subjects who are within the
window of data collection for times 1 and 2.
• They can also be used to produce data codebooks, lists of variable name prefixes, stems, suffixes, database rules, etc.
Have all the data been entered into appropriate data tables?• Matching tracking data (completion codes of
various stages of data collection) with data entry tables ensures all data that was collected is entered. This allows cleanup of inconsistent data OR tracking information. All data should tell the same story.
• If the tracking information states that Alice completed all protocols for timepoint 1 but is missing the video portion of timepoint 2, then the data tables should also reflect this pattern.
“Business Rules”
• Business rules are database management rules used to reduce the proliferation of database “junk.”
• When multiple people use a database and are designing tables and queries, it becomes necessary to have rules to live by so that the database is transparent to those who use it.
“Business Rules”These rules may include:
• Naming protocols for permanent queries, tables, reports, or syntax.
• Articulating cleanup procedures like automatic deletion of non-permanent queries, data tables, or syntax after a designated amount of time.
• Designating personal databases or folders for individual use.
• Keeping a computer file for logging data errors or changes.
• Maintaining strict control over your computer directory structure—helps eliminate clutter and confusion.
• Automating a backup schedule.
Data Codebooks• A database like MS Access can store ANY type
of data. Thus you can integrate making a codebook as part of the data collection/entry process.
• These types of data can include your variable labels, value labels, questionnaires, or appendices.
• In preparing a longitudinal dataset, it is important to maintain consistent coding schemes across time. MS Access can be used to maintain definitions of coding schemes.yes/no in EVERY DATASET coded no = 0, yes = 1.
Two Examples of Codebooks Produced by MS Access
• Differing amounts of information on each screen
• Ease of readability
Summary• Your research credibility depends on sound data
management techniques.
• Proper data cleaning ALWAYS takes longer and requires more investment than you think it will.
• Clear advance planning of data tracking, variable naming, business rules, and writing data codebooks will be invaluable for a dataset that is logical, consistent, and easy to use.
• Always respect your data subjects and remember to build in data procedures that maintain their right to privacy.
Contact InformationAndrea D. Hart
University of Arkansas for Medical Sciences
Partners for Inclusive Communities
2001 Pershing Circle, Suite 300
North Little Rock, AR 72114
501-682-9918
FAX 501-682-9991
Catherine Callow-Heusser
NSF MSP/RETA Evaluation Capacity Building Project
2810 Old Main Hill
Utah State University
Logan, UT 84322-2810
435-797-1111
FAX 435-797-1448