Data Management Techniques: How to Scrub Your Data Andrea D. Hart Catherine Callow-Heusser

Data Management Techniques: How to Scrub Your Data

Andrea D. Hart

Catherine Callow-Heusser

Quality Assurance: The Overlooked and Underused Part of Data Management

Plan carefully prior to starting data collection.

Careful planning will reduce later data problems and cost increases.

Plan with data analysis in mind!

Quality Assurance

Field EditsDirectly after an interview, the interviewer should

scan for missing data points, illogical skip patterns, quantitative items that need more complete answers, observational items, or any item that needs clarification from a supervisor. Field edits should be made in a different color (green pencil) to denote the field edit as separate from the data collection time point.

Quality AssuranceQuality Control (QC)

Whether one uses paper instruments or direct computer entry, there should be a second pair of eyes checking the data before data entry. This quality control person should be fully trained on the interviewing procedures and the data entry rules. Any edits made by the QC person should be done in a consistent color (red pen). Erasures should not be made beyond the interview, so a history of decisions is documented.

Quality Control

• Communication should be clear between researcher, supervisor, interviewer, and quality control personnel to maintain consistent data collection rules and interpretations.

• When the quality control person gives feedback to the interviewer regarding problems with the interview or needed clarification, this corrects similar errors in the future and prevents interviewer drift.

Double Data Entry• Entering data twice and then comparing the two

data files for differences has been a standard practice for researchers for many years.

• For example, datafile 1 should have the exact information as datafile 2. So var1 from datafile1 subtracted from var1 from datafile2 should equal zero. Simple code can check when they are anything but zero. Then either a scanned or hard copy of the original interview is checked to resolve the difference. Often, at this point, data entry rules are established that deal with ambiguous data points.

Double Data EntryPROS:

• Leaves a “paper trail” for checking data errors• Low tech, no specialized software needed• Allows another quality control check on data

points (reading clarifying comments written in margins of questionnaire)

CONS: • Time consuming• Labor intensive

Scanning DataSoftware programs are available that scan interview

forms and read data (generally filled in dots) based on a programmed template. These programs can assign variable and value labels and provide data files for use in SAS and SPSS.

Pros:• Faster (depending on the design of the interview form

and the ease of making a template), thereby reducing data entry budgets

• Facilitates archiving hard copies as digital media (searchable by id’s) reducing storage space and tedious filing

Scanning Data

Cons:• With few exceptions, generally requires expensive

equipment and software. Additionally, may require higher levels of expertise from data entry personnel.

• Must redo forms not designed with a scanning template in mind to allow for “dots” to be filled.

• Forms that have “dots” to fill in are not user friendly to a variety of populations.

• Does not allow for double data entry and the reliability of accurate entry varies across scanning programs and across the user’s ability to fill in the “dots”.

Direct Data EntryDirect data entry allows participants or

interviewers to directly enter data into a computer via interactive screens.

Pros: • Circumvents error introduced by multiple data

handling steps• Requires little or no additional data entry personnel• Eliminates out-of-range values and skip pattern errors

Cons:• Generally, no “paper trail” to check data errors or recreate

corrupt data• Requires fairly sophisticated programming to create a

foolproof data entry interface• Computer equipment is needed

Range ChecksQueries or frequencies for values that are not possible or

highly unlikely should be run. For example, a yearly income of $50,000 is not impossible, but highly unlikely for a teen mother who qualifies for Early Head Start.

Skip PatternsQueries should be run for values that are illogical. For example, if there is an item that says, “no, I’ve never drunk alcohol” on a screen for alcohol use, then there should be no data for the section on alcohol use.

Naming Protocols

• In a longitudinal study, large amounts of data will be collected. Some of this data will be collected across several time points. It is crucial to have consistent, easy to decipher file names, variable names, and variable labels.

• It’s an art to devise variable names for large data sets that can be “decoded” and fit into only 8 characters (if using SPSS for data analysis).

Variable Naming Protocols

One strategy is to divide the variable name into 3 parts: a prefix, a stem, and a suffix.

Prefix: 2 or 3 characters denoting the timepoint the data was gathered and from whom it was asked or how it was gathered.

• M1 = Time 1, asked of the mother• F2 = Time 2, asked of the father• C3 = Time 3, asked of the childcare provider• V1 = Time 1, video-coded data


Stem: After the prefix, the next set of characters should denote the construct being measured.

• V1intr = time 1, video-coded, parent intrusiveness

• M3dep = time 3, mother data, depression


Suffix: The suffix can be used for scales with item numbers or summary scores.

• M3dep01 = time 3, mother data, construct depression, item number 1

• M3deptot = time 3, mother data, construct depression, total summary score

• M3depsu = time 3, mother data, construct depression, suicide subscale score

Calculate Summary Scores or Reliability Checks

• The computer should be used to calculate summary scores, reliability scores, or check hand-calculated summary scores. This allows missing data to be treated consistently with rules.

• For example, a researcher may want to assign a missing value to any summary score that is missing more than 25% of the individual items.

Reports• Databases like MS Access can be used to

produce any useful collection of data that is supported by the database.

• These reports can be used for management of data collection, particularly when data collection time points during a longitudinal study overlap.E.g., weekly reports of subjects who are within the

window of data collection for times 1 and 2.

• They can also be used to produce data codebooks, lists of variable name prefixes, stems, suffixes, database rules, etc.

Have all the data been entered into appropriate data tables?• Matching tracking data (completion codes of

various stages of data collection) with data entry tables ensures all data that was collected is entered. This allows cleanup of inconsistent data OR tracking information. All data should tell the same story.

• If the tracking information states that Alice completed all protocols for timepoint 1 but is missing the video portion of timepoint 2, then the data tables should also reflect this pattern.

“Business Rules”

• Business rules are database management rules used to reduce the proliferation of database “junk.”

• When multiple people use a database and are designing tables and queries, it becomes necessary to have rules to live by so that the database is transparent to those who use it.

“Business Rules”These rules may include:

• Naming protocols for permanent queries, tables, reports, or syntax.

• Articulating cleanup procedures like automatic deletion of non-permanent queries, data tables, or syntax after a designated amount of time.

• Designating personal databases or folders for individual use.

• Keeping a computer file for logging data errors or changes.

• Maintaining strict control over your computer directory structure—helps eliminate clutter and confusion.

• Automating a backup schedule.

Data Codebooks• A database like MS Access can store ANY type

of data. Thus you can integrate making a codebook as part of the data collection/entry process.

• These types of data can include your variable labels, value labels, questionnaires, or appendices.

• In preparing a longitudinal dataset, it is important to maintain consistent coding schemes across time. MS Access can be used to maintain definitions of coding schemes.yes/no in EVERY DATASET coded no = 0, yes = 1.

Two Examples of Codebooks Produced by MS Access

• Differing amounts of information on each screen

• Ease of readability

Summary• Your research credibility depends on sound data

management techniques.

• Proper data cleaning ALWAYS takes longer and requires more investment than you think it will.

• Clear advance planning of data tracking, variable naming, business rules, and writing data codebooks will be invaluable for a dataset that is logical, consistent, and easy to use.

• Always respect your data subjects and remember to build in data procedures that maintain their right to privacy.

Contact InformationAndrea D. Hart

[email protected]

University of Arkansas for Medical Sciences

Partners for Inclusive Communities

2001 Pershing Circle, Suite 300

North Little Rock, AR 72114

501-682-9918

FAX 501-682-9991

Catherine Callow-Heusser

[email protected]

NSF MSP/RETA Evaluation Capacity Building Project

2810 Old Main Hill

Utah State University

Logan, UT 84322-2810

435-797-1111

FAX 435-797-1448

Documents

Data Management Techniques: How to Scrub Your Data Andrea D. Hart Catherine Callow-Heusser