Data Don’t Grow on Trees Data Collection Past and Present in the NLS

Data Don’t Grow on Trees

Data Collection Past and Present in the NLS

Randall J. Olsen

Center for Human Resource Research

Initiative for Population Research

Why Talk About This?

• How the data are collected affects its quality, structure, organization and presentation

• How the data are collected affects how the interviewer interacts with both the instrument and respondent as well as her cognitive load and hence her ability to catch mistakes.

In the Beginning there was Census

• The original plan was for the NLS to last 5 years and stop

• CHRR did not receive the data from Census – it simply requested tabulations of the data

• There were no “public use data” at all for several years

• Cohorts were done personal-none-phone-none-phone, so R interviewed 3 times in 5 years with a phone preceding a personal

Original Cohorts (cont.)

• Each case had a “control card” (heavy card stock) that listed each address R occupied, HH members, contacts, and history of cooperation. This card was used in up to three rounds.

• Data from prior rounds were transcribed, by hand, onto a “flap” the interviewer could refer to when a check item utilized such data for a skip.

• See P. 8 (CPS), 16 (workhistory), 30 (HH roster), 32-33 (assets & income), 36 (siblings)

What Did this Mean for Data

• Structure of interview was very simple as checks were hard to execute

• If interviewer made a skip error, it was very hard to recover from it

• Sometimes the wrong questions were asked and the right questions were not

• The data had to go through a labor-intensive “editing” process to correct or flag errors.

“Packed Variables”

Code-all-that-apply questions were “bit-packed” – if there were 6 categories of things you are doing to find work, the data were represented like 1-0-1-1-0-1 based on a yes or no, and the answer was recorded as 45. The user had to “unpack” the 45 to recover the six 0-1 codes for methods used to find work. All these have long since been recoded

Errors

• Check items sometimes were coded incorrectly or items skipped.

• The editing process itself was done manually and was prone to error.

• Occasionally the answer to a question that should not have been asked is in the data file.

Structure

• Original cohorts survey structure had a structure known as “time series of cross-sections” because the same questions were repeated in the various waves.

• In the 1980’s we shifted the original cohorts to a more event history approach for the employment data.

Early Public Use Data

• Data on magnetic tape, often several reels of tape.

• Codebook was printed and could be a foot thick

• There was a “Key word in context” (KWIC) file to help the user find variables. This file could be several inches thick.

• Magnetic tapes documented with a “tape map” that showed where to find a variable

Public use data (cont.)

• In the case of the NLSY79 there were so many variables that data exceeded block size limits on IBM 360/370 systems.

• NLSY79 data broken into record types and each variable put into one record type or another – but only one.

• By late 1980’s NLSY79 was on 7 reels of tape

CD-ROM

• Data dissemination problem was acute, and this change was undertaken to make data more accessible. First CD release in 1989.

• Record types from the magnetic tape era were continued as a indexing tool for CD ROM – tapes and CD’s had to co-exist for several years until tape died out. The tendency for variables to have one “area of interest” is an atavism from record types used on magnetic tape. We are working toward variables having several areas of interest.

Data on Web

• Support costs for extraction executable grew with need to support new operating systems

• Faster web & greater penetration argue for end of CD’s after more than a decade

• The new Web package will also implement some new features to deal with the increasing size and complexity of the data files.

• The data extraction system is completely integrated with the data collection strategy we have developed over the past 20 years.

NLSY79 Data Collection

• Paper and Pencil 1979 – 1992 and hence shared many of the QC problems Census had. 1988 survey (R10) used a system called “computer assisted data entry” which was a disaster and never used again – R10 has about 100,000 data points in error

• Information from previous interviews was printed on a separate piece of paper called the “Information sheet”. Now those data are in a file, but still called information sheet.

Paper…

• Data about the respondent (names, addresses, contacts, history of cooperation) were printed on another piece of paper called the “face sheet” – similar to the Census “control card” except the face sheet was reprinted each year. When address changed, this was noted on Face Sheet.

• Editors went over each booklet and marked errors. When key variables were missed or in error, the case went to retrieval, and a staff person called R to “retrieve” the data. Ranges on numerical values were checked.

Paper, more paper

• After editors marked up the booklet, the responses were data entered on key punch machines (later data entry terminals that produced punch card images) and verified

• “Cleaning specification” programs were run and errors either corrected, data retrieved or coded as “invalid skip”.

• Process repeated until “all clean”, which really meant “we give up”.

• This process was very similar at Census• Mag tape was medium of storage for decades• The Census Bureau, in a fit of incompetence, has lost or

let deteriorate, archival tapes for the NLS so some data suppressed by Title 13 cannot be recovered even from within Census

What did PAPI do to Data?

• Invalid skips were far more common – either because FI forgot to answer question or followed wrong skip path

• Dollar and cents amounts often were bad – could be off by a factor of 100

• “Reserve value” codes sometimes taken as monetary amounts– Missing value codes for Ref, DK,NA, out of range

were coded by NORC as 995, 996, 997, etc. depending on field width – could be 9995, 9996, etc. Numerical values of this sort may be spurious. Most of these have been edited out in the recent income & assets edit and imputation.

Sources of Data on Locationin NLSY79

• Because address changes were coded on the face sheet, respondents who did not move had no address data entered. If the FI forgot to enter new data, we assumed R did not move. This made the address data less reliable in the first decade.

• Origin of location data is not the questionnaire, but the face sheet.

• Address data for NLSY79 were often bad, address, city, state and zip often were in conflict

• Phone numbers usually good, so for several years we geocoded by phone number

• In 1990’s we undertook a major effort to geocode addresses to lat and long. We occasionally “find” data on addresses and can review and update geocodes.

Info Sheets

Some data in the file are “straight pick-ups” from the information sheets. These data are in the data file, but one does not see where they are collected in the questionnaire. However, information sheet items can be very important to an understanding of the questionnaire as questions often involve a mixture of data from previous rounds and data from earlier questions in the current round.

“Household Interview” or HHI

In early years, we used different Household Interview forms depending on the age and residence of the respondent. Sometimes the parent acted as the informant and sometimes the respondent. Thus information about the household was often collected in a different manner across respondents, so the data file will differ across those respondents.

Bounded Interviewing

• NLSY79 was designed to collect event histories – a complete series of transitions (marital status, employment, welfare receipt, etc.) and the date of each transition

• That section of the survey starts by confirming R status at date of previous interview & asks about all the changes since then. This minimizes “seam problem” of inconsistent data between two survey waves. We don’t ask for data month by month, just changes.

Bounded Interviewing (cont.)

• With paper and pencil, it was hard to manage bounded interviewing as one needed the interviewer to use data from the previous interview. This makes the interview more complex and requires carrying data forward from the previous wave.

• The interview was done every year, and that gave us very little time to prepare bounded interviewing data for use in the next wave. We only used bounded interviewing for a few areas.

Bounded Interviewing (cont.)

• When we moved to CAPI in 1993, more sections of the interview were converted to bounded event history interviewing.

• Schooling and recipiency data for NLSY79 was originally not event history and had lots of “seam” problems.

Fertility Data

These data carefully edited at CHRR because of the Child Study. Always use edited data, never raw. We did not finish editing the PAPI data from wave k until after NORC was in the field for wave k+1. The information sheets were not based on edited data and so info sheet data were flawed. Only with CAPI were we able to edit the data in time for the next wave and this greatly improved the data.

CAPI Transition

• We tested for the presence of “mode effects” with CAPI in Rounds 11 (1989) and 12(1990). Only place we saw them was where amounts and units were on the same page on paper and different screens with CAPI.

• Rounds 11 and 12 had the same content for CAPI as used for PAPI, so there were few changes to the instrument.

• CAPI only starting Round 15 (1993)

CAPI only rounds for ‘79

• Event history on schooling and recipiency

• Use of “text fills” increased

• To make the interview easier for interviewer and respondent, the instrument became more complex – for example rate of pay questions became more elaborate and tailored to the time unit R wanted to use for reporting.

CAPI Complexity

• More use of information carried forward from previous rounds to avoid repeating questions

• This added complexity helped interviewers and respondents, but made the data file more complex and harder to use.

• The later PAPI questionnaires are easier to follow and reveal what is going on.

CAPI Complexity (cont.)

• CAPI does a large number of calculations in the background, and these operations leave a long trail of “temporary” variables, rosters and indicators. These are labeled as “machine checks” in the file.

• The size and complexity of the instrument requires a holistic strategy for managing the instrument and documentation.

The Toolkit

• The NLS questionnaire and data dissemination system are managed using a large relational database – each question is represented as a data record and the survey operates by a sequence of operations on the question records.

• These question records are re-used by the documentation system. This substantially reduces the level of effort to manage the process and supports the use of a variety of utilities and reports that support each major step from design through data release.

Toolkit (cont.)

• Instrument is not actually programmed – a substantial difference in how other projects are managed

• This linkage means that virtually every questionnaire resource available to the interviewer during the interview could be available to the documentation system – if it would help, we could link in help screens provided to the interviewer or even alternate language versions of the questionnaire.

• This creates a divide between the documentation resources available for 1993 and later rounds and those before. Puzzling discontinuities in documentation and questionnaire structure reflect about 5 generations of survey systems.

Toolkit & Nomenclature

– In NLSY79 questions were numbered consecutively, but those numbers changed year by year. Deck and column referred to the positions on a particular punch card occupied by a variable during the processing stage. The “R number” was assigned at the end of the process.

– With move to RDB methods, each data element needed a unique identifier and that identifier stayed the same over time.

– A question name (Q7-13b) identified the same variable year after year, again R numbers assigned last.

Nomenclature

• We try to keep variables in the same part of the instrument with nearly contiguous R numbers.

• As we run out of R numbers, we use S or T numbers. Some variables that are highly related but come from different waves (health conditions at age 40) get similar H numbers.

Interview Mode

• While Original cohorts used P-x-T-x-T pattern, NLSY79 was almost all face to face until the ’90’s

• 1987 – ’79 done by phone unsatisfactorily• 1988 NLSY79 has quality problems due to a flawed

CADE process being used (it was never re-used)• Face to face continued with steady increase in phone

cases – in late 1990’s costs forced a switch to phone– Instrument re-vamped to be phone friendly– YA was always primarily phone– ’79 pioneered virtual CATI shop technology – Web survey and

VoIP enabled regular FIs to work from home – kept costs in check and kept FIs busy on NLS most of the time

Mode (cont.)

• ’79 continues primarily phone – ’97 is mostly in-person– Both have relied on SAQ for confidential material– SAQ in ’97 utilizes text fills whereas SAQ’s for ’79 did

not– Spanish language versions of quex have been

phased out, primarily for cost reasons

• Efforts to interview overseas R have been reduced to a phone effort – we no longer send FIs to Mexico or PR.

NLSY97 – No Paper!

• NLSY97 was never collected on paper and hence there is no “simpler version” of the questionnaire to consult. NLSY79 has the R14 quex that is compact and readable. No analog for NLSY97

• To save time, designers made an aggressive effort to streamline the instrument, but this made the data file more complex.

• Some questions asked by birth year• More event histories and bounded interviewing• For some questions, if R refuses, we attempt to collect

the data the following year

NLSY97 complexity

• The desire to include a great many questions in the survey clashed with the limit on interview length. Questions were asked based on birth year, so users need to watch out for those check items. Your warning this is going on is a large number of valid skips

• Questions on key concepts were asked in multiple locations – this makes data harder to use. As resources allow, we will try to “combine” these variables into one location.

NLSY97 complexity

• The file size for this cohort is very large so there is a project underway to “hide” some of the variables users ought not extract.

• Complexity is a problem for users, but it is the price of being able to collect so much data. Simpler questionnaires take longer to administer and long surveys create attrition risks. Greater scope and complexity are a tied deal.

NLSY97 complexity (cont.)

• NLSY97 relies on rosters to a greater degree than other cohorts – especially important to use the right roster and not questions that populate roster. In the past several years, we have quietly removed variables users should not extract.

• Some rosters were edited at CHRR and contain reliable data, other rosters were temporary or work files and we made no effort to edit them

Documents

Data Don’t Grow on Trees Data Collection Past and Present in the NLS