Upload
bethany-stone
View
217
Download
5
Embed Size (px)
Citation preview
Res Meth Workshop Dec 04
Disclosure problems with design information for surveys
Gillian Raab
Kathy Buckner/Iona Waterston
Napier University
Susan Purdon
National Centre for Social Research
Res Meth Workshop Dec 04
Background
• PEAS project• Uses real survey data – not textbook
examples• Illustrates how they can be analysed using
– Different methodologies– Their implementations in software packages– Links the analyses with sections on the theory
relevant to the design and analysis of surveys
Res Meth Workshop Dec 04
Data availability
• ESRC stipulation– Data used in the exemplars must be available
via the ESRC data archive
• But if this is the ONLY way it is available it it would make site hard to use
• So they exemplars use extracts, of just a few variables, available on the web
Res Meth Workshop Dec 04
Res Meth Workshop Dec 04
Need to make survey design variables available
• Cluster (primary sampling unit) identifiers– If the sample is clustered – here it was
• Indicators of the strata used– Here stratification was by local authority
• Weights • Cluster and stratum identifiers may not be
made available via the data archive or may be in restricted files
Res Meth Workshop Dec 04
Name Variables Formats
UNIQD Unique household identifier ids have been
scrambled
COUNCIL code for Scottish local authority see codebook
INTUSE whether uses internet 1=yes 0 =no
SHS_6CLA six fold regional classification see codebook
RC5 number of hours of internet p/w see codebook
AGE in years
SEX 1=male 2=female 1=male 2=female
RC7G internet for non-grocery shopping 1=yes 0=no
RC7E internet for grocery shopping 1=yes 0=no
PSU primary sampling unit ids have been
scrambled
EMP_STA current employment status see codebook
GRP_INC grouped income data see codebook
IND_WT weight for random adult
Res Meth Workshop Dec 04
Clusters are about 10 respondentsStrata are local authorities
Other cases strata might be (e.g.) large firms in a business survey.
Res Meth Workshop Dec 04
Disclosure can happen if
• We know the location of individual clusters
• We can identify an individual within a cluster
• Where a stratum is small and a large proportion of the stratum is sampled
• We have some means of linking the data on the web back to the full data source
Res Meth Workshop Dec 04
Steps to prevent disclosure
• Change cluster identifiers so they no longer reveal location
• Change IDs so they cannot link back• Add noise to the weights so they do not
identify individuals• Make the details of how the strata are
defined unavailable (not in this exemplar)• Maybe more things??
Res Meth Workshop Dec 04
What are the principles?• Do we need to worry about
– Population unique individuals– Sample unique individuals
Logically we would expect the formerBut the latter may also be important
If you know you are in the survey?If you know that someone else was in the survey?
Principles for individuals and organisations may have to be different
Res Meth Workshop Dec 04
Another way round this
• Surveys come with sets of replicate weights• Standard errors for surveys are provided using
jacknife or bootstrap methods• The user does not need to have access to the
individual deign variables• This approach has been pioneered by Statistics
Canada• But a sharp investigator could still work out
clusters
Res Meth Workshop Dec 04
Relevance to researchers
• We have been able to get the data we wanted for our exemplars so far
• But there are some surveys at the ESRC data archive where the cluster identifiers are – Not available at all– Information is there, but it is obscure
• A consistent policy (perhaps with restrictions) would be helpful