Res Meth Workshop Dec 04 Disclosure problems with design information for surveys Gillian Raab Kathy Buckner/Iona Waterston Napier University Susan Purdon

Res Meth Workshop Dec 04

Disclosure problems with design information for surveys

Gillian Raab

Kathy Buckner/Iona Waterston

Napier University

Susan Purdon

National Centre for Social Research


Background

• PEAS project• Uses real survey data – not textbook

examples• Illustrates how they can be analysed using

– Different methodologies– Their implementations in software packages– Links the analyses with sections on the theory

relevant to the design and analysis of surveys


Data availability

• ESRC stipulation– Data used in the exemplars must be available

via the ESRC data archive

• But if this is the ONLY way it is available it it would make site hard to use

• So they exemplars use extracts, of just a few variables, available on the web



Need to make survey design variables available

• Cluster (primary sampling unit) identifiers– If the sample is clustered – here it was

• Indicators of the strata used– Here stratification was by local authority

• Weights • Cluster and stratum identifiers may not be

made available via the data archive or may be in restricted files


Name Variables Formats

UNIQD Unique household identifier ids have been

scrambled

COUNCIL code for Scottish local authority see codebook

INTUSE whether uses internet 1=yes 0 =no

SHS_6CLA six fold regional classification see codebook

RC5 number of hours of internet p/w see codebook

AGE in years

SEX 1=male 2=female 1=male 2=female

RC7G internet for non-grocery shopping 1=yes 0=no

RC7E internet for grocery shopping 1=yes 0=no

PSU primary sampling unit ids have been

scrambled

EMP_STA current employment status see codebook

GRP_INC grouped income data see codebook

IND_WT weight for random adult


Clusters are about 10 respondentsStrata are local authorities

Other cases strata might be (e.g.) large firms in a business survey.


Disclosure can happen if

• We know the location of individual clusters

• We can identify an individual within a cluster

• Where a stratum is small and a large proportion of the stratum is sampled

• We have some means of linking the data on the web back to the full data source


Steps to prevent disclosure

• Change cluster identifiers so they no longer reveal location

• Change IDs so they cannot link back• Add noise to the weights so they do not

identify individuals• Make the details of how the strata are

defined unavailable (not in this exemplar)• Maybe more things??


What are the principles?• Do we need to worry about

– Population unique individuals– Sample unique individuals

Logically we would expect the formerBut the latter may also be important

If you know you are in the survey?If you know that someone else was in the survey?

Principles for individuals and organisations may have to be different


Another way round this

• Surveys come with sets of replicate weights• Standard errors for surveys are provided using

jacknife or bootstrap methods• The user does not need to have access to the

individual deign variables• This approach has been pioneered by Statistics

Canada• But a sharp investigator could still work out

clusters


Relevance to researchers

• We have been able to get the data we wanted for our exemplars so far

• But there are some surveys at the ESRC data archive where the cluster identifiers are – Not available at all– Information is there, but it is obscure

• A consistent policy (perhaps with restrictions) would be helpful

Documents

Res Meth Workshop Dec 04 Disclosure problems with design information for surveys Gillian Raab Kathy Buckner/Iona Waterston Napier University Susan Purdon