27
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003

Disclosure Avoidance: An Overview

  • Upload
    pavel

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Disclosure Avoidance: An Overview. Irene Wong ACCOLEDS/DLI Training December 8, 2003. Note:. - PowerPoint PPT Presentation

Citation preview

Page 1: Disclosure Avoidance: An Overview

Disclosure Avoidance:An Overview

Irene Wong

ACCOLEDS/DLI Training

December 8, 2003

Page 2: Disclosure Avoidance: An Overview

Note:

The following slides were prepared in conjunction with the ACCOLEDS/DLI Training presentations at the University of Calgary (Alberta) on December 8, 2003, and are not intended for use as documentation of disclosure risk control and practices.

For more information about the slides, please contact the author

at [email protected].

Page 3: Disclosure Avoidance: An Overview

Presentation Outline

• Overview of data confidentiality

• Different types of disclosure and output

• Some examples

• Facing the challenge

Page 4: Disclosure Avoidance: An Overview

Why is keeping data confidentiality so important?

• Retain and Respect Public Trust – Most household/population surveys do not have

mandatory participation

– Respondents volunteer their time and information

– Respondents trust Statistics Canada to ensure their privacy and the confidentiality of their information

– To ensure future data collection

– Statistics Act - judiciously guarding respondents’ confidential information

Page 5: Disclosure Avoidance: An Overview

Types of data

• Aggregated data vs. Microdata– Dictate the data release method

• Enterprise data vs. Household data– Mandatory vs. voluntary participation

• Admin Data and Census vs. Sample Survey– Different degree of risk of disclosure

Page 6: Disclosure Avoidance: An Overview

Confidentiality and Disclosure

• Under the Statistics Act, Statistics Canada must protect the confidentiality of respondents’ data and identity.

• Disclosure relates to the inappropriate attribution of information to a data subject, whether the subject is an individual or an organization.

Page 7: Disclosure Avoidance: An Overview

So what’s the problem?

• Direct Identifiers (name, address, health number, etc.) that uniquely identify a respondent. These are all stripped from released data files.

• Indirect Identifiers refer to variables such as age, marital status, occupation, ethnicity, postal code, type of business etc.). When combined they could be used to identify a respondent.

• Sensitive variables refer to information or characteristics relating to a respondent’s private life or business which are usually unknown to others (income, illness, behaviour etc.).

Page 8: Disclosure Avoidance: An Overview

The concern is…

• Combining indirect identifiers with sensitive variables poses a disclosure risk, but…

• It is usually what researchers like to do– to relate specific characteristics of some response

groups to some specific activities/characteristics– and how/why they are related

• Control method: restricted access, data reduction, disclosure analysis …

Page 9: Disclosure Avoidance: An Overview

Controls on microdata release

• Restricted Access– License and data sharing agreement– Strictly control record linkage (direct identifier)– Survey data access restricted within the organization

• Employee access granted on a “need to know” basis only

– Analytical (confidential) database with direct identifiers removed

• Direct access – authorized employee/deemed employee only• Indirect data access (Remote Access services/Remote Data

Access services) - screening

• Data Reduction – e.g. PUMF

Page 10: Disclosure Avoidance: An Overview

Public Use Microdata File (PUMF)

• Files of anonymous individual records• Created for research purposes• Follows Statistics Canada’s Policy on

Microdata Release • Expect some forms of data reduction and

suppression• Expect suppression of sample design

information (cluster, stratification, etc.)

Page 11: Disclosure Avoidance: An Overview

PUMF disclosure risk control

• Suppress some indirect identifiers (e.g. small geographical code, race details, etc.)

• Avoid unique combination of indirect identifiers that can disclose a response unit (such as gender, age, occupation, chronic conditions, religion, etc.)

• Perform Univariate analyses and look for outliers

• Sometimes maximum/minimum values are capped

• And more…

Page 12: Disclosure Avoidance: An Overview

Protection of confidential data

• Physical protection of the data storage area

• Protection of the computer systems

• Enforcement of data releasers’ and users’ responsibilities to protect respondent confidentiality

• Disclosure analysis on output that leaves the restricted data storage area

Page 13: Disclosure Avoidance: An Overview

Identity Disclosure

• Identity Disclosure - When a respondent can be identified from the released data.– Combine identifier with sensitive variables

Examples:• Spontaneous recognition of well-known

characteristic by others (e.g. from small sample)• Self-disclosure (e.g., respondent self-identifies

when complaining to the media on privacy violation)

Page 14: Disclosure Avoidance: An Overview

Attribute Disclosure

• Attribute Disclosure - When confidential information is revealed and can be attributed to an individual or a group.– Such as, all persons with characteristic x have

characteristic y

Examples:

• People in occupation W make $ 50-60,000/year…• 100% of the respondents of age W in area X

reported that they experimented with …

Page 15: Disclosure Avoidance: An Overview

Residual Disclosure

• Residual disclosure - when confidential information is disclosed by combining previously released output and information.

• Extra care is needed where risk of residual disclosure is high, such as – Subsequent cycles of longitudinal data files (e.g. NLSCY,

NPHS, etc.)– Sample from dependent surveys (e.g. SLID and LFS)– Research projects using the same data file– Overlapping small geographical area (e.g. Health Region

and Economic Region)

Page 16: Disclosure Avoidance: An Overview

Types of outputs

• Analytic studies (e.g. inferential statistics/model output)– Model parameters such as, regression coefficients, etc.

– Hypothesis test results such as, p-value, t-statistics, etc.

• Descriptive studies (e.g. table output)– Frequencies, percentiles, cross-tabulation, standard

errors, correlation matrix, etc.

Page 17: Disclosure Avoidance: An Overview

To lower disclosure risk

General rules we follow for household sample surveys:• Do not report statistics or table cells with small

number of respondents (e.g. fewer than 5 respondents) • No anecdotal information may be given about specific

respondents• ‘Zero’ and ‘Full’ cell restriction• Min. and Max. value restriction• Saturated models, covariance/correlation matrices

treated like underlying tables• And more…..

Page 18: Disclosure Avoidance: An Overview

Some examples…

Page 19: Disclosure Avoidance: An Overview

Low frequency cells

F, 0 is a low frequency cell.

Solution?

• Collapse column ‘M’ and ‘F’ = column ‘total’

• Collapse row ‘1’ and ‘0’ = row ‘total’

• Report either column ‘M’ and row ‘1’ but not along with the ‘total’

M F total

1 34 14 48

0 15 2 17

total 49 16 65

M F total

1 34 14 48

0 15 X 17

total 49 16 65

Page 20: Disclosure Avoidance: An Overview

Frequency distributions

Frequency curve, e.g.: user wishes to release the the value of observation at the 99th percentile

* child 1: family 1

child 2: family 1

child 3: family 2

child 4: family 2

child 5: family 3….

If < 5 respondents are above the 99th percentile, there is a problem. One solution is to describe the distribution using the 95th percentile.

* If the survey is multilevel (NLSCY), then the 5 or more respondents from level 1 (child) must come from at least 3 different units from level 2 (household).

Page 21: Disclosure Avoidance: An Overview

‘Zero’ and ‘Full’ cell

• (F, 1) is a full cell• (F, 0) is a non-structural zero

cell– Both could pose confidentiality

problem

• (Married, age <12) is a structural zero cell– Not a data confidentiality

problem – Not expect anyone to be in this

category

M F total

1 52 64 116

0 13 0 13

65 64 129

age married single total

<12 0 40 40

13-20 5 35 40

>20 32 8 40

37 83 120

Page 22: Disclosure Avoidance: An Overview

Implied tables - residual disclosure

• Implied tables are tables produced by subtracting results from one or more published tables from another published table

• In this example, ‘non-married’ individuals can easily be calculated

Select if Married = 1

Yes No1 2013 40

2 205 35

3 132 8

2350 83

Select all cases

Yes No1 2020 41

2 209 52

3 430 16

2659 109

Page 23: Disclosure Avoidance: An Overview

When reporting information…

• Writing a report is no different than working with table output, avoid statements such as:

• “… responded incomes ranging from $2,498 to $579,789.”– If necessary, give general indications (e.g. “no income was

above $600,000”.)

• “… all respondents of age 16 reported experimenting with drugs.”– This is equivalent to a full cell situation.

Page 24: Disclosure Avoidance: An Overview

Related Outputs

• If PUMF as well as analytical outputs using confidential data are released for the same survey, the published results should not disclose sensitive information about individual respondents that was suppressed in the PUMF.

• That is, from the reported results, it should not be possible to infer information that allows the identification of a PUMF respondent.

Page 25: Disclosure Avoidance: An Overview

Facing Challenges

• No single control of all the releases – Remote Access, PUMFs, RDCs, survey data

publications, etc.

• Potential residual disclosure

• Can residual disclosure be totally accounted for? Can it be better controlled?

Page 26: Disclosure Avoidance: An Overview

What RDCs are doing now…

• Educate data users to – Take precautions when dealing with confidential

information– Recognize disclosure risk– Make use of alternative reporting and

complementary suppression– Limit intermediary outputs

Page 27: Disclosure Avoidance: An Overview

What else should we do?

• Match against other types of file releases to assess overall disclosure risk?

• Future data reduction in PUMFs and publications?• Follow the American RDC approach? • Different disclosure analysis approach for

different data files?• Stricter screening process?• ……