Measuring Disclosure Risk and Data Utility for Flexible Table Generators

11

Measuring Disclosure Risk and Data Utility

for Flexible Table Generators

Natalie Shlomo, Laszlo Antal, Mark Elliot University of [email protected]

The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 262608 (DwB - Data without Boundaries).

22

Topics Covered

• Introduction• Design of Flexible Table Generating Servers• Information Based Risk-Utility Measures• Example Application and Results• Discussion

33

• Large demand for specialized and tailored tables from policy makers and researchers

• NSIs considering the internet to disseminate outputs through flexible table generators, eg. US Census Bureau, Australia ABS, Israel CBS

• Key questions:

(1) What data should be used to produce the tables?

• Original microdata with or without SDC methods, often aggregated to hypercubes

(2) At what stage to apply the SDC?

• Apply to underlying data and all tables considered safe

- Compounds SDC and reduces utility

• Apply to final output tables

- Problem to ensure consistency and additivity

Introduction

44

• Types of disclosure risk:

• Identity disclosure where small cells may lead to an identification

• Attribute disclsoure where rows/columns have structural zeros and only one or two cells populated (small cells on margins)

• Differencing tables leading to higher risks of above disclosures

• For output based query systems, eg. flexible table generator, need perturbative methods of SDC (see: CS literature on differential privacy)

• Flexible table generating requires ‘on the fly’ disclosure risk assessment, application of SDC methods and data utility measures

Introduction

55

SDC rules easily programmed, some examples: • Limit the number of dimensions

• Avoid disclosure by differencing by ensuring consistent and nested categories

• Minimum population thresholds, average cell size, etc.

Algorithm:

(1) Determine by SDC rules if table can be produced

(2) Assess disclosure risk

(3) Apply SDC method if needed

(4) Recalculate disclosure risk

(5) If safe table then output with utility measure, else go to (3)

Designing a Flexible Table Generating Server

66

Types of Data: • Census Data- whole population counts

• European Census Hub with all member states providing common hypercubes

• Different SDC methods across member states reduces the utility of the hub

• Business data – different type of tables (magnitude) and not considered further

• Survey data from Social Surveys typically have non-perturbative SDC methods (coarsening)

• Weighted counts generally safe due to large and varying weights with low sample counts deleted for low quality

• Unweighted counts not differentially private due to sample uniques that are population uniques (Shlomo and Skinner 2012) and must be avoided

Designing a Flexible Table Generating Server

77

• To assess attribute disclosure in tables mainly caused by structural zeros, use the entropy

where vector of frequency counts and • Entropy bounded by 0 if all cells are zero except one

cell, and log(K) if all cell values are equal, i.e. cell proportions are 1/K

Risk measure: • Combine with other measures (proportion of zeros

and size of the population)and define weighted average:

Information Based Disclosure Risk and Data Utility Measures

N

FFNN

N

FH

K

i ii

1loglog

)(

},...,,{ 21 KFFFF

K

i iFN1

K

N

FH log)(1

NeNww

KN

FFNNw

K

NKFwwwFR

K

i ii

K

i i 1log

1)1(

log

loglog12

1|

21

|),,( 21

12

1

121

88

• Take into account perturbation that introduces random zeros:

• Adjust first term comparing number of zeros before and after perturbation

• Smooth out perturbed cell counts based on their expectation under the transition matrix (lowers the second term)

Example: For random rounding, replace perturbed zeros with:

where frequencies of cell values and frequencies of perturbed cell values

• For sampling, smooth out sample counts by using probabilistic Log-linear-Poisson model approach (Skinner and Shlomo 2008)

• Replace population counts in the entropy term by • Estimate number of zeros by:


),,,( 3210 nnnn),,,( 3210 nnnn

0210 /)3/1(2)3/2(10 nnnn

k k )ˆexp(

k̂

99

• Utility measure: Hellenger’s Distance

where original counts

perturbed counts• Hellenger’s Distance bounded by 0 and and can be

used to compare SDC methods


2

1)(

2

1),( k

K

k k FFFFHD

N

}...,{ 21 KFFFF

}...,{ 21 KFFFF

1010

• Population N=1,500,000 NUTS2 Region - two regions Gender – 2 categories Banded age groups – 21 categories Current Activity Status – 5 categories Occupation – 13 categories Educational attainment – 9 categories Country of citizenship – 5 categories

• Calculate cell proportions from 2001 UK Census via iterative proportional fitting

• All proportions multiplied by population size and rounded

Example: Simulation Hypercube

Cell Value Number of Cells Percentage of Cells 0 226,939 92.36% 1 4,028 1.64% 2 2,112 0.86% 3-5 2,964 1.21% 6-8 1,664 0.68% 9-10 720 0.29% 11 and over 7,273 2.96% Total 245,700 100.00%

1111

• Define a 3- dimensional table with one variable to define the population: banded age group, education group and occupation group defined for NUTS2=1

• Table has 2,457 cells, 854,539 individuals, average cell size of 347.8

• For comparison, we carry out a semi-controlled random rounding to base 3 on the output table calculated from original data

Flexible Table Generating Servers

Cell Value Number of Cells Percentage of Cells

0 1534 62.43%1 44 1.79%2 35 1.42%3 27 1.10%4 20 0.81%5 and over 797 32.44%Total 2457 100.00%

1212

• Random record swapping by selecting 5% of the individuals in NUTS2 region and swapping LAU2, thus a total of 10% of individuals swapped

• Semi-controlled random rounding to base 3 controlled for two NUTS2 totals

• Invariant PRAM with control of totals for two NUTS regions

• Perturbation on cell values 1 to 10 and above 11 no perturbation

• Low entropy, i.e. cells perturbed to neighbouring cells only

• Risk measure: weights: .1, .7 (small cells), .1, .1• Adjust measure for perturbations by transition matrix• Sample based measure: all 2 way interaction log-linear

model (entropy term: populaton 0.318, sample 0.323, estimate 0.319)

SDC Methods for Hypercube

13

ResultsDisclosure Risk Hellinger’s Distance

Perturbed Input

Original 0.352 -

1:50 sample table 0.425 59.05

Swapping 0.351 6.47

Semi-controlled Random Rounding 0.237 7.97

Stochastic Perturbation 0.230 14.12

Perturbed Output

Semi-Controlled Random Rounding 0.233 5.90

• Record swapping applied to hypercube did little to reduce disclosure risk since small cells remain and utiity is high

• Stochastic perturbation has lower disclosure risk but low utility

• Semi-controlled random rounding also reduces disclosure risk and good utility but need to ensure consistency and additivity so could lower utility

• Comparing the rounding before and after shows that SDC ‘on the fly’ has lower disclosure risk and the highest utility out of all the methods since perturbation is not confounded

• Sample based risk measure resulted in higher risk measure (future work) with very low utility

1414

Discussion• While agencies can claim there is uncertainty in the

tables from record swapping, there is little actual reduction in disclosure risk which is problematic when disseminating tables freely over the internet

• Record swapping and the proposed stochastic perturbation have little impact on disclosure by differencing since it leaves original counts in the table

• Perturbative methods where all cells are perturbed can provide more protection and can be made differentially private

• To avoid confounding SDC methods, apply perturbative method ‘on the fly’ within the table generating server on final output table

• Using stochastic perturbative methods allow users to account for the perturbation in their analysis

• Future research: Improve SDC methods for additivity and consistency ; Consider conditional entropy to account for perturbation and sampling

15

Thank you for your attention

Documents

Measuring Disclosure Risk and Data Utility for Flexible Table Generators