11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University

11

Assessing Disclosure Risk in Sample Microdata Under

Misclassification

Natalie Shlomo Southampton Statistical Sciences Research InstituteUniversity of [email protected]

This is joint work with Prof Chris Skinner

mailto:[email protected]

22

Topics Covered

• Introduction and motivation

• Disclosure risk assessment for sample microdata: • Probabilistic modelling extended for

misclassification• Probabilistic record linkage – linking the

frameworks

• Risk-utility framework for microdata subject to misclassification

• Discussion

33

• Disclosure risk scenario: ‘intruder’ attack on microdata through linking to available public data sources

Linkage via identifying key variables common to both sources, eg. sex, age, region, ethnicity

• Agencies limit risk of identification through statistical disclosure limitation (SDL) methods:• Non-perturbative – sub-sampling, recoding and

collapsing categories of key variables, deleting variables

• Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data

Introduction

44

• Need to quantify the risk of identification to inform microdata release

• Probabilistic models for quantifying risk of identification based on population uniqueness on identifying key variables

• Population counts in contingency tables spanned by key variables unknown

• Distribution assumptions to draw inference from the sample for estimating population parameters

• Assumes that microdata has not been altered either through misclassification arising from data processes or purposely introduced for SDL

Introduction

55

• Expand probabilistic modelling for quantifying risk of identification under misclassification/perturbation

• For perturbative methods of SDL, risk assessment typically based on probabilistic record linkage • Conservative assessment of risk of identification • Assumes that intruder has access to original dataset

and does not take into account protection afforded by sampling

• Fit probabilistic record linkage into the probabilistic modelling framework for categorical matching variables

Introduction

6

Disclosure Risk Assessment

Probabilistic Modelling – no misclassification

• Let denote a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell

• Disclosure risk measure:

• For unknown population counts, estimate from the conditional distribution of

kF

kkf

k

kk FfI )1,1(1 k k

k FfI

1)1(2

)1|1(ˆ)1(1 kkk

k fFPfI )1|1

(ˆ)1(ˆ2 kkk

k fF

EfI

kk fF |

}{f kf

}{F kF),...,( 1 qkkk

7


• Natural assumption:

Bernoulli sampling:

It follows that: and

where are conditionally independent

)(~ kk PoissonF

),(~| kkkk FBinFf

)(~ kkk Poissonf

))1((~| kkkk PoissonfF

kk fF |

is the sampling fraction in cell kk

8


• Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters

• Sample frequencies are independent Poisson distributed with a mean of

• Log-linear model for estimating expressed as:

where design matrix of key variables and their interactions

• MLE’s calculated by solving score function:

}{ k

kfkkk

kk x)log(

0x)]xexp([ kkkk

f

}{ k

x

9


• Fitted values calculated by: and

• Individual risk measures estimated by:

• Skinner and Shlomo (2009) develop goodness of fit criteria which minimize the bias of disclosure risk estimates, for example, for

)êxp(ˆ kku xk

kk

u

ˆˆ

))1(êxp()1|1(ˆkkkk fFP

)]1(ˆ/[))]1(êxp(1[)1|1

(ˆkkkkk

k

fF

E

1

)}2/(])ˆ)[(1()ˆ){(1)(êxp(ˆˆ 21 kkkk

kkkkkkk fffB

10


• Criteria related to tests for over and under-dispersion:

• over-fitting - sample marginal counts produce too many random zeros, leading to expected cell counts too high for non-zero cells and under-estimation of risk

• under-fitting - sample marginal counts don’t take into account structural zeros, leading to expected cell counts too low for non-zero cells and over-estimation of risk

• Criteria selects the model using a forward search algorithm which minimizes for where is the variance of ii vB ˆ/ˆ 2,1,ˆ ii

i iB

11


Example: Population of 944,793 from UK 2001 Census SRS sample size 9,448

Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells

Model Selection:

Starting solution: main-effects log-linear model which indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit

12

Model Search Example (SRS n=9,448) True values ,

Independence - I 386.6 701.2 48.54 114.19

All 2 way - II 104.9 280.1 -1.57 -2.65

1: I + {a*ec} 243.4 494.3 54.75 59.22

2: 1 + {a*et} 180.1 411.6 3.07 9.82

3: 2 + {a*m} 152.3 343.3 0.88 1.73

4: 3 + {s*ec} 149.2 337.5 0.26 0.92

5a: 4 + {ar*a} 148.5 337.1 -0.01 0.84

5b: 4 + {s*m} 147.7 335.3 0.02 0.66

6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56

6c: 5b + {ar*m} 148.9 337.1 -0.04 0.72

6d: 5b + {m*ec} 146.3 331.4 -0.24 0.03

7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06

7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03

Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec

,

159~1 9.355~

2

11 /ˆ vB 22 /ˆ vB1 2

13

Model Search ExamplePreferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a}

True Global Risk: Estimated Global Risk

5.1481 1.337ˆ2

0.001

0.01

0.1

1

0.001 0.01 0.1 1

Log-scale

159~1 9.355~

2

Estimated per-record risk measure

True risk measure

14


• Skinner and Shlomo, 2009 address complex survey designs: • Sampling clusters introduce dependencies - key

variables cut across clusters and assumption holds in practice

• Stratification – include strata id in key variables to account for differential inclusion probabilities

• Survey weights - Use pseudo maximum likelihood estimation where score function modified to:

changed to: where

• Partition large tables and assess partitions separately (assumes that partitioning variable has an interaction with other key variables)

0)]exp(ˆ[ k

kkk xxF ˆˆ /k k kf F

ki

ik wFk

15

Disclosure Risk Assessment Under

Misclassification

• Model assumes no misclassification errors either arising from data processes or purposely introduced for SDL

• Shlomo and Skinner, 2010 address misclassification errors

Let: where cross-classified key variables: in population fixed in microdata subject to misclassification

)|~

( jXkXPM kj

X

X

X~

16

Disclosure Risk Assessment Under

misclassification

• The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification:

(1)

• For small misclassification and small sampling fractions: or (2)

• Global measure: estimated by:

(3)

where per-record risk:

kj

kjkjj

kkkkk FMMF

MMfBAP

1

)1/(

)1/()1

~|(

j

kjj

kk

MF

M

k

kk

F

M~

k k

kkk

F

MfI ~)1(2

k

k

kkk

k fF

EMfI~

|~1ˆ)1

~(ˆ2

1

~|~

1ˆk

k

kk fF

EM

17

PRAM ( Post-randomisation method)

• Probability transition matrix containing conditional probabilities for a category c:

• Let T be a vector of frequencies

• On each record, category c changed or not changed according to and the result of a draw of a random variate u

• vector of perturbed frequencies

• Unbiased moment estimator of the original data: assuming has an inverse (dominant on the diagonals)

Perturbation Methods of SDL

ckjM

cM

*T

cM

)1(*ˆ cMTTcM

)|~

( jXkXPM ccckj

18

PRAM ( Post-randomisation method) - cont.

• Invariant PRAM - Define:

(vector of the original frequencies eigenvector of )

• To ensure correct non-perturbation probability on diagonal, define

• Expected values of marginal distribution preserved

• Exact marginal distribution preserved using a without replacement selection strategy to select records for perturbation


TTM c

cM

ˆc c cR M Q

* (1 )c cR R I

l

cckl

cckj

cccjk lXPMjXPMkXjXPQ )](/[)()

~|(

1919

• Exchange values of a key variable between pairs of records

• Pairs selected within control strata to minimize bias• Typically geographical variable is swapped within a

large area:• Geography highly identifiable• Conditional independence assumption usually met

(sensitive variables relatively independent of geography)

• Does not produce inconsistent records• Marginal distributions preserved at higher

geographies• Can be targeted to high risk records

Random Record Swapping


2020

• Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255

• 1% srs sample n=14,683

• Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560.

Misclassification Example

2121

• Record Swapping: LAD swapped randomly, eg. for a 20% swap:

Diagonal: Off diagonal: where is the number of records in the sample from LAD k

• Pram: LAD misclassified, eg. for a 20% misclassification

Diagonal: Off diagonal: Parameter:


kn

55.0

8.0ckkM

kl lkckj nnM /2.0

8.0ckkM

)10/2.0(02.0ckjM

22

• Random 20% perturbation on LAD• Global risk measures: Expected correct matches

from SU’s


Global Risk Measure PRAM Swapping

True risk measure in original sample

358.1 362.4

Estimated naïve risk measure ignoring misclassification

349.5 358.6

Risk measure on non-perturbed records

292.2 292.8

Risk measure under misclassification (1)

Sample uniques

299.7

2,779

298.9

2,831

Approximation based on diagonals (2)

299.8 298.9

Estimated risk measure under misclassification (3)

283.1 286.8

Expected correct match per sample unique: Pram: 10.8% Record swapping: 10.6%

ckkM

2323

• Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale):

• From perspective of intruder, difficult to identify high risk (population unique) records


Estimated Risk Measure (3)

Risk Measure (1)

24

• Utility measured by whether inference can be carried out on perturbed data similar to original data

• Use proxy information loss measures on distributions calculated from microdata:

• Distance Metrics: where number of cells in distribution Let

measure of average absolute perturbation compared to average cell size

Also, can consider Kolmogorov-Smirnov statistic,

Hellinger’s Distance and relative differences in means or variances

Information Loss Measures

RCcrDcrDDDAADcr origpertpertorig /|),(),(|),(

,

RC

avgavgpertorig DAADDDDRAAD /)(100),(

cr origavg RCcrDD,

/),(

2525

• Pram record swapping Risk-Utility Map

jjjj FfBA~

/)1~

|Pr(

jF

X j

SUj k jkjkkjjjj F ])1/(/[)]1/([

)1~

|~

/1(ˆ jj fFE

Random perturbation versus Targeted perturbation on non-white ethnicities

2626

• value of vector of cross-classified identifying key variables for unit in the microdata ( )

• corresponding value for unit in the external database ( ) ( )

• Misclassification mechanism via probability matrix:

• Comparison vector for pairs of units• For subset partition set of pairs in Matches (M) Non-matches (U) through likelihood ratio: where

Probabilistic Record Linkage

aX~

1saa

bX b

2sb

kjaa MjXkXP )|~

(

),~

( ba XX21),( ssba

21~ sss s~

)),(|),~

(()( MbaXXPm ba

)),(|),~

(()( UbaXXPu ba

Ps 2

)(/)( um

2727

• probability that pair is in M• Probability of a correct match:

• Estimate parameters using previous test data or EM algorithm and assuming conditional independence

Probabilistic Record Linkage

)),(( MbaPp

)]1)(()(/[)()),~

(|),((| pupmpmXXMbaPp baM

)),(|),~

(()....),(|),~

(()),(|),~

((

)),(|),~

(()(

21 MbaXXPMbaXXPMbaXXP

MbaXXPm

baKbaba

ba

2828

• No misclassification

Linking the Two Frameworks

)1()1( jj FfNn jfn jjFfNn

)1( jj Ffjf jj Ff

)1( Nn

Non-match Match Total

Disagree

Agree

Total n Nn

jjjj

jM FNnFfNnfN

nfNp

1

)1(/)1()/11(//1

//1|

nfm j /)( )1(/)1()( NnFfu jj Np /1

2929

• With misclassification, denote


jk

kjkjjjj fMfMf~

jjjjj fMFfnNn ~

jjj fMn jj FfNn~

jjjjj fMFf ~

jjj fMjj Ff

~

nNn n Nn


Disagree

Agree

Total

j

jj

j

jj

jjjjjjjj

jjjM

F

M

f

M

NnfMFfNnfMN

nfMNp ~~

)1(/)~

)(/11(//1

//1|

nfMm jjj /)( )1(/)~

()( NnfMFfu jjjjj Np /1

3030

• Matching 2,853 sample uniques to the population and blocking on all key variables except LAD result

in 1,534,293 possible pairs

• On average across blocks, probability of a correct match given an agreement on LAD



Disagree LAD 1,388,069 619 1,388,688

Agree LAD 143,321 2,234 145,555

Total 1,531,390 2,853 1,534,293

78.0)( m 09.0)( u 002.0p

015.0| Mp

3131

• Probability of a correct match given an agreement for each

• Compare to risk measure

• Summing over the global disclosure risk measure of 289.5.


jXX ba ),~

(

jjj FM~

/

|Mp

|Mp

3232

• Global disclosure risk measures accurately estimated for a risk-utility assessment assuming known non-misclassification probability

• Empirical evidence of connection between F&S record linkage and probabilistic modelling for estimating identification risk

• Estimation carried out through log linear modelling for the probabilistic modelling or the EM algorithm for the F&S record linkage

• Individual disclosure risk measures more difficult to estimate without knowing true population parameters in both frameworks

• From the perspective of the intruder, it is difficult to identify sample uniques that are population uniques

Discussion

33

• Statistical Agencies (MRP) need to: - Assess disclosure risk objectively - Set tolerable risk thresholds according to different access

modes - Optimize and combine SDL techniques - Provide guidelines on how to analyze disclosure

controlled datasets

• Future dissemination strategies presents new challenges: - Synthetic data for web access prior to accessing real data - Online SDL techniques for flexible table generating

software and remote access - Auditing query systems

• Bridge the Statistical and Computer Science literature on privacy preserving algorithms

Discussion

34

Thank you for your attention

Documents

11 Assessing Disclosure Risk in Sample Microdata Under Misclassification Natalie Shlomo Southampton Statistical Sciences Research Institute University