Upload
shanon-lambert
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
11
Assessing Disclosure Risk in Sample Microdata Under
Misclassification
Natalie Shlomo Southampton Statistical Sciences Research InstituteUniversity of [email protected]
This is joint work with Prof Chris Skinner
22
Topics Covered
• Introduction and motivation
• Disclosure risk assessment for sample microdata: • Probabilistic modelling extended for
misclassification• Probabilistic record linkage – linking the
frameworks
• Risk-utility framework for microdata subject to misclassification
• Discussion
33
• Disclosure risk scenario: ‘intruder’ attack on microdata through linking to available public data sources
Linkage via identifying key variables common to both sources, eg. sex, age, region, ethnicity
• Agencies limit risk of identification through statistical disclosure limitation (SDL) methods:• Non-perturbative – sub-sampling, recoding and
collapsing categories of key variables, deleting variables
• Perturbative – data swapping, additive noise, misclassification (PRAM) and synthetic data
Introduction
44
• Need to quantify the risk of identification to inform microdata release
• Probabilistic models for quantifying risk of identification based on population uniqueness on identifying key variables
• Population counts in contingency tables spanned by key variables unknown
• Distribution assumptions to draw inference from the sample for estimating population parameters
• Assumes that microdata has not been altered either through misclassification arising from data processes or purposely introduced for SDL
Introduction
55
• Expand probabilistic modelling for quantifying risk of identification under misclassification/perturbation
• For perturbative methods of SDL, risk assessment typically based on probabilistic record linkage • Conservative assessment of risk of identification • Assumes that intruder has access to original dataset
and does not take into account protection afforded by sampling
• Fit probabilistic record linkage into the probabilistic modelling framework for categorical matching variables
Introduction
6
Disclosure Risk Assessment
Probabilistic Modelling – no misclassification
• Let denote a q-way frequency table which is a sample from a population table where indicates a cell population count and sample count in cell
• Disclosure risk measure:
• For unknown population counts, estimate from the conditional distribution of
kF
kkf
k
kk FfI )1,1(1 k k
k FfI
1)1(2
)1|1(ˆ)1(1 kkk
k fFPfI )1|1
(ˆ)1(ˆ2 kkk
k fF
EfI
kk fF |
}{f kf
}{F kF),...,( 1 qkkk
7
Disclosure Risk Assessment
• Natural assumption:
Bernoulli sampling:
It follows that: and
where are conditionally independent
)(~ kk PoissonF
),(~| kkkk FBinFf
)(~ kkk Poissonf
))1((~| kkkk PoissonfF
kk fF |
is the sampling fraction in cell kk
8
Disclosure Risk Assessment
• Skinner and Holmes, 1998, Elamir and Skinner, 2006 use log linear models to estimate parameters
• Sample frequencies are independent Poisson distributed with a mean of
• Log-linear model for estimating expressed as:
where design matrix of key variables and their interactions
• MLE’s calculated by solving score function:
}{ k
kfkkk
kk x)log(
0x)]xexp([ kkkk
f
}{ k
x
9
Disclosure Risk Assessment
• Fitted values calculated by: and
• Individual risk measures estimated by:
• Skinner and Shlomo (2009) develop goodness of fit criteria which minimize the bias of disclosure risk estimates, for example, for
)ˆexp(ˆ kku xk
kk
u
ˆˆ
))1(ˆexp()1|1(ˆkkkk fFP
)]1(ˆ/[))]1(ˆexp(1[)1|1
(ˆkkkkk
k
fF
E
1
)}2/(])ˆ)[(1()ˆ){(1)(ˆexp(ˆˆ 21 kkkk
kkkkkkk fffB
10
Disclosure Risk Assessment
• Criteria related to tests for over and under-dispersion:
• over-fitting - sample marginal counts produce too many random zeros, leading to expected cell counts too high for non-zero cells and under-estimation of risk
• under-fitting - sample marginal counts don’t take into account structural zeros, leading to expected cell counts too low for non-zero cells and over-estimation of risk
• Criteria selects the model using a forward search algorithm which minimizes for where is the variance of ii vB ˆ/ˆ 2,1,ˆ ii
i iB
11
Disclosure Risk Assessment
Example: Population of 944,793 from UK 2001 Census SRS sample size 9,448
Key: Area (2), Sex (2), Age (101), Marital Status (6), Ethnicity (17), Economic Activity (10) - 412,080 cells
Model Selection:
Starting solution: main-effects log-linear model which indicates under-fitting (minimum error statistics too large) Add in higher interaction terms until minimum error statistics indicate fit
12
Model Search Example (SRS n=9,448) True values ,
Independence - I 386.6 701.2 48.54 114.19
All 2 way - II 104.9 280.1 -1.57 -2.65
1: I + {a*ec} 243.4 494.3 54.75 59.22
2: 1 + {a*et} 180.1 411.6 3.07 9.82
3: 2 + {a*m} 152.3 343.3 0.88 1.73
4: 3 + {s*ec} 149.2 337.5 0.26 0.92
5a: 4 + {ar*a} 148.5 337.1 -0.01 0.84
5b: 4 + {s*m} 147.7 335.3 0.02 0.66
6b: 5b + {ar*a} 147.0 335.0 -0.24 0.56
6c: 5b + {ar*m} 148.9 337.1 -0.04 0.72
6d: 5b + {m*ec} 146.3 331.4 -0.24 0.03
7c: 6c + {m*ec} 147.5 333.2 -0.34 0.06
7d: 6d + {ar*a} 145.6 331.0 -0.44 -0.03
Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec
,
159~1 9.355~
2
11 /ˆ vB 22 /ˆ vB1 2
13
Model Search ExamplePreferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a}
True Global Risk: Estimated Global Risk
5.1481 1.337ˆ2
0.001
0.01
0.1
1
0.001 0.01 0.1 1
Log-scale
159~1 9.355~
2
Estimated per-record risk measure
True risk measure
14
Disclosure Risk Assessment
• Skinner and Shlomo, 2009 address complex survey designs: • Sampling clusters introduce dependencies - key
variables cut across clusters and assumption holds in practice
• Stratification – include strata id in key variables to account for differential inclusion probabilities
• Survey weights - Use pseudo maximum likelihood estimation where score function modified to:
changed to: where
• Partition large tables and assess partitions separately (assumes that partitioning variable has an interaction with other key variables)
0)]exp(ˆ[ k
kkk xxF ˆˆ /k k kf F
ki
ik wFk
15
Disclosure Risk Assessment Under
Misclassification
• Model assumes no misclassification errors either arising from data processes or purposely introduced for SDL
• Shlomo and Skinner, 2010 address misclassification errors
Let: where cross-classified key variables: in population fixed in microdata subject to misclassification
)|~
( jXkXPM kj
X
X
X~
16
Disclosure Risk Assessment Under
misclassification
• The per-record disclosure risk measure of a match of external unit B to a unique record in microdata A that has undergone misclassification:
(1)
• For small misclassification and small sampling fractions: or (2)
• Global measure: estimated by:
(3)
where per-record risk:
kj
kjkjj
kkkkk FMMF
MMfBAP
1
)1/(
)1/()1
~|(
j
kjj
kk
MF
M
k
kk
F
M~
k k
kkk
F
MfI ~)1(2
k
k
kkk
k fF
EMfI~
|~1ˆ)1
~(ˆ2
1
~|~
1ˆk
k
kk fF
EM
17
PRAM ( Post-randomisation method)
• Probability transition matrix containing conditional probabilities for a category c:
• Let T be a vector of frequencies
• On each record, category c changed or not changed according to and the result of a draw of a random variate u
• vector of perturbed frequencies
• Unbiased moment estimator of the original data: assuming has an inverse (dominant on the diagonals)
Perturbation Methods of SDL
ckjM
cM
*T
cM
)1(*ˆ cMTTcM
)|~
( jXkXPM ccckj
18
PRAM ( Post-randomisation method) - cont.
• Invariant PRAM - Define:
(vector of the original frequencies eigenvector of )
• To ensure correct non-perturbation probability on diagonal, define
• Expected values of marginal distribution preserved
• Exact marginal distribution preserved using a without replacement selection strategy to select records for perturbation
Perturbation Methods of SDL
TTM c
cM
ˆc c cR M Q
* (1 )c cR R I
l
cckl
cckj
cccjk lXPMjXPMkXjXPQ )](/[)()
~|(
1919
• Exchange values of a key variable between pairs of records
• Pairs selected within control strata to minimize bias• Typically geographical variable is swapped within a
large area:• Geography highly identifiable• Conditional independence assumption usually met
(sensitive variables relatively independent of geography)
• Does not produce inconsistent records• Marginal distributions preserved at higher
geographies• Can be targeted to high risk records
Random Record Swapping
Perturbation Methods of SDL
2020
• Population of individuals from 2001 United Kingdom (UK) Census N=1,468,255
• 1% srs sample n=14,683
• Six key variables: Local Authority (LAD) (11), sex (2), age groups (24), marital status (6), ethnicity (17), economic activity (10) K=538,560.
Misclassification Example
2121
• Record Swapping: LAD swapped randomly, eg. for a 20% swap:
Diagonal: Off diagonal: where is the number of records in the sample from LAD k
• Pram: LAD misclassified, eg. for a 20% misclassification
Diagonal: Off diagonal: Parameter:
Misclassification Example
kn
55.0
8.0ckkM
kl lkckj nnM /2.0
8.0ckkM
)10/2.0(02.0ckjM
22
• Random 20% perturbation on LAD• Global risk measures: Expected correct matches
from SU’s
Misclassification Example
Global Risk Measure PRAM Swapping
True risk measure in original sample
358.1 362.4
Estimated naïve risk measure ignoring misclassification
349.5 358.6
Risk measure on non-perturbed records
292.2 292.8
Risk measure under misclassification (1)
Sample uniques
299.7
2,779
298.9
2,831
Approximation based on diagonals (2)
299.8 298.9
Estimated risk measure under misclassification (3)
283.1 286.8
Expected correct match per sample unique: Pram: 10.8% Record swapping: 10.6%
ckkM
2323
• Estimating individual per-record risk measures for 20% random swap based on log linear modelling (log scale):
• From perspective of intruder, difficult to identify high risk (population unique) records
Misclassification Example
Estimated Risk Measure (3)
Risk Measure (1)
24
• Utility measured by whether inference can be carried out on perturbed data similar to original data
• Use proxy information loss measures on distributions calculated from microdata:
• Distance Metrics: where number of cells in distribution Let
measure of average absolute perturbation compared to average cell size
Also, can consider Kolmogorov-Smirnov statistic,
Hellinger’s Distance and relative differences in means or variances
Information Loss Measures
RCcrDcrDDDAADcr origpertpertorig /|),(),(|),(
,
RC
avgavgpertorig DAADDDDRAAD /)(100),(
cr origavg RCcrDD,
/),(
2525
• Pram record swapping Risk-Utility Map
jjjj FfBA~
/)1~
|Pr(
jF
X j
SUj k jkjkkjjjj F ])1/(/[)]1/([
)1~
|~
/1(ˆ jj fFE
Random perturbation versus Targeted perturbation on non-white ethnicities
2626
• value of vector of cross-classified identifying key variables for unit in the microdata ( )
• corresponding value for unit in the external database ( ) ( )
• Misclassification mechanism via probability matrix:
• Comparison vector for pairs of units• For subset partition set of pairs in Matches (M) Non-matches (U) through likelihood ratio: where
Probabilistic Record Linkage
aX~
1saa
bX b
2sb
kjaa MjXkXP )|~
(
),~
( ba XX21),( ssba
21~ sss s~
)),(|),~
(()( MbaXXPm ba
)),(|),~
(()( UbaXXPu ba
Ps 2
)(/)( um
2727
• probability that pair is in M• Probability of a correct match:
• Estimate parameters using previous test data or EM algorithm and assuming conditional independence
Probabilistic Record Linkage
)),(( MbaPp
)]1)(()(/[)()),~
(|),((| pupmpmXXMbaPp baM
)),(|),~
(()....),(|),~
(()),(|),~
((
)),(|),~
(()(
21 MbaXXPMbaXXPMbaXXP
MbaXXPm
baKbaba
ba
2828
• No misclassification
Linking the Two Frameworks
)1()1( jj FfNn jfn jjFfNn
)1( jj Ffjf jj Ff
)1( Nn
Non-match Match Total
Disagree
Agree
Total n Nn
jjjj
jM FNnFfNnfN
nfNp
1
)1(/)1()/11(//1
//1|
nfm j /)( )1(/)1()( NnFfu jj Np /1
2929
• With misclassification, denote
Linking the Two Frameworks
jk
kjkjjjj fMfMf~
jjjjj fMFfnNn ~
jjj fMn jj FfNn~
jjjjj fMFf ~
jjj fMjj Ff
~
nNn n Nn
Non-match Match Total
Disagree
Agree
Total
j
jj
j
jj
jjjjjjjj
jjjM
F
M
f
M
NnfMFfNnfMN
nfMNp ~~
)1(/)~
)(/11(//1
//1|
nfMm jjj /)( )1(/)~
()( NnfMFfu jjjjj Np /1
3030
• Matching 2,853 sample uniques to the population and blocking on all key variables except LAD result
in 1,534,293 possible pairs
• On average across blocks, probability of a correct match given an agreement on LAD
Linking the Two Frameworks
Non-match Match Total
Disagree LAD 1,388,069 619 1,388,688
Agree LAD 143,321 2,234 145,555
Total 1,531,390 2,853 1,534,293
78.0)( m 09.0)( u 002.0p
015.0| Mp
3131
• Probability of a correct match given an agreement for each
• Compare to risk measure
• Summing over the global disclosure risk measure of 289.5.
Linking the Two Frameworks
jXX ba ),~
(
jjj FM~
/
|Mp
|Mp
3232
• Global disclosure risk measures accurately estimated for a risk-utility assessment assuming known non-misclassification probability
• Empirical evidence of connection between F&S record linkage and probabilistic modelling for estimating identification risk
• Estimation carried out through log linear modelling for the probabilistic modelling or the EM algorithm for the F&S record linkage
• Individual disclosure risk measures more difficult to estimate without knowing true population parameters in both frameworks
• From the perspective of the intruder, it is difficult to identify sample uniques that are population uniques
Discussion
33
• Statistical Agencies (MRP) need to: - Assess disclosure risk objectively - Set tolerable risk thresholds according to different access
modes - Optimize and combine SDL techniques - Provide guidelines on how to analyze disclosure
controlled datasets
• Future dissemination strategies presents new challenges: - Synthetic data for web access prior to accessing real data - Online SDL techniques for flexible table generating
software and remote access - Auditing query systems
• Bridge the Statistical and Computer Science literature on privacy preserving algorithms
Discussion
34
Thank you for your attention