Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
CS208: Applied Privacy for Data Science Membership & Other Attacks
James Honaker & Salil Vadhan School of Engineering & Applied Sciences
Harvard University
February 8, 2019
Motivation • Last time: on a dataset with 𝑛𝑛 individuals, releasing 𝑚𝑚 = 𝑛𝑛 counts with error 𝐸𝐸 = 𝑜𝑜 𝑛𝑛 allows for reconstructing 1 − 𝑜𝑜(1) fraction of sensitive attributes.
• Q: what happens if we allow error Ω 𝑛𝑛 ≤ 𝐸𝐸 ≤ 𝑜𝑜 𝑛𝑛 ?
• A (today): if we release 𝑚𝑚 = 𝑛𝑛2 counts, can be vulnerable to “membership attacks”.
What is this 𝒏𝒏 threshold? • if 𝑋𝑋 = 𝑋𝑋1 + ⋯+ 𝑋𝑋𝑛𝑛 for independent random variables 𝑋𝑋𝑖𝑖
each with standard deviation 𝜎𝜎, then the standard deviation of 𝑋𝑋 is 𝜎𝜎 ⋅ 𝑛𝑛.
• So the “sampling error” for a sum is typically Θ 𝑛𝑛 . • If the 𝑋𝑋𝑖𝑖 ’s are bounded (or “subgaussian”), then 𝑋𝑋 will
have Gaussian-like concentration around its mean 𝜇𝜇: Pr 𝑋𝑋 − 𝜇𝜇 > 𝑡𝑡 ⋅ 𝑛𝑛 ≤ 𝑒𝑒−Ω(𝑡𝑡2) [Chernoff-Hoeffding Bound]
Normalized Counts (i.e. Averages)
• if 𝑋𝑋 = (𝑋𝑋1+⋯+ 𝑋𝑋𝑛𝑛)/𝑛𝑛 for independent random variables 𝑋𝑋𝑖𝑖 each with standard deviation 𝜎𝜎, then the standard deviation of 𝑋𝑋 is 𝜎𝜎/ 𝑛𝑛.
• So the “sampling error” for a sum is typically Θ 1/ 𝑛𝑛 . • If the 𝑋𝑋𝑖𝑖 ’s are bounded (or “subgaussian”), then 𝑋𝑋 will
have Gaussian-like concentration around its mean 𝜇𝜇: Pr 𝑋𝑋 − 𝜇𝜇 > 𝑡𝑡/ 𝑛𝑛 ≤ 𝑒𝑒−Ω(𝑡𝑡2) [Chernoff-Hoeffding Bound]
This is why subsampling 𝑘𝑘 out of 𝑛𝑛 rows allows us to approximate 𝑚𝑚 averages each to within ±𝑂𝑂 1
𝑘𝑘⋅ log𝑚𝑚
concentration std dev
Motivation • Last time: on a dataset with 𝑛𝑛 individuals, releasing 𝑚𝑚 = 𝑛𝑛 averages with error 𝐸𝐸 = 𝑜𝑜 1/ 𝑛𝑛 allows for reconstructing 1 − 𝑜𝑜(1) fraction of sensitive attributes.
• Q: what happens if we allow error Ω 1/ 𝑛𝑛 ≤ 𝐸𝐸 ≤ 𝑜𝑜 1 ?
• A (today): if we release 𝑚𝑚 = 𝑛𝑛2 counts, can be vulnerable to “membership attacks”.
PAUSE FOR INTRODUCTIONS
Membership Attacks: Setup
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0 Data set X
Alice’s data
Attacker
Population
“In”
“Out”
“In”/ “Out”
Mechanism (stats, ML model, …)
aux
OR
Attacker gets: • Access to mechanism outputs • Alice’s data • (Possibly) auxiliary info about population Then decides: if Alice is in the dataset X
[slide based on one from Adam Smith]
Membership Attacks: Examples
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0 Data set X
Alice’s data
Attacker
Population
“In”
“Out”
“In”/ “Out”
Mechanism (stats, ML model, …)
aux
OR
• Genome-wide Association Studies [Homer et al. `08] – release frequencies of SNP’s (individual positions) – determine whether Alice is in “case group” [w/a particular diagnosis]
• ML as a service [Shokri et al. `17] – apply models trained on X to Alice’s data
[slide based on one from Adam Smith]
[slide based on one from Adam Smith]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0
Alice’s data
.5 .75 .5 .5 .75 .5 .25 .25 .5 𝑋𝑋� Attacker
“In”
“Out”
“In”/ “Out”
Membership Attacks from Means Population
𝒅𝒅 attributes/predicates OR
• Population = [vector 𝑝𝑝 = (𝑝𝑝1, … ,𝑝𝑝𝑑𝑑) of probabilities] – 𝑗𝑗’th attribute = iid Bernoulli(𝑝𝑝𝑗𝑗), independent across 𝑗𝑗 – Adversary gets 𝑝𝑝 (or a few random draws) given to adversary
[slide based on one from Adam Smith]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0
Alice’s data
.5 .75 .5 .5 .75 .5 .25 .25 .5 𝑋𝑋� Attacker
Population
“In”
“Out”
“In”/ “Out”
Membership Attacks from Means
.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝
𝒂𝒂𝒂𝒂𝒂𝒂 = 𝒑𝒑
OR
• Population = vector 𝑝𝑝 = (𝑝𝑝1, … ,𝑝𝑝𝑑𝑑) of probabilities – 𝑗𝑗’th attribute = iid Bernoulli(𝑝𝑝𝑗𝑗), independent across 𝑗𝑗 – Adversary gets 𝑎𝑎 ≈ �̅�𝑥 and 𝑝𝑝 (or a few random draws) – Only assume that 𝑎𝑎 = 𝑀𝑀(𝑥𝑥) has 𝑎𝑎𝑗𝑗 − �̅�𝑥𝑗𝑗 ≤ 𝛼𝛼 whp.
(“Noise” need not be independent or unbiased.) [slide based on one from Adam Smith]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0
Alice’s data
.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥
Attacker
Population
“In”
“Out”
“In”/ “Out”
Membership Attacks from Noisy Means
.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝
𝒂𝒂𝒂𝒂𝒂𝒂 = 𝒑𝒑
OR
privacy mechanism
• We are interested in 𝛼𝛼 > 1/ 𝑛𝑛. • In this regime, if 𝑝𝑝 known to mechanism, can prevent attack. (Q: Why?) • So we will assume random 𝑝𝑝𝑗𝑗 ’s (e.g. iid uniform in [0,1]).
[slide based on one from Adam Smith]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0
Alice’s data
.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥
Attacker
Population
“In”
“Out”
“In”/ “Out”
Membership Attacks from Noisy Means
.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝
𝒂𝒂𝒂𝒂𝒂𝒂 = 𝒑𝒑
OR
privacy mechanism
Theorem [Dwork et al. `15]: There is a constant 𝑐𝑐 and an attacker 𝐴𝐴 such that when 𝑑𝑑 ≥ 𝑐𝑐𝑛𝑛 and 𝛼𝛼 < min 𝑑𝑑 𝑂𝑂(𝑛𝑛2 log(1 𝛿𝛿⁄ ))⁄ , 1/2 :
• If Alice is IN, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≥ Ω 1𝛼𝛼2𝑛𝑛
.
• If Alice is OUT, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≤ 𝛿𝛿.
[slide based on one from Adam Smith]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0
Alice’s data 𝑦𝑦
.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥
Attacker
Population
“In”
“Out”
“In”/ “Out”
Membership Attacks from Noisy Means
.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝
𝒂𝒂𝒂𝒂𝒂𝒂 = 𝒑𝒑
OR
privacy mechanism
Theorem [Dwork et al. `15]: There is an attacker 𝐴𝐴 such that when 𝑑𝑑 ≥ 𝑂𝑂(𝑛𝑛) and 𝛼𝛼 < min 𝑑𝑑 𝑂𝑂(𝑛𝑛2 log(1 𝛿𝛿⁄ ))⁄ , 1/2 :
• If Alice is IN, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≥ Ω 1𝛼𝛼2𝑛𝑛
. (true positive)
• If Alice is OUT, then Pr 𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = IN ≤ 𝛿𝛿. (false positive)
Remarks:
• Only interesting when 𝛿𝛿 < Ω 1𝛼𝛼2𝑛𝑛
.
• On average, succesfully trace Ω 1𝛼𝛼2
members of dataset. This is the best possible. (Why?)
• Can safely release at most 𝑂𝑂�(𝑛𝑛2) means!
Membership Attacks from Noisy Means
𝐴𝐴 𝑦𝑦,𝑎𝑎,𝑝𝑝 = � IN if ⟨𝑦𝑦 − 𝑝𝑝,𝑎𝑎 − 𝑝𝑝⟩ > 𝑇𝑇OUT if ⟨𝑦𝑦 − 𝑝𝑝,𝑎𝑎 − 𝑝𝑝⟩ ≤ 𝑇𝑇
Note: given 𝑝𝑝,𝑎𝑎, can choose 𝑇𝑇 = 𝑇𝑇𝑝𝑝,𝑎𝑎 = O 𝑑𝑑 log(1 𝛿𝛿⁄ ) to make false positive probability exactly 𝛿𝛿.
[slide based on one from Adam Smith]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 0
Alice’s data 𝑦𝑦
.54 .71 .49 .52 .80 .54 .20 .21 .45 𝑎𝑎 ≈ �̅�𝑥
Attacker
Population
“In”
“Out”
“In”/ “Out”
The Attacker
.6 .9 .3 .1 .9 0 .4 .2 .4 𝑝𝑝
𝒂𝒂𝒂𝒂𝒂𝒂 = 𝒑𝒑
OR
privacy mechanism
Attacks on Aggregate Stats • What error 𝛼𝛼 makes sense?
– Estimation error due to sampling ≈ 1/ 𝑛𝑛 – Reconstruction attacks require 𝛼𝛼 ≲ 1/ 𝑛𝑛, 𝑑𝑑 ≥ 𝑛𝑛
– Robust membership attacks: 𝛼𝛼 ≲ 𝒅𝒅/𝒏𝒏 • Lessons
– “Too many, too accurate” statistics reveal individual data – “Aggregate” is hard to pin down
16
𝟏𝟏𝒏𝒏
Reconstruction attacks
Sampling error
Membership attacks Distortion 𝜶𝜶
𝒅𝒅𝒏𝒏
[slide based on one from Adam Smith]
Membership Attacks on ML as a Service
[Shokri et al. 2017] Switch to slides from Reza Shokri’s talk
Another Attack on ML? [Frederickson et al. `14, cf. McSherry `16]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 Data set X
Alice’s (known) data
Attacker
Population
Mechanism (stats, ML model, …)
Attacker gets: • Access to mechanism outputs • Some of Alice’s data • (Possibly) auxiliary info about population Then computes: a sensitive attribute of Alice
1
Another Attack on ML? [Frederickson et al. `14, cf. McSherry `16]
𝒏𝒏 people
0 1 1 0 1 0 0 0 1
0 1 0 1 0 1 0 0 1
1 0 1 1 1 1 0 1 0
1 1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1 Data set X
Alice’s (known) data
Attacker
Population
Mechanism (stats, ML model, …)
Difference from reconstruction attacks: • Above attack works even if Alice not in dataset. Based
on correlation between known & sensitive attributes. • Reconstruction attacks work even when sensitive bit
uncorrelated.
1
Goals of Differential Privacy • Utility: enable “statistical analysis” of datasets
– e.g. inference about population, ML training, useful descriptive statistics
• Privacy: protect individual-level data – against “all” attack strategies, auxiliary info.
Q: Can it help with privacy in microtargetted advertising? [Korolova attacks]
– inference from impressions? – inference from clicks? – displaying intrusive ads?
“Five Views” Responses to Membership Attacks on GWAS