Upload
rowa
View
39
Download
0
Embed Size (px)
DESCRIPTION
Helping Kinsey Compute. Cynthia Dwork Microsoft Research. The Problem. Exploit Data, eg, Medical Insurance Database Does smoking contribute to heart disease? Was there a rise in asthma emergency room cases this month? What fraction of the admissions during 2004 were men 25-35? - PowerPoint PPT Presentation
Citation preview
Helping Kinsey Compute Helping Kinsey Compute
Cynthia Dwork
Microsoft Research
Cynthia Dwork
Microsoft Research
The Problem
• Exploit Data, eg, Medical Insurance Database
— Does smoking contribute to heart disease?
— Was there a rise in asthma emergency room cases this month?
— What fraction of the admissions during 2004 were men 25-35?
• …while preserving privacy of individuals
Holistic Statistics
• Is the dataset well clustered?
• What is the single best predictor for risk of stroke?
• How are attributes X and Y correlated; what is the cov(X,Y)?
• Are the data inherently low-dimensional?
Statistical Database
Query (f,S)Query (f,S)
f: row f: row [0,1][0,1]
S S µµ [n] [n]
Exact Answer Exact Answer f(row r) f(row r)
Database Database
(D(D11, … D, … Dnn) )
ffffff
ff
+ noise
Statistical Database
ffffff
ff
+ noise
Under control of interlocutor:Noise generationNumber of queries T permitted
Why Bother With Noise?
Limiting interface to queries about large sets is insufficient:
A = {1, … , n} and B = {2, … , n}
a2 A f(row a) - b2 B f(row b) = f(row 1)
Previous (Modern) Work in this Model
• Dinur, Nissim [2003]
Single binary attribute (query function f = identity)
Non-privacy: whp adversary guesses 1- rows
— Theorem: Polytime non-privacy if whp |noise| is o(√n)
— Theorem: Privacy with o(√n) noise if #queries is << n
• Privacy “for free” !
Rows » samples from underlying distribution: Pr[row i = 1] = p
E[# 1’s] = pn, Var = (n)
Acutal #1’s » pn § (√n)
|Privacy-preserving noise| is o(sampling error)
Real Power in this Model
• Dwork, Nissim [2004]Multiple binary attributes
q=(S,f), f:{0,1}d ! {0,1}
—Definition of privacy appropriate to enriched query set
—Theorem: Privacy with o(√n) noise if #queries is << n
—Coined term SuLQ
• Vertically Partitioned Databases—Learn joint statistics from independently operated SuLQ
databases:
• Given SulQA, SuLQB learn if A implies B in probability
• Eg, heart disease risk increases with smoking
• Enables learning statistics for all Boolean fns of attributes
Still More Power [Blum, Dwork, McSherry, Nissim 05]
• Extend Privacy Proofs
— Real-valued functions f: [0,1]d ! [0,1]
— Per row analysis: drop dependence on n!
• How many queries has THIS row participated in?
• Our Data, Ourselves
• Holistic Statistics: A Calculus of Noisy Computation
— Beyond statistics:
• (not too) noisy versions of k-means, perceptron, ID3 algs
• (not too) noisy optimal projections SVD, PCA
• All of STAT learning
Towards Defining Privacy: “Facts of Life” vs Privacy Breach
• Diabetes is more likely in obese persons
— Does not imply THIS obese person has or will have diabetes
• Sneaker color preference is correlated with political party
— Does not imply THIS person in red sneakers is a Republican
• Half of all marriages result in divorce
— Does not imply Pr [ THIS marriage will fail ] = ½
(, T)-Privacy
Power of adversary:
• Phase 0: Specify a goal function g: row {0,1}Actually, a polynomial number of functions;
Adversary will try to learn this information about someone
• Phase 1: Adaptively make T queries
• Phase 2: Choose a row i to attack; get entire database except for row i
Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by
Notes:
• Adversary chooses goal
• My privacy is preserved even if everybody else tells their secrets to the adversary
Flavor of Privacy Proofs
• Define confidence in value of g( row i )
— c0 = log [p0/(1-p0)]
— 0 when p = ½, skyrockets as p moves toward 0 or 1
• Model evolution of confidence as a martingale
— Argue expected difference at each step is small
— Compute absolute upper bound on difference
— Plug these two parameters into Azuma’s inequality
Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i )
c0
Remainder of This Talk
• Description of SuLQ Algorithm + Statement of Main Theorem
• Examples
— k means
— SVD, PCA
— Perceptron
— STAT learning
• Vertically Partitioned Data
— Determining if ) in probability: Pr[|] ¸ Pr[]+ when and are in different SuLQ databases
• Summary
The SuLQ Algorithm
• Algorithm:
— Input: query (S µ [n], f: [0,1]d ! [0,1])
—Output: i 2 Sf( row i ) + N(0, R)
• Theorem: 8 , with probability at least 1-, choosing
R > 32 log(2/) log (T/)T/2 ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than is at most .
• R is independent of n. Bigger n means better stats.
k Means Clustering
physics, OR, machine learning, data mining, etc.
SuLQ k Means
• Estimate size of each cluster
• Estimate average of points in cluster
— Estimate their sum; and
— Divide estimated sum by estimated average
Side by Side: k Means and SuLQ k-Means
Basic step:
• Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d
• Sj = points for which cj is the closest center
• Output: c’j = average of points in Sj, j=1, … k
Basic step:
• Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d
• sj = SuLQ( f(di) :=
1 if j = arg minj ||cj – di|| 0 otherwise)
• ’j = SuLQ( f(di) :=
di if j = arg minj ||cj - di|| 0 otherwise) / sj
k(1+d) queries total
Small Error!
For each 1 · j · k, if |Sj| >> R1/2 then with high probability ||’j – c’j|| is O( (||j|| + d1/2 ) R1/2/|Sj|).
• Inaccuracies:
— Estimating |Sj|
— Summing points in Sj
• Even with just the first:
(1/sj - 1/|Sj|) I 2 Sjdi
= (1/sj - 1/|Sj|) (j |Sj|)
= ((|Sj| - sj)/sj ) j ¼ (noise/size) j
Reducing Dimensionality
• Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance
• Find Optimal Linear Projections
— Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A
• Singular Value Decomposition uses top k eigenvectors of ATA
• Principal Component Analysis uses top k eigenvectors of cov(A)
• Approach
— Approximate ATA and cov(A) using SuLQ, then compute eigenvectors
Optimal Projections
• ATA = i diT di
• = (i di)/n
• cov(A) = i(di - )T(di - )
• SuLQ (f(i) = diT di) = AT A + N(0,R)d £ d
’ = SuLQ(f(i)=di)/n
• SuLQ( f(i) = (di - ’)T (di - ’) )
d2 and d2+d queries, respectively
Perceptron [Rosenblatt 57]
• Input: n points p1,…,pn in [-1,1]d, and labels b1,…,bn in {-1,1}—Assumed linearly separable, with a plane through the origin
• Initialize w randomly
• h w, p i b > 0 iff label b agrees with sign of h w, p i
• While 9 labeled point (pi,bi) s.t. h wi, pi i bi · 0, set w = w + pi·bi
• Output: w
ppii
ww
ww
SuLQ Perceptron
• Initialize w = 0d and s= n.
Repeat while s >> R1/2
• Count the misclassified rows (1 query) :
s = SuLQ(f(di) := 1 if h di , w i bi · 0 and 0 ow)
• Synthesize a misclassified vector (d queries) :
v = SuLQ(f(di) := bi di if h di , w i ¢ bi · 0 and 0 ow) / s
• Update w:
Set w = w + v
Return the final value of w.
How Many Rounds?
Theorem: If there exists a unit vector w’ and scalar such that for all i hw',dii bi ¸ and for all j, >> (dR)1/2/|Sj| then with high probability the algorithm terminates in at most 32 maxi |di|2 / rounds.
|Sj| = number of misclassified vectors at iteration j
In each round j, hw', wi increases by more than |w| does. Since hw', wi · |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.
The Statistical Queries Learning Model
[Kearns93]
• Concept c: {0,1}d {0,1}
• Distribution D on {0,1}d
• STAT(c,D) Oracle
—Query: (p, ) where p:{0,1}d+1 {0,1} and =1/poly(d)
—Answer: PrxD[p(x,c(x))] + for ||
Capturing STAT
Each row contains a labeled example (x, c(x))
Input: predicate p and accuracy
• Initialize tally = 0.
• Reduce variance:
Repeat t ¸ R/ n2 times
tally = tally + SuLQ(f(di) := p(di))
Output: tally / tn
Capturing STAT
Theorem: For any algorithm that -learns a class C using at most q statistical queries of accuracy {1, … , q}, the adapted algorithm can -learn C on a SuLQ database of n elements, provided that
n2 ¸ R log(q / )}/(T-q) £ j · q 1/j
Probabilistic Implication: Two SuLQ Databases
implies in probability: Pr[|] ≥ Pr[]+
• Construct a tester for distinguishing <1 from >2 (for constants 1 < 2)
—Estimate by binary search
• In the analysis we consider deviations from an expected value, of magnitude (√n)
—As perturbation << √n, it does not mask out these deviations
• Results generalize to functions and of attributes in two distinct SuLQ databases
Key Insight: Test for Pr[|] ≥ Pr[]+
Assume T chosen so that noise = o(√n).
1. Find a “heavy” set S for : a subset of rows that have more than |S| a +[a(1-a) |S]1/2 ones in database. Here, a = Pr[] and |S| = (n).
Find S s.t. aS, > |S| a + √ [|S|(a(1- a))].
Let excess= aS, - |S| a. Note that excess is (n1/2).
2. Query the SuLQ database for , on S
If aS, ¸ |S| Pr[] + excess ( / (1 - a)) then return 1 else return 0
If is constant then noise is too small to hide the correlation.
Summary
• SuLQ framework for privacy-preserving statistical databases
— real-valued query functions
— Variance for noise depends (roughly linearly) on number of queries, not size of database
• Examples of power of SuLQ calculus
• Vertically Partitioned Databases
Sources
• C. Dwork and K. Nissim,
Privacy-Preserving Datamining on Vertically Partitioned Databases
• A. Blum, C. Dwork, F. McSherry, and K. Nissim,
Practical Privacy: The SuLQ Framework
• See http://research.microsoft.com/research/sv/DabasePrivacy