Helping Kinsey Compute

Helping Kinsey Compute Helping Kinsey Compute

Cynthia Dwork

Microsoft Research

Cynthia Dwork

Microsoft Research

The Problem

• Exploit Data, eg, Medical Insurance Database

— Does smoking contribute to heart disease?

— Was there a rise in asthma emergency room cases this month?

— What fraction of the admissions during 2004 were men 25-35?

• …while preserving privacy of individuals

Holistic Statistics

• Is the dataset well clustered?

• What is the single best predictor for risk of stroke?

• How are attributes X and Y correlated; what is the cov(X,Y)?

• Are the data inherently low-dimensional?

Statistical Database

Query (f,S)Query (f,S)

f: row f: row [0,1][0,1]

S S µµ [n] [n]

Exact Answer Exact Answer f(row r) f(row r)

Database Database

(D(D11, … D, … Dnn) )

ffffff

ff

+ noise

Statistical Database

ffffff

ff

+ noise

Under control of interlocutor:Noise generationNumber of queries T permitted

Why Bother With Noise?

Limiting interface to queries about large sets is insufficient:

A = {1, … , n} and B = {2, … , n}

a2 A f(row a) - b2 B f(row b) = f(row 1)

Previous (Modern) Work in this Model

• Dinur, Nissim [2003]

Single binary attribute (query function f = identity)

Non-privacy: whp adversary guesses 1- rows

— Theorem: Polytime non-privacy if whp |noise| is o(√n)

— Theorem: Privacy with o(√n) noise if #queries is << n

• Privacy “for free” !

Rows » samples from underlying distribution: Pr[row i = 1] = p

E[# 1’s] = pn, Var = (n)

Acutal #1’s » pn § (√n)

|Privacy-preserving noise| is o(sampling error)

Real Power in this Model

• Dwork, Nissim [2004]Multiple binary attributes

q=(S,f), f:{0,1}d ! {0,1}

—Definition of privacy appropriate to enriched query set

—Theorem: Privacy with o(√n) noise if #queries is << n

—Coined term SuLQ

• Vertically Partitioned Databases—Learn joint statistics from independently operated SuLQ

databases:

• Given SulQA, SuLQB learn if A implies B in probability

• Eg, heart disease risk increases with smoking

• Enables learning statistics for all Boolean fns of attributes

Still More Power [Blum, Dwork, McSherry, Nissim 05]

• Extend Privacy Proofs

— Real-valued functions f: [0,1]d ! [0,1]

— Per row analysis: drop dependence on n!

• How many queries has THIS row participated in?

• Our Data, Ourselves

• Holistic Statistics: A Calculus of Noisy Computation

— Beyond statistics:

• (not too) noisy versions of k-means, perceptron, ID3 algs

• (not too) noisy optimal projections SVD, PCA

• All of STAT learning

Towards Defining Privacy: “Facts of Life” vs Privacy Breach

• Diabetes is more likely in obese persons

— Does not imply THIS obese person has or will have diabetes

• Sneaker color preference is correlated with political party

— Does not imply THIS person in red sneakers is a Republican

• Half of all marriages result in divorce

— Does not imply Pr [ THIS marriage will fail ] = ½

(, T)-Privacy

Power of adversary:

• Phase 0: Specify a goal function g: row {0,1}Actually, a polynomial number of functions;

Adversary will try to learn this information about someone

• Phase 1: Adaptively make T queries

• Phase 2: Choose a row i to attack; get entire database except for row i

Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by

Notes:

• Adversary chooses goal

• My privacy is preserved even if everybody else tells their secrets to the adversary

Flavor of Privacy Proofs

• Define confidence in value of g( row i )

— c0 = log [p0/(1-p0)]

— 0 when p = ½, skyrockets as p moves toward 0 or 1

• Model evolution of confidence as a martingale

— Argue expected difference at each step is small

— Compute absolute upper bound on difference

— Plug these two parameters into Azuma’s inequality

Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i )

c0

Remainder of This Talk

• Description of SuLQ Algorithm + Statement of Main Theorem

• Examples

— k means

— SVD, PCA

— Perceptron

— STAT learning

• Vertically Partitioned Data

— Determining if ) in probability: Pr[|] ¸ Pr[]+ when and are in different SuLQ databases

• Summary

The SuLQ Algorithm

• Algorithm:

— Input: query (S µ [n], f: [0,1]d ! [0,1])

—Output: i 2 Sf( row i ) + N(0, R)

• Theorem: 8 , with probability at least 1-, choosing

R > 32 log(2/) log (T/)T/2 ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than is at most .

• R is independent of n. Bigger n means better stats.

k Means Clustering

physics, OR, machine learning, data mining, etc.

SuLQ k Means

• Estimate size of each cluster

• Estimate average of points in cluster

— Estimate their sum; and

— Divide estimated sum by estimated average

Side by Side: k Means and SuLQ k-Means

Basic step:

• Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d

• Sj = points for which cj is the closest center

• Output: c’j = average of points in Sj, j=1, … k

Basic step:

• Input: data points p1,…,pn and k ‘centers’ c1,…,ck in [0,1]d

• sj = SuLQ( f(di) :=

1 if j = arg minj ||cj – di|| 0 otherwise)

• ’j = SuLQ( f(di) :=

di if j = arg minj ||cj - di|| 0 otherwise) / sj

k(1+d) queries total

Small Error!

For each 1 · j · k, if |Sj| >> R1/2 then with high probability ||’j – c’j|| is O( (||j|| + d1/2 ) R1/2/|Sj|).

• Inaccuracies:

— Estimating |Sj|

— Summing points in Sj

• Even with just the first:

(1/sj - 1/|Sj|) I 2 Sjdi

= (1/sj - 1/|Sj|) (j |Sj|)

= ((|Sj| - sj)/sj ) j ¼ (noise/size) j

Reducing Dimensionality

• Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance

• Find Optimal Linear Projections

— Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A

• Singular Value Decomposition uses top k eigenvectors of ATA

• Principal Component Analysis uses top k eigenvectors of cov(A)

• Approach

— Approximate ATA and cov(A) using SuLQ, then compute eigenvectors

Optimal Projections

• ATA = i diT di

• = (i di)/n

• cov(A) = i(di - )T(di - )

• SuLQ (f(i) = diT di) = AT A + N(0,R)d £ d

’ = SuLQ(f(i)=di)/n

• SuLQ( f(i) = (di - ’)T (di - ’) )

d2 and d2+d queries, respectively

Perceptron [Rosenblatt 57]

• Input: n points p1,…,pn in [-1,1]d, and labels b1,…,bn in {-1,1}—Assumed linearly separable, with a plane through the origin

• Initialize w randomly

• h w, p i b > 0 iff label b agrees with sign of h w, p i

• While 9 labeled point (pi,bi) s.t. h wi, pi i bi · 0, set w = w + pi·bi

• Output: w

ppii

ww

ww

SuLQ Perceptron

• Initialize w = 0d and s= n.

Repeat while s >> R1/2

• Count the misclassified rows (1 query) :

s = SuLQ(f(di) := 1 if h di , w i bi · 0 and 0 ow)

• Synthesize a misclassified vector (d queries) :

v = SuLQ(f(di) := bi di if h di , w i ¢ bi · 0 and 0 ow) / s

• Update w:

Set w = w + v

Return the final value of w.

How Many Rounds?

Theorem: If there exists a unit vector w’ and scalar such that for all i hw',dii bi ¸ and for all j, >> (dR)1/2/|Sj| then with high probability the algorithm terminates in at most 32 maxi |di|2 / rounds.

|Sj| = number of misclassified vectors at iteration j

In each round j, hw', wi increases by more than |w| does. Since hw', wi · |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.

The Statistical Queries Learning Model

[Kearns93]

• Concept c: {0,1}d {0,1}

• Distribution D on {0,1}d

• STAT(c,D) Oracle

—Query: (p, ) where p:{0,1}d+1 {0,1} and =1/poly(d)

—Answer: PrxD[p(x,c(x))] + for ||

Capturing STAT

Each row contains a labeled example (x, c(x))

Input: predicate p and accuracy

• Initialize tally = 0.

• Reduce variance:

Repeat t ¸ R/ n2 times

tally = tally + SuLQ(f(di) := p(di))

Output: tally / tn

Capturing STAT

Theorem: For any algorithm that -learns a class C using at most q statistical queries of accuracy {1, … , q}, the adapted algorithm can -learn C on a SuLQ database of n elements, provided that

n2 ¸ R log(q / )}/(T-q) £ j · q 1/j

Probabilistic Implication: Two SuLQ Databases

implies in probability: Pr[|] ≥ Pr[]+

• Construct a tester for distinguishing <1 from >2 (for constants 1 < 2)

—Estimate by binary search

• In the analysis we consider deviations from an expected value, of magnitude (√n)

—As perturbation << √n, it does not mask out these deviations

• Results generalize to functions and of attributes in two distinct SuLQ databases

Key Insight: Test for Pr[|] ≥ Pr[]+

Assume T chosen so that noise = o(√n).

1. Find a “heavy” set S for : a subset of rows that have more than |S| a +[a(1-a) |S]1/2 ones in database. Here, a = Pr[] and |S| = (n).

Find S s.t. aS, > |S| a + √ [|S|(a(1- a))].

Let excess= aS, - |S| a. Note that excess is (n1/2).

2. Query the SuLQ database for , on S

If aS, ¸ |S| Pr[] + excess ( / (1 - a)) then return 1 else return 0

If is constant then noise is too small to hide the correlation.

Summary

• SuLQ framework for privacy-preserving statistical databases

— real-valued query functions

— Variance for noise depends (roughly linearly) on number of queries, not size of database

• Examples of power of SuLQ calculus

• Vertically Partitioned Databases

Sources

• C. Dwork and K. Nissim,

Privacy-Preserving Datamining on Vertically Partitioned Databases

• A. Blum, C. Dwork, F. McSherry, and K. Nissim,

Practical Privacy: The SuLQ Framework

• See http://research.microsoft.com/research/sv/DabasePrivacy

Documents

Helping Kinsey Compute