Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Statistical Decision Theory and Information Constraints

Michael I. Jordan University of California, Berkeley

November 20, 2014

What Is the Big Data Phenomenon?

•  Science in confirmatory mode (e.g., particle physics) •  Science in exploratory mode (e.g., astronomy, genomics) •  Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

•  Sensor networks are becoming pervasive

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime


•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data?



data? •  Statistical with distributed and streaming data

–  how is inferential quality impacted by communication constraints?





•  The tradeoff between statistical risk and privacy (and other externalities)





•  The tradeoff between statistical risk and privacy (and other externalities)

•  Many other issues that require a blend of statistical thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

Our Approach

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

•  Under the hood: geometry, information theory and optimization

Our Approach

•  In the 1930’s, Wald laid the foundations of statistical decision theory

•  Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:

•  Minimax principle [Wald, ‘39, ‘43]: choose

estimator minimizing worst-case risk:

Background

supP2P

EP

hl(✓̂, ✓(P ))

i

P✓(P ) P 2 P

✓̂ l(✓̂, ✓(P ))

RP (✓̂) := EP

hl(✓̂, ✓(P ))

i

Part I: Privacy and Minimax Risk

with John Duchi and Martin Wainwright University of California, Berkeley

•  Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

•  We will quantify “privacy loss” via differential privacy

•  We then treat differential privacy as a constraint on inference via statistical decision theory

•  This yields (personal) tradeoffs between privacy loss and inferential gain

Privacy and Risk

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]


X1 X2 Xn

ZnZ2Z1

b✓


X1 X2 Xn

ZnZ2Z1

b✓

Private


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

1 )


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

1 )

Definitions of privacyDefinition: channel is -differentially private if

↵Q

[Dwork, McSherry, Nissim, Smith 06]

Xi

Zi

b✓

Q(· | Xi)sup

S,x2X ,x

02X

Q(Z 2 S | x)Q(Z 2 S | x0

)

exp(↵)

Definitions of privacyDefinition: channel is -differentially private if

↵Q

[Dwork, McSherry, Nissim, Smith 06]

Xi

Zi

b✓

Q(· | Xi)sup

S,x2X ,x

02X

Q(Z 2 S | x)Q(Z 2 S | x0

)

exp(↵)

logQ(z | x)logQ(z | x0

)

↵

Private Minimax RiskCentral object of study: minimax risk

Minimax risk

•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `) := infb✓supP2P

EP

h`(b✓(Xn

1 ), ✓(P ))i

-private Minimax risk↵

Private Minimax RiskCentral object of study: minimax risk•Parameter of distribution✓(P )


Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i

•Family of private channelsQ↵



Best -private channel↵



Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i




Best -private channel↵Minimax risk under privacy constraint



Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i


Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

Proportions =

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

✓

✓1✓2✓3✓4✓5✓6

= .45= .32= .16= .20= .00= .02

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=

�distributions P supported on [�1, 1]d

Proposition:


✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=


Minimax rateMn(Pd, k·k1) ⇣ min

⇢1,

plog dpn

�

(achieved by sample mean)

Proposition:Private minimax rate for ↵ = O(1)


✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=


Mn(Pd, k·k1 ,↵) ⇣ min

⇢1,

pd log dpn↵2

�

Proposition:Private minimax rate for ↵ = O(1)


✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=


Mn(Pd, k·k1 ,↵) ⇣ min

⇢1,

pd log dpn↵2

�

Effective sample size n 7! n↵2/dNote:

Optimal mechanism?

Non-privateobservation

X =

2

66664

10100

3

77775

Optimal mechanism?


X =

2

66664

10100

3

77775

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

2

66664

1 +W1

0 +W2

1 +W3

0 +W4

0 +W5

3

77775

Optimal mechanism?


X =

2

66664

10100

3

77775

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

2

66664

1 +W1

0 +W2

1 +W3

0 +W4

0 +W5

3

77775

Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

Optimal mechanism


X =

2

66664

10100

3

77775

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2

•Draw uniformly inv {0, 1}d

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2


•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

•otherwise, choose farther

e↵

1 + e↵

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2


•With probability




e↵

1 + e↵

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2


•With probability



At end:Compute sample

average andde-bias


e↵

1 + e↵

Empirical evidence

Estimate proportion of emergency room visits involving different substances

Data source: Drug Abuse Warning

NetworkSample size n

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Xi Zi1,2Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Question: How much “contraction” does privacy induce?

Xi Zi1,2Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)


Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

↵n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Note: for n 7! n↵2 ↵ . 1


Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

↵n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Pj Q

Final remarks: privacy

Key: Allows identification of new optimal mechanisms

Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds

[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples

‣ Fixed-design regression‣ Convex risk minimization‣ Multinomial estimation‣ Nonparametric density estimation

n 7! n↵2Almost always: effective sample size reduction

n 7! n↵2

dIn d-dimensional problems:

Part II: Communication and Minimax Risk

with John Duchi, Martin Wainwright and Yuchen Zhang

University of California, Berkeley

Communication-constraints

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Communication-constraints•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?



X1 X2 Xm

Z1 Z2 Zm

b✓

•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?



X1 X2 Xm

Z1 Z2 Zm

b✓

Xi = (Xi1, X

i2, . . . , X

in)


Setting: each of agents has sample of size

mn

Messages to fusion centerZi



X1 X2 Xm

Z1 Z2 Zm

b✓

Xi = (Xi1, X

i2, . . . , X

in)


Setting: each of agents has sample of size

mn

Messages to fusion centerZi

Question: tradeoffs between communication

and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

EP

hkb✓(Zm

1 )� ✓(P )k22i

X1 X2 Xm

Z1 Z2 Zm

b✓

• Loss k·k22

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

EP

hkb✓(Zm

1 )� ✓(P )k22i

Best protocol with smaller than bitsBZiZi = ⇡(Xi)

X1 X2 Xm

Z1 Z2 Zm

b✓

• Loss k·k22

Constrained to be bits B

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

b✓

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d


Z1 Z2 Zm

b✓

Minimax rateTheorem: when each agent has sample of sizen

E[kb✓(X1, . . . , Xm)� ✓k22] ⇣�2d

nm



✓ 2 [�1, 1]d

Minimax rate with B-bounded communication


Z1 Z2 Zm

b✓

Theorem: when each agent has sample of sizen

bitsB



✓ 2 [�1, 1]d

d

B ^ d

1

logm

�2d

nm. Mn(Nd, B) . d logm

B ^ d

�2d

nm

Minimax rate with B-bounded communication


Z1 Z2 Zm

b✓

Theorem: when each agent has sample of sizen

bitsB

Consequence: each sends bits for optimal estimation⇡ d



✓ 2 [�1, 1]d

d

B ^ d

1

logm

�2d

nm. Mn(Nd, B) . d logm

B ^ d

�2d

nm

Part III: Computation and Minimax Risk

with John Duchi and Brendan McMahan

Towards computationHighly-varying

or sparse X 2 Rn⇥d ✓ 2 Rd

Noise " 2 RnTargetY 2 Rn

[D., Hazan, Singer 11; D., Jordan, McMahan 13]

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775




•Both large n and large d•Analogous models for classification (e.g. logistic)


z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Columns very different (ill-conditioning)




•Both large n and large d•Analogous models for classification (e.g. logistic)


z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Convex risk minimizationLarge n, large d: focus on prediction. Minimize

RP (✓) := EP [`(X; ✓)]

convex inwhere ` ✓ and X ⇠ P


Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

1unit

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]



Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

1unit

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]


Minimax risk with n computationsMn(⇥,P, `) := inf

b✓2Cn

supP2P

n

EP [RP (b✓)]�min✓2⇥

RP (✓)o

Optimal convergenceLinear functionals on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d

E[X]

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX 2 {�1, 0, 1}d

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi



2. Update

Xi


Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]



2. Update

Xi


Problem: method (and variants) sub-optimal for data with high-variance features




2. Update

Xi


Problem: method (and variants) sub-optimal for data with high-variance features


Sometimes a factor of d worse than necessary...



2. Update

Xi




2. Update

Xi

Unc

omm

on

Common

Contours of R(✓)




2. Update

Xi

Unc

omm

on

CommonCan we do something a bit more second order?

Contours of R(✓)




2. Update

Xi

Unc

omm

on


hard

Rescaled risk

Contours of R(✓) easier




2. Update

Xi

Unc

omm

on


hard

Rescaled risk

Contours of R(✓) easier

Ai = diag

✓X

ji

gjg>j

◆ 12

✓i+1 ✓i �A�1i gi

AdaGrad

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

For any risk functional,E[R(b✓)]�R(✓⇤)

1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

Oracle: adaptiveto this data


1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX

Linear functional on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d


1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj



>✓

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

p2pn

dX

j=1

ppj

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj



>✓

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

p2pn

dX

j=1

ppj

Note: adaptivity < factor 12 larger than lower bound

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

Misc

lassifi

catio

n ra

te 1

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

Misc

lassifi

catio

n ra

te 125 - 50%

improvement

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounPer noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounpr

opor

tion

good

in to

p k

k

Per noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%

)Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS

[Dean et al. 12]

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%



Adaptive method

[Dean et al. 12]

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%



Adaptive method

[Dean et al. 12]

Order of magnitude improvement in time & costIn many production systems at Google(only result I may show...)

Unifying picture

Unifying pictureConstraints in minimax analysis

Reduction of estimation to testingV X Z

( is private, only a few bits, a function evaluation...)Z

find initial using onlyV Z


Reduction of estimation to testing

Information-theoretic techniques: strong data processing

V X Z

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

X Z

apply




Reduction of estimation to testing

Information-theoretic techniques: strong data processing

V X Z

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

X Z

apply

Understanding leads to new, optimal schemes



Documents

Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November