Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and...

Statistical Decision Theory and Information Constraints

Michael I. Jordan University of California, Berkeley

November 20, 2014

What Is the Big Data Phenomenon?

•  Science in confirmatory mode (e.g., particle physics) •  Science in exploratory mode (e.g., astronomy, genomics) •  Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

•  Sensor networks are becoming pervasive

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime

•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data? •  Statistical with distributed and streaming data

–  how is inferential quality impacted by communication constraints?

•  The tradeoff between statistical risk and privacy (and other externalities)

•  Many other issues that require a blend of statistical thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

Our Approach

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

•  Under the hood: geometry, information theory and optimization

Our Approach

•  In the 1930’s, Wald laid the foundations of statistical decision theory

•  Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:

•  Minimax principle [Wald, ‘39, ‘43]: choose

estimator minimizing worst-case risk:

Background

supP2P

hl(✓̂, ✓(P ))

P✓(P ) P 2 P

✓̂ l(✓̂, ✓(P ))

RP (✓̂) := EP

hl(✓̂, ✓(P ))

Part I: Privacy and Minimax Risk

with John Duchi and Martin Wainwright University of California, Berkeley

•  Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

•  We will quantify “privacy loss” via differential privacy

•  We then treat differential privacy as a constraint on inference via statistical decision theory

•  This yields (personal) tradeoffs between privacy loss and inferential gain

Privacy and Risk

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

X1 X2 Xn

ZnZ2Z1

Private

X1 X2 Xn

ZnZ2Z1

Private

Q(· | Xi)

X1 X2 Xn

ZnZ2Z1

Private

Q(· | Xi)

Channel

X1 X2 Xn

ZnZ2Z1

Private

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

X1 X2 Xn

ZnZ2Z1

Private

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

Definitions of privacyDefinition: channel is -differentially private if

[Dwork, McSherry, Nissim, Smith 06]

Q(· | Xi)sup

S,x2X ,x

Q(Z 2 S | x)Q(Z 2 S | x0

exp(↵)

Definitions of privacyDefinition: channel is -differentially private if

[Dwork, McSherry, Nissim, Smith 06]

Q(· | Xi)sup

S,x2X ,x

Q(Z 2 S | x)Q(Z 2 S | x0

exp(↵)

logQ(z | x)logQ(z | x0

Private Minimax RiskCentral object of study: minimax risk

Minimax risk

•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `) := infb✓supP2P

h`(b✓(Xn

1 ), ✓(P ))i

-private Minimax risk↵

Private Minimax RiskCentral object of study: minimax risk•Parameter of distribution✓(P )

Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

h`(b✓(Zn

1 ), ✓(P ))i

•Family of private channelsQ↵

Best -private channel↵

Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

h`(b✓(Zn

1 ), ✓(P ))i

Best -private channel↵Minimax risk under privacy constraint

Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

h`(b✓(Zn

1 ), ✓(P ))i

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

Proportions =

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

✓1✓2✓3✓4✓5✓6

= .45= .32= .16= .20= .00= .02

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

�distributions P supported on [�1, 1]d

Proposition:

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Minimax rateMn(Pd, k·k1) ⇣ min

plog dpn

(achieved by sample mean)

Proposition:Private minimax rate for ↵ = O(1)

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Mn(Pd, k·k1 ,↵) ⇣ min

pd log dpn↵2

Proposition:Private minimax rate for ↵ = O(1)

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Mn(Pd, k·k1 ,↵) ⇣ min

pd log dpn↵2

Effective sample size n 7! n↵2/dNote:

Optimal mechanism?

Non-privateobservation

Optimal mechanism?

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

Optimal mechanism?

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

Optimal mechanism

77775v =

777751� v =

View 1 View 2

•Draw uniformly inv {0, 1}d

Optimal mechanism

77775v =

777751� v =

View 1 View 2

•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

•otherwise, choose farther

1 + e↵

Optimal mechanism

77775v =

777751� v =

View 1 View 2

•With probability

1 + e↵

Optimal mechanism

77775v =

777751� v =

View 1 View 2

•With probability

At end:Compute sample

average andde-bias

1 + e↵

Empirical evidence

Estimate proportion of emergency room visits involving different substances

Data source: Drug Abuse Warning

NetworkSample size n

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Xi Zi1,2Pj Q

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Question: How much “contraction” does privacy induce?

Xi Zi1,2Pj Q

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Note: for n 7! n↵2 ↵ . 1

Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Final remarks: privacy

Key: Allows identification of new optimal mechanisms

Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds

[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples

‣ Fixed-design regression‣ Convex risk minimization‣ Multinomial estimation‣ Nonparametric density estimation

n 7! n↵2Almost always: effective sample size reduction

n 7! n↵2

dIn d-dimensional problems:

Part II: Communication and Minimax Risk

with John Duchi, Martin Wainwright and Yuchen Zhang

University of California, Berkeley

Communication-constraints

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Communication-constraints•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?

X1 X2 Xm

Z1 Z2 Zm

•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?

X1 X2 Xm

Z1 Z2 Zm

Xi = (Xi1, X

i2, . . . , X

Setting: each of agents has sample of size

Messages to fusion centerZi

X1 X2 Xm

Z1 Z2 Zm

Xi = (Xi1, X

i2, . . . , X

Setting: each of agents has sample of size

Messages to fusion centerZi

Question: tradeoffs between communication

and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

hkb✓(Zm

1 )� ✓(P )k22i

X1 X2 Xm

Z1 Z2 Zm

• Loss k·k22

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

hkb✓(Zm

1 )� ✓(P )k22i

Best protocol with smaller than bitsBZiZi = ⇡(Xi)

X1 X2 Xm

Z1 Z2 Zm

• Loss k·k22

Constrained to be bits B

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d

Z1 Z2 Zm

Minimax rateTheorem: when each agent has sample of sizen

E[kb✓(X1, . . . , Xm)� ✓k22] ⇣�2d

✓ 2 [�1, 1]d

Minimax rate with B-bounded communication

Z1 Z2 Zm

Theorem: when each agent has sample of sizen

✓ 2 [�1, 1]d

nm. Mn(Nd, B) . d logm

Minimax rate with B-bounded communication

Z1 Z2 Zm

Theorem: when each agent has sample of sizen

Consequence: each sends bits for optimal estimation⇡ d

✓ 2 [�1, 1]d

nm. Mn(Nd, B) . d logm

Part III: Computation and Minimax Risk

with John Duchi and Brendan McMahan

Towards computationHighly-varying

or sparse X 2 Rn⇥d ✓ 2 Rd

Noise " 2 RnTargetY 2 Rn

[D., Hazan, Singer 11; D., Jordan, McMahan 13]

6666664

⇤⇤⇤⇤⇤⇤

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

7777775

666666664

⇤⇤⇤⇤⇤⇤⇤

777777775

6666664

⇤⇤⇤⇤⇤⇤

7777775

•Both large n and large d•Analogous models for classification (e.g. logistic)

6666664

⇤⇤⇤⇤⇤⇤

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

7777775

666666664

⇤⇤⇤⇤⇤⇤⇤

777777775

6666664

⇤⇤⇤⇤⇤⇤

7777775

Columns very different (ill-conditioning)

•Both large n and large d•Analogous models for classification (e.g. logistic)

6666664

⇤⇤⇤⇤⇤⇤

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

7777775

666666664

⇤⇤⇤⇤⇤⇤⇤

777777775

6666664

⇤⇤⇤⇤⇤⇤

7777775

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Convex risk minimizationLarge n, large d: focus on prediction. Minimize

RP (✓) := EP [`(X; ✓)]

convex inwhere ` ✓ and X ⇠ P

Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]

Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]

Minimax risk with n computationsMn(⇥,P, `) := inf

b✓2Cn

supP2P

EP [RP (b✓)]�min✓2⇥

RP (✓)o

Optimal convergenceLinear functionals on hypercube`(x; ✓) = x

⇥ = [�1, 1]d

Lower bounds

Mn(⇥,P, `) � 1

Feature j of appears with probability pjX 2 {�1, 0, 1}d

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

✓i+1 ✓i �1pigi

2. Update

Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]

2. Update

Problem: method (and variants) sub-optimal for data with high-variance features

2. Update

Problem: method (and variants) sub-optimal for data with high-variance features

Sometimes a factor of d worse than necessary...

2. Update

Common

Contours of R(✓)

2. Update

CommonCan we do something a bit more second order?

Contours of R(✓)

2. Update

Rescaled risk

Contours of R(✓) easier

2. Update

Rescaled risk

Contours of R(✓) easier

Ai = diag

◆ 12

✓i+1 ✓i �A�1i gi

AdaGrad

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

For any risk functional,E[R(b✓)]�R(✓⇤)

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

g>i A�1gi

��

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

Oracle: adaptiveto this data

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

g>i A�1gi

��

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

Feature j of appears with probability pjX

Linear functional on hypercube`(x; ✓) = x

⇥ = [�1, 1]d

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

g>i A�1gi

��

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

Note: adaptivity < factor 12 larger than lower bound

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

lassifi

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

lassifi

te 125 - 50%

improvement

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounPer noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounpr

Per noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

One more result

0 20 40 60 80 100 1200

Time (hours)

)Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS

[Dean et al. 12]

One more result

0 20 40 60 80 100 1200

Time (hours)

Adaptive method

[Dean et al. 12]

One more result

0 20 40 60 80 100 1200

Time (hours)

Adaptive method

[Dean et al. 12]

Order of magnitude improvement in time & costIn many production systems at Google(only result I may show...)

Unifying picture

Unifying pictureConstraints in minimax analysis

Reduction of estimation to testingV X Z

( is private, only a few bits, a function evaluation...)Z

find initial using onlyV Z

Reduction of estimation to testing

Information-theoretic techniques: strong data processing

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

Reduction of estimation to testing

Information-theoretic techniques: strong data processing

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

Understanding leads to new, optimal schemes

Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and...

Documents

Statistical methods to support medical decision-making on

Statistical Approaches to Learning and Discovery Week 3: Elements of Decision Theorytom/10-702/week3b.pdf · 2003-01-30 · Decision Theory Statistical decision theory – making

Decision Procedures for String Constraints

Statistical Decision Theory-English Statistical Decisio

Statistical Constraints · Statistical Constraints Roberto Rossi1, Steven D. Prestwich2, S. Armagan Tarim2,3 1TheUniversity ofEdinburgh Business School,TheUniversityofEdinburgh, UK

PREP Course 17: STATISTICAL DECISION THEORY, HYPOTHESIS TESTING AND COMMON STATISTICAL ... · · 2013-02-13STATISTICAL DECISION THEORY, HYPOTHESIS TESTING AND COMMON STATISTICAL

Lecture 2: Statistical Decision Theory · EE378A Statistical Signal Processing Lecture 2 - 04/06/2017 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew

Statistical classifiers: Bayesian decision theory and ...research.cs.tamu.edu/prism/lectures/talks/nose04.pdf · Statistical classifiers: Bayesian decision theory and density estimation

Efficient Decision Procedure for Non-linear Arithmetic Constraints

Lecture 3: Statistical Decision Theory (Part II)

Quantum Minimax Theorem in Statistical Decision Theory (RIMS2014)

Lecture 2: Statistical Decision Theory (Part I)hzhang/math574m/2019Lect2_decisionI.pdf · Part I: Statistical Decision Theory Bayes Inference Statistical Decision Theory Statistical

Applications of Asymptotic Statistical Decision …cowles.yale.edu/sites/default/files/files/lunch/hirano...Applications of Asymptotic Statistical Decision Theory in Econometrics Wald

Statistical framework for decision making in mine action

STATISTICAL DECISION THEORY - Agribusiness Management template - Main

Decision-Making under StatisticalMaking under Statistical ...€¦ · Decision-Making under StatisticalMaking under Statistical Uncertainty Jayakrishnan Unnikrishnan PhD Defense ECE

Decision-making for prevention and control under economic constraints

Statistical Decision Theory Bayes’ theorem : For discrete events

Statistical Histogram Decision Based Contrast

TIME CONSTRAINTS AND CONSUMER DECISION-MAKING… · time constraints and consumer decision-making: an examination of the conditional effect of time pressure rex h. warland, robert