Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
Statistical Decision Theory and Information Constraints
Michael I. Jordan University of California, Berkeley
November 20, 2014
What Is the Big Data Phenomenon?
• Science in confirmatory mode (e.g., particle physics) • Science in exploratory mode (e.g., astronomy, genomics) • Measurement of human activity, particularly online
activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets
• Sensor networks are becoming pervasive
What Are the Conceptual/Mathematical Issues?
• The need to control statistical risk under constraints on algorithmic runtime
What Are the Conceptual/Mathematical Issues?
• The need to control statistical risk under constraints on algorithmic runtime – how do risk and runtime trade off as a function of the amount of
data?
What Are the Conceptual/Mathematical Issues?
• The need to control statistical risk under constraints on algorithmic runtime – how do risk and runtime trade off as a function of the amount of
data? • Statistical with distributed and streaming data
– how is inferential quality impacted by communication constraints?
What Are the Conceptual/Mathematical Issues?
• The need to control statistical risk under constraints on algorithmic runtime – how do risk and runtime trade off as a function of the amount of
data? • Statistical with distributed and streaming data
– how is inferential quality impacted by communication constraints?
• The tradeoff between statistical risk and privacy (and other externalities)
What Are the Conceptual/Mathematical Issues?
• The need to control statistical risk under constraints on algorithmic runtime – how do risk and runtime trade off as a function of the amount of
data? • Statistical with distributed and streaming data
– how is inferential quality impacted by communication constraints?
• The tradeoff between statistical risk and privacy (and other externalities)
• Many other issues that require a blend of statistical thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)
• Take (classical) statistical decision theory as a mathematical point of departure
• Treat computation, communication, privacy, etc as constraints on statistical risk
• This induces tradeoffs among these quantities and the number of data points
Our Approach
• Take (classical) statistical decision theory as a mathematical point of departure
• Treat computation, communication, privacy, etc as constraints on statistical risk
• This induces tradeoffs among these quantities and the number of data points
• Under the hood: geometry, information theory and optimization
Our Approach
• In the 1930’s, Wald laid the foundations of statistical decision theory
• Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:
• Minimax principle [Wald, ‘39, ‘43]: choose
estimator minimizing worst-case risk:
Background
supP2P
EP
hl(✓̂, ✓(P ))
i
P✓(P ) P 2 P
✓̂ l(✓̂, ✓(P ))
RP (✓̂) := EP
hl(✓̂, ✓(P ))
i
Part I: Privacy and Minimax Risk
with John Duchi and Martin Wainwright University of California, Berkeley
• Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur
• We will quantify “privacy loss” via differential privacy
• We then treat differential privacy as a constraint on inference via statistical decision theory
• This yields (personal) tradeoffs between privacy loss and inferential gain
Privacy and Risk
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b✓
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b✓
Private
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b✓
Private
Xi
Zi
b✓
Q(· | Xi)
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b✓
Private
Xi
Zi
b✓
Q(· | Xi)
Channel
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b✓
Private
Xi
Zi
b✓
Q(· | Xi)
Channel
Individuals with private data Estimator
Xiiid⇠ Pi 2 {1, . . . , n}
Zn1 7! b✓(Zn
1 )
A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]
X1 X2 Xn
ZnZ2Z1
b✓
Private
Xi
Zi
b✓
Q(· | Xi)
Channel
Individuals with private data Estimator
Xiiid⇠ Pi 2 {1, . . . , n}
Zn1 7! b✓(Zn
1 )
Definitions of privacyDefinition: channel is -differentially private if
↵Q
[Dwork, McSherry, Nissim, Smith 06]
Xi
Zi
b✓
Q(· | Xi)sup
S,x2X ,x
02X
Q(Z 2 S | x)Q(Z 2 S | x0
)
exp(↵)
Definitions of privacyDefinition: channel is -differentially private if
↵Q
[Dwork, McSherry, Nissim, Smith 06]
Xi
Zi
b✓
Q(· | Xi)sup
S,x2X ,x
02X
Q(Z 2 S | x)Q(Z 2 S | x0
)
exp(↵)
logQ(z | x)logQ(z | x0
)
↵
Private Minimax RiskCentral object of study: minimax risk
Minimax risk
•Parameter of distribution✓(P )
•Family of distributions P•Loss measuring error`
Mn(✓(P), `) := infb✓supP2P
EP
h`(b✓(Xn
1 ), ✓(P ))i
-private Minimax risk↵
Private Minimax RiskCentral object of study: minimax risk•Parameter of distribution✓(P )
•Family of distributions P•Loss measuring error`
Mn(✓(P), `,↵) := infQ2Q↵
infb✓supP2P
EP,Q
h`(b✓(Zn
1 ), ✓(P ))i
•Family of private channelsQ↵
-private Minimax risk↵
Private Minimax RiskCentral object of study: minimax risk
Best -private channel↵
•Parameter of distribution✓(P )
•Family of distributions P•Loss measuring error`
Mn(✓(P), `,↵) := infQ2Q↵
infb✓supP2P
EP,Q
h`(b✓(Zn
1 ), ✓(P ))i
•Family of private channelsQ↵
-private Minimax risk↵
Private Minimax RiskCentral object of study: minimax risk
Best -private channel↵Minimax risk under privacy constraint
•Parameter of distribution✓(P )
•Family of distributions P•Loss measuring error`
Mn(✓(P), `,↵) := infQ2Q↵
infb✓supP2P
EP,Q
h`(b✓(Zn
1 ), ✓(P ))i
•Family of private channelsQ↵
Vignette: private mean (location) estimation
Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances
Proportions =
Vignette: private mean (location) estimation
Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances
1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines
✓
✓1✓2✓3✓4✓5✓6
= .45= .32= .16= .20= .00= .02
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
✓(P ) := EP [X] 2 Rd
`1 E[kb✓ � ✓k1]
Pd :=
�distributions P supported on [�1, 1]d
Proposition:
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
✓(P ) := EP [X] 2 Rd
`1 E[kb✓ � ✓k1]
Pd :=
�distributions P supported on [�1, 1]d
Minimax rateMn(Pd, k·k1) ⇣ min
⇢1,
plog dpn
�
(achieved by sample mean)
Proposition:Private minimax rate for ↵ = O(1)
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
✓(P ) := EP [X] 2 Rd
`1 E[kb✓ � ✓k1]
Pd :=
�distributions P supported on [�1, 1]d
Mn(Pd, k·k1 ,↵) ⇣ min
⇢1,
pd log dpn↵2
�
Proposition:Private minimax rate for ↵ = O(1)
Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for
✓(P ) := EP [X] 2 Rd
`1 E[kb✓ � ✓k1]
Pd :=
�distributions P supported on [�1, 1]d
Mn(Pd, k·k1 ,↵) ⇣ min
⇢1,
pd log dpn↵2
�
Effective sample size n 7! n↵2/dNote:
Optimal mechanism?
Non-privateobservation
X =
2
66664
10100
3
77775
Optimal mechanism?
Non-privateobservation
X =
2
66664
10100
3
77775
Idea 1: add independent noise(e.g. standard Laplace
mechanism)
Z = X +W =
2
66664
1 +W1
0 +W2
1 +W3
0 +W4
0 +W5
3
77775
Optimal mechanism?
Non-privateobservation
X =
2
66664
10100
3
77775
Idea 1: add independent noise(e.g. standard Laplace
mechanism)
Z = X +W =
2
66664
1 +W1
0 +W2
1 +W3
0 +W4
0 +W5
3
77775
Problem: magnitude much too large(this is unavoidable: provably sub-optimal)
Optimal mechanism
Non-privateobservation
X =
2
66664
10100
3
77775
Optimal mechanism
Non-privateobservation
X =
2
66664
10100
3
77775v =
2
66664
01100
3
777751� v =
2
66664
10011
3
77775
View 1 View 2
•Draw uniformly inv {0, 1}d
Optimal mechanism
Non-privateobservation
X =
2
66664
10100
3
77775v =
2
66664
01100
3
777751� v =
2
66664
10011
3
77775
View 1 View 2
•Draw uniformly inv {0, 1}d
•With probability
choose closer of and tov 1� v X
(closer : 3 overlap) (farther : 2 overlap)
•otherwise, choose farther
e↵
1 + e↵
Optimal mechanism
Non-privateobservation
X =
2
66664
10100
3
77775v =
2
66664
01100
3
777751� v =
2
66664
10011
3
77775
View 1 View 2
•Draw uniformly inv {0, 1}d
•With probability
choose closer of and tov 1� v X
(closer : 3 overlap) (farther : 2 overlap)
•otherwise, choose farther
e↵
1 + e↵
Optimal mechanism
Non-privateobservation
X =
2
66664
10100
3
77775v =
2
66664
01100
3
777751� v =
2
66664
10011
3
77775
View 1 View 2
•Draw uniformly inv {0, 1}d
•With probability
choose closer of and tov 1� v X
(closer : 3 overlap) (farther : 2 overlap)
At end:Compute sample
average andde-bias
•otherwise, choose farther
e↵
1 + e↵
Empirical evidence
Estimate proportion of emergency room visits involving different substances
Data source: Drug Abuse Warning
NetworkSample size n
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵
Mj(S) :=
ZQ(S | x1, . . . , xn)dP
nj (x1, . . . , xn)
Xi Zi1,2Pj Q
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵
Mj(S) :=
ZQ(S | x1, . . . , xn)dP
nj (x1, . . . , xn)
Question: How much “contraction” does privacy induce?
Xi Zi1,2Pj Q
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵
Mj(S) :=
ZQ(S | x1, . . . , xn)dP
nj (x1, . . . , xn)
Question: How much “contraction” does privacy induce?
Xi Zi1,2
Theorem (data processing): for any -private channel and i.i.d. sample of size
↵n
Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV
Pj Q
Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵
Mj(S) :=
ZQ(S | x1, . . . , xn)dP
nj (x1, . . . , xn)
Note: for n 7! n↵2 ↵ . 1
Question: How much “contraction” does privacy induce?
Xi Zi1,2
Theorem (data processing): for any -private channel and i.i.d. sample of size
↵n
Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV
Pj Q
Final remarks: privacy
Key: Allows identification of new optimal mechanisms
Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds
[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples
‣ Fixed-design regression‣ Convex risk minimization‣ Multinomial estimation‣ Nonparametric density estimation
n 7! n↵2Almost always: effective sample size reduction
n 7! n↵2
dIn d-dimensional problems:
Part II: Communication and Minimax Risk
with John Duchi, Martin Wainwright and Yuchen Zhang
University of California, Berkeley
Communication-constraints
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
Communication-constraints•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
Communication-constraints
X1 X2 Xm
Z1 Z2 Zm
b✓
•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
Communication-constraints
X1 X2 Xm
Z1 Z2 Zm
b✓
Xi = (Xi1, X
i2, . . . , X
in)
•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?
Setting: each of agents has sample of size
mn
Messages to fusion centerZi
[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
Communication-constraints
X1 X2 Xm
Z1 Z2 Zm
b✓
Xi = (Xi1, X
i2, . . . , X
in)
•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?
Setting: each of agents has sample of size
mn
Messages to fusion centerZi
Question: tradeoffs between communication
and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]
Minimax risk with -bounded communicationB
Minimax communicationCentral object of study:• Parameter of distribution✓(P )
• Family of distributions P
Mn(✓(P), B) := inf⇡2⇧B
infb✓supP2P
EP
hkb✓(Zm
1 )� ✓(P )k22i
X1 X2 Xm
Z1 Z2 Zm
b✓
• Loss k·k22
Minimax risk with -bounded communicationB
Minimax communicationCentral object of study:• Parameter of distribution✓(P )
• Family of distributions P
Mn(✓(P), B) := inf⇡2⇧B
infb✓supP2P
EP
hkb✓(Zm
1 )� ✓(P )k22i
Best protocol with smaller than bitsBZiZi = ⇡(Xi)
X1 X2 Xm
Z1 Z2 Zm
b✓
• Loss k·k22
Constrained to be bits B
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
b✓
Consider estimation in normallocation family,
Xijiid⇠ N(✓,�2Id⇥d)
✓ 2 [�1, 1]d
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
b✓
Minimax rateTheorem: when each agent has sample of sizen
E[kb✓(X1, . . . , Xm)� ✓k22] ⇣�2d
nm
Consider estimation in normallocation family,
Xijiid⇠ N(✓,�2Id⇥d)
✓ 2 [�1, 1]d
Minimax rate with B-bounded communication
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
b✓
Theorem: when each agent has sample of sizen
bitsB
Consider estimation in normallocation family,
Xijiid⇠ N(✓,�2Id⇥d)
✓ 2 [�1, 1]d
d
B ^ d
1
logm
�2d
nm. Mn(Nd, B) . d logm
B ^ d
�2d
nm
Minimax rate with B-bounded communication
Vignette: mean estimationX1 X2 Xm
Z1 Z2 Zm
b✓
Theorem: when each agent has sample of sizen
bitsB
Consequence: each sends bits for optimal estimation⇡ d
Consider estimation in normallocation family,
Xijiid⇠ N(✓,�2Id⇥d)
✓ 2 [�1, 1]d
d
B ^ d
1
logm
�2d
nm. Mn(Nd, B) . d logm
B ^ d
�2d
nm
Part III: Computation and Minimax Risk
with John Duchi and Brendan McMahan
Towards computationHighly-varying
or sparse X 2 Rn⇥d ✓ 2 Rd
Noise " 2 RnTargetY 2 Rn
[D., Hazan, Singer 11; D., Jordan, McMahan 13]
z}|{2
6666664
⇤⇤⇤⇤⇤⇤
3
7777775⇡
z }| {2
6666664
⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤
⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤
3
7777775
z}|{2
666666664
⇤⇤⇤⇤⇤⇤⇤
3
777777775
+
z}|{2
6666664
⇤⇤⇤⇤⇤⇤
3
7777775
Towards computationHighly-varying
or sparse X 2 Rn⇥d ✓ 2 Rd
Noise " 2 RnTargetY 2 Rn
•Both large n and large d•Analogous models for classification (e.g. logistic)
[D., Hazan, Singer 11; D., Jordan, McMahan 13]
z}|{2
6666664
⇤⇤⇤⇤⇤⇤
3
7777775⇡
z }| {2
6666664
⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤
⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤
3
7777775
z}|{2
666666664
⇤⇤⇤⇤⇤⇤⇤
3
777777775
+
z}|{2
6666664
⇤⇤⇤⇤⇤⇤
3
7777775
Columns very different (ill-conditioning)
Towards computationHighly-varying
or sparse X 2 Rn⇥d ✓ 2 Rd
Noise " 2 RnTargetY 2 Rn
•Both large n and large d•Analogous models for classification (e.g. logistic)
[D., Hazan, Singer 11; D., Jordan, McMahan 13]
z}|{2
6666664
⇤⇤⇤⇤⇤⇤
3
7777775⇡
z }| {2
6666664
⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤
⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤
3
7777775
z}|{2
666666664
⇤⇤⇤⇤⇤⇤⇤
3
777777775
+
z}|{2
6666664
⇤⇤⇤⇤⇤⇤
3
7777775
Sparse and heavy-tailed
Features of 2000 webpages
Medical risk prediction patients and indicator variablesBinary interactions yield
(sparse) featuresd ⇡ 4.5 · 108
d ⇡ 3 · 104n ⇡ 105
Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero
d ⇡ 5 · 104
d ⇡ 3 · 106
[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]
Sparse and heavy-tailed
Features of 2000 webpages
Medical risk prediction patients and indicator variablesBinary interactions yield
(sparse) featuresd ⇡ 4.5 · 108
d ⇡ 3 · 104n ⇡ 105
Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero
d ⇡ 5 · 104
d ⇡ 3 · 106
[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]
Convex risk minimizationLarge n, large d: focus on prediction. Minimize
RP (✓) := EP [`(X; ✓)]
convex inwhere ` ✓ and X ⇠ P
Convex risk minimizationLarge n, large d: focus on prediction. Minimize
Computation = optimization complexity
{x, ✓} `(x; ✓)
r✓`(x; ✓)
1unit
[Nemirovski& Yudin 83]
RP (✓) := EP [`(X; ✓)]
convex inwhere ` ✓ and X ⇠ P
Convex risk minimizationLarge n, large d: focus on prediction. Minimize
Computation = optimization complexity
{x, ✓} `(x; ✓)
r✓`(x; ✓)
1unit
[Nemirovski& Yudin 83]
RP (✓) := EP [`(X; ✓)]
convex inwhere ` ✓ and X ⇠ P
Minimax risk with n computationsMn(⇥,P, `) := inf
b✓2Cn
supP2P
n
EP [RP (b✓)]�min✓2⇥
RP (✓)o
Optimal convergenceLinear functionals on hypercube`(x; ✓) = x
>✓
⇥ = [�1, 1]d
E[X]
Lower bounds
Mn(⇥,P, `) � 1
8
1pn
dX
j=1
ppj
Feature j of appears with probability pjX 2 {�1, 0, 1}d
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
✓i+1 ✓i �1pigi
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
✓i+1 ✓i �1pigi
Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
✓i+1 ✓i �1pigi
Problem: method (and variants) sub-optimal for data with high-variance features
Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
✓i+1 ✓i �1pigi
Problem: method (and variants) sub-optimal for data with high-variance features
Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]
Sometimes a factor of d worse than necessary...
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
✓i+1 ✓i �1pigi
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
Unc
omm
on
Common
Contours of R(✓)
✓i+1 ✓i �1pigi
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
Unc
omm
on
CommonCan we do something a bit more second order?
Contours of R(✓)
✓i+1 ✓i �1pigi
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
Unc
omm
on
CommonCan we do something a bit more second order?
hard
Rescaled risk
Contours of R(✓) easier
✓i+1 ✓i �1pigi
Methods and challengesStochastic approximationIterate:
gi = r`(Xi; ✓i)1. Random
2. Update
Xi
Unc
omm
on
CommonCan we do something a bit more second order?
hard
Rescaled risk
Contours of R(✓) easier
Ai = diag
✓X
ji
gjg>j
◆ 12
✓i+1 ✓i �A�1i gi
AdaGrad
Optimal convergence
Upper bounds
Iterate:gi = r`(Xi; ✓i)1. Gradient
3. Update
2. Scale Ai = diag(Pji
gjg>j )12
✓i+1 ✓i �A�1i gi
For any risk functional,E[R(b✓)]�R(✓⇤)
1p2n
E
mindiagonal A⌫0
⇢k✓⇤k21 tr(A) +
nX
i=1
g>i A�1gi
��
Optimal convergence
Upper bounds
Iterate:gi = r`(Xi; ✓i)1. Gradient
3. Update
2. Scale Ai = diag(Pji
gjg>j )12
✓i+1 ✓i �A�1i gi
Oracle: adaptiveto this data
For any risk functional,E[R(b✓)]�R(✓⇤)
1p2n
E
mindiagonal A⌫0
⇢k✓⇤k21 tr(A) +
nX
i=1
g>i A�1gi
��
Optimal convergence
Upper bounds
Lower bounds
Mn(⇥,P, `) � 1
8
1pn
dX
j=1
ppj
Feature j of appears with probability pjX
Linear functional on hypercube`(x; ✓) = x
>✓
⇥ = [�1, 1]d
For any risk functional,E[R(b✓)]�R(✓⇤)
1p2n
E
mindiagonal A⌫0
⇢k✓⇤k21 tr(A) +
nX
i=1
g>i A�1gi
��
Optimal convergence
Upper bounds
Lower bounds
Mn(⇥,P, `) � 1
8
1pn
dX
j=1
ppj
Feature j of appears with probability pjX
Linear functional on hypercube`(x; ✓) = x
>✓
⇥ = [�1, 1]d
Linear models on hypercubeE[R(b✓)]�R(✓⇤)
p2pn
dX
j=1
ppj
Optimal convergence
Upper bounds
Lower bounds
Mn(⇥,P, `) � 1
8
1pn
dX
j=1
ppj
Feature j of appears with probability pjX
Linear functional on hypercube`(x; ✓) = x
>✓
⇥ = [�1, 1]d
Linear models on hypercubeE[R(b✓)]�R(✓⇤)
p2pn
dX
j=1
ppj
Note: adaptivity < factor 12 larger than lower bound
Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents
1[Crammer et al. ’06]
Misc
lassifi
catio
n ra
te 1
Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents
1[Crammer et al. ’06]
Misc
lassifi
catio
n ra
te 125 - 50%
improvement
Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounPer noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104
[Bengio, Weston, Grangier 10]
Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounpr
opor
tion
good
in to
p k
k
Per noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104
[Bengio, Weston, Grangier 10]
One more result
0 20 40 60 80 100 1200
5
10
15
20
25
Time (hours)
Ave
rag
e F
ram
e A
ccu
racy
(%
)Accuracy on Test Set
SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS
[Dean et al. 12]
One more result
0 20 40 60 80 100 1200
5
10
15
20
25
Time (hours)
Ave
rag
e F
ram
e A
ccu
racy
(%
)Accuracy on Test Set
SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS
Adaptive method
[Dean et al. 12]
One more result
0 20 40 60 80 100 1200
5
10
15
20
25
Time (hours)
Ave
rag
e F
ram
e A
ccu
racy
(%
)Accuracy on Test Set
SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS
Adaptive method
[Dean et al. 12]
Order of magnitude improvement in time & costIn many production systems at Google(only result I may show...)
Unifying picture
Unifying pictureConstraints in minimax analysis
Reduction of estimation to testingV X Z
( is private, only a few bits, a function evaluation...)Z
find initial using onlyV Z
Unifying pictureConstraints in minimax analysis
Reduction of estimation to testing
Information-theoretic techniques: strong data processing
V X Z
Relate information ‣ Fano‣ Assouad‣ Le Cam
tovia constraint on pairV Z V X
X Z
apply
( is private, only a few bits, a function evaluation...)Z
find initial using onlyV Z
Unifying pictureConstraints in minimax analysis
Reduction of estimation to testing
Information-theoretic techniques: strong data processing
V X Z
Relate information ‣ Fano‣ Assouad‣ Le Cam
tovia constraint on pairV Z V X
X Z
apply
Understanding leads to new, optimal schemes
( is private, only a few bits, a function evaluation...)Z
find initial using onlyV Z