61
Interplay between Statistics and Optimization Hui Zou University of Minnesota SAMSI August 29, 2016

Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Embed Size (px)

Citation preview

Page 1: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Interplay between Statistics andOptimization

Hui Zou

University of Minnesota

SAMSIAugust 29, 2016

Page 2: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

* * * * * * * * * * * * *

0.0 0.2 0.4 0.6 0.8 1.0

-500

0500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

* * * * ***

* * * * * *

**

**

* * * * * * * * *

* * **

** *

* * * * * *

* * * * * * *

*

**

* *

*

* * * * * * * * * *

* *

*

* * * *

** * *

* *

* *

*

* * * * * * * *

* ** *

*

* *

**

* * ** * *

* **

* * * * * * ** * * * * *

LASSO

52

178

46

9

0 2 3 4 5 7 8 10 12

2

Page 3: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Microarray data in early 2000sLarge scale multiple testing–false discovery rate (FDR) control(Benjamini and Hochberg), local FDR (Efron), Higher criticisim(Donoho and Jin), SAM (Tibshirani)Regression analysis–Least Angle Regression (Efron, Hastie,Johnstone and Tibshirani), Lasso

Various penalization techniques: SCAD, MCP, Elastic Net,Adaptive Lasso, fused Lasso, group Lasso,...

More sophisticated models/problems: GLM, GAM, precisionmatrix estimation, covariance matrix estimation, ...

Compressed sensing, Matrix completion, Robust PCA

Tensor regression, Tensor completion, Tensor decomposition

3

Page 4: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

My personal view

Optimization for Statistics: model fitting, model formulation,theoretical analysis

Statistics for Optimization: new research thrusts

Today’s talk:

Majorization-Minimization (MM)

Alternating Direction Method of Multipliers (ADMM)

4

Page 5: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Majorization-Minimization

Solve argminθ C(θ)

Majorization step:

C(θ) < D(θ|θk) for any θ 6= θk,

C(θk) = D(θk|θk)

Minimization step:

θk ← θk+1 = argminθ

D(θ|θk)

Lange, Hunter and Yang (2000)[optimization transfer]; Hunter & Lange

(2000) [MM]; Wu and Lange (2010) [EM and MM]

5

Page 6: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

LLAZou and Li (2008)

Fan, Xue and Zou (2014)

6

Page 7: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Nonconvex penalized regression

minβ

`n(β) + ∑j

Pλ(|βj|)

`n is convex and represents the statistical inference model- least squares loss- Huber ’s M loss or least absolute loss- logistic regression: negative log-Bernoulli-likelihood- quantile regression: check loss- Ising model: composite conditional likelihood (Xue, Zou and

Cai, 2012)

Pλ(t) is a non-decreasing concave function for t ∈ (0, ∞)

- Lq norm penalty (0 < q < 1)- SCAD (Fan and Li, 2001)- MCP (Zhang, 2010)

7

Page 8: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

−10 −5 0 5 10

05

1015

20

pena

lty

Lasso lambda=2SCAD lambda=2 a=3.7MCP lambda=2, a=2

8

Page 9: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

LLA

argminβ

{`n(β) +

p

∑j=1

Pλ(βj)

}

1 Start with some initial estimator β(0).

2 At step k, define

Qλ(βj) = Pλ(|β(k)j |) + P′λ(|β

(k)j |+)(|βj| − |β

(k)j |)

3 Solve β(k+1) = argminβ

{`n(β) + ∑d

j=1 Qλ(βj)}

.

Iterate between Steps 2 and 3.

9

Page 10: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

LLA and EM

Condition on Pλ: if there is a positive function H(t) such that

exp(−nPλ(|β|)) =∫ ∞

0H(t)e−t|β|dt. (∗)

Let π(t) = 2t H( 1

t ) and p(βj|τj) =1

2τje−|βj |τj . Then (∗) yields

exp(−nPλ(|βj|)) =∫ ∞

0p(βj|τj)π(τj)dτj. (∗∗)

(∗∗) represents a hierarchical Bayesian model and suggests anEM algorithm for maximizing the penalized likelihood by treatingτjs as “missing values".

Under condition (∗) EM=LLA.

10

Page 11: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

The issue of multiple local solutions

The folded concave penalization problem usually has multiplelocal solutions, but the theory (namely, the oracle property) isestablished only for one of the unknown local solutions (Fanand Li, 2001; Fan and Peng, 2004; Lv and Fan, 2008; Fan andLv, 2011; ...).

Over a decade, the challenging fundamental issue stillremains that it is not clear whether the local optimal solutioncomputed by a given optimization algorithm possesses thosenice theoretical properties.

11

Page 12: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Numeric demonstrationSimulation model: y ∼ Bernoulli( exp(Xβ?)

1+exp(Xβ?)), where X ∼ Np(0, Σ)

with Σij = 0.5|i−j| and β? = (3, 1.5, 0, 0, 2, 0p−5).

n = 200 & p = 1000`1 loss `2 loss # FP # FN

Sparse logistic regression

Lasso5.67 2.37 24.02 0.04

(0.05) (0.02) (0.44) (0.01)

SCAD-CD4.50 2.13 13.99 0.08

(0.06) (0.02) (0.31) (0.01)

SCAD-LLA-zero2.16 1.32 0.31 0.22

(0.11) (0.06) (0.05) (0.02)

SCAD-LLA-Lasso2.08 1.28 0.26 0.19

(0.10) (0.06) (0.04) (0.02)

12

Page 13: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

LLA closes the theoretical gap

In Fan, Xue and Zou (2014) it is shown that

Theorem

? If the initial estimator is Lasso, then the two-step LLAprocedure finds the oracle solution with high probability.

? If the initial estimator is zero, then the three-step LLAprocedure finds the oracle solution with high probability.

As illustration, the theory is verified for penalized least squares,penalized logistic regression, penalized quantile regression andpenalized graphical model estimation.

13

Page 14: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

The philosophical root of our theory

In the classical MLE theory, when the log-likelihood function isnot concave, one of the local maximizers of the log-likelihoodfunction is shown to be asymptotic efficient, but how tocompute that estimator is very challenging and often unclear.

Le Cam (1956) (and later Bickel 1975) overcame this technicaldifficulty by focusing on a specially designed one-stepNewton-Raphson estimator initialized by a root-n estimator.

Le Cam did not try to get the global maximizer nor thetheoretical local maximizer of the likelihood.

14

Page 15: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

The search for the global minimizer

Mixed Integer Programming has been used to get the globalminimizer of L0 penalized and SCAD penalized least squares.

Dimitris Bertsimas, Angela King and Rahul Mazumder (2016,AoS). Best Subset Selection via a Modern Optimization Lens.

Hongcheng Liu, Tao Yao, Runze Li (2016). Global solutions tofolded concave penalized nonconvex learning. AoS, 44(2),629-659.

Extension to more general models?

15

Page 16: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

BMDYang and Zou (2013)

16

Page 17: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Coordinate descent for lasso

argminβ1,...,βp

f (β1, . . . , βp) +p

∑j=1

λ|βj|

1 Initialization of β

2 Cyclic coordinate descent: for j = 1, 2, . . . , p, 1, 2, . . ., updateβj by minimizing the objective function

βupdatej ← argmin

βj

f (β1, . . . , βj−1, βj, βj+1, βp) + λ|βj|

3 Repeat (2) till convergence.

17

Page 18: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Lasso regression: f (β1, . . . , βp) = ‖y− Xβ‖22

βupdatej ← argmin

βj

‖y−∑k 6=j

xk βk − xjβj‖22 + λ|βj|

reduces to soft-thresholding.

Fu (1998) proposed the algorithm named “shooting".

Friedman, Hastie and Tibshirani (2008) glmnet, the same CDbut with clever implementation tricks such as active set, warmstart and later strong rule.

For lasso logistic regression, Friedman, Hastie and Tibshirani(2008) did CD within a Newton-Ralphson loop. Genkin, Lewisand Madigan (2007) did the standard CD by solving theone-dimensional optimization repeatedly.

18

Page 19: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Group lasso regression

min(β0,β)

12

∥∥∥∥∥y− β0 −∑k

X(k)β(k)

∥∥∥∥∥2

2

+ λK

∑k=1

√pk‖β(k)‖2

.

Group Lasso penalty was introduced in Turlach, Vanebles andWright (2004) and Yuan and Lin (2006).

A blockwise descent algorithm under a groupwiseorthonormal condition: X(k) columns are orthonormal.

Orthonormal condition is incompatible with cross-validation,bootstrap, sub-sampling.

19

Page 20: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

A general group lasso problem

arg minβ

1n

n

∑i=1

τiΦ(yi, βTxi) + λK

∑k=1

wk‖β(k)‖2

where τi ≥ 0 and wk ≥ 0 for all i, k.

The observation weights τis are introduced in order to covermethods such as weighted regression and weighted largemargin classification (biased sampling, unequal costclassification).

The penalty weights wks make a more flexible model. Thedefault choice for wk is

√pk. If we do not want to penalize a

group of predictors, simply let the corresponding weight bezero.

20

Page 21: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Loss functions

Least squares: Φ(y, f ) = 12 (y− f )2

Logistic regression: Φ(y, f ) = log(1 + e−yf ), y = ±1

Squared hinge loss: Φ(y, f ) = [(1− yf )+]2, y = ±1

Huberized SVM loss: Φ(y, f ) = hsvm(yf ), y = ±1 where

hsvm(t) =

0,(1− t)2/2δ,1− t− δ/2,

t > 11− δ < t ≤ 1t ≤ 1− δ.

21

Page 22: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Let D denote the data {y, X} and define

L(β | D) =1n

n

∑i=1

τiΦ(yi, βTxi).

Definition

The loss function Φ is said to satisfy the QM condition, if

(i). ∇L(β|D) exists everywhere.

(ii). There exists a p× p matrix H, which may only depend on thedata D, such that for all β, β∗,

L(β | D) ≤ L(β∗ | D) + (β− β∗)T∇L(β∗|D)

+12(β− β∗)TH(β− β∗).

22

Page 23: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Loss −∇L(β | D) H

Least squares 1n ∑n

i=1 τi(yi − xTi β)xi XTΓX/n

Logistic regression 1n ∑n

i=1 τiyixi1

1+exp(yixTi β)

14 XTΓX/n

Squared hinge loss 1n ∑n

i=1 2τiyixi(1− yixTi β)+ 4XTΓX/n

Huberized SVM loss 1n ∑n

i=1 τiyixihsvm′(yixTi β) 2

δ XTΓX/n

Γ = diag(τ1, . . . , τn)

23

Page 24: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Write β such that β(k′) = β(k′)

for k′ 6= k.

Given β(k′) = β(k′)

for k′ 6= k, the optimal β(k) is defined as

argminβ(k)

L(β | D) + λwk‖β(k)‖2.

By QM condition,

L(β | D) ≤ L(β | D) + (β− β)T∇L(β|D) +12(β− β)TH(β− β).

Write U(β) = −∇L(β|D).

L(β | D) ≤ L(β | D)− (β(k) − β(k))TU(k)

+12(β(k) − β

(k))TH(k)(β(k) − β

(k)).

24

Page 25: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Let ηk be the largest eigenvalue of H(k). We set γk = (1 + 10−4)ηk

L(β | D) ≤ L(β | D)− (β(k)− β(k))TU(k)+

12

γk‖(β(k)− β(k))‖2

2 (∗)

"=" holds if only if β(k) = β(k)

The minimizer of the right hand side of (∗) is

β(k)(new) =

1γk

(U(k) + γkβ

(k))1− λwk

‖U(k) + γkβ(k)‖2

+

.

The whole process drives the objective strictly downhill unless theoptimal solution is reached (i.e., KKT conditions are satisfied).

25

Page 26: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

BMD for group lasso

For k = 1, . . . , K, compute γk, the largest eigenvalue of H(k)

γk = (1 + 10−4)γk (for nontrival groups with size ≥ 2)

Initialize β.

Repeat the following cyclic blockwise updates untilconvergence:

? for k = 1, . . . , K, do (1)–(3)? (1) Compute U(β) = −∇L(β|D).? (2) Compute

β(k)

(new) = 1γk

(U(k) + γk β

(k))(

1− λwk

‖U(k)+γk β(k)‖2

)+

.

? (3) Set β(k)

= β(k)

(new).

gglasso package: also uses active set, strong rule and warmstart.

26

Page 27: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Competitors

block coordinate gradient descent grplasso: Meier et al.(2008) for group-lasso logistic regression.

ISTA-BC algorithm: Qin et al. (2010), an extension of theISTA/FISTA (Beck & Teboulle 2009) based on variablestep-lengths.

SLEP implemented Nesterov’s method: Liu et al. (2009)

27

Page 28: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Dataset Type n q p Data Source

Autompg R 392 7 31 (Quinlan 1993)Bardet R 120 200 1000 (Scheetz et al. 2006)Cardiomypathy R 30 6319 31595 (Segal et al. 2003)Spectroscopy R 103 100 500 (Sabo et al. 2008)Breast C 42 22283 111415 (Graham et al. 2010)Colon C 62 2000 10000 (Alon et al. 1999)Prostate C 102 6033 30165 (Singh et al. 2002)Sonar C 208 60 300 (Gorman et al. 1988)

Some real datasets. n is the number of instances. q is the number of

original variables. p is the number of predictors after expansion. “R"

means regression and “C" means classification.

28

Page 29: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Group-lasso GAM regression, timing performance

Dataset Autompg Bardet Cardiomypathy Spectroscopy

SLEP 3.14 9.96 78.23 9.37ISTA-BC 5.66 1.55 2.43 1.31gglasso 2.51 0.77 2.48 0.76

All experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor.

29

Page 30: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Group-lasso GAM classification, timing performance

Dataset Colon Prostate Sonar Breast

grplasso (Logit) 60.42 111.75 24.55 439.76SLEP (Logit) 75.31 166.91 5.49 358.75gglasso (Logit) 1.13 3.877 1.54 9.62gglasso (HSVM) 1.15 3.53 0.66 9.15

All experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor.

30

Page 31: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

A Small TrickYang and Zou (2012)

31

Page 32: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

A counterintuitive phenomenon

Consider the glmnet for fitting elastic net penalized regression.W.L.O.G. assume ∑N

i=1 xij = 0, 1N ∑N

i=1 x2ij = 1, for j = 1, . . . , p.

R(β0, β) =1

2N

N

∑i=1

(yi − β0 − xᵀi β)2 + Pλ,α(β),

where Pλ,α(β) is the elastic net penalty

Pλ,α(β) = λ ∑pj=1 pα(βj) = λ

p

∑j=1

[12(1− α)β2

j + α|βj|]

.

32

Page 33: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

glmnet implements the standard CD algorithm in which weiteratively solve a univariate elastic net problem

βj = arg minβj

R(βj|β0, β),

where

R(βj|β0, β) =12(

βj − βj)2 − 1

N

N

∑i=1

rixij(

βj − βj)+ λpα(βj).

βj =S(

1N ∑N

i=1 xijri + βj, λα)

1 + λ(1− α),

where S(z, t) = (|z| − t)+sgn(z).

33

Page 34: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

A tiny change to glmnet

We change the univariate update formula to

βBj =

S(

1N ∑N

i=1 xijri + f · βj, λα)

f · 1 + λ(1− α)(f ≥ 1)

Yang made a code error by using f = 2 in glmnet, but still gotgood/even better results.

As long as f ≥ 1 the iterative process converges to thedesired solution.

A bigger f means a smaller step size along each coordinatedirection. For an orthogonal design, f = 1 is the best choice.

34

Page 35: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Simulation

FHT model:We simulated data with N observations and p predictors whereeach pair of predictors Xj and Xj′ have the same populationcorrelation ρ, with ρ ranges from zero to 0.95.The response variable was generated by

Y =p

∑j=1

Xjβj + k ·N(0, 1),

where βj = (−1)j exp(−(2j− 1)/20) and k is set to make thesignal-to-noise ratio equal 3.

We compared f = 1 (glmnet) and f = 2 (glmnet2).

35

Page 36: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Correlation

0 0.1 0.2 0.5 0.8 0.95α = 1

N = 100, p = 5000

glmnet 0.2222 0.2339 0.2979 0.4606 0.7919 1.9016glmnet2 0.2533 0.2519 0.2886 0.3758 0.5450 1.0735

α = 0.5

N = 100, p = 5000

glmnet 0.2107 0.2189 0.2356 0.3669 0.7765 2.1528glmnet2 0.2225 0.2285 0.2414 0.2861 0.4876 1.3335

36

Page 37: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

A simple explanation

υj = (0, · · · ,xᵀj y

fN, · · · , 0) uj = (ukj)p×1 =

− 1f k = j

− ρf k 6= j

Wj = Ip×p +[0p×(j−1) uj 0p×(N−j)

]

A =p

∏j=1

Wj µ =p−1

∑s=1

(υs

p

∏j=s+1

Wj

)+ υp

If apply the CD and CMD to the LS problem, after a complete cyclefrom j = 1 to j = p, we get

β(k) = β(k−1)A + µ

The convergence rate is basically the maximum eigenvalue of(Ak)ᵀ Ak, which is affected by both f and ρ.

37

Page 38: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(a) ρ = 0.1

log(Iteration k)

ηm

ax((

Ak)T

Ak)

1 2 3

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(b) ρ = 0.5

log(Iteration k)

ηm

ax((

Ak)T

Ak)

1 2 3

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

(c) ρ = 0.8

log(Iteration k)

ηm

ax((

Ak)T

Ak)

2 3 4 5

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(d) ρ = 0.95

log(Iteration k)

ηm

ax((

Ak)T

Ak)

3 4 5

f = 1

f = 2f = 3

f = 4f = 5

38

Page 39: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Colon Prostate WBCD Ionosphere Sonar

N 62 102 569 351 208p 2000 6033 495 (30) 560 (32) 1890 (60)

αCV 0.6 0.5 0.6 0.4 0.4Test Error 8.3% 5% 1.77% 2.86% 24.39%

glmnet 0.1166 0.3283 9.4039 0.5158 2.0828glmnet2 0.0910 0.2938 4.9593 0.3667 1.0945Improv. % +28% +11.7% +89.6% +40.6% +90.3%

39

Page 40: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

ADMM

Douglas & Rachford (1956); Lions & Mercier (1979); Eckstein& Bertsekas (1992)

Goldstein & Osher (2009); Yin, Osher, Goldfard, and Darbon(2008); Goldfarb & Ma (2012)

many applications in signal processing, statistics, machinelearning

40

Page 41: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Improving MPTXue, Ma and Zou (2012)

41

Page 42: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

An investor has p assets. Asset j makes up ωj proportion of theinvestor’s portfolio.

ωj ≥ 0p

∑j=1

ωj = 1.

Asset j delivers return Rj which has mean µi and variance σ2j .

The mean of the return of the entire portfolio is ∑pj=1 ωjµj and the

variance of the portfolio’s return is

p

∑i

p

∑j

ωiωjσiσiρij

where ρij is the correlation between Ri and Rj.

42

Page 43: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

w = (ω1, . . . , ωp)T, µ = (µ1, . . . , µp)T,Σ is the covariance matrix of return vector (R1, . . . , Rp)T.

MPT

argminw

wTΣw

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

MPT (1952, J. of Finance) won 1990 Nobel Prize in Economics.

43

Page 44: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

The usual implementation of MPT

µ = (µ1, . . . , µp)T is the sample mean vectorΣn is the sample covariance matrix

Empirical MPT

w = arg minw

wTΣnw

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

44

Page 45: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

When p is relatively large

The sample cov. matrix performs poorly (Johnstone, 2001). Itleads to bias and undesirable risk issues in the empirical MPT(El Karoui, 2010,Brodie et al., 2009; DeMiguel et al., 2009;Fan et al., 2012).

Under some suitable “sparsity" assumption on Σ, an optimalestimator can be obtained by Thresholding (Bickel and Levina2008a; El Karoui, 2008,Cai and Zhou, 2011)

Let σij be the ij entry of the sample covariance matrix.

Σthresholding = {sλ(σij)}1≤i,j≤p

The difficulty is how to preserve both P.D. and Sparsitysimultaneously.

45

Page 46: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Notation: |Σ|1 = ∑i 6=j |σij|, ‖Σ‖2F = ∑i,j σ2

ij .The soft-thresholding estimator is the global solution of

argminΣ

12‖Σ− Σn‖2

F + λ|Σ|1.

PSD sparse covariance estimator

Σ+= argmin

Σ�εI

12‖Σ− Σn‖2

F + λ|Σ|1.

ε = 10−6. ε can be other positive constant depending on theapplication.

46

Page 47: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Algorithm

The augmented Lagrangian function for some given parameter µ ,

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F,

where Λ is the Lagrange multiplier.For i = 0, 1, 2, . . .,

Θ step : Θi+1 = arg minΘ�εI

L(Θ, Σi; Λi)

Σ step : Σi+1 = arg minΣ

L(Θi+1, Σ; Λi)

Λ step : Λi+1 = Λi − 1µ(Θi+1 − Σi+1).

47

Page 48: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Θ step

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F

Θi+1 = arg minΘ�εI

L(Θ, Σi; Λi)

= arg minΘ�εI−〈Λi, Θ〉+ 1

2µ‖Θ− Σi‖2

F

= arg minΘ�εI‖Θ− (Σi + µΛi)‖2

F

= (Σi + µΛi)+.

Let Z’s eigen-decomposition be ∑pj=1 λjvT

j vj, then define

(Z)+ = ∑pj=1 max(λj, ε)vT

j vj.

48

Page 49: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Σ step

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F

Σi+1 = arg minΣ

L(Θi+1, Σ; Λi)

= arg minΣ

12‖Σ− Σn‖2

F + λ|Σ|1 + 〈Λi, Σ〉+ 12µ‖Σ−Θi+1‖2

F

= arg minΣ

12‖Σ− µ(Σn −Λi) + Θi+1

1 + µ‖2

F +λµ

1 + µ|Σ|1

=1

1 + µS(µ(Σn −Λi) + Θi+1, λµ).

Define S(Z, τ) = {s(zj`, τ)}1≤j,`≤p withs(zj`, τ) = sign(zj`)max(|zj`| − τ, 0)I{j 6=`} + zj`I{j=`}.

49

Page 50: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Improved Empirical MPT

w = arg minw

wTΣ+w

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

50

Page 51: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

0.000 0.002 0.004 0.006 0.008 0.010

0.00

0.02

0.04

0.06

0.08

Risk

Re

turn

Portfolio (P1)

Portfolio (P2)

Two MPT frontiers based on S&P 100 from Jan. 1990—Jan. 1993. Red: new;blue: traditional.

51

Page 52: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Latent Variable glassoMa, Xue and Zou (2013)

52

Page 53: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Latent variable Gaussian graphical model

observed X (p-dim.) and unobserved Y (q-dim.) are jointlyGaussian [

XY

]∼ Np+q

([µXµY

],

[ΣX ΣXY

ΣYX ΣY

])

sparsity: X|Y has a sparse Gaussian graphical modelrepresentation.

(Σ)−1 = Θ =

[ΘX ΘXY

ΘYX ΘY

]X|Y is normal with precision matrix ΘX.

How to estimate ΘX just based on X?

53

Page 54: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

A convex formulation

Chandrasekaran, Parrilo & Willsky (2012)A key observation: Σ−1

X = ΘX −ΘXYΘ−1Y ΘYX

ΘX is sparse (assumption),

ΘXY has rank at most q, ΘXYΘ−1Y ΘYX’ rank is at most q.

If assume q is small (very reasonable in applications), we have a“sparse” -“low-rank” decomposition of Σ−1

X –the marginal precisionmatrix of X.

54

Page 55: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

LVGM estimator

WriteΣ−1

X = S− L,

S is a sparse PD matrix and L is a low rank SPD matrix.

min(S,L) 〈ΣX, S− L〉 − log det(S− L) + α‖S‖1 + βTr(L)

subject to S− L � 0, L � 0

‖S‖1 is a convex relaxation of the sparsity of S. Tr(L) is a convexrelaxation of the rank of L.

Chandrasekaran, Parrilo & Willsky (2012) viewed the above as alog-determinant semidefinite programming problem.

55

Page 56: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Algorithm

R = S− L

min(R,S,L) 〈ΣX, R〉 − log det(R) + α‖S‖1 + βTr(L)

subject to R � 0, L � 0

augmented Lagrangian

L(R, S, L; Λ) = 〈ΣX, R〉 − log det(R) + α‖S‖1 + βTr(L)

−〈Λ, R− S + L〉+ 12µ‖R− S + L‖2

F.

alternating minimization

Rk+1 = arg minR L(R, Sk, Lk; Λk)

Sk+1 = arg minS L(Rk+1, S, Lk; Λk)

Lk+1 = arg minL�0 L(Rk+1, Sk+1, L; Λk)

Λk+1 = Λk − 1µ (R

k+1 − Sk+1 + Lk+1)56

Page 57: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

R step

arg minR〈ΣX, R〉− log det(R)−〈Λk, R−Sk +Lk〉+ 1

2µ‖R−Sk +Lk‖2

F

arg minR− log det(R) +

12µ‖R−G‖2

F

G = Sk − Lk − µ(ΣX −Λk)

R−G− µR−1 = 0

Let G = UTσU (eigen-decomposition of G)

R = UTγU

with

γi =σi +

√σ2

i + 4µ

2

57

Page 58: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

S step

arg minS

α‖S‖1 − 〈Λk, Rk+1 − S + Lk〉+ 12µ‖Rk+1 − S + Lk‖2

F

Sk+1 = arg minS

µα‖S‖1 +12‖Z− S‖2

F

Z = (Rk+1 + Lk − µΛk)

τ = µα

Sk+1ij = [Shrink(Z, τ)]ij :=

Zii, if i = jZij − τ, if i 6= j and Zij > τ

Zij + τ, if i 6= j and Zij < −τ

0, if i 6= j and − τ ≤ Zij ≤ τ.

58

Page 59: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

L stepThe above is equivalent to

arg minL�0

βTr(L)− 〈Λk, Rk+1− Sk+1 + L〉+ 12µ‖Rk+1− Sk+1 + L‖2

F

Lk+1 = arg minL�0

(µβ)Tr(L) +12‖M− L‖2

F

whereM = (Sk+1 −Rk+1 + µΛk)

M = UTσU (eigen-decomposition of M) then

Lk+1 = SVT(M, µβ) = UTγU

withγi = max(σi − µβ, 0)

59

Page 60: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Concluding remark

? Tailoring optimization algorithms to the specificstatistics problem

? Efforts to polish the solver

60

Page 61: Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Part I Part II: MM Part III: ADMM

Thank You

61