Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory

Interplay between Statistics andOptimization

Hui Zou

University of Minnesota

SAMSIAugust 29, 2016

Part I Part II: MM Part III: ADMM

* * * * * * * * * * * * *

0.0 0.2 0.4 0.6 0.8 1.0

-500

0500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

* * * * ***

* * * * * *

**

**

* * * * * * * * *

* * **

** *

* * * * * *

* * * * * * *

*

**

* *

*

* * * * * * * * * *

* *

*

* * * *

** * *

* *

* *

*

* * * * * * * *

* ** *

*

* *

**

* * ** * *

* **

* * * * * * ** * * * * *

LASSO

52

178

46

9

0 2 3 4 5 7 8 10 12

2


Microarray data in early 2000sLarge scale multiple testing–false discovery rate (FDR) control(Benjamini and Hochberg), local FDR (Efron), Higher criticisim(Donoho and Jin), SAM (Tibshirani)Regression analysis–Least Angle Regression (Efron, Hastie,Johnstone and Tibshirani), Lasso

Various penalization techniques: SCAD, MCP, Elastic Net,Adaptive Lasso, fused Lasso, group Lasso,...

More sophisticated models/problems: GLM, GAM, precisionmatrix estimation, covariance matrix estimation, ...

Compressed sensing, Matrix completion, Robust PCA

Tensor regression, Tensor completion, Tensor decomposition

3


My personal view

Optimization for Statistics: model fitting, model formulation,theoretical analysis

Statistics for Optimization: new research thrusts

Today’s talk:

Majorization-Minimization (MM)

Alternating Direction Method of Multipliers (ADMM)

4


Majorization-Minimization

Solve argminθ C(θ)

Majorization step:

C(θ) < D(θ|θk) for any θ 6= θk,

C(θk) = D(θk|θk)

Minimization step:

θk ← θk+1 = argminθ

D(θ|θk)

Lange, Hunter and Yang (2000)[optimization transfer]; Hunter & Lange

(2000) [MM]; Wu and Lange (2010) [EM and MM]

5


LLAZou and Li (2008)

Fan, Xue and Zou (2014)

6


Nonconvex penalized regression

minβ

`n(β) + ∑j

Pλ(|βj|)

`n is convex and represents the statistical inference model- least squares loss- Huber ’s M loss or least absolute loss- logistic regression: negative log-Bernoulli-likelihood- quantile regression: check loss- Ising model: composite conditional likelihood (Xue, Zou and

Cai, 2012)

Pλ(t) is a non-decreasing concave function for t ∈ (0, ∞)

- Lq norm penalty (0 < q < 1)- SCAD (Fan and Li, 2001)- MCP (Zhang, 2010)

7


−10 −5 0 5 10

05

1015

20

pena

lty

Lasso lambda=2SCAD lambda=2 a=3.7MCP lambda=2, a=2

8


LLA

argminβ

{`n(β) +

p

∑j=1

Pλ(βj)

}

1 Start with some initial estimator β(0).

2 At step k, define

Qλ(βj) = Pλ(|β(k)j |) + P′λ(|β

(k)j |+)(|βj| − |β

(k)j |)

3 Solve β(k+1) = argminβ

{`n(β) + ∑d

j=1 Qλ(βj)}

.

Iterate between Steps 2 and 3.

9


LLA and EM

Condition on Pλ: if there is a positive function H(t) such that

exp(−nPλ(|β|)) =∫ ∞

0H(t)e−t|β|dt. (∗)

Let π(t) = 2t H( 1

t ) and p(βj|τj) =1

2τje−|βj |τj . Then (∗) yields

exp(−nPλ(|βj|)) =∫ ∞

0p(βj|τj)π(τj)dτj. (∗∗)

(∗∗) represents a hierarchical Bayesian model and suggests anEM algorithm for maximizing the penalized likelihood by treatingτjs as “missing values".

Under condition (∗) EM=LLA.

10


The issue of multiple local solutions

The folded concave penalization problem usually has multiplelocal solutions, but the theory (namely, the oracle property) isestablished only for one of the unknown local solutions (Fanand Li, 2001; Fan and Peng, 2004; Lv and Fan, 2008; Fan andLv, 2011; ...).

Over a decade, the challenging fundamental issue stillremains that it is not clear whether the local optimal solutioncomputed by a given optimization algorithm possesses thosenice theoretical properties.

11


Numeric demonstrationSimulation model: y ∼ Bernoulli( exp(Xβ?)

1+exp(Xβ?)), where X ∼ Np(0, Σ)

with Σij = 0.5|i−j| and β? = (3, 1.5, 0, 0, 2, 0p−5).

n = 200 & p = 1000`1 loss `2 loss # FP # FN

Sparse logistic regression

Lasso5.67 2.37 24.02 0.04

(0.05) (0.02) (0.44) (0.01)

SCAD-CD4.50 2.13 13.99 0.08

(0.06) (0.02) (0.31) (0.01)

SCAD-LLA-zero2.16 1.32 0.31 0.22

(0.11) (0.06) (0.05) (0.02)

SCAD-LLA-Lasso2.08 1.28 0.26 0.19

(0.10) (0.06) (0.04) (0.02)

12


LLA closes the theoretical gap

In Fan, Xue and Zou (2014) it is shown that

Theorem

? If the initial estimator is Lasso, then the two-step LLAprocedure finds the oracle solution with high probability.

? If the initial estimator is zero, then the three-step LLAprocedure finds the oracle solution with high probability.

As illustration, the theory is verified for penalized least squares,penalized logistic regression, penalized quantile regression andpenalized graphical model estimation.

13


The philosophical root of our theory

In the classical MLE theory, when the log-likelihood function isnot concave, one of the local maximizers of the log-likelihoodfunction is shown to be asymptotic efficient, but how tocompute that estimator is very challenging and often unclear.

Le Cam (1956) (and later Bickel 1975) overcame this technicaldifficulty by focusing on a specially designed one-stepNewton-Raphson estimator initialized by a root-n estimator.

Le Cam did not try to get the global maximizer nor thetheoretical local maximizer of the likelihood.

14


The search for the global minimizer

Mixed Integer Programming has been used to get the globalminimizer of L0 penalized and SCAD penalized least squares.

Dimitris Bertsimas, Angela King and Rahul Mazumder (2016,AoS). Best Subset Selection via a Modern Optimization Lens.

Hongcheng Liu, Tao Yao, Runze Li (2016). Global solutions tofolded concave penalized nonconvex learning. AoS, 44(2),629-659.

Extension to more general models?

15


BMDYang and Zou (2013)

16


Coordinate descent for lasso

argminβ1,...,βp

f (β1, . . . , βp) +p

∑j=1

λ|βj|

1 Initialization of β

2 Cyclic coordinate descent: for j = 1, 2, . . . , p, 1, 2, . . ., updateβj by minimizing the objective function

βupdatej ← argmin

βj

f (β1, . . . , βj−1, βj, βj+1, βp) + λ|βj|

3 Repeat (2) till convergence.

17


Lasso regression: f (β1, . . . , βp) = ‖y− Xβ‖22

βupdatej ← argmin

βj

‖y−∑k 6=j

xk βk − xjβj‖22 + λ|βj|

reduces to soft-thresholding.

Fu (1998) proposed the algorithm named “shooting".

Friedman, Hastie and Tibshirani (2008) glmnet, the same CDbut with clever implementation tricks such as active set, warmstart and later strong rule.

For lasso logistic regression, Friedman, Hastie and Tibshirani(2008) did CD within a Newton-Ralphson loop. Genkin, Lewisand Madigan (2007) did the standard CD by solving theone-dimensional optimization repeatedly.

18


Group lasso regression

min(β0,β)

12

∥∥∥∥∥y− β0 −∑k

X(k)β(k)

∥∥∥∥∥2

2

+ λK

∑k=1

√pk‖β(k)‖2

.

Group Lasso penalty was introduced in Turlach, Vanebles andWright (2004) and Yuan and Lin (2006).

A blockwise descent algorithm under a groupwiseorthonormal condition: X(k) columns are orthonormal.

Orthonormal condition is incompatible with cross-validation,bootstrap, sub-sampling.

19


A general group lasso problem

arg minβ

1n

n

∑i=1

τiΦ(yi, βTxi) + λK

∑k=1

wk‖β(k)‖2

where τi ≥ 0 and wk ≥ 0 for all i, k.

The observation weights τis are introduced in order to covermethods such as weighted regression and weighted largemargin classification (biased sampling, unequal costclassification).

The penalty weights wks make a more flexible model. Thedefault choice for wk is

√pk. If we do not want to penalize a

group of predictors, simply let the corresponding weight bezero.

20


Loss functions

Least squares: Φ(y, f ) = 12 (y− f )2

Logistic regression: Φ(y, f ) = log(1 + e−yf ), y = ±1

Squared hinge loss: Φ(y, f ) = [(1− yf )+]2, y = ±1

Huberized SVM loss: Φ(y, f ) = hsvm(yf ), y = ±1 where

hsvm(t) =

0,(1− t)2/2δ,1− t− δ/2,

t > 11− δ < t ≤ 1t ≤ 1− δ.

21


Let D denote the data {y, X} and define

L(β | D) =1n

n

∑i=1

τiΦ(yi, βTxi).

Definition

The loss function Φ is said to satisfy the QM condition, if

(i). ∇L(β|D) exists everywhere.

(ii). There exists a p× p matrix H, which may only depend on thedata D, such that for all β, β∗,

L(β | D) ≤ L(β∗ | D) + (β− β∗)T∇L(β∗|D)

+12(β− β∗)TH(β− β∗).

22


Loss −∇L(β | D) H

Least squares 1n ∑n

i=1 τi(yi − xTi β)xi XTΓX/n

Logistic regression 1n ∑n

i=1 τiyixi1

1+exp(yixTi β)

14 XTΓX/n

Squared hinge loss 1n ∑n

i=1 2τiyixi(1− yixTi β)+ 4XTΓX/n

Huberized SVM loss 1n ∑n

i=1 τiyixihsvm′(yixTi β) 2

δ XTΓX/n

Γ = diag(τ1, . . . , τn)

23


Write β such that β(k′) = β(k′)

for k′ 6= k.

Given β(k′) = β(k′)

for k′ 6= k, the optimal β(k) is defined as

argminβ(k)

L(β | D) + λwk‖β(k)‖2.

By QM condition,

L(β | D) ≤ L(β | D) + (β− β)T∇L(β|D) +12(β− β)TH(β− β).

Write U(β) = −∇L(β|D).

L(β | D) ≤ L(β | D)− (β(k) − β(k))TU(k)

+12(β(k) − β

(k))TH(k)(β(k) − β

(k)).

24


Let ηk be the largest eigenvalue of H(k). We set γk = (1 + 10−4)ηk

L(β | D) ≤ L(β | D)− (β(k)− β(k))TU(k)+

12

γk‖(β(k)− β(k))‖2

2 (∗)

"=" holds if only if β(k) = β(k)

The minimizer of the right hand side of (∗) is

β(k)(new) =

1γk

(U(k) + γkβ

(k))1− λwk

‖U(k) + γkβ(k)‖2

+

.

The whole process drives the objective strictly downhill unless theoptimal solution is reached (i.e., KKT conditions are satisfied).

25


BMD for group lasso

For k = 1, . . . , K, compute γk, the largest eigenvalue of H(k)

γk = (1 + 10−4)γk (for nontrival groups with size ≥ 2)

Initialize β.

Repeat the following cyclic blockwise updates untilconvergence:

? for k = 1, . . . , K, do (1)–(3)? (1) Compute U(β) = −∇L(β|D).? (2) Compute

β(k)

(new) = 1γk

(U(k) + γk β

(k))(

1− λwk

‖U(k)+γk β(k)‖2

)+

.

? (3) Set β(k)

= β(k)

(new).

gglasso package: also uses active set, strong rule and warmstart.

26


Competitors

block coordinate gradient descent grplasso: Meier et al.(2008) for group-lasso logistic regression.

ISTA-BC algorithm: Qin et al. (2010), an extension of theISTA/FISTA (Beck & Teboulle 2009) based on variablestep-lengths.

SLEP implemented Nesterov’s method: Liu et al. (2009)

27


Dataset Type n q p Data Source

Autompg R 392 7 31 (Quinlan 1993)Bardet R 120 200 1000 (Scheetz et al. 2006)Cardiomypathy R 30 6319 31595 (Segal et al. 2003)Spectroscopy R 103 100 500 (Sabo et al. 2008)Breast C 42 22283 111415 (Graham et al. 2010)Colon C 62 2000 10000 (Alon et al. 1999)Prostate C 102 6033 30165 (Singh et al. 2002)Sonar C 208 60 300 (Gorman et al. 1988)

Some real datasets. n is the number of instances. q is the number of

original variables. p is the number of predictors after expansion. “R"

means regression and “C" means classification.

28


Group-lasso GAM regression, timing performance

Dataset Autompg Bardet Cardiomypathy Spectroscopy

SLEP 3.14 9.96 78.23 9.37ISTA-BC 5.66 1.55 2.43 1.31gglasso 2.51 0.77 2.48 0.76

All experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor.

29


Group-lasso GAM classification, timing performance

Dataset Colon Prostate Sonar Breast

grplasso (Logit) 60.42 111.75 24.55 439.76SLEP (Logit) 75.31 166.91 5.49 358.75gglasso (Logit) 1.13 3.877 1.54 9.62gglasso (HSVM) 1.15 3.53 0.66 9.15

All experiments were carried out on an Intel Xeon X5560 (Quad-core 2.8 GHz) processor.

30


A Small TrickYang and Zou (2012)

31


A counterintuitive phenomenon

Consider the glmnet for fitting elastic net penalized regression.W.L.O.G. assume ∑N

i=1 xij = 0, 1N ∑N

i=1 x2ij = 1, for j = 1, . . . , p.

R(β0, β) =1

2N

N

∑i=1

(yi − β0 − xᵀi β)2 + Pλ,α(β),

where Pλ,α(β) is the elastic net penalty

Pλ,α(β) = λ ∑pj=1 pα(βj) = λ

p

∑j=1

[12(1− α)β2

j + α|βj|]

.

32


glmnet implements the standard CD algorithm in which weiteratively solve a univariate elastic net problem

βj = arg minβj

R(βj|β0, β),

where

R(βj|β0, β) =12(

βj − βj)2 − 1

N

N

∑i=1

rixij(

βj − βj)+ λpα(βj).

βj =S(

1N ∑N

i=1 xijri + βj, λα)

1 + λ(1− α),

where S(z, t) = (|z| − t)+sgn(z).

33


A tiny change to glmnet

We change the univariate update formula to

βBj =

S(

1N ∑N

i=1 xijri + f · βj, λα)

f · 1 + λ(1− α)(f ≥ 1)

Yang made a code error by using f = 2 in glmnet, but still gotgood/even better results.

As long as f ≥ 1 the iterative process converges to thedesired solution.

A bigger f means a smaller step size along each coordinatedirection. For an orthogonal design, f = 1 is the best choice.

34


Simulation

FHT model:We simulated data with N observations and p predictors whereeach pair of predictors Xj and Xj′ have the same populationcorrelation ρ, with ρ ranges from zero to 0.95.The response variable was generated by

Y =p

∑j=1

Xjβj + k ·N(0, 1),

where βj = (−1)j exp(−(2j− 1)/20) and k is set to make thesignal-to-noise ratio equal 3.

We compared f = 1 (glmnet) and f = 2 (glmnet2).

35


Correlation

0 0.1 0.2 0.5 0.8 0.95α = 1

N = 100, p = 5000

glmnet 0.2222 0.2339 0.2979 0.4606 0.7919 1.9016glmnet2 0.2533 0.2519 0.2886 0.3758 0.5450 1.0735

α = 0.5

N = 100, p = 5000

glmnet 0.2107 0.2189 0.2356 0.3669 0.7765 2.1528glmnet2 0.2225 0.2285 0.2414 0.2861 0.4876 1.3335

36


A simple explanation

υj = (0, · · · ,xᵀj y

fN, · · · , 0) uj = (ukj)p×1 =

− 1f k = j

− ρf k 6= j

Wj = Ip×p +[0p×(j−1) uj 0p×(N−j)

]

A =p

∏j=1

Wj µ =p−1

∑s=1

(υs

p

∏j=s+1

Wj

)+ υp

If apply the CD and CMD to the LS problem, after a complete cyclefrom j = 1 to j = p, we get

β(k) = β(k−1)A + µ

The convergence rate is basically the maximum eigenvalue of(Ak)ᵀ Ak, which is affected by both f and ρ.

37


0.0

0.1

0.2

0.3

0.4

0.5

0.6

(a) ρ = 0.1

log(Iteration k)

ηm

ax((

Ak)T

Ak)

1 2 3

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(b) ρ = 0.5

log(Iteration k)

ηm

ax((

Ak)T

Ak)

1 2 3

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

(c) ρ = 0.8

log(Iteration k)

ηm

ax((

Ak)T

Ak)

2 3 4 5

f = 1

f = 2f = 3

f = 4f = 5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(d) ρ = 0.95

log(Iteration k)

ηm

ax((

Ak)T

Ak)

3 4 5

f = 1

f = 2f = 3

f = 4f = 5

38


Colon Prostate WBCD Ionosphere Sonar

N 62 102 569 351 208p 2000 6033 495 (30) 560 (32) 1890 (60)

αCV 0.6 0.5 0.6 0.4 0.4Test Error 8.3% 5% 1.77% 2.86% 24.39%

glmnet 0.1166 0.3283 9.4039 0.5158 2.0828glmnet2 0.0910 0.2938 4.9593 0.3667 1.0945Improv. % +28% +11.7% +89.6% +40.6% +90.3%

39


ADMM

Douglas & Rachford (1956); Lions & Mercier (1979); Eckstein& Bertsekas (1992)

Goldstein & Osher (2009); Yin, Osher, Goldfard, and Darbon(2008); Goldfarb & Ma (2012)

many applications in signal processing, statistics, machinelearning

40


Improving MPTXue, Ma and Zou (2012)

41


An investor has p assets. Asset j makes up ωj proportion of theinvestor’s portfolio.

ωj ≥ 0p

∑j=1

ωj = 1.

Asset j delivers return Rj which has mean µi and variance σ2j .

The mean of the return of the entire portfolio is ∑pj=1 ωjµj and the

variance of the portfolio’s return is

p

∑i

p

∑j

ωiωjσiσiρij

where ρij is the correlation between Ri and Rj.

42


w = (ω1, . . . , ωp)T, µ = (µ1, . . . , µp)T,Σ is the covariance matrix of return vector (R1, . . . , Rp)T.

MPT

argminw

wTΣw

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

MPT (1952, J. of Finance) won 1990 Nobel Prize in Economics.

43


The usual implementation of MPT

µ = (µ1, . . . , µp)T is the sample mean vectorΣn is the sample covariance matrix

Empirical MPT

w = arg minw

wTΣnw

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

44


When p is relatively large

The sample cov. matrix performs poorly (Johnstone, 2001). Itleads to bias and undesirable risk issues in the empirical MPT(El Karoui, 2010,Brodie et al., 2009; DeMiguel et al., 2009;Fan et al., 2012).

Under some suitable “sparsity" assumption on Σ, an optimalestimator can be obtained by Thresholding (Bickel and Levina2008a; El Karoui, 2008,Cai and Zhou, 2011)

Let σij be the ij entry of the sample covariance matrix.

Σthresholding = {sλ(σij)}1≤i,j≤p

The difficulty is how to preserve both P.D. and Sparsitysimultaneously.

45


Notation: |Σ|1 = ∑i 6=j |σij|, ‖Σ‖2F = ∑i,j σ2

ij .The soft-thresholding estimator is the global solution of

argminΣ

12‖Σ− Σn‖2

F + λ|Σ|1.

PSD sparse covariance estimator

Σ+= argmin

Σ�εI

12‖Σ− Σn‖2

F + λ|Σ|1.

ε = 10−6. ε can be other positive constant depending on theapplication.

46


Algorithm

The augmented Lagrangian function for some given parameter µ ,

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F,

where Λ is the Lagrange multiplier.For i = 0, 1, 2, . . .,

Θ step : Θi+1 = arg minΘ�εI

L(Θ, Σi; Λi)

Σ step : Σi+1 = arg minΣ

L(Θi+1, Σ; Λi)

Λ step : Λi+1 = Λi − 1µ(Θi+1 − Σi+1).

47


Θ step

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F

Θi+1 = arg minΘ�εI

L(Θ, Σi; Λi)

= arg minΘ�εI−〈Λi, Θ〉+ 1

2µ‖Θ− Σi‖2

F

= arg minΘ�εI‖Θ− (Σi + µΛi)‖2

F

= (Σi + µΛi)+.

Let Z’s eigen-decomposition be ∑pj=1 λjvT

j vj, then define

(Z)+ = ∑pj=1 max(λj, ε)vT

j vj.

48


Σ step

L(Θ, Σ; Λ) =12‖Σ− Σn‖2

F + λ|Σ|1 − 〈Λ, Θ− Σ〉+ 12µ‖Θ− Σ‖2

F

Σi+1 = arg minΣ

L(Θi+1, Σ; Λi)

= arg minΣ

12‖Σ− Σn‖2

F + λ|Σ|1 + 〈Λi, Σ〉+ 12µ‖Σ−Θi+1‖2

F

= arg minΣ

12‖Σ− µ(Σn −Λi) + Θi+1

1 + µ‖2

F +λµ

1 + µ|Σ|1

=1

1 + µS(µ(Σn −Λi) + Θi+1, λµ).

Define S(Z, τ) = {s(zj`, τ)}1≤j,`≤p withs(zj`, τ) = sign(zj`)max(|zj`| − τ, 0)I{j 6=`} + zj`I{j=`}.

49


Improved Empirical MPT

w = arg minw

wTΣ+w

s.t.

wTµ = µP, wT~1 = 1 ,

ωj ≥ 0, j = 1, . . . , p.

50


0.000 0.002 0.004 0.006 0.008 0.010

0.00

0.02

0.04

0.06

0.08

Risk

Re

turn

Portfolio (P1)

Portfolio (P2)

Two MPT frontiers based on S&P 100 from Jan. 1990—Jan. 1993. Red: new;blue: traditional.

51


Latent Variable glassoMa, Xue and Zou (2013)

52


Latent variable Gaussian graphical model

observed X (p-dim.) and unobserved Y (q-dim.) are jointlyGaussian [

XY

]∼ Np+q

([µXµY

],

[ΣX ΣXY

ΣYX ΣY

])

sparsity: X|Y has a sparse Gaussian graphical modelrepresentation.

(Σ)−1 = Θ =

[ΘX ΘXY

ΘYX ΘY

]X|Y is normal with precision matrix ΘX.

How to estimate ΘX just based on X?

53


A convex formulation

Chandrasekaran, Parrilo & Willsky (2012)A key observation: Σ−1

X = ΘX −ΘXYΘ−1Y ΘYX

ΘX is sparse (assumption),

ΘXY has rank at most q, ΘXYΘ−1Y ΘYX’ rank is at most q.

If assume q is small (very reasonable in applications), we have a“sparse” -“low-rank” decomposition of Σ−1

X –the marginal precisionmatrix of X.

54


LVGM estimator

WriteΣ−1

X = S− L,

S is a sparse PD matrix and L is a low rank SPD matrix.

min(S,L) 〈ΣX, S− L〉 − log det(S− L) + α‖S‖1 + βTr(L)

subject to S− L � 0, L � 0

‖S‖1 is a convex relaxation of the sparsity of S. Tr(L) is a convexrelaxation of the rank of L.

Chandrasekaran, Parrilo & Willsky (2012) viewed the above as alog-determinant semidefinite programming problem.

55


Algorithm

R = S− L

min(R,S,L) 〈ΣX, R〉 − log det(R) + α‖S‖1 + βTr(L)

subject to R � 0, L � 0

augmented Lagrangian

L(R, S, L; Λ) = 〈ΣX, R〉 − log det(R) + α‖S‖1 + βTr(L)

−〈Λ, R− S + L〉+ 12µ‖R− S + L‖2

F.

alternating minimization

Rk+1 = arg minR L(R, Sk, Lk; Λk)

Sk+1 = arg minS L(Rk+1, S, Lk; Λk)

Lk+1 = arg minL�0 L(Rk+1, Sk+1, L; Λk)

Λk+1 = Λk − 1µ (R

k+1 − Sk+1 + Lk+1)56


R step

arg minR〈ΣX, R〉− log det(R)−〈Λk, R−Sk +Lk〉+ 1

2µ‖R−Sk +Lk‖2

F

arg minR− log det(R) +

12µ‖R−G‖2

F

G = Sk − Lk − µ(ΣX −Λk)

R−G− µR−1 = 0

Let G = UTσU (eigen-decomposition of G)

R = UTγU

with

γi =σi +

√σ2

i + 4µ

2

57


S step

arg minS

α‖S‖1 − 〈Λk, Rk+1 − S + Lk〉+ 12µ‖Rk+1 − S + Lk‖2

F

Sk+1 = arg minS

µα‖S‖1 +12‖Z− S‖2

F

Z = (Rk+1 + Lk − µΛk)

τ = µα

Sk+1ij = [Shrink(Z, τ)]ij :=

Zii, if i = jZij − τ, if i 6= j and Zij > τ

Zij + τ, if i 6= j and Zij < −τ

0, if i 6= j and − τ ≤ Zij ≤ τ.

58


L stepThe above is equivalent to

arg minL�0

βTr(L)− 〈Λk, Rk+1− Sk+1 + L〉+ 12µ‖Rk+1− Sk+1 + L‖2

F

Lk+1 = arg minL�0

(µβ)Tr(L) +12‖M− L‖2

F

whereM = (Sk+1 −Rk+1 + µΛk)

M = UTσU (eigen-decomposition of M) then

Lk+1 = SVT(M, µβ) = UTγU

withγi = max(σi − µβ, 0)

59


Concluding remark

? Tailoring optimization algorithms to the specificstatistics problem

? Efforts to polish the solver

60


Thank You

61

Documents

Interplay between Statistics and Optimization - SAMSI · PDF fileInterplay between Statistics and Optimization ... Alternating Direction Method of Multipliers (ADMM) 4. ... the theory