Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Sparse Inverse Covariance Estimation for a MillionVariables

Cho-Jui HsiehThe University of Texas at Austin

ICML workshop on Covariance SelectionBeijing, ChinaJune 26, 2014

Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee,S. Becker, and P. Olsen

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

FMRI Brain Analysis

Goal: Reveal functional connections between regions of the brain.(Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011)

p = 228, 483 voxels.

Figure from (Varoquaux et al, 2010)Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Other Applications

Gene regulatory network discovery:

(Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez etal, 2010; Yin and Li, 2011)

Financial Data Analysis:

Model dependencies in multivariate time series (Xuan & Murphy, 2007).Sparse high dimensional models in economics (Fan et al, 2011).

Social Network Analysis / Web data:

Model co-authorship networks (Goldenberg & Moore, 2005).Model item-item similarity for recommender system(Agarwal et al, 2011).

Climate Data Analysis (Chen et al., 2010).

Signal Processing (Zhang & Fung, 2013).

Anomaly Detection (Ide et al, 2009).

Time series prediction and classification (Wang et al., 2014).


L1-regularized inverse covariance selection

Goal: Estimate the inverse covariance matrix in the high dimensionalsetting: p(# variables)� n(# samples)

Add `1 regularization – a sparse inverse covriance matrix is preferred.

The `1-regularized Maximum Likelihood Estimator:

Σ−1 = arg minX�0

{− log detX + tr(SX )︸︷︷︸

negative log likelihood

+λ‖X‖1

}= arg min

X�0f (X ),

where ‖X‖1 =∑n

i ,j=1 |Xij |.High dimensional problems

Number of parameters scales quadratically with number of variables.


Proximal Newton method

Split smooth and non-smooth terms: f (X ) = g(X ) + h(X ), where

g(X ) = − log detX + tr(SX ) and h(X ) = λ‖X‖1.

Form quadratic approximation for g(Xt + ∆):

gXt (∆) = tr((S −Wt)∆) + (1/2) vec(∆)T (Wt ⊗Wt) vec(∆)

− log detXt + tr(SXt),

where Wt = (Xt)−1 = ∂∂X log det(X ) |X =Xt .

Define the generalized Newton direction:

Dt = arg min∆

gXt (∆) + λ‖Xt + ∆‖1.


Algorithm – Proximal Newton method

Proximal Newton method for sparse Inverse Covariance estimation

Input: Empirical covariance matrix S , scalar λ, initial X0.

For t = 0, 1, . . .1 Compute Newton direction: Dt = arg min∆ fXt (Xt + ∆) (A Lasso

problem.)2 Line Search: use an Armijo-rule based step-size selection to get α s.t.

Xt+1 = Xt + αDt is

positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .

(Cholesky factorization of Xt + αDt)

(Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013;Scheinberg et al., 2014)


Scalability

How to apply proximal Newton method to high dimensional problems?

Number of parameters: too large (p2)Hessian: cannot be explicited formed (p2 × p2)Limited memory. . .

BigQUIC (2013):

p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytesmemory (using a single machine with 32 cores).

Key ingradients:

Active variable selection⇒ the complexity scales with the intrincic dimensionality.

Block coordinate descent⇒ Handle problems where # parameters > memory size.

(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .Extension to general dirty models.


Active Variable Selection


Motivation

We can directly implement a proximal Newton method.

Takes 28 seconds on ER dataset.

Run Glasso: takes only 25 seconds.

Why do people like to use proximal Newton method?


Newton Iteration 1

ER dataset, p = 692, Newton iteration 1.


Newton Iteration 3



Newton Iteration 5



Newton Iteration 7


Only 2.7% variables have been changed to nonzero.



Our goal: Reduce the number of variables from O(p2) to O(‖X ∗‖).

The solution usually has a small number of nonzeros: ‖X ∗‖0 � p2

Strategy: before solving the Newton direction, “guess” the variablesthat should be updated.

Fixed set: Sfixed = {(i , j) | Xij = 0 and |∇ijg(X )| ≤ λ.}.The remaining variables constitute the free set.

Solve the smaller subproblem:

Dt = arg min∆:∆Sfixed

=0gXt (∆) + λ‖Xt + ∆‖1.



Is this just a heuristic? No.

The process is equivalent to a block coordinate descent algorithmwith two blocks:

1 Update the fixed set – no change (due to the definition of Sfixed).2 Update the free set – update on the subproblem.

Convergence is guaranteed.


Size of free set

In practice, the size of free set is small.

Take Hereditary dataset as an example:

p = 1869, number of variables = p2 = 3.49 million. The size of free setdrops to 20, 000 at the end.


Real datasets

(a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869

Figure : Comparison of algorithms on real datasets. The results showQUIC converges faster than other methods.


Algorithm

QUIC: QUadratic approximation for sparse Inverse Covariance estimation

Input: Empirical covariance matrix S , scalar λ, initial X0.

For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Use coordinate descent to find descent direction:

Dt = arg min∆ fXt (Xt + ∆) over set of free variables, (A Lasso problem.)3 Line Search: use an Armijo-rule based step-size selection to get α s.t.

Xt+1 = Xt + αDt is


(Cholesky factorization of Xt + αDt)


Extensions

Also used in LIBLINEAR (Yuan et al., 2011) for `1-regularized logisticregression.

Can be generalized to active subspace selection for nuclear norm(Hsieh et al., 2014).

Can be generalized to the general decomposable norm.


Efficient (block) Coordinate Descent

with Limited Memory


Coordinate Descent Updates

Use coordinate descent to solve: arg minD{gX (D) + λ‖X + D‖1}.Current gradient∇gX (D) = (W ⊗W )vec(D) + S −W = WDW + S −W , whereW = X−1.

Closed form solution for each coordinate descent update:

Dij ← −c + S(c − b/a, λ/a),

where S(z , r) = sign(z) max{|z | − r , 0} is the soft-thresholdingfunction, a = W 2

ij +WiiWjj , b = Sij −Wij +wTi Dwj , and c = Xij +Dij .

The main cost is in computing wTi Dwj .

Instead, we maintain U = DW after each coordinate updates, and thencompute wT

i uj — only O(p) flops per updates.


Difficulties in Scaling QUIC

Consider the case that p ≈ 1million, m = ‖Xt‖0 ≈ 50million.

Coordinate descent requires Xt and W = X−1t ,

needs O(p2) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0


Coordinate Updates with Memory Cache

Assume we can store M columns of W in memory.

Coordinate descent update (i , j): compute wTi Dwj .

If wi ,wj are not in memory: recompute by CG:

Xwi = ei : O(TCG ) time.



w1,w2,w3,w4 stored in memory.



Cache hit, do not need to recompute wi ,wj .



Cache miss, recompute wi ,wj .


Coordinate Updates – ideal case

Want to find update sequence that minimizes number of cache misses:

probably NP Hard.

Our strategy: update variables block by block.

The ideal case: there exists a partition {S1, . . . ,Sk} such that all freesets are in diagonal blocks:

Only requires p column evaluations.


General case: block diagonal + sparse

If the block partition is not perfect:

extra column computations can be characterized by boundary nodes.

Given a partition {S1, . . . ,Sk}, we define boundary nodes as

B(Sq) ≡ {j | j ∈ Sq and ∃i ∈ Sz , z 6= q s.t. Fij = 1},

where F is adjacency matrix of the free set.


Graph Clustering Algorithm

The number of columns to be computed in one sweep is

p +∑

q

|B(Sq)|.

Can be upper bounded by

p +∑

q

|B(Sq)| ≤ p +∑z 6=q

∑i∈Sz ,j∈Sq

Fij .

Use Graph Clustering (METIS or Graclus) to find the partition.

Example: on fMRI dataset (p = 0.228 million) with 20 blocks,

random partition: need 1.6 million column computations.

graph clustering: need 0.237 million column computations.


BigQUIC

Block co-ordinate descent with clustering,

needs O(p2) → O(m + p2/k) storageneeds O(mp) → O(mp) computation per sweep, where m = ‖Xt‖0


Algorithm

BigQUIC

Input: Samples Y , scalar λ, initial X0.

For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Construct a partition by clustering.3 Run block coordinate descent to find descent direction:

Dt = arg min∆ fXt (Xt + ∆) over set of free variables.4 Line Search: use an Armijo-rule based step-size selection to get α s.t.

Xt+1 = Xt + αDt is


(Schur complement with conjugate gradient method. )


Convergence Analysis


Convergence Guarantees for QUIC

Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011)

QUIC converges quadratically to the global optimum, that is for someconstant 0 < κ < 1:

limt→∞

‖Xt+1 − X ∗‖F

‖Xt − X ∗‖2F

= κ.

Other convergence proof for proximal Newton:

(Lee et al, 2012) discuss the convergence for general loss function andregularization.

(Katyas et al., 2014) discuss the global convergence rate.


BigQUIC Convergence Analysis

Recall W = X−1.

When each wi is computed by CG (Xwi = ei ):

The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT

i Dwj in coordinate updates) needs to be repeatedlycomputed.

To reduce the time overhead, Hessian should be computedapproximately.

Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)

If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.



Recall W = X−1.

When each wi is computed by CG (Xwi = ei ):

The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT

i Dwj in coordinate updates) needs to be repeatedlycomputed.

To reduce the time overhead, Hessian should be computedapproximately.


If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.



To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.

Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .

The optimality of Xt can be measured by

∇Sij f (X ) =

{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,

sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.


In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.

When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.



To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.

Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .

The optimality of Xt can be measured by

∇Sij f (X ) =

{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,

sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.


In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.

When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.


Experimental results (scalability)

(a) Scalability on random graph (b) Scalability on chain graph

Figure : BigQUIC can solve one million dimensional problems.


Experimental results

BigQUIC is faster even for medium size problems.

Figure : Comparison on FMRI data with a p = 20000 subset (maximum dimensionthat previous methods can handle).


Results on FMRI dataset

228,483 voxels, 518 time points.

λ = 0.6 =⇒ average degree 8, BigQUIC took 5 hours.

λ = 0.5 =⇒ average degree 38, BigQUIC took 21 hours.

Findings:

Voxels with large degree were generally found in the gray matter.Can detect meaningful brain modules by modularity clustering.


Quic & Dirty

A Quadratic Approximation Approach for Dirty Statistical Models


Dirty Statistical Models

Superposition-structured models or dirty models (Yang et al., 2013):

min{θ(r)}k

r=1

L(∑

r

θ(r)

)+∑

r

λr‖θ(r)‖Ar := F (θ),

An example: GMRF with latent features (Chandrasekaran et al., 2012)

minS,L:L�0,S−L�0

− log det(S − L) + 〈S − L,Σ〉+ λS‖S‖1 + λL‖L‖∗.

When there are two groups of random variables X ,Y ,Θ = [ΘXX ,ΘXY ; ΘYX ,ΘYY ]. If we only observe X , and Y is the set of hiddenvariables, the marginal precision matrix will be [ΘXX −ΘXY Θ−1

YY ΘYX ],therefore will have a structure of low rank + sparse.

Error bound: (Meng et al., 2014).

Other applications: Robust PCA (low rank + sparse), multitask (sparse +group sparse), . . .


Quadratic Approximation

The Newton direction d:

[d(1), . . . ,d(k)] = argmin∆(1),...,∆(k)

g(θ + ∆) +k∑

r=1

λr‖θ(r) + ∆(r)‖Ar ,

where

g(θ + ∆) = g(θ) +k∑

r=1

〈∆(t),G 〉+1

2∆TH∆,

and

H := ∇2g(θ) =

H · · · H...

. . ....

H · · · H

, H := ∇2L(θ).


Quadratic Approximation

A general active subspace selection for decomposable norms.

L1 norm: variable selectionNuclear norm: subspace selectionGroup sparse norm: group selection

Method for solving the quadratic approximation subproblem:

Alternating minimization.

Convergence analysis: Hessian is not positive definite.

Still have quadratic convergence.


Experimental Comparisons

(a) ER dataset (b) Leukemia dataset

Figure : Comparison on gene expression data (GMRF with latent variables).


Conclusions

How to scale proximal Newton method to high dimensional problems:

Active variable selection⇒ the complexity scales with the intrincic dimensionality.

Block coordinate descent⇒ Handle problems where # parameters > memory size.

(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .

BigQUIC: Memory efficient quadratic approximation method forsparse inverse covariance estimation (scale to 1 million by 1 millionproblems).

With a careful design, we can apply proximal Newton method to solvemodels with superposition-structured regularization (sparse + low rank;sparse + group sparse; low rank + group sparse . . . ).


References

[1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparseinverse covariance estimation for a million variables . NIPS (oral presentation), 2013.[2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance MatrixEstimation using Quadratic Approximation. NIPS, 2011.[3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure forSparse Inverse Covariance Estimation. NIPS, 2012.[4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for SparseInverse Covariance Estimation. Optimization Online, 2012.[5] O. Banerjee, L. El Ghaoui, and A. d’Aspremont Model Selection Through Sparse MaximumLikelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008.[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso. Biostatistics, 2008.[7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covarianceselection. Mathematical Programming Computation, 2010.[8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternatinglinearization methods. NIPS, 2010.[9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedycoordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.


Documents

Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,