47
Sparse Inverse Covariance Estimation for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee, S. Becker, and P. Olsen Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Sparse Inverse Covariance Estimation for a MillionVariables

Cho-Jui HsiehThe University of Texas at Austin

ICML workshop on Covariance SelectionBeijing, ChinaJune 26, 2014

Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee,S. Becker, and P. Olsen

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 2: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

FMRI Brain Analysis

Goal: Reveal functional connections between regions of the brain.(Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011)

p = 228, 483 voxels.

Figure from (Varoquaux et al, 2010)Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 3: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Other Applications

Gene regulatory network discovery:

(Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez etal, 2010; Yin and Li, 2011)

Financial Data Analysis:

Model dependencies in multivariate time series (Xuan & Murphy, 2007).Sparse high dimensional models in economics (Fan et al, 2011).

Social Network Analysis / Web data:

Model co-authorship networks (Goldenberg & Moore, 2005).Model item-item similarity for recommender system(Agarwal et al, 2011).

Climate Data Analysis (Chen et al., 2010).

Signal Processing (Zhang & Fung, 2013).

Anomaly Detection (Ide et al, 2009).

Time series prediction and classification (Wang et al., 2014).

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 4: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

L1-regularized inverse covariance selection

Goal: Estimate the inverse covariance matrix in the high dimensionalsetting: p(# variables)� n(# samples)

Add `1 regularization – a sparse inverse covriance matrix is preferred.

The `1-regularized Maximum Likelihood Estimator:

Σ−1 = arg minX�0

{− log detX + tr(SX )︸ ︷︷ ︸

negative log likelihood

+λ‖X‖1

}= arg min

X�0f (X ),

where ‖X‖1 =∑n

i ,j=1 |Xij |.High dimensional problems

Number of parameters scales quadratically with number of variables.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 5: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Proximal Newton method

Split smooth and non-smooth terms: f (X ) = g(X ) + h(X ), where

g(X ) = − log detX + tr(SX ) and h(X ) = λ‖X‖1.

Form quadratic approximation for g(Xt + ∆):

gXt (∆) = tr((S −Wt)∆) + (1/2) vec(∆)T (Wt ⊗Wt) vec(∆)

− log detXt + tr(SXt),

where Wt = (Xt)−1 = ∂∂X log det(X ) |X =Xt .

Define the generalized Newton direction:

Dt = arg min∆

gXt (∆) + λ‖Xt + ∆‖1.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 6: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Algorithm – Proximal Newton method

Proximal Newton method for sparse Inverse Covariance estimation

Input: Empirical covariance matrix S , scalar λ, initial X0.

For t = 0, 1, . . .1 Compute Newton direction: Dt = arg min∆ fXt (Xt + ∆) (A Lasso

problem.)2 Line Search: use an Armijo-rule based step-size selection to get α s.t.

Xt+1 = Xt + αDt is

positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .

(Cholesky factorization of Xt + αDt)

(Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013;Scheinberg et al., 2014)

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 7: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Scalability

How to apply proximal Newton method to high dimensional problems?

Number of parameters: too large (p2)Hessian: cannot be explicited formed (p2 × p2)Limited memory. . .

BigQUIC (2013):

p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytesmemory (using a single machine with 32 cores).

Key ingradients:

Active variable selection⇒ the complexity scales with the intrincic dimensionality.

Block coordinate descent⇒ Handle problems where # parameters > memory size.

(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .Extension to general dirty models.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 8: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Active Variable Selection

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 9: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Motivation

We can directly implement a proximal Newton method.

Takes 28 seconds on ER dataset.

Run Glasso: takes only 25 seconds.

Why do people like to use proximal Newton method?

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 10: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Newton Iteration 1

ER dataset, p = 692, Newton iteration 1.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 11: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Newton Iteration 3

ER dataset, p = 692, Newton iteration 3.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 12: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Newton Iteration 5

ER dataset, p = 692, Newton iteration 5.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 13: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Newton Iteration 7

ER dataset, p = 692, Newton iteration 7.

Only 2.7% variables have been changed to nonzero.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 14: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Active Variable Selection

Our goal: Reduce the number of variables from O(p2) to O(‖X ∗‖).

The solution usually has a small number of nonzeros: ‖X ∗‖0 � p2

Strategy: before solving the Newton direction, “guess” the variablesthat should be updated.

Fixed set: Sfixed = {(i , j) | Xij = 0 and |∇ijg(X )| ≤ λ.}.The remaining variables constitute the free set.

Solve the smaller subproblem:

Dt = arg min∆:∆Sfixed

=0gXt (∆) + λ‖Xt + ∆‖1.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 15: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Active Variable Selection

Is this just a heuristic? No.

The process is equivalent to a block coordinate descent algorithmwith two blocks:

1 Update the fixed set – no change (due to the definition of Sfixed).2 Update the free set – update on the subproblem.

Convergence is guaranteed.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 16: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Size of free set

In practice, the size of free set is small.

Take Hereditary dataset as an example:

p = 1869, number of variables = p2 = 3.49 million. The size of free setdrops to 20, 000 at the end.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 17: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Real datasets

(a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869

Figure : Comparison of algorithms on real datasets. The results showQUIC converges faster than other methods.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 18: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Algorithm

QUIC: QUadratic approximation for sparse Inverse Covariance estimation

Input: Empirical covariance matrix S , scalar λ, initial X0.

For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Use coordinate descent to find descent direction:

Dt = arg min∆ fXt (Xt + ∆) over set of free variables, (A Lasso problem.)3 Line Search: use an Armijo-rule based step-size selection to get α s.t.

Xt+1 = Xt + αDt is

positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .

(Cholesky factorization of Xt + αDt)

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 19: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Extensions

Also used in LIBLINEAR (Yuan et al., 2011) for `1-regularized logisticregression.

Can be generalized to active subspace selection for nuclear norm(Hsieh et al., 2014).

Can be generalized to the general decomposable norm.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 20: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Efficient (block) Coordinate Descent

with Limited Memory

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 21: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Coordinate Descent Updates

Use coordinate descent to solve: arg minD{gX (D) + λ‖X + D‖1}.Current gradient∇gX (D) = (W ⊗W )vec(D) + S −W = WDW + S −W , whereW = X−1.

Closed form solution for each coordinate descent update:

Dij ← −c + S(c − b/a, λ/a),

where S(z , r) = sign(z) max{|z | − r , 0} is the soft-thresholdingfunction, a = W 2

ij +WiiWjj , b = Sij −Wij +wTi Dwj , and c = Xij +Dij .

The main cost is in computing wTi Dwj .

Instead, we maintain U = DW after each coordinate updates, and thencompute wT

i uj — only O(p) flops per updates.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 22: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Difficulties in Scaling QUIC

Consider the case that p ≈ 1million, m = ‖Xt‖0 ≈ 50million.

Coordinate descent requires Xt and W = X−1t ,

needs O(p2) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 23: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Coordinate Updates with Memory Cache

Assume we can store M columns of W in memory.

Coordinate descent update (i , j): compute wTi Dwj .

If wi ,wj are not in memory: recompute by CG:

Xwi = ei : O(TCG ) time.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 24: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Coordinate Updates with Memory Cache

w1,w2,w3,w4 stored in memory.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 25: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Coordinate Updates with Memory Cache

Cache hit, do not need to recompute wi ,wj .

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 26: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Coordinate Updates with Memory Cache

Cache miss, recompute wi ,wj .

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 27: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Coordinate Updates – ideal case

Want to find update sequence that minimizes number of cache misses:

probably NP Hard.

Our strategy: update variables block by block.

The ideal case: there exists a partition {S1, . . . ,Sk} such that all freesets are in diagonal blocks:

Only requires p column evaluations.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 28: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

General case: block diagonal + sparse

If the block partition is not perfect:

extra column computations can be characterized by boundary nodes.

Given a partition {S1, . . . ,Sk}, we define boundary nodes as

B(Sq) ≡ {j | j ∈ Sq and ∃i ∈ Sz , z 6= q s.t. Fij = 1},

where F is adjacency matrix of the free set.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 29: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Graph Clustering Algorithm

The number of columns to be computed in one sweep is

p +∑

q

|B(Sq)|.

Can be upper bounded by

p +∑

q

|B(Sq)| ≤ p +∑z 6=q

∑i∈Sz ,j∈Sq

Fij .

Use Graph Clustering (METIS or Graclus) to find the partition.

Example: on fMRI dataset (p = 0.228 million) with 20 blocks,

random partition: need 1.6 million column computations.

graph clustering: need 0.237 million column computations.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 30: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

BigQUIC

Block co-ordinate descent with clustering,

needs O(p2) → O(m + p2/k) storageneeds O(mp) → O(mp) computation per sweep, where m = ‖Xt‖0

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 31: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Algorithm

BigQUIC

Input: Samples Y , scalar λ, initial X0.

For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Construct a partition by clustering.3 Run block coordinate descent to find descent direction:

Dt = arg min∆ fXt (Xt + ∆) over set of free variables.4 Line Search: use an Armijo-rule based step-size selection to get α s.t.

Xt+1 = Xt + αDt is

positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .

(Schur complement with conjugate gradient method. )

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 32: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Convergence Analysis

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 33: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Convergence Guarantees for QUIC

Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011)

QUIC converges quadratically to the global optimum, that is for someconstant 0 < κ < 1:

limt→∞

‖Xt+1 − X ∗‖F

‖Xt − X ∗‖2F

= κ.

Other convergence proof for proximal Newton:

(Lee et al, 2012) discuss the convergence for general loss function andregularization.

(Katyas et al., 2014) discuss the global convergence rate.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 34: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

BigQUIC Convergence Analysis

Recall W = X−1.

When each wi is computed by CG (Xwi = ei ):

The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT

i Dwj in coordinate updates) needs to be repeatedlycomputed.

To reduce the time overhead, Hessian should be computedapproximately.

Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)

If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 35: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

BigQUIC Convergence Analysis

Recall W = X−1.

When each wi is computed by CG (Xwi = ei ):

The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT

i Dwj in coordinate updates) needs to be repeatedlycomputed.

To reduce the time overhead, Hessian should be computedapproximately.

Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)

If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 36: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

BigQUIC Convergence Analysis

To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.

Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .

The optimality of Xt can be measured by

∇Sij f (X ) =

{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,

sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.

Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)

In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.

When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 37: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

BigQUIC Convergence Analysis

To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.

Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .

The optimality of Xt can be measured by

∇Sij f (X ) =

{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,

sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.

Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)

In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.

When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 38: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Experimental results (scalability)

(a) Scalability on random graph (b) Scalability on chain graph

Figure : BigQUIC can solve one million dimensional problems.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 39: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Experimental results

BigQUIC is faster even for medium size problems.

Figure : Comparison on FMRI data with a p = 20000 subset (maximum dimensionthat previous methods can handle).

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 40: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Results on FMRI dataset

228,483 voxels, 518 time points.

λ = 0.6 =⇒ average degree 8, BigQUIC took 5 hours.

λ = 0.5 =⇒ average degree 38, BigQUIC took 21 hours.

Findings:

Voxels with large degree were generally found in the gray matter.Can detect meaningful brain modules by modularity clustering.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 41: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Quic & Dirty

A Quadratic Approximation Approach for Dirty Statistical Models

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 42: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Dirty Statistical Models

Superposition-structured models or dirty models (Yang et al., 2013):

min{θ(r)}k

r=1

L(∑

r

θ(r)

)+∑

r

λr‖θ(r)‖Ar := F (θ),

An example: GMRF with latent features (Chandrasekaran et al., 2012)

minS,L:L�0,S−L�0

− log det(S − L) + 〈S − L,Σ〉+ λS‖S‖1 + λL‖L‖∗.

When there are two groups of random variables X ,Y ,Θ = [ΘXX ,ΘXY ; ΘYX ,ΘYY ]. If we only observe X , and Y is the set of hiddenvariables, the marginal precision matrix will be [ΘXX −ΘXY Θ−1

YY ΘYX ],therefore will have a structure of low rank + sparse.

Error bound: (Meng et al., 2014).

Other applications: Robust PCA (low rank + sparse), multitask (sparse +group sparse), . . .

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 43: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Quadratic Approximation

The Newton direction d:

[d(1), . . . ,d(k)] = argmin∆(1),...,∆(k)

g(θ + ∆) +k∑

r=1

λr‖θ(r) + ∆(r)‖Ar ,

where

g(θ + ∆) = g(θ) +k∑

r=1

〈∆(t),G 〉+1

2∆TH∆,

and

H := ∇2g(θ) =

H · · · H...

. . ....

H · · · H

, H := ∇2L(θ).

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 44: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Quadratic Approximation

A general active subspace selection for decomposable norms.

L1 norm: variable selectionNuclear norm: subspace selectionGroup sparse norm: group selection

Method for solving the quadratic approximation subproblem:

Alternating minimization.

Convergence analysis: Hessian is not positive definite.

Still have quadratic convergence.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 45: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Experimental Comparisons

(a) ER dataset (b) Leukemia dataset

Figure : Comparison on gene expression data (GMRF with latent variables).

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 46: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

Conclusions

How to scale proximal Newton method to high dimensional problems:

Active variable selection⇒ the complexity scales with the intrincic dimensionality.

Block coordinate descent⇒ Handle problems where # parameters > memory size.

(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .

BigQUIC: Memory efficient quadratic approximation method forsparse inverse covariance estimation (scale to 1 million by 1 millionproblems).

With a careful design, we can apply proximal Newton method to solvemodels with superposition-structured regularization (sparse + low rank;sparse + group sparse; low rank + group sparse . . . ).

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation

Page 47: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,

References

[1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparseinverse covariance estimation for a million variables . NIPS (oral presentation), 2013.[2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance MatrixEstimation using Quadratic Approximation. NIPS, 2011.[3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure forSparse Inverse Covariance Estimation. NIPS, 2012.[4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for SparseInverse Covariance Estimation. Optimization Online, 2012.[5] O. Banerjee, L. El Ghaoui, and A. d’Aspremont Model Selection Through Sparse MaximumLikelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008.[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso. Biostatistics, 2008.[7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covarianceselection. Mathematical Programming Computation, 2010.[8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternatinglinearization methods. NIPS, 2010.[9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedycoordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.

Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation