Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Sparse Inverse Covariance Estimation for a MillionVariables
Cho-Jui HsiehThe University of Texas at Austin
ICML workshop on Covariance SelectionBeijing, ChinaJune 26, 2014
Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee,S. Becker, and P. Olsen
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
FMRI Brain Analysis
Goal: Reveal functional connections between regions of the brain.(Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011)
p = 228, 483 voxels.
Figure from (Varoquaux et al, 2010)Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Other Applications
Gene regulatory network discovery:
(Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez etal, 2010; Yin and Li, 2011)
Financial Data Analysis:
Model dependencies in multivariate time series (Xuan & Murphy, 2007).Sparse high dimensional models in economics (Fan et al, 2011).
Social Network Analysis / Web data:
Model co-authorship networks (Goldenberg & Moore, 2005).Model item-item similarity for recommender system(Agarwal et al, 2011).
Climate Data Analysis (Chen et al., 2010).
Signal Processing (Zhang & Fung, 2013).
Anomaly Detection (Ide et al, 2009).
Time series prediction and classification (Wang et al., 2014).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
L1-regularized inverse covariance selection
Goal: Estimate the inverse covariance matrix in the high dimensionalsetting: p(# variables)� n(# samples)
Add `1 regularization – a sparse inverse covriance matrix is preferred.
The `1-regularized Maximum Likelihood Estimator:
Σ−1 = arg minX�0
{− log detX + tr(SX )︸ ︷︷ ︸
negative log likelihood
+λ‖X‖1
}= arg min
X�0f (X ),
where ‖X‖1 =∑n
i ,j=1 |Xij |.High dimensional problems
Number of parameters scales quadratically with number of variables.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Proximal Newton method
Split smooth and non-smooth terms: f (X ) = g(X ) + h(X ), where
g(X ) = − log detX + tr(SX ) and h(X ) = λ‖X‖1.
Form quadratic approximation for g(Xt + ∆):
gXt (∆) = tr((S −Wt)∆) + (1/2) vec(∆)T (Wt ⊗Wt) vec(∆)
− log detXt + tr(SXt),
where Wt = (Xt)−1 = ∂∂X log det(X ) |X =Xt .
Define the generalized Newton direction:
Dt = arg min∆
gXt (∆) + λ‖Xt + ∆‖1.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Algorithm – Proximal Newton method
Proximal Newton method for sparse Inverse Covariance estimation
Input: Empirical covariance matrix S , scalar λ, initial X0.
For t = 0, 1, . . .1 Compute Newton direction: Dt = arg min∆ fXt (Xt + ∆) (A Lasso
problem.)2 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Cholesky factorization of Xt + αDt)
(Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013;Scheinberg et al., 2014)
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Scalability
How to apply proximal Newton method to high dimensional problems?
Number of parameters: too large (p2)Hessian: cannot be explicited formed (p2 × p2)Limited memory. . .
BigQUIC (2013):
p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytesmemory (using a single machine with 32 cores).
Key ingradients:
Active variable selection⇒ the complexity scales with the intrincic dimensionality.
Block coordinate descent⇒ Handle problems where # parameters > memory size.
(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .Extension to general dirty models.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Active Variable Selection
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Motivation
We can directly implement a proximal Newton method.
Takes 28 seconds on ER dataset.
Run Glasso: takes only 25 seconds.
Why do people like to use proximal Newton method?
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Newton Iteration 1
ER dataset, p = 692, Newton iteration 1.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Newton Iteration 3
ER dataset, p = 692, Newton iteration 3.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Newton Iteration 5
ER dataset, p = 692, Newton iteration 5.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Newton Iteration 7
ER dataset, p = 692, Newton iteration 7.
Only 2.7% variables have been changed to nonzero.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Active Variable Selection
Our goal: Reduce the number of variables from O(p2) to O(‖X ∗‖).
The solution usually has a small number of nonzeros: ‖X ∗‖0 � p2
Strategy: before solving the Newton direction, “guess” the variablesthat should be updated.
Fixed set: Sfixed = {(i , j) | Xij = 0 and |∇ijg(X )| ≤ λ.}.The remaining variables constitute the free set.
Solve the smaller subproblem:
Dt = arg min∆:∆Sfixed
=0gXt (∆) + λ‖Xt + ∆‖1.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Active Variable Selection
Is this just a heuristic? No.
The process is equivalent to a block coordinate descent algorithmwith two blocks:
1 Update the fixed set – no change (due to the definition of Sfixed).2 Update the free set – update on the subproblem.
Convergence is guaranteed.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Size of free set
In practice, the size of free set is small.
Take Hereditary dataset as an example:
p = 1869, number of variables = p2 = 3.49 million. The size of free setdrops to 20, 000 at the end.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Real datasets
(a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869
Figure : Comparison of algorithms on real datasets. The results showQUIC converges faster than other methods.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Algorithm
QUIC: QUadratic approximation for sparse Inverse Covariance estimation
Input: Empirical covariance matrix S , scalar λ, initial X0.
For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Use coordinate descent to find descent direction:
Dt = arg min∆ fXt (Xt + ∆) over set of free variables, (A Lasso problem.)3 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Cholesky factorization of Xt + αDt)
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Extensions
Also used in LIBLINEAR (Yuan et al., 2011) for `1-regularized logisticregression.
Can be generalized to active subspace selection for nuclear norm(Hsieh et al., 2014).
Can be generalized to the general decomposable norm.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Efficient (block) Coordinate Descent
with Limited Memory
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Coordinate Descent Updates
Use coordinate descent to solve: arg minD{gX (D) + λ‖X + D‖1}.Current gradient∇gX (D) = (W ⊗W )vec(D) + S −W = WDW + S −W , whereW = X−1.
Closed form solution for each coordinate descent update:
Dij ← −c + S(c − b/a, λ/a),
where S(z , r) = sign(z) max{|z | − r , 0} is the soft-thresholdingfunction, a = W 2
ij +WiiWjj , b = Sij −Wij +wTi Dwj , and c = Xij +Dij .
The main cost is in computing wTi Dwj .
Instead, we maintain U = DW after each coordinate updates, and thencompute wT
i uj — only O(p) flops per updates.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Difficulties in Scaling QUIC
Consider the case that p ≈ 1million, m = ‖Xt‖0 ≈ 50million.
Coordinate descent requires Xt and W = X−1t ,
needs O(p2) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(TCG ) time.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Coordinate Updates with Memory Cache
w1,w2,w3,w4 stored in memory.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Coordinate Updates with Memory Cache
Cache hit, do not need to recompute wi ,wj .
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Coordinate Updates with Memory Cache
Cache miss, recompute wi ,wj .
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Coordinate Updates – ideal case
Want to find update sequence that minimizes number of cache misses:
probably NP Hard.
Our strategy: update variables block by block.
The ideal case: there exists a partition {S1, . . . ,Sk} such that all freesets are in diagonal blocks:
Only requires p column evaluations.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
General case: block diagonal + sparse
If the block partition is not perfect:
extra column computations can be characterized by boundary nodes.
Given a partition {S1, . . . ,Sk}, we define boundary nodes as
B(Sq) ≡ {j | j ∈ Sq and ∃i ∈ Sz , z 6= q s.t. Fij = 1},
where F is adjacency matrix of the free set.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Graph Clustering Algorithm
The number of columns to be computed in one sweep is
p +∑
q
|B(Sq)|.
Can be upper bounded by
p +∑
q
|B(Sq)| ≤ p +∑z 6=q
∑i∈Sz ,j∈Sq
Fij .
Use Graph Clustering (METIS or Graclus) to find the partition.
Example: on fMRI dataset (p = 0.228 million) with 20 blocks,
random partition: need 1.6 million column computations.
graph clustering: need 0.237 million column computations.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
BigQUIC
Block co-ordinate descent with clustering,
needs O(p2) → O(m + p2/k) storageneeds O(mp) → O(mp) computation per sweep, where m = ‖Xt‖0
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Algorithm
BigQUIC
Input: Samples Y , scalar λ, initial X0.
For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Construct a partition by clustering.3 Run block coordinate descent to find descent direction:
Dt = arg min∆ fXt (Xt + ∆) over set of free variables.4 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Schur complement with conjugate gradient method. )
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Convergence Analysis
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Convergence Guarantees for QUIC
Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011)
QUIC converges quadratically to the global optimum, that is for someconstant 0 < κ < 1:
limt→∞
‖Xt+1 − X ∗‖F
‖Xt − X ∗‖2F
= κ.
Other convergence proof for proximal Newton:
(Lee et al, 2012) discuss the convergence for general loss function andregularization.
(Katyas et al., 2014) discuss the global convergence rate.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
BigQUIC Convergence Analysis
Recall W = X−1.
When each wi is computed by CG (Xwi = ei ):
The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT
i Dwj in coordinate updates) needs to be repeatedlycomputed.
To reduce the time overhead, Hessian should be computedapproximately.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
BigQUIC Convergence Analysis
Recall W = X−1.
When each wi is computed by CG (Xwi = ei ):
The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT
i Dwj in coordinate updates) needs to be repeatedlycomputed.
To reduce the time overhead, Hessian should be computedapproximately.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
BigQUIC Convergence Analysis
To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.
Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .
The optimality of Xt can be measured by
∇Sij f (X ) =
{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,
sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.
When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
BigQUIC Convergence Analysis
To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.
Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .
The optimality of Xt can be measured by
∇Sij f (X ) =
{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,
sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.
When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Experimental results (scalability)
(a) Scalability on random graph (b) Scalability on chain graph
Figure : BigQUIC can solve one million dimensional problems.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Experimental results
BigQUIC is faster even for medium size problems.
Figure : Comparison on FMRI data with a p = 20000 subset (maximum dimensionthat previous methods can handle).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Results on FMRI dataset
228,483 voxels, 518 time points.
λ = 0.6 =⇒ average degree 8, BigQUIC took 5 hours.
λ = 0.5 =⇒ average degree 38, BigQUIC took 21 hours.
Findings:
Voxels with large degree were generally found in the gray matter.Can detect meaningful brain modules by modularity clustering.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Quic & Dirty
A Quadratic Approximation Approach for Dirty Statistical Models
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Dirty Statistical Models
Superposition-structured models or dirty models (Yang et al., 2013):
min{θ(r)}k
r=1
L(∑
r
θ(r)
)+∑
r
λr‖θ(r)‖Ar := F (θ),
An example: GMRF with latent features (Chandrasekaran et al., 2012)
minS,L:L�0,S−L�0
− log det(S − L) + 〈S − L,Σ〉+ λS‖S‖1 + λL‖L‖∗.
When there are two groups of random variables X ,Y ,Θ = [ΘXX ,ΘXY ; ΘYX ,ΘYY ]. If we only observe X , and Y is the set of hiddenvariables, the marginal precision matrix will be [ΘXX −ΘXY Θ−1
YY ΘYX ],therefore will have a structure of low rank + sparse.
Error bound: (Meng et al., 2014).
Other applications: Robust PCA (low rank + sparse), multitask (sparse +group sparse), . . .
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Quadratic Approximation
The Newton direction d:
[d(1), . . . ,d(k)] = argmin∆(1),...,∆(k)
g(θ + ∆) +k∑
r=1
λr‖θ(r) + ∆(r)‖Ar ,
where
g(θ + ∆) = g(θ) +k∑
r=1
〈∆(t),G 〉+1
2∆TH∆,
and
H := ∇2g(θ) =
H · · · H...
. . ....
H · · · H
, H := ∇2L(θ).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Quadratic Approximation
A general active subspace selection for decomposable norms.
L1 norm: variable selectionNuclear norm: subspace selectionGroup sparse norm: group selection
Method for solving the quadratic approximation subproblem:
Alternating minimization.
Convergence analysis: Hessian is not positive definite.
Still have quadratic convergence.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Experimental Comparisons
(a) ER dataset (b) Leukemia dataset
Figure : Comparison on gene expression data (GMRF with latent variables).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
Conclusions
How to scale proximal Newton method to high dimensional problems:
Active variable selection⇒ the complexity scales with the intrincic dimensionality.
Block coordinate descent⇒ Handle problems where # parameters > memory size.
(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .
BigQUIC: Memory efficient quadratic approximation method forsparse inverse covariance estimation (scale to 1 million by 1 millionproblems).
With a careful design, we can apply proximal Newton method to solvemodels with superposition-structured regularization (sparse + low rank;sparse + group sparse; low rank + group sparse . . . ).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
References
[1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparseinverse covariance estimation for a million variables . NIPS (oral presentation), 2013.[2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance MatrixEstimation using Quadratic Approximation. NIPS, 2011.[3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure forSparse Inverse Covariance Estimation. NIPS, 2012.[4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for SparseInverse Covariance Estimation. Optimization Online, 2012.[5] O. Banerjee, L. El Ghaoui, and A. d’Aspremont Model Selection Through Sparse MaximumLikelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008.[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso. Biostatistics, 2008.[7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covarianceselection. Mathematical Programming Computation, 2010.[8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternatinglinearization methods. NIPS, 2010.[9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedycoordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation