A Joint convex penalty for inverse covariance matrix estimation

Introduction and Scope Joint Convex Penalty Solving the optimization problem Simulation Study Conclusion References

A Joint Convex Penalty for Inverse CovarianceMatrix Estimation

Ashwini Maurya

Department of Statistics and Probability, Michigan State University

March 27, 2014

Ashwini Maurya A Joint Convex Penalty for Inverse Covariance Matrix Estimation


Outline

1 Introduction and Scope

2 Joint Convex Penalty

3 Solving the optimization problem

4 Simulation Study

5 Conclusion

6 References



Covariance Matrix

Consider X = (X1, · · · ,Xn)T be observation matrix from ap-dimensional multivariate distribution with mean vectorµ = (µ1, · · · , µp)T and covariance matrix Σ, the samplecovariance matrix is defined by:

Si ,j =n∑

m=1

(Xm,i − X̄i )(Xm,j − X̄j); i , j = 1, 2, ..p

When p > n, the sample covariance matrix is singular matrixand therefore the estimation of Σ and hence Σ−1 is ill-posed.We assume that Σ−1 is sparse, i.e.,

#{(i , j) : Σ−1i ,j 6= 0} := g � p(p + 1)/2.

How can we estimate the non-zero entries of inversecovariance matrix?



Covariance Matrix


Si ,j =n∑

m=1

(Xm,i − X̄i )(Xm,j − X̄j); i , j = 1, 2, ..p


#{(i , j) : Σ−1i ,j 6= 0} := g � p(p + 1)/2.




Covariance Matrix


Si ,j =n∑

m=1

(Xm,i − X̄i )(Xm,j − X̄j); i , j = 1, 2, ..p


#{(i , j) : Σ−1i ,j 6= 0} := g � p(p + 1)/2.




Scope of Inverse Covariance Matrix

Estimation of inverse covariance matrix is important in number ofstatistical analysis including:

Gaussian Graphical Modeling: In Gaussian graphical modeling,a zero entry of an element in the inverse of a covariancematrix corresponds to conditional independence between thevariables.

Linear or Quadratic Discriminant Analysis: When the featuresare assumed to have multivariate Gaussian density, theresulting discriminant rule requires an estimate of inversecovariance matrix.

Principal Component Analysis (PCA): In multivariate highdimensional data it is often desirable to transform thehigh-dimensional feature space to a lower dimension withoutloosing much information. covariance matrix method ispopular method for PCA estimates.

















Penalization methods

It is well known that the `1 minimization leads to sparse solution.Some of the popular methods for estimating a sparse inversecovariance matrix based on `1 minimization includes:

Graphical Lasso (Friedman et.al., 2007)

Sparse Permutation Invariant Covariance Matrix Estimation(Rothamn et al., 2008)

High Dimensional Inverse Covariance Matrix Estimation(Ming Yuan, 2010)

Sparse Inverse Covariance Matrix Estimation using QuadraticApproximation ( CJ Hsieh et.al., 2011)



Penalization methods

It is well known that the `1 minimization leads to sparse solution.Some of the popular methods for estimating a sparse inversecovariance matrix based on `1 minimization includes:

Graphical Lasso (Friedman et.al., 2007)

Sparse Permutation Invariant Covariance Matrix Estimation(Rothamn et al., 2008)

High Dimensional Inverse Covariance Matrix Estimation(Ming Yuan, 2010)

Sparse Inverse Covariance Matrix Estimation using QuadraticApproximation ( CJ Hsieh et.al., 2011)



Outline




4 Simulation Study

5 Conclusion

6 References



Beyond `1 Penalization

It is well known that the eigenvalues of sample covariancematrix are over-dispersed [Marcenko-Pastur 1967, Johnstone2001]

In high dimensional setting ( where n� p), an estimate ofcovariance matrix based on sample covariance matrix will besingular matrix. In other words, the most of eigen values ofcovariance matrix will be zero or close to zero.

Consequently the eigenvalues of inverse covariance matrix willbe very large.
















True Eigen Values

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

Sample Eigen Values n=500 p=200

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5

050

100

150

200

Sample Eigen Values n=100 p=200

Fre

quen

cy

0 1 2 3 4 5 6

050

100

150

200



Beyond `1 Penalization and A Joint Convex Penalty

Motivated by this observation, we include an additionalpenalty which reduces the eigen-spectrum of inversecovariance matrix.

Let X = (X1, · · · ,Xn)T be observation matrix from ap-dimensional multivariate normal distribution with meanvector µ = (µ1, · · · , µp)T and covariance matrix Σ.

The constrained maximum likelihood estimate of the inversecovariance matrix is given by:

argminW�0

F (W ) = −log(det(W )) + tr(SW ) + λ‖W ‖1 + τ‖W ‖∗(2.1)

where W is inverse of covariance matrix, S is samplecovariance matrix, λ and τ are non-negative real constants.







argminW�0









argminW�0






The first penalty term ‖W ‖1 in equation (1.1) is defined asthe sum of absolute value of the inverse covariance matrx W .

The second penalty term ‖W ‖∗ is the trace norm which isdefined as sum of singular values of W ( for positive definitematrices, eigen values and singular values are same).

`1 norm is convex, smooth except at the origin however tracenorm is non-smooth convex function. Consequently theoptimization problem is convex optimization problems withnon-smooth constraints.

We implement proximal gradient method to solve thisoptimization problem.
























Outline




4 Simulation Study

5 Conclusion

6 References



Proximal Gradient Algorithm

Let h(x) be some smooth convex function bounded below bysome finite constant. The proximal gradient method generatesa sequence of solutions by solving,

xk = argminx

h(x) + ‖x − xk−1‖2 k = 1, 2, ... (3.1)

The above solution converges weekly to argminx h(x)[Rockafellar, 1976].



Convergence Analysis

Theorem

Let {Wn, n = 1, 2, ...} be the sequence generated by proximalgradient algorithm. Let M <∞ be a constant such that∞∑n=1| <W ∗ −Wn,5‖Wn‖1 > | < M. We have,

F (Wk)− F (W ∗) ≤(γL‖W 0−W ∗‖2F +18c2 + M

4k

)where L > 0 is the least upper Lipschitz constant of gradient ofnegative log-likelihood function, γ > 1 and c > 0 are constants.



Outline




4 Simulation Study

5 Conclusion

6 References



Simulation

Extensive simulation are done, here we report the result fortwo types of inverse covariance matrix:

(i) Hub Graph : The rows/columns are partitioned into Jequally-sized disjoint groups:{V1 ∪ V2 ∪, ...,∪ VJ} = {1, 2, ..., p}, each group is associatedwith a pivotal row k. Let |V1| = s. We set wi ,j = wj ,i = ρ fori ∈ Vk and wi ,j = wj ,i = 0 otherwise. In our experiment,J = [p/s], k = 1, s + 1, 2s + 1, ..., and we always setρ = 1/(s + 1) with J = 20.

(ii) Neighborhood Graph : We first uniformly sample(y1, y2, ..., yn) from a unit square. We then set wi ,j = wj ,i = ρ

with probability (√

2π)−1

exp(−4‖yi − yj‖2). The remainingentries of W are set to be zero. The number of nonzerooff-diagonal elements of each row or column is restricted to besmaller than [1/ρ]. ρ is set to be 0.245.



Simulation

Extensive simulation are done, here we report the result fortwo types of inverse covariance matrix:

(i) Hub Graph : The rows/columns are partitioned into Jequally-sized disjoint groups:{V1 ∪ V2 ∪, ...,∪ VJ} = {1, 2, ..., p}, each group is associatedwith a pivotal row k. Let |V1| = s. We set wi ,j = wj ,i = ρ fori ∈ Vk and wi ,j = wj ,i = 0 otherwise. In our experiment,J = [p/s], k = 1, s + 1, 2s + 1, ..., and we always setρ = 1/(s + 1) with J = 20.

(ii) Neighborhood Graph : We first uniformly sample(y1, y2, ..., yn) from a unit square. We then set wi ,j = wj ,i = ρ

with probability (√

2π)−1

exp(−4‖yi − yj‖2). The remainingentries of W are set to be zero. The number of nonzerooff-diagonal elements of each row or column is restricted to besmaller than [1/ρ]. ρ is set to be 0.245.



Simulation

We compare the proposed method (JCP) to graphical lassoand SPICE in terms of Kullback-Leibler loss (KL-Loss) forvarying values of sample size (n) and number of covariates(p). We choose n = 50, 100 and p = 50, 100, 200.

For both Hub and Neighborhood type inverse covariancematrix, the JCP method shows better performance thanSPICE method for all n and p in terms of KL-Loss.

The performance of JCP is better than graphical lasso forp=50 and p=100, whereas the JCP performs similar toGraphical lasso for p=200.

The extensive simulation analysis ( refer to the paper) alsoshows that the loss function value of the estimated inversecovariance matrix also depends upon the underlying structureof matrix.



Simulation







Simulation







Simulation







Simulation(Hub type Inverse Covariance Matrix)

Left graph n = 50, right graph n = 100 for varying p.

50 100 200

1020

30

Number of variables p

KL-

loss

JCPGraphical LassoSPICE

50 100 200

10


KL-

loss




Simulation(Neighborhood type Inverse Covariance Matrix)

Left graph n = 50, right graph n = 100 for varying p.

50 100 200

1020


KL-

loss


50 100 200

10


KL-

loss




Outline




4 Simulation Study

5 Conclusion

6 References



Conclusion

We impose joint convex penalty which has shown betterperformance than other methods based on simulation study.In practice the underlying inverse covariance matrix may haveadditional structure than just being sparse, in that case asuitable penalty can be used to estimate the inherentstructure of the matrix.

The proposed method uses joint convex penalty which is moreflexible for penalizing entries of a inverse covariance matrixdifferently rather than penalizing by same amount of chosenregularization parameter as in graphical lasso and SPICE.

Under mild conditions, the proposed proximal gradientalgorithm is shown to achieve sub-linear rate of convergencefor this problem which can further be improved to have alinear rate of convergence, makes it suitable choice for largescale optimization problems.



Conclusion






Conclusion






Outline




4 Simulation Study

5 Conclusion

6 References



Selected References

Maurya, Ashwini, A Joint Convex Penalty for InverseCovariance Matrix Estimation. Journal of ComputationalStatistics and Data Analysis 2014.

Friedman J., Hastie T., Tibshirani R., Sparse inversecovariance estimation with the graphical lasso. Biostatistics.2008 .

Rothman A. J., Bickel P. J., Levina E., Zhu J., Sparsepermutation invariant covariance estimation. Electron. J.Stat. 2 494-515,2008



Thanks You !


Documents

A Joint convex penalty for inverse covariance matrix estimation