64
Random Matrix Theory: Selected Applications from Statistical Signal Processing and Machine Learning Ph.D. Thesis Defense Khalil Elkhalil Committee Chairperson: Dr. Tareq Y. Al-Naffouri Committee Co-Chair: Dr. Mohamed-Slim Alouini Committee Members: Dr. Xiangliang Zhang and Dr. Abla Kammoun External Examiner: Dr. Alfred Hero June 24, 2019 King Abdullah University of Science and Technology Computer, Electrical and Mathematical Sciences and Engineering

Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Random Matrix Theory: Selected Applications from

Statistical Signal Processing and Machine Learning

Ph.D. Thesis Defense

Khalil Elkhalil

Committee Chairperson: Dr. Tareq Y. Al-Naffouri

Committee Co-Chair: Dr. Mohamed-Slim Alouini

Committee Members: Dr. Xiangliang Zhang and Dr. Abla Kammoun

External Examiner: Dr. Alfred Hero

June 24, 2019

King Abdullah University of Science and Technology

Computer, Electrical and Mathematical Sciences and Engineering

Page 2: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Table of contents

1. Introduction

2. Moments of Correlated Gram matrices

3. Regularized discriminant analysis with large dimensional data

4. Centered Kernel Ridge Regression (CKRR)

5. Conclusion

6. Future research directions

1

Page 3: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Introduction

Page 4: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Moore’s law

• The # of transistors that you can fit into a piece of silicon doubles every

couple of years.

1

1C. M. Bishop, Microsoft research, Cambridge.

2

Page 5: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Big data

Big dataLarge sample size

High dimensional data

High variability

New processingtechniques

Optimization

Statistics

We need a tool that embraces these challenges

3

Page 6: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Random matrix theory (RMT)

Study the behavior of large random matrices

• Allow the prediction of the behavior of random quantities depending on

large random matrices

• Key of success: Randomness + High dimensionality

Large dimensionalRandom Matrix

Theory

Randomness

High dimensionalit

y

4

Page 7: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

RMT: Applications

Statistical Signal Processing

• Large number of antenna arrays vs large number of observations

→ Improved signal processing techniques

Wireless Communications

• Large # of antennas, Large # of users

→ Improved transmission and detection strategies

→ Low complexity design

Machine learning

• Supervised2/semi-supervised3/unsupervised learning4.

→ A better fundamental understanding

→ Improved classification performance

2Z. Liao, R. Couillet, ”A Large Dimensional Analysis of Least Squares Support Vector Machines”,

submitted3X. Mai, R. Couillet, ”A random matrix analysis and improvement of semi-supervised learning for

large dimensional data”, submitted.4R. Couillet, F. Benaych-Georges, ”Kernel Spectral Clustering of Large Dimensional Data”,

Electronic Journal of Statistics, vol. 10, no. 1, pp. 1393-1454, 2016.

5

Page 8: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

RMT: Example

How does this work ?

• Self-averaging effect mechanism similar to that met in the law of large

numbers

• h1, · · · , hn ∈ Cp with i.i.d entries with zero mean and variance 1n

.

• HHH is an estimator of the cov. matrix with H = [h1, · · · , hn].

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

1.2

1

Eigenvalues

Fre

qu

ency

of

Eig

enva

lues Histogram

Limiting law

p fixed, n→∞

Figure 1.1: Histogram of eigenvalues of

HHH

(1 +√c)2(1−

√c)2

Marcenko-Pastur

Bulk

Eigenvalues

Fre

qu

ency

of

Eig

enva

lues

Histogram

Marcenko-Pastur Law

p, n→∞, p/n→ c

Figure 1.2: Histogram of eigenvalues of

HHH6

Page 9: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Why is this useful?

The same result can be extended in the correlated case5

Certain functionals of HHH can be evaluated when p, n→∞, p/n→ c.

f(HHH

)

• 1ntr(HHH

)• 1

ntr(HHH

)−k: performance of linear est. techniques

• 1n

log det(HHH

): MIMO systems, linear estimation (LCE)

• λmin

(HHH

), λmax

(HHH

),... : WEV in linear estimation

What happens for the moments in the finite regime?

EH∼Df(HHH

)5J. W. Silverstein and Z. D. Bai, On the Empirical Distribution of Eigenvalues of a Class of Large

Dimensional Random Matrices, Journal of Multivariate Analysis, vol. 54, pp. 175192, May 2002.

7

Page 10: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Moments of Correlated Gram matrices

8

Page 11: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Gram matrices

Linear estimation

Let m < n and H ∈ Cn×m with i.i.d zero mean unit variance Gaussian entries

and Λ is positive definite matrix with distinct eigenvalues θ1, θ2, · · · , θn.

yn×1 = Hn×mvm×1 + Λ−12 zn×1. (1)

Define the correlated Gram matrix

G = H∗ΛH. (2)

LS LMMSE

MSE E trG−1 E tr(G + R−1

v

)−1

Sample covariance matrix (SCM): u (k) = R12 h (k)

R (n) = (1− λ)n∑

k=1

λn−ku (k) u∗ (k) = R12 HΛ (n) H∗R

12 , (3)

Loss (n) , E∥∥∥R

12 R−1 (n) R

12 − Im

∥∥∥2

F

= m + E trG−2n − 2E trG−1

n9

Page 12: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Negative moments of correlated Gram matrices

Define the negative moments of G as

µΛ (−k) , E tr(

G−k), k ∈ N.

Then,

LS (Exact) LMMSE (Rx = σ2x I, σ2

x � 1)

MSE µΛ (−1)∑l

k=0(−1)k

σ2kxµΛ (−k − 1) + o

(σ−2lx

)

Loss (n) = m + µΛ(n) (−2)− 2µΛ(n) (−1) .

10

Page 13: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Negative moments of correlated Gram matrices

Theorem (Negative moments)a Let p = min (m, n −m), then for 1 ≤k ≤ p, we have

µΛ (−k) = Lk∑

j=1

m∑i=1

D (i , j)(−1)k−j

(k − j)!bti Ψ−1Diaj,k .

aK. Elkhalil, A. Kammoun, T. Al-Naffouri and M.-S. Alouini. Analytical Derivation of the

Inverse Moments of One-sided Correlated Gram Matrices with Applications.

IEEE Trans. Signal Processing, 2016.

Ψ =

1 θ1 · · · θn−m−1

1

......

. . ....

1 θn−m · · · θn−m−1n−m

11

Page 14: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Limitations

µΛ (−k) = Lk∑

j=1

m∑i=1

D (i, j)(−1)k−j

(k − j)!bti Ψ−1Diaj,k .

Ψ =

1 θ1 · · · θn−m−1

1

.

.

....

. . ....

1 θn−m · · · θn−m−1n−m

• So complicated formula

• Not useful if the eigenvalues of Λ are close to each other (We treat this issue for positive

moments) 6.

• Not numerically stable if the dimensions are large.

• Not insightful!

• Not universal: the result will be different if we change the distribution from Gaussian.

6K. Elkhalil, A. Kammoun, T. Y. Al-Naffouri and M.-S. Alouini: Numerically Stable Evaluation of

Moments of Random Gram Matrices With Applications. IEEE Signal Process. Lett. 24(9):

1353-1357 (2017)

12

Page 15: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Asymptotic moments

Theorem (Silverstein and Bai 7)

Consider the Gram matrix G = H∗ΛH with the following assumptions

• m, n→∞ with mn→ c ∈ (0,∞)

• ‖Λ‖ = O (1) with rank (Λ) = O (m).

1

mtrG−1 − δ →a.s. 0, δ =

[1

mtrΛ (In + δΛ)−1

]−1

.

Higher inverse moments can be computed using an iterative processa

aKhalil Elkhalil, Abla Kammoun, Tareq Y. Al-Naffouri, Mohamed-Slim Alouini: Analytical

Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications.

IEEE Trans. Signal Processing 64(10): 2624-2635 (2016)

7J. W. Silverstein and Z. D. Bai, On the Empirical Distribution of Eigenvalues of a Class of Large

Dimensional Random Matrices, Journal of Multivariate Analysis, vol. 54, pp. 175192, May 2002.

13

Page 16: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Validation of the inverse moments

m = 3

7 8 9 10 11 12

n

-9

-8

-7

-6

-5

-4

µΛ(r)in

dB

r = −1

Theoretical Formula

Monte Carlo simulation

Normalized asymptotic moments

7 8 9 10 11 12

n

-16

-14

-12

-10

-8

-6

-4

µΛ(r)in

dB

r = −2

Theoretical Formula

Monte Carlo simulation

Normalized asymptotic moments

7 8 9 10 11 12

n

-25

-20

-15

-10

-5

0

µΛ(r)in

dB

r = −3

Theoretical Formula

Monte Carlo simulation

Normalized asymptotic moments

7 8 9 10 11 12

n

-30

-25

-20

-15

-10

-5

0

5

10

µΛ(r)in

dB

r = −4

Theoretical Formula

Monte Carlo simulation

Normalized asymptotic moments

14

Page 17: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Optimal λ for SCM estimation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

λ

15

20

25

30

35

40

45

50

55

60

Loss

indB

Loss(λ) = m+ µλ(−2)− 2µλ(−1)

m=3, n=5

m=3, n=6

m=3, n=7

λ∗

Figure 2.1: The estimation loss as a function of λ (Exact formula).15

Page 18: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Regularized discriminant analysis with

large dimensional data

16

Page 19: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Supervised learning

• We are provided with labeled data(featuresi , response/labeli

)1≤i≤n

.

• Fit a model to the data.

https://towardsdatascience.com

17

Page 20: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Classification

• Principle: Build a classification rule that allows to assign for an unseen

observation its corresponding class.

-0.5 0 0.5 1 1.5 2 2.5 3 3.5-4

-3

-2

-1

0

1

2

3

---

-+

++++ -

Classifier

Class 1

Class 2

Let x be the input data and f be the classification rule.

Classifier ,

{Assign class 1 if f (x) > 0

Assign class 2 if f (x) ≤ 0 18

Page 21: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Model based classification

• Data is assumed to be sampled from a certain dist.

• The decision rule is constructed based on that.

• The MAP rule is considered in the design

k = arg maxk:classes

P [Ck |x]

The classifier is designed to satisfy this rule.

19

Page 22: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Gaussian discriminant analysis

Gaussian mixture model for binary classification (2 classes)

• x1, · · · , xn ∈ Rp

• Class k is formed by x ∼ N (µk ,Σk ) , k = 0, 1

Linear discriminant analysis (LDA): Σ0 = Σ1 = Σ

W LDA (x) =

(x−

µ0 + µ1

2

)TΣ−1(µ0 − µ1)− log

π1

π0

C1

<

>

C0

0.

→ Decision rule is linear in x.

Quadratic discriminant analysis: Σ0 6= Σ1

WQDA (x) = −1

2log|Σ0||Σ1|

−1

2(x− µ0)TΣ

−10 (x− µ0) +

1

2(x− µ1)TΣ

−11 (x− µ1)

C1

<

>

C0

0.

→ Decision rule is quadratic in x.

How does this perform ? 20

Page 23: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

LDA

• Assume Σ, µ0 and µ1 known.

• Equal priors : π0 = π1 = 0.5

• No asymptotic regime, p is fixed.

The total misclassification rate is given by 8

ε = Φ

(−

2

), ∆ = ‖µ0 − µ1‖Σ−1

What happens when the statistics are not known ?

8Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning.

Springer, 2009.

21

Page 24: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

LDA: Asymptotic regime (equal covariances)

Asymptotic growth regime

Let n = n0 + n1.

• n0, n1, p →∞ such that pn → c < 1.

• µ0 and µ1 are known.

• Σ is replaced by its sample estimate Σ = 1n−2

∑2k=1

∑nki=1 (xk,i − xk ) (xk,i − xk )T .

Wang et al. 2018 a

εLDA − Φ[−

2

√1− c

]→prob. 0

aCheng Wang and Binyan Jiang. On the dimension effect of regularized

linear discriminant analysis, arXiv:1710.03136v1

→ When c → 1, the misclassification rate tends to 0.5.

→ For the LDA to result in acceptable performance, we need c close to 0.

→ Because its use of the inverse of the pooled covariance matrix, the LDA applies only when

c < 1.

What happens if p > n?

22

Page 25: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

LDA: High dimensionality

Regularization

H =(

Ip + γΣ)−1

.

Optimal γ ?

Dimensionality reduction

data(d) = W d×p × data(p)

Best d ?

23

Page 26: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

LDA with random projections

Random projections

Rp −→ Rd

x 7−→ Wx

Projection matrix

We shall assume that the projection matrix W writes as W = 1√p Z, where the entries Zi,j

(1 ≤ i ≤ d , 1 ≤ j ≤ p) of Z are centered with unit variance and independent identically

distributed random variables satisfying the following moment assumption. There exists

ε > 0, such that E |Zi,j |4+ε <∞.

Johnsonn-Lindenstrauss Lemma

For a given n data points x1, · · · , xn in Rp , ε ∈ (0, 1) and d > 8 log n

ε2 , there exists a

linear map f : Rp → Rd such that

(1− ε) ‖x i − x j‖2 ≤ ‖Wx i −Wx j‖2 ≤ (1 + ε) ‖x i − x j‖2, (4)

Conditional risk after projection

εP-LDAi = Φ

[−

1

2

√µ>W> (WΣW>)−1 Wµ +

(−1)i+1 logπ0π1√

µ>W> (WΣW>)−1 Wµ

].

What is the behavior at high dimensions ?

24

Page 27: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Performance of LDA with random projections

Asymptotic Performancea

εP-LDAi − Φ

− 12µ> (Σ + δd Ip)−1 µ + (−1)i+1 log

π0π1√

µ> (Σ + δd Ip)−1 µ

→prob. 0, (5)

δd tr (Σ + δd Ip)−1 = p − d. (6)

δd can be seen as a penalty on projection.

aK. Elkhalil, A. Kammoun, R. Calderbank, T. Al-Naffouri and M.-S. Alouini. Asymptotic

Performance of Linear Discriminant Analysis with Random Projections. ICASSP 2019.

LDA

equal priors: Φ

[−

1

2

√µ>Σµ

]

Σ = Ip : Φ

[−

1

2‖µ‖

]P-LDA

Φ

[−

1

2

√µ> (Σ + δd Ip)−1 µ

]

Φ

[−

1

2

√d/p ‖µ‖

]

25

Page 28: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

P-LDA: Experiments

• p = 800.

• µ0 = 0p and µ1 = 3√p 1p .

• Σ = {0.4|i−j|}i,j .

10-2

10-1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

10-2

10-1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

10-2

10-1

0

0.1

0.2

0.3

0.4

10-2

10-1

0

0.05

0.1

0.15

0.2

26

Page 29: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

R-LDA: Asymptotic regime (equal covariances)

Asymptotic growth regime

• n0, n1, p →∞ such that pn → c ∈ (0,∞).

• µk are replaced by xk = 1nk

∑x i∈Ck

x i .

• Σ−1 is replaced by its ridge estimate H =(

Ip + γΣ)−1

.

Hachem el al 2008. a

H ∼ T = (Ip + ρΣ)−1,

in the sense that aT (H− T) b →prob. 0 and 1n tr A (H− T)→prob. 0.

aW. Hachem, O. Khorunzhiy, P. Loubaton, J. Najim, L. Pastur: A New Approach

for Mutual Information Analysis of Large Dimensional Multi-Antenna Channels.

IEEE Trans. Information Theory 54(9): 3987 - 4004 (2008).

Zollanvari and Dougherty 2015 a

εequalR−LDA − Φ

[−µT (Ip + ρΣ)−1 µ

√D

]→prob. 0

aAmin Zollanvari and Edward R. Dougherty: Generalized Consistent Error Estimator

of Linear Discriminant Analysis. IEEE Trans. Signal Processing 63(11): 2804-2814 (2015)

What if the covariances are different ? 27

Page 30: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

R-LDA: Asymptotic regime (dist. covariances)

Asymptotic growth regime

• n0, n1, p →∞ such that pn → c ∈ (0,∞).

• µk are replaced by xk = 1nk

∑x i∈Ck

x i .

• Σ−1 is replaced by its ridge estimate H =(

Ip + γΣ)−1

.

Benaych and Couillet 2016 a

H ∼ T0,1 ∝ (Ip + ρ0Σ0 + ρ1Σ1)−1

aF. Benaych-Georges and R. Couillet, Spectral Analysis of the Gram Matrix

of Mixture Models, ESAIM: Probability and Statistics, vol. 20, pp. 217237, 2016.

Elkhalil et al. 2018 a

εdist.R−LDA −

{1

[−µT T0,1µ + β

√D0

]+

1

[−µT T0,1µ− β√

D1

]}→prob. 0

aK. Elkhalil, A. Kammoun, R. Couillet, T. Al-Naffouri and M.-S. Alouini.

A Large Dimensional Study of Regularized Discriminant Analysis Classifiers.

Under review in IEEE Trans. Information Theory.

How is this different from the case of equal covariances?

28

Page 31: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

R-LDA: Asymptotic regime (dist. covariances)

εdist.R−LDA −

{1

[−µT T0,1µ + β

√D0

]+

1

[−µT T0,1µ− β√

D1

]}→prob. 0

Some insights

• If ‖Σ0 −Σ1‖ = o(1)

εdist.R−LDA = ε

equalR−LDA + o(1).

→ R-LDA is robust against small perturbations.

• Different misclassification rates across classes.

• The enhancement in the misclassification rate in one class is likely to be lost by the other

class.

• R-LDA does not leverage well the information about the covariance differences.

What about R-QDA ?

29

Page 32: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

R-QDA: Asymptotic regime

R-LDA R-QDA

n0, n1 samples n0, n1 samples

Σ = 1

n−2

∑2

k=1

∑nk

i=1(xk,i − xk) (xk,i − xk)

T

Σ0 =1

n0−1

∑n0

i=1 (x0,i − x0) (x0,i − x0)T

Σ1 =1

n1−1

∑n1

i=1 (x1,i − x1) (x1,i − x1)T

εi = P[ω

T Biω + 2ωT yi < ξi],where ω ∼ N (0p, Ip) ,

Asymptotic growth regime

1. Data scaling:nip → c ∈ (0,∞), with |n0 − n1| = o (p).

2. Mean scaling: ‖µ0 − µ1‖2 = O(√

p)

.

3. Covariance scaling: ‖Σi‖ = O (1).

4. Σ0 −Σ1 has exactly O(√

p)

eigenvalues of O(1).

CLT(Lyapunov)

εR−QDAi − Φ

(−1)i1/√pξi − 1/

√p tr Bi√

1/p2 tr B2i + 1/p4yT

i yi

→prob. 0.

30

Page 33: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

R-QDA: Asymptotic regime

Elkhalil et al. 2017/2018 a b

εR−QDAi −

{1

[ξ0 − b0√

2B0

]+

1

[−ξ1 + b1√

2B1

]}→prob. 0.

ξi , bi and B i depend on the classes’ statistics.

aK. Elkhalil, A. Kammoun, R. Couillet, T. Y. Al-Naffouri, and M.-S. Alouini.

Asymptotic Performance of Regularized Quadratic Discriminat Analysis-Based Classifiers.

IEEE MLSP, Roppongi, Japan, Sept 2017.bK. Elkhalil, A. Kammoun, R. Couillet, T. Al-Naffouri and M.-S. Alouini.

A Large Dimensional Study of Regularized Discriminant Analysis Classifiers.

Under review in IEEE Trans. Information Theory.

31

Page 34: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Discussion

‖µ0 − µ1‖ = O(1)

Σ0 Σ1

Recall that R-QDA needs ‖µ0 − µ1‖2 = O(√p)

The information on the distance between the means is asymptotically useless!

32

Page 35: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Discussion

‖µ0 − µ1‖2 = O(√p)

Σ0Σ1

Σ0 − Σ1 has more than O(√

p)

eigenvalues of O(1)

R-QDA achieves asymptotic perfect classification.

32

Page 36: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Discussion

‖µ0 − µ1‖ = O(1)

Σ0 Σ1

‖Σ0 − Σ1‖ = o(1)

Classification is asymptotically impossible.

32

Page 37: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Discussion

• Unbalanced training: n0 − n1 = O(p).

R-QDA is equivalent to the naive classifier.

ε→ π0Φ (∞) + π1Φ (−∞)

32

Page 38: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Parameter tuning

• Prone to estimation errors due to insufficiency in the number of observations.

• The tuning of the regularization parameter is very important

Training data Testing data

Derive the classifier

Assess the performances

Model selection Given a set of candidate regularization factors

• Evaluate the performance using the test data for each regularization value 9

• Select the value that presents the lowest mis-classification rate

9J. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association,

84:165175, 1989

33

Page 39: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Consistent estimator of the classification error

Exploiting the asymptotic equivalent

• If p = O(1) and n →∞, then∥∥∥Σi −Σi

∥∥∥ = op (1).

• When p, n →∞, then∥∥∥Σi −Σi

∥∥∥ 6= op (1).

R-LDA (GE)

εR−LDAi − Φ

(−1)i G (µi , µ0, µ1,H)√D(µ0, µ1,H, Σi

)→p. 0

R-QDA (GE)

εR−QDAi − Φ

(−1)iξi − bi√

2Bi

→p. 0

Optimal regularizer

γ? = arg min

γ>0ε (γ) .

• These results provides a glimpse on the region where the optimal γ is likely to belong.

• Perform a cross validation or testing in that region.

How well does this perform?

34

Page 40: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Performance

Benchmark estimation techniques:

• 5-fold cross-validation with 5 repetitions (5-CV).

• 0.632 bootstrap (B632).

• 0.632+ bootstrap (B632+)

• Plugin estimator consisting of replacing the stats. in the DEs by their sample estimates.

Synthetic data

• [Σ0]i,j = 0.6|i−j|.

• Σ1 = Σ0 + 3

[Id√pe 0k×(p−k)

0(p−k)×k 0(p−k)×(p−k)

].

• µ0 =[1, 01×(p−1)

]T .

• µ1 = µ0 + 0.8√p 1p×1.

Real data

• USPS dataset.

• p = 256 features (16× 16)

grayscale images.

• n = 7291 training examples.

• ntest = 2007 testing examples.

35

Page 41: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Performance: Synthetic data

100 200 300 400 5000

0.05

0.1

0.15

0.2

0.25RQDA

RQDA G

RQDA 5-CV

RQDA B632

RQDA B632+

RQDA Plugin

100 200 300 400 500

# of features, p

0

0.05

0.1

0.15

0.2

0.25

100 200 300 400 5000

0.05

0.1

0.15

0.2

0.25RMS

RLDA

RLDA 5-CV

RLDA B632

RLDA B632+

RLDA Plugin

RLDA G

100 200 300 400 500

# of features, p

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

RMS

ni = p/2

ni = p

Figure 3.1: n0 = n1 and γ = 1.36

Page 42: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Performance: Synthetic data

0 1 2 3 4 5

γ

0.25

0.3

0.35

0.4

0.45

0.5

Averagemisclassificationrate

RLDA

Empirical Error

RLDA G

RLDA 5-CV

RLDA B632

RLDA B632+

RLDA Zollanvari

RLDA Plugin

0 1 2 3 4 5

γ

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5RQDA

Empirical Error

RQDA G

RQDA 5-CV

RQDA B632

RQDA B632+

RQDA Plugin

Figure 3.2: p = 100 features with equal training size (n0 = n1 = p).

37

Page 43: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Performance: USPS dataset

200 250 300 350 400 450 500

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

USPS(5,2)

R-LDA

R-LDA G

R-LDA 5-CV

R-LDA B632

R-LDA B632+

R-LDA Plugin

200 250 300 350 400 450 500

0.009

0.01

0.011

0.012

0.013

0.014

0.015

0.016

0.017R-QDA

R-QDA G

R-QDA 5-CV

R-QDA B632

R-QDA B632+

R-QDA Plugin

200 250 300 350 400 450 500

Training per class n0

0.006

0.008

0.01

0.012

0.014

0.016

USPS(5,6)

R-LDA G

R-LDA 5-CV

R-LDAB632

R-LDA B632+

R-LDA Plugin

200 250 300 350 400 450 500

Training per class n0

2

4

6

8

10

12

14

16×10

-3

R-QDA G

R-QDA 5-CV

R-QDA B632

R-QDA B632+

R-QDA Plugin

Figure 3.3: n0 = n1 and γ = 1. The first row gives the performance for the USPS

data with digits (5, 2) whereas the second row considers the digits (5, 6).38

Page 44: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Performance: USPS dataset

0 5 10

γ

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

n0 = 400

R-QDA (Empirical Error)

R-QDA G

R-LDA (Empirical Error)

R-LDA G

0 5 10

γ

0

0.05

0.1

0.15

USPS(5,6)

n0 = 100

R-QDA (Empirical Error)

R-QDA G

R-LDA (Empirical Error)

R-LDA G

0 5 10

γ

0

0.01

0.02

0.03

0.04

0.05

0.06

USPS(5,2)

n0 = 100

R-QDA (Empirical Error)

R-QDA G

R-LDA G

R-LDA (Empirical Error)

0 5 10

γ

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

n0 = 400

R-QDA (Empirical Error)

R-QDA G

R-LDA (Empirical Error)

R-LDA G

Figure 3.4: Equal training size (n0 = n1).

39

Page 45: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Centered Kernel Ridge Regression

(CKRR)

40

Page 46: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

KRR: Kernel trick

y = wT x y = wTφ (x)

41

Page 47: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

KRR: Kernel trick

• {x i , yi}ni=1 in X × Y s.t. yi = f (x i ) + σεi with εi ∼i.i.d N (0, 1).

• Feature map: φ : X → H, with H is a RKHS.

• Learning problem

minα

1

2‖y −Φα‖2 +

λ

2‖α‖2

Φ = [φ (x1) , · · · , φ (xn)]T ∈ R|H|×n

α? =(ΦTΦ + λI|H|

)−1ΦT y ∈ R|H|

f∗ (x) = φ (x)T (

ΦTΦ+ λI|H|

)

−1

ΦTy

f∗ (x) = φ (x)TΦ

T(

ΦΦT + λIn

)

−1

y

Woodbury

{ΦΦT }i,j = φ (xi)

Tφ (xj)

f ? (x) = κ (x)T (K + λIn)−1 y

with κ (x)i = φ (x)T φ (x i ) and K i,j = φ (x i )T φ (x j ).

42

Page 48: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Centered KRR: Motivation

Inner-product kernels

k(x, x′

)= φ (x)T φ

(x′)

= g(xT x/p

), x and x′ ∈ X .

Asymptotic growth regime

Assumption 1.

• p/n → c (0,∞).

• Ex i = 0 and covx i = Σ unif. bounded in p (e.g. x i ∼ N (0,Σ)).

El Karoui 2010a ∥∥K − K∞∥∥→a.s. 0

with K∞ = g (0) 11T︸ ︷︷ ︸‖.‖=O(p)

+ g ′ (0)XXT

p+ constant(g ,Σ)︸ ︷︷ ︸‖.‖=O(1)

.

aN. El-Karoui, The Spectrum of Kernel Random Matrices,

The Annals of Statistics, vol. 38, no. 1, pp. 150, 2010.

43

Page 49: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Centered KRR: Motivation

Inner-product kernels

k(x, x′

)= φ (x)T φ

(x′)

= g(xT x/p

), x and x′ ∈ X .

Asymptotic growth regime

Assumption 1.

• p/n → c (0,∞).

• Ex i = 0 and covx i = Σ unif. bounded in p (e.g. x i ∼ N (0,Σ)).

Centering with P = In − 11T

n

K c = PKP

El Karoui 2010a ∥∥K − K∞∥∥→a.s. 0

with K∞ =

����g (0) 11T︸ ︷︷ ︸

‖.‖=O(p)

+ g ′ (0)XXT

p+ constant(g ,Σ)︸ ︷︷ ︸‖.‖=O(1)

.

aN. El-Karoui, The Spectrum of Kernel Random Matrices,

The Annals of Statistics, vol. 38, no. 1, pp. 150, 2010.

43

Page 50: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Centered KRR

P = I n −11T

n.

Learning problem

minα0,α

1

2‖y −Φα− α01n‖2 +

λ

2‖α‖2 ⇔ min

α

1

2‖P (y −Φα)‖2 +

λ

2‖α‖2

α? = Φ

TP

PKP︸ ︷︷ ︸Kc

+λI n

−1

(y − y1n)

f ?c (x) = κc (x)T (K c + λI n)−1 Py + y .

κc (x) = Pκ (x)−1

nPK1n, φc (x) = φ (x)−

1

n

n∑i=1

φ (x i ) .

Centered KRR ∼ KRR with centered kernels

What about the performance ?

44

Page 51: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Performance

Performance metrics

Rtrain =1

nEε

∥∥∥fc (X )− f (X )∥∥∥2

2

Rtest = Es∼D,ε

∣∣∣fc (s)− f (s)∣∣∣2

Assumption1. (Growth rate) As p, n →∞ we assume the following

• Data scaling: p/n → c ∈ (0,∞).

• Covariance scaling: lim supp ‖Σ‖ <∞.

Assumptions 2. (kernel function)

E

∣∣∣∣g (3)(

1

pxTi x j

)∣∣∣∣k <∞.Assumption 3. (Data generating function)

•Ex∼N (0,Σ) |f (x)|k <∞,

Ex∼N (0,Σ)

‖∇f (x)‖k2 <∞,where ∇f (x) =

{∂f (x)

∂xl

}p

l=1

.

45

Page 52: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Limiting risk

Limiting riska Let z = −λ+g(τ)−g(0)−τg′(0)

g′(0)with τ = 1

p trΣ.

Rtrain −R∞train →prob. 0,

Rtest −R∞test →prob. 0,

R∞train =

(cλmz

g ′ (0)

)2 n (1 + mz )2(σ2 + varf

)− nmz (2 + mz ) ‖E [∇f ]‖2

n (1 + mz )2 − pm2z

+ σ2 − 2σ2 cλmz

g ′ (0)

R∞test =n (1 + mz )2

(σ2 + varf

)− nmz (2 + mz ) ‖E [∇f ]‖2

n (1 + mz )2 − pm2z

− σ2.

aK. Elkhalil, A. Kammoun, X. Zhang, M.-S. Alouini and T. Al-Naffouri.

Risk Convergence of Centered Kernel Ridge Regression with Large Dimensional Data.

Submitted to IEEE Trans. Signal Processing.

Bad news § Minimim prediction risk is achieved by all kernels!! ∼ Linear kernel

Good news © kernel/regularizer can be jointly optimized!

46

Page 53: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Consistent estimator of the prediction risk

Interesting relation between R∞train and R∞test

R∞test =

(cλmz

g ′ (0)

)−2

R∞train − σ2(

g ′ (0)

cλmz− 1

)2

.

Consistent estimator of Rtest

Rtest =

(cλmz

g ′ (0)

)−2

Rtrain − σ2(

g ′ (0)

cλmz− 1

)2

,

mz =1

ptr

(XXT

p− zI n

)−1

.

Issues with λ small §

47

Page 54: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Consistent estimator of the prediction risk

Consistent estimator of Rtest

Rtest =1

(czmz )2

[1

npyT PX

(zQ2

z − Qz

)XT Py + var (y)

]− σ2

.

Qz =

(XT PX

p− zIp

)−1

.

More stable with respect to λ ©

R?test = minz /∈Supp{XXT /p}

Rtest (z) , z? = −λ? + g? (τ)− g? (0)− τg ′? (0)

g ′? (0).

47

Page 55: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Experiments

Kernels

• Linear kernels: k(x, x′

)= αxT x′/p + β.

• Polynomial kernels: k(x, x′

)=(αxT x′/p + β

)d.

• Sigmoid kernels: k(x, x′

)= tanh

(αxT x′/p + β

).

• Exponential kernels: k(x, x′

)= exp

(αxT x′/p + β

).

Synthetic data

• x ∼ Σ12 z with z {zi}pi=1, E zi = 0, varzi = 1 and E zki = O(1).

• Generating function: f (x) = sin(

1T x√p

).

Real data

• Communities and Crime dataset.

• p = 122, ntrain = 73 and ntest = 50.

• Prediction risk is computed by averaging over 500 data shuffling.

48

Page 56: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Synthetic data

0 10 20

0.2

0.3

0.4

0.5

0.6

Risk

Linear kernel, α = 1, β = 1

Rtest

R∞

test (Theorem 1)

Rtest (Theorem 2)

Rtest (Lemma 1)Rtrain

R∞

train (Theorem 1)

0 10 20

0.2

0.3

0.4

0.5

0.6

Polynomial kernel, α = 1, β = 1, d = 2

0 10 20

0.2

0.3

0.4

0.5

0.6

Sigmoid kernel, α = 1, β = −1

0 10 20

0.2

0.3

0.4

0.5

0.6

Exponential kernel, α = 1, β = 0

0 10 20

0.2

0.3

0.4

0.5

0.6

0.7

Risk

0 10 20

0.2

0.3

0.4

0.5

0.6

0 10 20

0.2

0.3

0.4

0.5

0.6

0 10 20

0.2

0.3

0.4

0.5

0 10 20

0.2

0.3

0.4

0.5

0.6

0.7

Risk

0 10 20

0.2

0.3

0.4

0.5

0 10 20

0.2

0.3

0.4

0.5

0 10 20

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20

λ

0.2

0.3

0.4

0.5

0.6

Risk

0 10 20

λ

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20

λ

0.2

0.3

0.4

0.5

0.6

0 10 20

λ

0.2

0.3

0.4

0.5

0.6

p = 100, n = 200 p = 100, n = 200 p = 100, n = 200 p = 100, n = 200

p = 100, n = 200 p = 100, n = 200

p = 100, n = 200 p = 100, n = 200

p = 200, n = 100 p = 200, n = 100 p = 200, n = 100

p = 200, n = 100

p = 200, n = 100p = 200, n = 100p = 200, n = 100p = 200, n = 100

Bernoulli data Bernoulli data Bernoulli data

Bernoulli data

Bernoulli data

Bernoulli dataBernoulli data

Bernoulli data

Gaussian data Gaussian data

Gaussian data

Gaussian data

Gaussian data

Gaussian data

Gaussian data

Gaussian data

49

Page 57: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Synthetic data

0 5 10 15 200.2

0.3

0.4

0.5

0.6

0 5 10 15 200.2

0.3

0.4

0.5

0.6

0 5 10 15 200.2

0.3

0.4

0.5

0.6

0 5 10 15 200.2

0.3

0.4

0.5

0.6

Testing

Training

Figure 4.1: CKRR risk with respect to the regularization parameter λ on Gaussian

data (x ∼ N(0p , {0.4|i−j|}i,j

), n = 200 training samples and p = 100 predictors.

50

Page 58: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

CKRR: Real data

0 5 10 15 20

0.02

0.03

0.04

0.05

0.06

0.07

0 5 10 15 20

0.02

0.03

0.04

0.05

0.06

0.07

0 5 10 15 20

0.02

0.03

0.04

0.05

0.06

0.07

0 5 10 15 20

0.02

0.03

0.04

0.05

0.06

0.07

Figure 4.2: CKRR risk with respect to λ where independent zero mean Gaussian noise

samples with variance σ2 = 0.05 are added to the true response.51

Page 59: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Conclusion

52

Page 60: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Conclusion

• Random matrix theory is a powerful tool that has been applied with

success to the fields wireless communications and signal processing,

providing solutions to very challengning problems

• High dimensionality along with stochasticity are the sole prerequisite of

this tool

• Successful application of this tool has been demonstrated in the context of

RDA.

• Fundamental limits of Centered kernel ridge regression.

53

Page 61: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Future research directions

54

Page 62: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Future research directions

• We can also consider the performance analysis of kernel LDA/QDA.

• Extend the analysis to Homogenous kernels.

55

Page 63: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

Important results on Homogenus Kernel matrices

• φ (x) is a fixed non linear feature space mapping. The kernel function is given by

k(

x, x′)

= φ (x)T φ(

x′).

• Homogeneous kernels

k(

x, x′)

= f

(‖x− x′‖2

p

).

• {K}i,j = k (xi , xj ).

Theorem (Spectrum of kernel random matrices, El-Karoui 2010)10 [Informal statement]

K = f (τ) 11T + f ′ (τ) W + f ′′ (τ) Q,‖x− x′‖2

p→a.s. τ.

‖K− K‖ p−→ 0. (7)

This might help to analyze the performance of some kernel methods in regression or classification.

10N. El Karoui, The spectrum of kernel random matrices, The annals of statistics, Volume 38,

Number 1 (2010), 1-50.

56

Page 64: Random Matrix Theory: Selected Applications from Statistical … · 2020-06-15 · Derivation of the Inverse Moments of One-Sided Correlated Gram Matrices With Applications. IEEE

56