Development of robust scatter estimators under independent ...andy.leung/files/... · Alqallaf, Van...

Preview:

Citation preview

Development of robust scatter estimatorsunder independent contamination model

C. Agostinelli1, A. Leung2, V.J. Yohai3 and R.H. Zamar2

1 Universita Ca Foscari di Venezia, 2 University of British Columbia, and 3Universidad de Buenos Aires and CONICET

Mar 16, 2013

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Some declarations

I To math geeks: I am sorry but I will keep my talk to haveminimal math equations and theorems today (come on, it is9 am!)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Objective of the day

Objective: robust estimation of (location and) scatter matrix fora data set of size n and p continuous variables.

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

What is contamination?

Perhaps the most classical contamination model isHuber-Tukey contamination model (HTCM) (Tukey in 1960,Huber in 1964), which was originally for 1-D data...

Contamination is row-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 0.9 -2.8 -2.1 -0.8 -2.4 1.3 2.7 3.4 0.9 -0.1[2,] -2.4 2.3 -1.8 -3.0 1.9 1.0 -0.5 0.4 -2.8 -1.5

[3,] 0.7 -2.3 -0.6 2.9 -1.5 -0.8 2.9 0.0 -2.6 1.8

[4,] 1.0 1.9 1.6 1.1 0.0 -2.2 1.0 -4.1 2.2 -0.9[5,] 0.1 -1.0 1.8 2.2 -0.1 2.1 -1.3 3.1 1.2 1.0

[6,] 1.7 3.0 0.6 0.9 -1.4 1.9 -0.3 -0.4 -0.4 1.7[7,] -0.8 1.0 2.5 3.9 -2.8 2.5 -0.3 -0.9 2.6 2.4

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

What is contamination?

Perhaps the most classical contamination model isHuber-Tukey contamination model (HTCM) (Tukey in 1960,Huber in 1964), which was originally for 1-D data...

Contamination is row-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 0.9 -2.8 -2.1 -0.8 -2.4 1.3 2.7 3.4 0.9 -0.1[2,] -2.4 2.3 -1.8 -3.0 1.9 1.0 -0.5 0.4 -2.8 -1.5

[3,] 0.7 -2.3 -0.6 2.9 -1.5 -0.8 2.9 0.0 -2.6 1.8

[4,] 1.0 1.9 1.6 1.1 0.0 -2.2 1.0 -4.1 2.2 -0.9[5,] 0.1 -1.0 1.8 2.2 -0.1 2.1 -1.3 3.1 1.2 1.0

[6,] 1.7 3.0 0.6 0.9 -1.4 1.9 -0.3 -0.4 -0.4 1.7[7,] -0.8 1.0 2.5 3.9 -2.8 2.5 -0.3 -0.9 2.6 2.4

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

What is contamination?

HTCM in math notation,

x∗ = (1 − u)x + uc

whereI x = (x1, ..., xp) ∼ N(µ,Σ)

I c ∼“something”I u ∼ Bin(1, ε), 0 ≤ ε < 1/2

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

New contamination model

HTCM may not be realistic...I outliers are more likely to happen in certain variables,

independent of othersI what if p is large but n is of moderate to small size?I what if every single observation has one component

contamination?

Alqallaf, Van Aelst, Yohai and Zamar (2006) proposed a newcontamination model...

Cell-wise contamination model

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

New contamination model

HTCM may not be realistic...I outliers are more likely to happen in certain variables,

independent of othersI what if p is large but n is of moderate to small size?I what if every single observation has one component

contamination?

Alqallaf, Van Aelst, Yohai and Zamar (2006) proposed a newcontamination model...

Cell-wise contamination model

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

New contamination model

Contamination is cell-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 2.69 2.10 4.59 2.13 -1.09 2.72 -0.72 0.47 -1.42 -1.90

[2,] 2.92 2.20 -1.70 -1.83 -1.05 4.89 0.32 -1.93 -2.59 -2.48

[3,] -0.75 0.53 -3.22 3.07 4.04 -1.39 -0.26 0.44 0.05 2.14

[4,] -2.35 4.46 -0.99 -0.41 0.68 -2.79 1.37 1.74 1.35 1.78

[5,] -1.09 -2.77 4.59 -2.78 -0.97 1.35 4.10 -0.56 3.79 -0.11

[6,] -1.94 -0.33 -0.40 -3.22 1.32 0.24 -1.89 1.02 2.60 4.54

where in math model is

x∗ = (1 − U)x + Uc

where x = (x1, ..., xp) and c is same as before, except

U = diag(ui), where ui ∼ Bin(1, ε),0 ≤ ε < 1/2

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

New contamination model

Contamination is cell-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 2.69 2.10 4.59 2.13 -1.09 2.72 -0.72 0.47 -1.42 -1.90

[2,] 2.92 2.20 -1.70 -1.83 -1.05 4.89 0.32 -1.93 -2.59 -2.48

[3,] -0.75 0.53 -3.22 3.07 4.04 -1.39 -0.26 0.44 0.05 2.14

[4,] -2.35 4.46 -0.99 -0.41 0.68 -2.79 1.37 1.74 1.35 1.78

[5,] -1.09 -2.77 4.59 -2.78 -0.97 1.35 4.10 -0.56 3.79 -0.11

[6,] -1.94 -0.33 -0.40 -3.22 1.32 0.24 -1.89 1.02 2.60 4.54

where in math model is

x∗ = (1 − U)x + Uc

where x = (x1, ..., xp) and c is same as before, except

U = diag(ui), where ui ∼ Bin(1, ε),0 ≤ ε < 1/2

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Existing robust scatter estimators

Under HTCM, we have...I Minimum Volume Ellipsoid (MVE) (Rousseeuw, 1985)I Minimum Covariance Determinant (MCD) (Rousseeuw,

1985)I S-estimator (Davies, 1987)I MM-estimator (Yohai, 1987; Tatsuoka and Tyler, 2000)I modified GK estimator (Maronna and Zamar, 2002)I ...

Let’s look at how these existing robust scatter estimators (e.g.MVE, S-est, MM-est) perform under HTCM and Cell-wisecontam.

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),

S-est. (red)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),

S-est. (red) ,MM-est. (gray)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Davies’ S-estimator

Definition (Davies, 1987): For µ ∈ Rp and positive definite Σ,S-estimator is (

µ, Σ)

= arg min s(µ,Σ)

Σ = s∗ Σ

where s(µ,Σ) is solution s to

1n

n∑i=1

ρ

(xi − µ)TΣ−1(xi − µ)|Σ|1/p

s

=12,

with ρ(·) is some bounded monotone loss function and mustsatifies

(||X||2

c

))=

12

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

MM-estimator (a two-stage estimator)

Definition: For µ ∈ Rp and positive definite Σ, MM-estimator is

(µ, Σ) = arg min J(µ,Σ)

where

J(µ,Σ) =1n

n∑i=1

ρ2

(xi − µ)TΣ−1(xi − µ)|Σ|1/p

sn

with ρ2(·) being a different loss function, i.e. ρ2(·) ≤ ρ1(·) and snbeing the scale from S-estimate.

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Cell-wise contamination

I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Cell-wise contamination

I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Cell-wise contamination

I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Cell-wise contamination

I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),

S-est. (red)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Cell-wise contamination

I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),

S-est. (red) ,MM-est. (gray)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

MVE, S-, and MM estimator performs very badly undercell-wise contam....

Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.

In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!

We need to develop a new estimator...

Composite-S estimator (CSE)

...but this estimator is not affine equivariant, which saves fromfalling under HTCM!

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

MVE, S-, and MM estimator performs very badly undercell-wise contam....

Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.

In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!

We need to develop a new estimator...

Composite-S estimator (CSE)

...but this estimator is not affine equivariant, which saves fromfalling under HTCM!

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

MVE, S-, and MM estimator performs very badly undercell-wise contam....

Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.

In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!

We need to develop a new estimator...

Composite-S estimator (CSE)

...but this estimator is not affine equivariant, which saves fromfalling under HTCM!

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

MVE, S-, and MM estimator performs very badly undercell-wise contam....

Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.

In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!

We need to develop a new estimator...

Composite-S estimator (CSE)

...but this estimator is not affine equivariant, which saves fromfalling under HTCM!

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

In short, CSE attempts to minimize the size of the covariance(e.g. “ellipses”) for each pair of variables simultaneously,instead of all variables.

It tries to downweight bivariate Mahalanobis distances, insteadof full, when constructing the covariance matrix

Now let’s have an example, we will get back to its definitionlater...

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

In short, CSE attempts to minimize the size of the covariance(e.g. “ellipses”) for each pair of variables simultaneously,instead of all variables.

It tries to downweight bivariate Mahalanobis distances, insteadof full, when constructing the covariance matrix

Now let’s have an example, we will get back to its definitionlater...

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

In short, CSE attempts to minimize the size of the covariance(e.g. “ellipses”) for each pair of variables simultaneously,instead of all variables.

It tries to downweight bivariate Mahalanobis distances, insteadof full, when constructing the covariance matrix

Now let’s have an example, we will get back to its definitionlater...

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

Example: p = 5,n = 100, ε = 0.10, random covariance matrix, origin center,normal, cell-wise contam.

95% confidence region based on Davies’ S-estimator vs true covariance:

Scatter Plot Matrix

V1024 0 2 4

−4−2

0

−4 −2 0

V2246

2 4 6

−4−2

0

−4−2 0

V3246

2 4 6

−202

−2 0 2

V40

24 0 2 4

−4−2

0

−4 −2 0

V52468

2 4 6 8

−4−2

02

−4 0 2

true S−est

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

Example: p = 5,n = 100, ε = 0.10, random covariance matrix, origin center,normal, cell-wise contam.

95% confidence region based on CSE:

Scatter Plot Matrix

V1024 0 2 4

−4−2

0

−4 −2 0

V2246

2 4 6

−4−2

0

−4−2 0

V3246

2 4 6

−202

−2 0 2

V40

24 0 2 4

−4−2

0

−4 −2 0

V52468

2 4 6 8

−4−2

02

−4 0 2

true CSE

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

Example: p = 5,n = 100, ε = 0.10, random covariance matrix, origin center,normal, cell-wise contam.

95% confidence region based on CSE versus S-est. based on each pair:

Scatter Plot Matrix

V1024 0 2 4

−4−2

0

−4 −2 0

V2246

2 4 6

−4−2

0

−4−2 0

V3246

2 4 6

−202

−2 0 2

V40

24 0 2 4

−4−2

0

−4 −2 0

V52468

2 4 6 8

−4−2

02

−4 0 2

true CSE Pairwise−S

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S-estimator

Definition (CSE): For a given robust initial estimator Ω0,

(µ, Σ) = arg min s(µ,Σ, Ω0)

Σ = s∗ Σ

where s(µ,Σ, Ω0) is solution s to

2p(p − 1)n

n∑i=1

p∑j=k

p−1∑k=1

ρ

d jki (µ,Σ)

s c0

|Σjk|1/2

|Ωjk0 |

1/2

=12

d jki (µ,Σ) = (xjk

− µjk )TΣjk−1(xjk− µjk ) is the bivariate

Mahalanobis distance, and c must satisifies the same criteriaas in Davies’ S-estimator but in bivariate.

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite MM-estimator

CSE in general is robust under cell-wise contam. but notefficient.

Efficiency is a measurement of variability of the estimaterelative to some gold standard, such as MLE, under nocontamination.

We use the corresponding MM-version (Tatsuoka and Tyler,2000) of CSE to achieve efficiency

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite MM-estimator

CSE in general is robust under cell-wise contam. but notefficient.

Efficiency is a measurement of variability of the estimaterelative to some gold standard, such as MLE, under nocontamination.

We use the corresponding MM-version (Tatsuoka and Tyler,2000) of CSE to achieve efficiency

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite MM-estimator

CSE in general is robust under cell-wise contam. but notefficient.

Efficiency is a measurement of variability of the estimaterelative to some gold standard, such as MLE, under nocontamination.

We use the corresponding MM-version (Tatsuoka and Tyler,2000) of CSE to achieve efficiency

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Composite S- and MM-estimator

Both have very nice but complex estimation procedure thatclosely link with S-estimator with missing data (Danilov et al,2012), but we will not describe here

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Some results shown in ICORS 2012

We performed a Monte Carlo study to assess the behavior ofthe proposed estimators.

Simulation setting:I x ∼ N(0,Σ0), some n and pI Σ0 is exchangeable correlation, i.e.

Σ0 =

1 r ... rr 1 ... r... ... ... ...r ... 1 rr ... r 1

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Some results shown in ICORS 2012

Here we show some results for

I Correlations: r = 0.5 and r = 0.9I p = 10 and n = 100.I p = 20 and n = 200.

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Some results shown in ICORS 2012

Performance criteria as:1. Likelihood ratio test distance (LRT) for robustness

evaluation

D(Σ,Σ0) =1N

N∑i=1

D(Σi ,Σ0)

where

D(Σ,Σ0) = trace(Σ−10 Σ) − log(det(Σ−1

0 Σ)) − p

2. Relative efficiency based on LRT values for efficiencyevaluation

D(ΣMLE,Σ0)/D(Σ,Σ0)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Monte Carlo results

Gaussian Efficiency Without Outliers

p = 10, n = 100 p = 20,n = 200

ESTIMATES r0.5 0.9

S-est 0.91 0.90Pairwise-S 0.25 0.45CSE 0.70 0.50CMME 0.74 0.78

ESTIMATES r0.5 0.9

S-est 0.96 0.96Pairwise-S 0.36 0.37CSE 0.74 0.44CMME 0.81 0.60

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Monte Carlo results

n = 100,p = 10, ε = 10%

10% Contamination(n=100, p=10)

Outliers size

Aver

age

LRT

dist

ance

0

2

4

6

8

5 10 15 20

Corr.=0.5ICM

Corr.=0.9ICM

Corr.=0.5THCM

5 10 15 20

0

2

4

6

8

Corr.=0.9THCM

Pairwise−SCS (QC)

Classical−SCMM (QC)

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Remarks and conclusion

I In general, CSE (and CMME) are very robust undercell-wise contam.

I We have seen that CSE (and CMME) do not perform verywell under HTCM

I Our goal is to have an estimator highly robust under bothHTCM and cell-wise contam. (we are ambitious!)

I ...while efficiency is our second priority

To be continued....

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model

Acknowledgement

Special thanks to Professor R. Zamar and Professor V. Yohai!

Prof. Zamar Prof. Yohai

...AND THANK YOU FOR LISTENING!

C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model