Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Robust estimation via (non)convex M-estimation

Po-Ling Loh

University of Wisconsin - MadisonDepartments of ECE & Statistics

Workshop on computational efficiency andhigh-dimensional robust statistics

TTI Chicago

August 15, 2018

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 1 / 39

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators


Outline




Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R





E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Statistical M-estimator:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi )

}

in high dimensions, may be ill-conditioned, large solution space





E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Regularized M-estimator:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi )

︸︷︷︸Ln(β)

+ρλ(β)

}


Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}

=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}






{1

n

n∑

i=1

(yi − xTi β)2

}

=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)



{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}






{1

n

n∑

i=1

(yi − xTi β)2

}=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)



{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}






{1

n

n∑

i=1

(yi − xTi β)2

}=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)



{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}


Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result












Example: Errors-in-variables regression

Model:yi = xTi β

∗ + εi

observe {(zi , yi )}ni=1, infer β∗

OLS estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

(zTi β − yi )2 + λ‖β‖1

}

statistically inconsistent


Example: Errors-in-variables regression

Model:yi = xTi β

∗ + εi

observe {(zi , yi )}ni=1, infer β∗

OLS estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

(zTi β − yi )2 + λ‖β‖1

}

statistically inconsistent


Corrected losses

L. & Wainwright ’12 propose natural method for correcting loss forlinear regression:


{1

2βT

XTX

nβ − yXT

nβ + ρλ(β)

}

β̂corr ∈ arg minβ

{1

2βT Γ̂β − γ̂Tβ + ρλ(β)

}

(Γ̂, γ̂) estimators for (Cov(xi ),Cov(xi , yi )) based on {(zi , yi )}ni=1


Example: Additive noise

Additive noise: Z = X + W , use

Γ̂ =ZTZ

n− Σw , γ̂ =

ZT y

n

However, corrected objective nonconvex:


{1

2βT(ZTZ

n− Σw

)β − yTZ

nβ + ρλ(β)

}

Fortunately, local optima have good properties


Example: Additive noise

Additive noise: Z = X + W , use

Γ̂ =ZTZ

n− Σw , γ̂ =

ZT y

n

However, corrected objective nonconvex:


{1

2βT(ZTZ

n− Σw

)β − yTZ

nβ + ρλ(β)

}

Fortunately, local optima have good properties


Nonconvex regularizers

`1 is “convexified” version of `0

1

0

`0

0

`1

But `1 penalizes larger coefficients more, causes solution bias


Nonconvex regularizers

`1 is “convexified” version of `0

1

0

`0

0

`1

But `1 penalizes larger coefficients more, causes solution bias


Alternative regularizers

Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)

−4 −2 0 2 40

0.5

1

1.5

2

2.5

3

3.5

4

SCADMCPlqcapped l1l1


Empirical benefits

Nonconvex regularizers show significant improvement (Breheny &

Huang ’11)


Local vs. global optima

Optimization algorithms only guaranteed to find local optima(stationary points)

Statistical theory only guarantees consistency of global optima

�⇤b� e�

O

rk log p

n

!

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature


Local vs. global optima

Optimization algorithms only guaranteed to find local optima(stationary points)

Statistical theory only guarantees consistency of global optima

�⇤b� e�

O

rk log p

n

!

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature


Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex




β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}


Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0






β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}


Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0




Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ





{Ln(β) +

p∑

j=1

ρλ(βj)

}








{Ln(β) +

p∑

j=1

ρλ(βj)

}





Geometric intuition

Population-level convexity, finite-sample nonconvexity

function view gradient view

�⇤

Ln

�⇤

rLn

Population-level objective L strongly convex, α > µ

RSC quantifies convergence rate of ∇Ln −→ ∇L


Geometric intuition

Population-level convexity, finite-sample nonconvexity

function view gradient view

�⇤

Ln

�⇤

rLn

Population-level objective L strongly convex, α > µ

RSC quantifies convergence rate of ∇Ln −→ ∇L


More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error


More formally

�⇤b� e�

O

rk log p

n

!




log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p



Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ














Conditions on Ln

Restricted strong convexity (Negahban et al. ’12):

〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p

n ‖∆‖21, ∀‖∆‖2 ≤ r

α‖∆‖2 − τ√

log pn ‖∆‖1, o.w.

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression


Conditions on Ln

Restricted strong convexity (Negahban et al. ’12):

〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p

n ‖∆‖21, ∀‖∆‖2 ≤ r

α‖∆‖2 − τ√

log pn ‖∆‖1, o.w.


−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1


Negative directions are forbidden by regularizer.Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression


Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)


Conditions on ρλ





qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0

ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:



Conditions on ρλ






Examples:



Conditions on ρλ






Examples:

µ-amenable: `1, SCAD, MCP, LSP

(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)


Conditions on ρλ






Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCP

Neither: capped-`1, bridge penalty (`q, 0 < q < 1)


Conditions on ρλ






Examples:



Alternative regularizers

Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)

−4 −2 0 2 40

0.5

1

1.5

2

2.5

3

3.5

4

SCADMCPlqcapped l1l1


Statistical consistency

Regularized M-estimator


{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.





{Ln(β) + ρλ(β)} ,




max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2


‖β̃ − β∗‖2 -λ√k






{Ln(β) + ρλ(β)} ,




max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2


‖β̃ − β∗‖2 -λ√k



More formally

�⇤b� e�

O

rk log p

n

!




log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p



Outline




Robust regression functions

Linear model:yi = xTi β

∗ + εi

Heavy-tailed distribution on εi and/or outlier contamination

Use M-estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi )

}


Robust regression functions

Linear model:yi = xTi β

∗ + εi

Heavy-tailed distribution on εi and/or outlier contamination

Use M-estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi )

}


Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimator

Redescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!




IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c



Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss


Patrick Breheny BST 764: Applied Statistical Modeling 6/17

But bad for optimization!!




IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c



Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss


Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1}

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?


High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1}

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?


Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm


Overview of results


‖β̂ − β∗‖2 ≤ C

√k log p

n,




‖β̃ − β∗‖2 ≤ C ′√

k log p

n



Overview of results


‖β̂ − β∗‖2 ≤ C

√k log p

n,




‖β̃ − β∗‖2 ≤ C ′√

k log p

n



Overview of results


‖β̂ − β∗‖2 ≤ C

√k log p

n,




‖β̃ − β∗‖2 ≤ C ′√

k log p

n



Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸︷︷︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal


Theoretical insight


β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸︷︷︸

Ln(β)

}


λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k


k log pn

)



Theoretical insight


β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸︷︷︸

Ln(β)

}


λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k


k log pn

)



Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian


Theoretical insight


n

∥∥∥∞

,


√k

`′(ε) sub-Gaussian whenever `′ bounded

=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,



Theoretical insight


n

∥∥∥∞

,


√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,



Local statistical consistency

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p

n‖∆‖21, ∀‖βj −β∗‖2 ≤ r


−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1


Negative directions are forbidden by regularizer.

Only requires restricted curvature within constant-radius regionaround β∗


Local statistical consistency

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p

n‖∆‖21, ∀‖βj −β∗‖2 ≤ r


−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1


Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗


Consistency of local stationary points

�⇤b� e�

O

rk log p

n

!

r

Theorem (L. ’15)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ

α−µk log p, any

stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies

‖β̃ − β∗‖2 -λ√k

α− µ.


Consistency of local stationary points

�⇤b� e�

O

rk log p

n

!

r

Theorem (L. ’15)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ

α−µk log p, any

stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies

‖β̃ − β∗‖2 -λ√k

α− µ.


Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property























Asymptotic efficiency

n/(k log p)0 5 10 15

∥β̂−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses

p=128p=256p=512HuberCauchy

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)


How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192












Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point





Algorithm








Algorithm








Algorithm








Algorithm





Summary of two-step M-estimator

Output is computationally and statistically efficient:

Computational guarantee on rate of convergence in each step ofM-estimation algorithm

Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)


Summary of two-step M-estimator

Output is computationally and statistically efficient:

Computational guarantee on rate of convergence in each step ofM-estimation algorithm

Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)


Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions


Summary





Consistency under relaxed distributional assumptions when `′ bounded

Oracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency



Summary





Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency



References

P. Loh and M. J. Wainwright (2015). Regularized M-estimators withnonconvexity: Statistical and algorithmic theory for local optima.Journal of Machine Learning Research.

P. Loh (2017). Statistical consistency and asymptotic normality forhigh-dimensional robust M-estimators. Annals of Statistics.

Thank you!


Documents

Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional