Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison

Robust high-dimensional linear regression:A statistical perspective

Po-Ling Loh

University of Wisconsin - MadisonDepartments of ECE & Statistics

STOC workshop on robustness and nonconvexityMontreal, Canada

June 23, 2017

Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26

Introduction: Robust regression

Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)

Goals:1 Develop estimators T (·) that are reliable under deviations from model

assumptions2 Quantify performance with respect to deviations

Local stability captured by influence function

IF (x ;T ,F ) = limt→0

T ((1− t)F + tδx)− T (F )

t

Global stability captured by breakdown point

ε∗(T ;X1, . . . ,Xn) = min

m

n: supXm‖T (Xm)− T (X )‖ =∞








T ((1− t)F + tδx)− T (F )

t


ε∗(T ;X1, . . . ,Xn) = min

m









T ((1− t)F + tδx)− T (F )

t


ε∗(T ;X1, . . . ,Xn) = min

m









T ((1− t)F + tδx)− T (F )

t


ε∗(T ;X1, . . . ,Xn) = min

m



High-dimensional linear models

n 1 n p n 1

p 1

Linear model:yi = xTi β

∗ + εi , i = 1, . . . , n

When p n, assume sparsity: ‖β∗‖0 ≤ k


High-dimensional linear models

n 1 n p n 1

p 1

Linear model:yi = xTi β

∗ + εi , i = 1, . . . , n

When p n, assume sparsity: ‖β∗‖0 ≤ k


Robust M-estimators

Generalization of OLS appropriate for robust statistics:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi )

Extensive theory for p fixed, n→∞

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17



Belgian phone calls: Linear vs. robust regression

1950 1955 1960 1965 1970

050

100

150

200

Year

Milli

ons

of c

alls

Least squaresHuberTukey



Robust M-estimators

Generalization of OLS appropriate for robust statistics:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi )

Extensive theory for p fixed, n→∞



Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss






1950 1955 1960 1965 1970

050

100

150

200

Year

Milli

ons

of c

alls




Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimator

Redescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c



Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss


Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!




IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c



Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss



But bad for optimization!!




IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c



Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss


Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?


High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β ∈ arg minβ

1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?


Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β − β∗‖2 ≤ C ′√

k log p

n

* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well

Local optima may be obtained via two-step algorithm


Overview of results


‖β − β∗‖2 ≤ C

√k log p

n,




‖β − β∗‖2 ≤ C ′√

k log p

n




Overview of results


‖β − β∗‖2 ≤ C

√k log p

n,




‖β − β∗‖2 ≤ C ′√

k log p

n




Overview of results


‖β − β∗‖2 ≤ C

√k log p

n,




‖β − β∗‖2 ≤ C ′√

k log p

n




Overview of results


‖β − β∗‖2 ≤ C

√k log p

n,




‖β − β∗‖2 ≤ C ′√

k log p

n




Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β ∈ arg minβ

1

n‖y − Xβ‖2

2 + λ‖β‖1

︸︷︷︸Ln(β)

Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal


Theoretical insight


β ∈ arg minβ

1

n‖y − Xβ‖2

2 + λ‖β‖1

︸︷︷︸Ln(β)


λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k


k log pn

)



Theoretical insight


β ∈ arg minβ

1

n‖y − Xβ‖2

2 + λ‖β‖1

︸︷︷︸Ln(β)


λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β − β∗‖2 ≤ cλ√k


k log pn

)



Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian


Theoretical insight


n

∥∥∥∞

,


√k

`′(ε) sub-Gaussian whenever `′ bounded

=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

n,



Theoretical insight


n

∥∥∥∞

,


√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β − β∗‖2 ≤ c

√k log p

n,



Technical challenges

Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `

When ` is nonconvex, local optima β may exist that are not globaloptima

Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently












Related work: Nonconvex regularized M-estimators

Composite objective function

β ∈ arg min‖β‖1≤R

Ln(β) +

p∑

j=1

ρλ(βj)

Assumptions:

Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ


Related work: Nonconvex regularized M-estimators

Composite objective function


Ln(β) +

p∑

j=1

ρλ(βj)

Assumptions:

Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ


Stationary points (L. & Wainwright ’15)

b e

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible

Under suitable distributional assumptions, for λ √

log pn and R 1

λ ,

‖β − β∗‖2 ≤ c

√k log p

n≈ statistical error


Stationary points (L. & Wainwright ’15)

b e

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible

Under suitable distributional assumptions, for λ √

log pn and R 1

λ ,

‖β − β∗‖2 ≤ c

√k log p

n≈ statistical error


Mathematical statement

Theorem (L. & Wainwright ’15)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

‖∇Ln(β∗)‖∞, α

√log p

n

- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β satisfies

‖β − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results


Mathematical statement

Theorem (L. & Wainwright ’15)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

‖∇Ln(β∗)‖∞, α

√log p

n

- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β satisfies

‖β − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results


Local statistical consistencyIntroduction

Robust regressionImplementation

Scottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss






1950 1955 1960 1965 1970

050

100

150

200

Year

Milli

ons

of c

alls


Patrick Breheny BST 764: Applied Statistical Modeling 3/17Challenge in robust statistics: Population-level nonconvexity of loss=⇒ need for local optimization theory


Local RSC condition

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ

log p

n‖∆‖2

1, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Only requires restricted curvature within constant-radius regionaround β∗


Local RSC condition

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ

log p

n‖∆‖2

1, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗


Consistency of local stationary points

b e

O

rk log p

n

!

r

Theorem (L. ’17)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.

Suppose ‖`′‖∞ ≤ C and λ √

log pn . For n % τ

α−µk log p, any stationary

point β s.t. ‖β − β∗‖2 ≤ r satisfies

‖β − β∗‖2 -λ√k

α− µ.


Consistency of local stationary points

b e

O

rk log p

n

!

r

Theorem (L. ’17)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.

Suppose ‖`′‖∞ ≤ C and λ √

log pn . For n % τ

α−µk log p, any stationary

point β s.t. ‖β − β∗‖2 ≤ r satisfies

‖β − β∗‖2 -λ√k

α− µ.


Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator


1

n

n∑

i=1

`(xTi β − yi ) + ρλ(β)

,

where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r


Optimization theory

Question: How to obtain sufficiently close local solutions?

Goal: For regularized M-estimator


1

n

n∑

i=1


,

where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r


Wisdom from Huber

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . Itis therefore preferable to start with a monotone ψ,iterate to death, and then append a few (1 or 2)iterations with the nonmonotone ψ. — Huber 1981,pp. 191–192


Two-step algorithm (L. ’17)

Use composite gradient descent (Nesterov ’07):Iterative method to solve

β ∈ arg minβ∈ΩLn(β) + ρλ(β),

Ln differentiable, ρλ convex & subdifferentiable

Ln

Ln(t) + hrLn(t), ti +L

2k tk2

2

b tt+1

Updates:

βt+1 ∈ arg minβ∈Ω

Ln(βt) + 〈∇Ln(βt), β − βt〉+

L

2‖β − βt‖2

2 + ρλ(β)






Ln


2k tk2

2

b tt+1

Updates:



L

2‖β − βt‖2

2 + ρλ(β)






Ln


2k tk2

2

b tt+1

Updates:



L

2‖β − βt‖2

2 + ρλ(β)



Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty


1

n

n∑

i=1


Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH

2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH

Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators





1

n

n∑

i=1


Algorithm








1

n

n∑

i=1


Algorithm








1

n

n∑

i=1


Algorithm





Simulation

n/(k log p)0 5 10 15

∥β−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses


`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)


Simulation

n/(k log p)0 5 10 15

∥β−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses


n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses


`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)


Summary

Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:

bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O

(√k log p

n

)consistency

Two-step optimization procedure: First step for consistency,second step for efficiency

Loh (2017). Statistical consistency and asymptotic normality for

high-dimensional robust M-estimators. Annals of Statistics.


Summary

Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:

bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O

(√k log p

n

)consistency

Two-step optimization procedure: First step for consistency,second step for efficiency

Loh (2017). Statistical consistency and asymptotic normality for

high-dimensional robust M-estimators. Annals of Statistics.


Trailer

Problem: Loss function ` in some sense calibrated to scale of εi

Better objective (joint location/scale estimator):

(β, σ) ∈ arg minβ,σ

1

n

n∑

i=1

`

(yi − xTi β

σ

)σ + aσ

︸︷︷︸Ln(β,σ)

+λ‖β‖1

However, location/scale estimation notoriously difficult even in lowdimensions


Trailer




1

n

n∑

i=1

`

(yi − xTi β

σ

)σ + aσ

︸︷︷︸Ln(β,σ)

+λ‖β‖1



Trailer




1

n

n∑

i=1

`

(yi − xTi β

σ

)σ + aσ

︸︷︷︸Ln(β,σ)

+λ‖β‖1



Trailer

Another idea: MM-estimator

β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,

using robust estimate of scale σ0 based on preliminary estimate β0

How to obtain (β0, σ0)?

S-estimators/LMS:

β0 ∈ arg minβσ(r(β)) ,

where σ(r) = r(n−bnδc)LTS:

β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1


Trailer


β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,



S-estimators/LMS:



β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1


Trailer


β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,



S-estimators/LMS:


where σ(r) = r(n−bnδc)

LTS:

β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1


Trailer


β ∈ arg minβ

1

n

n∑

i=1

`

(yi − xTi β

σ0

)+ λ‖β‖1

,



S-estimators/LMS:



β0 ∈ arg minβ

1

n

n−bnαc∑

i=1

(yi − xTi β)2(i) + λ‖β‖1


Trailer

Maybe an entirely different approach is necessary . . .

Loh (2017). Scale estimation for high-dimensional robust regression.

Coming soon?


Thank you!


Documents

Robust high-dimensional linear regression: A statistical ... · Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison