Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Robust high-dimensional linear regression:A statistical perspective
Po-Ling Loh
University of Wisconsin - MadisonDepartments of ECE & Statistics
STOC workshop on robustness and nonconvexityMontreal, Canada
June 23, 2017
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 1 / 26
Introduction: Robust regression
Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)
Goals:1 Develop estimators T (·) that are reliable under deviations from model
assumptions2 Quantify performance with respect to deviations
Local stability captured by influence function
IF (x ;T ,F ) = limt→0
T ((1− t)F + tδx)− T (F )
t
Global stability captured by breakdown point
ε∗(T ;X1, . . . ,Xn) = min
m
n: supXm‖T (Xm)− T (X )‖ =∞
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26
Introduction: Robust regression
Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)
Goals:1 Develop estimators T (·) that are reliable under deviations from model
assumptions2 Quantify performance with respect to deviations
Local stability captured by influence function
IF (x ;T ,F ) = limt→0
T ((1− t)F + tδx)− T (F )
t
Global stability captured by breakdown point
ε∗(T ;X1, . . . ,Xn) = min
m
n: supXm‖T (Xm)− T (X )‖ =∞
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26
Introduction: Robust regression
Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)
Goals:1 Develop estimators T (·) that are reliable under deviations from model
assumptions2 Quantify performance with respect to deviations
Local stability captured by influence function
IF (x ;T ,F ) = limt→0
T ((1− t)F + tδx)− T (F )
t
Global stability captured by breakdown point
ε∗(T ;X1, . . . ,Xn) = min
m
n: supXm‖T (Xm)− T (X )‖ =∞
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26
Introduction: Robust regression
Robust statistics introduced in 1960s (Huber, Tukey, Hampel, et al.)
Goals:1 Develop estimators T (·) that are reliable under deviations from model
assumptions2 Quantify performance with respect to deviations
Local stability captured by influence function
IF (x ;T ,F ) = limt→0
T ((1− t)F + tδx)− T (F )
t
Global stability captured by breakdown point
ε∗(T ;X1, . . . ,Xn) = min
m
n: supXm‖T (Xm)− T (X )‖ =∞
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 2 / 26
High-dimensional linear models
n 1 n p n 1
p 1
Linear model:yi = xTi β
∗ + εi , i = 1, . . . , n
When p n, assume sparsity: ‖β∗‖0 ≤ k
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26
High-dimensional linear models
n 1 n p n 1
p 1
Linear model:yi = xTi β
∗ + εi , i = 1, . . . , n
When p n, assume sparsity: ‖β∗‖0 ≤ k
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 3 / 26
Robust M-estimators
Generalization of OLS appropriate for robust statistics:
β ∈ arg minβ
1
n
n∑
i=1
`(xTi β − yi )
Extensive theory for p fixed, n→∞
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17
IntroductionRobust regression
ImplementationScottish hill races
Belgian phone calls: Linear vs. robust regression
1950 1955 1960 1965 1970
050
100
150
200
Year
Milli
ons
of c
alls
Least squaresHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 3/17
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26
Robust M-estimators
Generalization of OLS appropriate for robust statistics:
β ∈ arg minβ
1
n
n∑
i=1
`(xTi β − yi )
Extensive theory for p fixed, n→∞
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17
IntroductionRobust regression
ImplementationScottish hill races
Belgian phone calls: Linear vs. robust regression
1950 1955 1960 1965 1970
050
100
150
200
Year
Milli
ons
of c
alls
Least squaresHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 3/17
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 4 / 26
Classes of loss functions
Bounded `′ limits influence of outliers:
IF ((x , y);T ,F ) = limt→0+
T ((1− t)F + tδ(x ,y))− T (F )
t
∝ `′(xTβ − y)x
where F ∼ Fβ and T minimizes M-estimator
Redescending M-estimators have finite rejection point:
`′(u) = 0, for |u| ≥ c
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26
Classes of loss functions
Bounded `′ limits influence of outliers:
IF ((x , y);T ,F ) = limt→0+
T ((1− t)F + tδ(x ,y))− T (F )
t
∝ `′(xTβ − y)x
where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:
`′(u) = 0, for |u| ≥ c
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17
But bad for optimization!!
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26
Classes of loss functions
Bounded `′ limits influence of outliers:
IF ((x , y);T ,F ) = limt→0+
T ((1− t)F + tδ(x ,y))− T (F )
t
∝ `′(xTβ − y)x
where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:
`′(u) = 0, for |u| ≥ c
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 5 / 26
High-dimensional M-estimators
Natural idea: For p > n, use regularized version:
β ∈ arg minβ
1
n
n∑
i=1
`(xTi β − yi ) + λ‖β‖1
Complications:
Optimization for nonconvex `?
Statistical theory? Are certain losses provably better than others?
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26
High-dimensional M-estimators
Natural idea: For p > n, use regularized version:
β ∈ arg minβ
1
n
n∑
i=1
`(xTi β − yi ) + λ‖β‖1
Complications:
Optimization for nonconvex `?
Statistical theory? Are certain losses provably better than others?
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 6 / 26
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β − β∗‖2 ≤ C ′√
k log p
n
* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β − β∗‖2 ≤ C ′√
k log p
n
* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β − β∗‖2 ≤ C ′√
k log p
n
* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β − β∗‖2 ≤ C ′√
k log p
n
* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β − β∗‖2 ≤ C ′√
k log p
n
* in order to verify RE condition w.h.p., need Var(εi ) < cr2, as well
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 7 / 26
Theoretical insight
Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):
β ∈ arg minβ
1
n‖y − Xβ‖2
2 + λ‖β‖1
︸ ︷︷ ︸Ln(β)
Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming
λ ≥ 2∥∥∥XT ε
n
∥∥∥∞
, obtain
‖β − β∗‖2 ≤ cλ√k
Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√
k log pn
)
bounds, minimax optimal
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26
Theoretical insight
Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):
β ∈ arg minβ
1
n‖y − Xβ‖2
2 + λ‖β‖1
︸ ︷︷ ︸Ln(β)
Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming
λ ≥ 2∥∥∥XT ε
n
∥∥∥∞
, obtain
‖β − β∗‖2 ≤ cλ√k
Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√
k log pn
)
bounds, minimax optimal
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26
Theoretical insight
Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):
β ∈ arg minβ
1
n‖y − Xβ‖2
2 + λ‖β‖1
︸ ︷︷ ︸Ln(β)
Rearranging basic inequality Ln(β) ≤ Ln(β∗) and assuming
λ ≥ 2∥∥∥XT ε
n
∥∥∥∞
, obtain
‖β − β∗‖2 ≤ cλ√k
Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√
k log pn
)
bounds, minimax optimal
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 8 / 26
Theoretical insight
Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)
n
∥∥∥∞
,
obtain‖β − β∗‖2 ≤ cλ
√k
`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error
‖β − β∗‖2 ≤ c
√k log p
n,
without assuming εi is sub-Gaussian
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26
Theoretical insight
Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)
n
∥∥∥∞
,
obtain‖β − β∗‖2 ≤ cλ
√k
`′(ε) sub-Gaussian whenever `′ bounded
=⇒ can achieve estimation error
‖β − β∗‖2 ≤ c
√k log p
n,
without assuming εi is sub-Gaussian
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26
Theoretical insight
Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)
n
∥∥∥∞
,
obtain‖β − β∗‖2 ≤ cλ
√k
`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error
‖β − β∗‖2 ≤ c
√k log p
n,
without assuming εi is sub-Gaussian
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 9 / 26
Technical challenges
Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `
When ` is nonconvex, local optima β may exist that are not globaloptima
Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26
Technical challenges
Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `
When ` is nonconvex, local optima β may exist that are not globaloptima
Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26
Technical challenges
Lasso analysis also requires verifying restricted eigenvalue (RE)condition on design matrix, more complicated for general `
When ` is nonconvex, local optima β may exist that are not globaloptima
Want error bounds on ‖β − β∗‖2 as well, or algorithms to find βefficiently
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 10 / 26
Related work: Nonconvex regularized M-estimators
Composite objective function
β ∈ arg min‖β‖1≤R
Ln(β) +
p∑
j=1
ρλ(βj)
Assumptions:
Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26
Related work: Nonconvex regularized M-estimators
Composite objective function
β ∈ arg min‖β‖1≤R
Ln(β) +
p∑
j=1
ρλ(βj)
Assumptions:
Ln satisfies restricted strong convexity with curvature α(Negahban et al. ’12)ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convexα > µ
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 11 / 26
Stationary points (L. & Wainwright ’15)
b e
O
rk log p
n
!
Stationary points statistically indistinguishable from global optima
〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible
Under suitable distributional assumptions, for λ √
log pn and R 1
λ ,
‖β − β∗‖2 ≤ c
√k log p
n≈ statistical error
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26
Stationary points (L. & Wainwright ’15)
b e
O
rk log p
n
!
Stationary points statistically indistinguishable from global optima
〈∇Ln(β) +∇ρλ(β), β − β〉 ≥ 0, ∀β feasible
Under suitable distributional assumptions, for λ √
log pn and R 1
λ ,
‖β − β∗‖2 ≤ c
√k log p
n≈ statistical error
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 12 / 26
Mathematical statement
Theorem (L. & Wainwright ’15)
Suppose R is chosen s.t. β∗ is feasible, and λ satisfies
max
‖∇Ln(β∗)‖∞, α
√log p
n
- λ -
α
R.
For n ≥ Cτ2
α2 R2 log p, any stationary point β satisfies
‖β − β∗‖2 -λ√k
α− µ, where k = ‖β∗‖0.
New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26
Mathematical statement
Theorem (L. & Wainwright ’15)
Suppose R is chosen s.t. β∗ is feasible, and λ satisfies
max
‖∇Ln(β∗)‖∞, α
√log p
n
- λ -
α
R.
For n ≥ Cτ2
α2 R2 log p, any stationary point β satisfies
‖β − β∗‖2 -λ√k
α− µ, where k = ‖β∗‖0.
New ingredient for robust setting: ` convex only in local region=⇒ need for local consistency results
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 13 / 26
Local statistical consistencyIntroduction
Robust regressionImplementation
Scottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17
IntroductionRobust regression
ImplementationScottish hill races
Belgian phone calls: Linear vs. robust regression
1950 1955 1960 1965 1970
050
100
150
200
Year
Milli
ons
of c
alls
Least squaresHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 3/17Challenge in robust statistics: Population-level nonconvexity of loss=⇒ need for local optimization theory
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 14 / 26
Local RSC condition
Local RSC condition: For ∆ := β1 − β2,
〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ
log p
n‖∆‖2
1, ∀‖βj −β∗‖2 ≤ r
How is such a result possible?
−1−0.5
00.5
1 −1−0.5
00.5
1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Loss function has directions of both positive and negative curvature.
Negative directions are forbidden by regularizer.
Only requires restricted curvature within constant-radius regionaround β∗
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26
Local RSC condition
Local RSC condition: For ∆ := β1 − β2,
〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τ
log p
n‖∆‖2
1, ∀‖βj −β∗‖2 ≤ r
How is such a result possible?
−1−0.5
00.5
1 −1−0.5
00.5
1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Loss function has directions of both positive and negative curvature.
Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 15 / 26
Consistency of local stationary points
b e
O
rk log p
n
!
r
Theorem (L. ’17)
Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.
Suppose ‖`′‖∞ ≤ C and λ √
log pn . For n % τ
α−µk log p, any stationary
point β s.t. ‖β − β∗‖2 ≤ r satisfies
‖β − β∗‖2 -λ√k
α− µ.
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26
Consistency of local stationary points
b e
O
rk log p
n
!
r
Theorem (L. ’17)
Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.
Suppose ‖`′‖∞ ≤ C and λ √
log pn . For n % τ
α−µk log p, any stationary
point β s.t. ‖β − β∗‖2 ≤ r satisfies
‖β − β∗‖2 -λ√k
α− µ.
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 16 / 26
Optimization theory
Question: How to obtain sufficiently close local solutions?
Goal: For regularized M-estimator
β ∈ arg min‖β‖1≤R
1
n
n∑
i=1
`(xTi β − yi ) + ρλ(β)
,
where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26
Optimization theory
Question: How to obtain sufficiently close local solutions?
Goal: For regularized M-estimator
β ∈ arg min‖β‖1≤R
1
n
n∑
i=1
`(xTi β − yi ) + ρλ(β)
,
where ` satisfies α-local RSC, find stationary point such that‖β − β∗‖2 ≤ r
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 17 / 26
Wisdom from Huber
Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . Itis therefore preferable to start with a monotone ψ,iterate to death, and then append a few (1 or 2)iterations with the nonmonotone ψ. — Huber 1981,pp. 191–192
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 18 / 26
Two-step algorithm (L. ’17)
Use composite gradient descent (Nesterov ’07):Iterative method to solve
β ∈ arg minβ∈ΩLn(β) + ρλ(β),
Ln differentiable, ρλ convex & subdifferentiable
Ln
Ln(t) + hrLn(t), ti +L
2k tk2
2
b tt+1
Updates:
βt+1 ∈ arg minβ∈Ω
Ln(βt) + 〈∇Ln(βt), β − βt〉+
L
2‖β − βt‖2
2 + ρλ(β)
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26
Two-step algorithm (L. ’17)
Use composite gradient descent (Nesterov ’07):Iterative method to solve
β ∈ arg minβ∈ΩLn(β) + ρλ(β),
Ln differentiable, ρλ convex & subdifferentiable
Ln
Ln(t) + hrLn(t), ti +L
2k tk2
2
b tt+1
Updates:
βt+1 ∈ arg minβ∈Ω
Ln(βt) + 〈∇Ln(βt), β − βt〉+
L
2‖β − βt‖2
2 + ρλ(β)
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26
Two-step algorithm (L. ’17)
Use composite gradient descent (Nesterov ’07):Iterative method to solve
β ∈ arg minβ∈ΩLn(β) + ρλ(β),
Ln differentiable, ρλ convex & subdifferentiable
Ln
Ln(t) + hrLn(t), ti +L
2k tk2
2
b tt+1
Updates:
βt+1 ∈ arg minβ∈Ω
Ln(βt) + 〈∇Ln(βt), β − βt〉+
L
2‖β − βt‖2
2 + ρλ(β)
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 19 / 26
Two-step algorithm (L. ’17)
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty
β ∈ arg min‖β‖1≤R
1
n
n∑
i=1
`(xTi β − yi ) + ρλ(β)
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH
2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH
Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26
Two-step algorithm (L. ’17)
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty
β ∈ arg min‖β‖1≤R
1
n
n∑
i=1
`(xTi β − yi ) + ρλ(β)
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH
2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH
Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26
Two-step algorithm (L. ’17)
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty
β ∈ arg min‖β‖1≤R
1
n
n∑
i=1
`(xTi β − yi ) + ρλ(β)
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH
2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH
Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26
Two-step algorithm (L. ’17)
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + µ-amenable penalty
β ∈ arg min‖β‖1≤R
1
n
n∑
i=1
`(xTi β − yi ) + ρλ(β)
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output βH
2 Run composite gradient descent on nonconvex, robust loss +µ-amenable penalty, input β0 = βH
Important: We want to optimize original nonconvex objective, sinceit leads to more efficient (lower-variance) estimators
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 20 / 26
Simulation
n/(k log p)0 5 10 15
∥β−β∗∥ 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1l2-error for robust regression losses
p=128p=256p=512HuberCauchy
n/(k log p)10 11 12 13 14 15 16 17 18 19 20
empi
rical
var
ianc
e of
firs
t com
pone
nt
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35variance for robust regression losses
p=128p=256p=512HuberCauchy
`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)
Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26
Simulation
n/(k log p)0 5 10 15
∥β−β∗∥ 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1l2-error for robust regression losses
p=128p=256p=512HuberCauchy
n/(k log p)10 11 12 13 14 15 16 17 18 19 20
empi
rical
var
ianc
e of
firs
t com
pone
nt
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35variance for robust regression losses
p=128p=256p=512HuberCauchy
`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)
Can prove geometric convergence of two-step algorithm to desirablelocal optima (L. ’17)
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 21 / 26
Summary
Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:
bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O
(√k log p
n
)consistency
Two-step optimization procedure: First step for consistency,second step for efficiency
Loh (2017). Statistical consistency and asymptotic normality for
high-dimensional robust M-estimators. Annals of Statistics.
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26
Summary
Loss functions with desirable robustness properties in low-dimensionalregression also good for high dimensions:
bounded influence ⇐⇒ ‖`′‖∞ ≤ C ⇐⇒ O
(√k log p
n
)consistency
Two-step optimization procedure: First step for consistency,second step for efficiency
Loh (2017). Statistical consistency and asymptotic normality for
high-dimensional robust M-estimators. Annals of Statistics.
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 22 / 26
Trailer
Problem: Loss function ` in some sense calibrated to scale of εi
Better objective (joint location/scale estimator):
(β, σ) ∈ arg minβ,σ
1
n
n∑
i=1
`
(yi − xTi β
σ
)σ + aσ
︸ ︷︷ ︸Ln(β,σ)
+λ‖β‖1
However, location/scale estimation notoriously difficult even in lowdimensions
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26
Trailer
Problem: Loss function ` in some sense calibrated to scale of εi
Better objective (joint location/scale estimator):
(β, σ) ∈ arg minβ,σ
1
n
n∑
i=1
`
(yi − xTi β
σ
)σ + aσ
︸ ︷︷ ︸Ln(β,σ)
+λ‖β‖1
However, location/scale estimation notoriously difficult even in lowdimensions
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26
Trailer
Problem: Loss function ` in some sense calibrated to scale of εi
Better objective (joint location/scale estimator):
(β, σ) ∈ arg minβ,σ
1
n
n∑
i=1
`
(yi − xTi β
σ
)σ + aσ
︸ ︷︷ ︸Ln(β,σ)
+λ‖β‖1
However, location/scale estimation notoriously difficult even in lowdimensions
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 23 / 26
Trailer
Another idea: MM-estimator
β ∈ arg minβ
1
n
n∑
i=1
`
(yi − xTi β
σ0
)+ λ‖β‖1
,
using robust estimate of scale σ0 based on preliminary estimate β0
How to obtain (β0, σ0)?
S-estimators/LMS:
β0 ∈ arg minβσ(r(β)) ,
where σ(r) = r(n−bnδc)LTS:
β0 ∈ arg minβ
1
n
n−bnαc∑
i=1
(yi − xTi β)2(i) + λ‖β‖1
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26
Trailer
Another idea: MM-estimator
β ∈ arg minβ
1
n
n∑
i=1
`
(yi − xTi β
σ0
)+ λ‖β‖1
,
using robust estimate of scale σ0 based on preliminary estimate β0
How to obtain (β0, σ0)?
S-estimators/LMS:
β0 ∈ arg minβσ(r(β)) ,
where σ(r) = r(n−bnδc)LTS:
β0 ∈ arg minβ
1
n
n−bnαc∑
i=1
(yi − xTi β)2(i) + λ‖β‖1
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26
Trailer
Another idea: MM-estimator
β ∈ arg minβ
1
n
n∑
i=1
`
(yi − xTi β
σ0
)+ λ‖β‖1
,
using robust estimate of scale σ0 based on preliminary estimate β0
How to obtain (β0, σ0)?
S-estimators/LMS:
β0 ∈ arg minβσ(r(β)) ,
where σ(r) = r(n−bnδc)
LTS:
β0 ∈ arg minβ
1
n
n−bnαc∑
i=1
(yi − xTi β)2(i) + λ‖β‖1
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26
Trailer
Another idea: MM-estimator
β ∈ arg minβ
1
n
n∑
i=1
`
(yi − xTi β
σ0
)+ λ‖β‖1
,
using robust estimate of scale σ0 based on preliminary estimate β0
How to obtain (β0, σ0)?
S-estimators/LMS:
β0 ∈ arg minβσ(r(β)) ,
where σ(r) = r(n−bnδc)LTS:
β0 ∈ arg minβ
1
n
n−bnαc∑
i=1
(yi − xTi β)2(i) + λ‖β‖1
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 24 / 26
Trailer
Maybe an entirely different approach is necessary . . .
Loh (2017). Scale estimation for high-dimensional robust regression.
Coming soon?
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 25 / 26
Thank you!
Po-Ling Loh (UW-Madison) Robust high-dimensional regression June 23, 2017 26 / 26