Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Robust estimation via (non)convex M-estimation
Po-Ling Loh
University of Wisconsin - MadisonDepartments of ECE & Statistics
Workshop on computational efficiency andhigh-dimensional robust statistics
TTI Chicago
August 15, 2018
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 1 / 39
Outline
1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima
2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 2 / 39
Outline
1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima
2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 3 / 39
Regularized M-estimators
Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate
β∗ = arg minβ∈Rp
E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39
Regularized M-estimators
Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate
β∗ = arg minβ∈Rp
E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R
Statistical M-estimator:
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(β; xi , yi )
}
in high dimensions, may be ill-conditioned, large solution space
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39
Regularized M-estimators
Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate
β∗ = arg minβ∈Rp
E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R
Regularized M-estimator:
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(β; xi , yi )
︸ ︷︷ ︸Ln(β)
+ρλ(β)
}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39
Example: `1-regularized OLS regression
Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k
Low-dimensional M-estimator:
β̂OLS ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2
}
=
(1
n
n∑
i=1
xixTi
)−1(1
n
n∑
i=1
yixi
)
High-dimensional regularized M-estimator:
β̂Lasso ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2 + λ‖β‖1}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39
Example: `1-regularized OLS regression
Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k
Low-dimensional M-estimator:
β̂OLS ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2
}
=
(1
n
n∑
i=1
xixTi
)−1(1
n
n∑
i=1
yixi
)
High-dimensional regularized M-estimator:
β̂Lasso ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2 + λ‖β‖1}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39
Example: `1-regularized OLS regression
Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k
Low-dimensional M-estimator:
β̂OLS ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2
}=
(1
n
n∑
i=1
xixTi
)−1(1
n
n∑
i=1
yixi
)
High-dimensional regularized M-estimator:
β̂Lasso ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2 + λ‖β‖1}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39
Example: `1-regularized OLS regression
Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k
Low-dimensional M-estimator:
β̂OLS ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2
}=
(1
n
n∑
i=1
xixTi
)−1(1
n
n∑
i=1
yixi
)
High-dimensional regularized M-estimator:
β̂Lasso ∈ arg minβ
{1
n
n∑
i=1
(yi − xTi β)2 + λ‖β‖1}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39
Sources of nonconvexity
May arise in loss or regularizer
Nonconvex loss used to correct bias, increase efficiency
Nonconvex regularizer used to reduce bias, achieve oracle result
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39
Sources of nonconvexity
May arise in loss or regularizer
Nonconvex loss used to correct bias, increase efficiency
Nonconvex regularizer used to reduce bias, achieve oracle result
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39
Sources of nonconvexity
May arise in loss or regularizer
Nonconvex loss used to correct bias, increase efficiency
Nonconvex regularizer used to reduce bias, achieve oracle result
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39
Example: Errors-in-variables regression
Model:yi = xTi β
∗ + εi
observe {(zi , yi )}ni=1, infer β∗
OLS estimator
β̂ ∈ arg minβ
{1
n
n∑
i=1
(zTi β − yi )2 + λ‖β‖1
}
statistically inconsistent
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 7 / 39
Example: Errors-in-variables regression
Model:yi = xTi β
∗ + εi
observe {(zi , yi )}ni=1, infer β∗
OLS estimator
β̂ ∈ arg minβ
{1
n
n∑
i=1
(zTi β − yi )2 + λ‖β‖1
}
statistically inconsistent
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 7 / 39
Corrected losses
L. & Wainwright ’12 propose natural method for correcting loss forlinear regression:
β̂OLS ∈ arg minβ
{1
2βT
XTX
nβ − yXT
nβ + ρλ(β)
}
β̂corr ∈ arg minβ
{1
2βT Γ̂β − γ̂Tβ + ρλ(β)
}
(Γ̂, γ̂) estimators for (Cov(xi ),Cov(xi , yi )) based on {(zi , yi )}ni=1
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 8 / 39
Example: Additive noise
Additive noise: Z = X + W , use
Γ̂ =ZTZ
n− Σw , γ̂ =
ZT y
n
However, corrected objective nonconvex:
β̂corr ∈ arg minβ
{1
2βT(ZTZ
n− Σw
)β − yTZ
nβ + ρλ(β)
}
Fortunately, local optima have good properties
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 9 / 39
Example: Additive noise
Additive noise: Z = X + W , use
Γ̂ =ZTZ
n− Σw , γ̂ =
ZT y
n
However, corrected objective nonconvex:
β̂corr ∈ arg minβ
{1
2βT(ZTZ
n− Σw
)β − yTZ
nβ + ρλ(β)
}
Fortunately, local optima have good properties
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 9 / 39
Nonconvex regularizers
`1 is “convexified” version of `0
1
0
`0
0
`1
But `1 penalizes larger coefficients more, causes solution bias
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 10 / 39
Nonconvex regularizers
`1 is “convexified” version of `0
1
0
`0
0
`1
But `1 penalizes larger coefficients more, causes solution bias
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 10 / 39
Alternative regularizers
Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)
−4 −2 0 2 40
0.5
1
1.5
2
2.5
3
3.5
4
SCADMCPlqcapped l1l1
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 11 / 39
Empirical benefits
Nonconvex regularizers show significant improvement (Breheny &
Huang ’11)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 12 / 39
Local vs. global optima
Optimization algorithms only guaranteed to find local optima(stationary points)
Statistical theory only guarantees consistency of global optima
�⇤b� e�
O
rk log p
n
!
L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 13 / 39
Local vs. global optima
Optimization algorithms only guaranteed to find local optima(stationary points)
Statistical theory only guarantees consistency of global optima
�⇤b� e�
O
rk log p
n
!
L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 13 / 39
Measures of closeness
Various measures of statistical consistency
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(β; xi , yi ) + ρλ(β)
}
Estimation: ‖β̂ − β∗‖ → 0
Prediction: 1n
∑ni=1 `(β̂; xi , yi )→ 0
Variable selection: supp(β̂)→ supp(β∗)
Interested in cases where ` and ρλ possibly nonconvex
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39
Measures of closeness
Various measures of statistical consistency
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(β; xi , yi ) + ρλ(β)
}
Estimation: ‖β̂ − β∗‖ → 0
Prediction: 1n
∑ni=1 `(β̂; xi , yi )→ 0
Variable selection: supp(β̂)→ supp(β∗)
Interested in cases where ` and ρλ possibly nonconvex
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39
Measures of closeness
Various measures of statistical consistency
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(β; xi , yi ) + ρλ(β)
}
Estimation: ‖β̂ − β∗‖ → 0
Prediction: 1n
∑ni=1 `(β̂; xi , yi )→ 0
Variable selection: supp(β̂)→ supp(β∗)
Interested in cases where ` and ρλ possibly nonconvex
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39
Estimation/prediction consistency
Composite objective function
β̂ ∈ arg min‖β‖1≤R
{Ln(β) +
p∑
j=1
ρλ(βj)
}
Ln satisfies restricted strong convexity with curvature α
ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex
L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39
Estimation/prediction consistency
Composite objective function
β̂ ∈ arg min‖β‖1≤R
{Ln(β) +
p∑
j=1
ρλ(βj)
}
Ln satisfies restricted strong convexity with curvature α
ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex
L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39
Estimation/prediction consistency
Composite objective function
β̂ ∈ arg min‖β‖1≤R
{Ln(β) +
p∑
j=1
ρλ(βj)
}
Ln satisfies restricted strong convexity with curvature α
ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex
L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39
Geometric intuition
Population-level convexity, finite-sample nonconvexity
function view gradient view
�⇤
Ln
�⇤
rLn
Population-level objective L strongly convex, α > µ
RSC quantifies convergence rate of ∇Ln −→ ∇L
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 16 / 39
Geometric intuition
Population-level convexity, finite-sample nonconvexity
function view gradient view
�⇤
Ln
�⇤
rLn
Population-level objective L strongly convex, α > µ
RSC quantifies convergence rate of ∇Ln −→ ∇L
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 16 / 39
More formally
�⇤b� e�
O
rk log p
n
!
Stationary points statistically indistinguishable from global optima
〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible
Nonasymptotic rates: For λ �√
log pn and R � 1
λ ,
‖β̃ − β∗‖2 ≤ c
√k log p
n≈ statistical error
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 17 / 39
More formally
�⇤b� e�
O
rk log p
n
!
Stationary points statistically indistinguishable from global optima
〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible
Nonasymptotic rates: For λ �√
log pn and R � 1
λ ,
‖β̃ − β∗‖2 ≤ c
√k log p
n≈ statistical error
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 17 / 39
Technical conditions
Requirements on loss and regularizer to ensure consistency ofstationary points
Restricted strong convexity of Ln
Bound on nonconvexity of ρλ
“Oracle” result under additional condition on ρλ
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39
Technical conditions
Requirements on loss and regularizer to ensure consistency ofstationary points
Restricted strong convexity of Ln
Bound on nonconvexity of ρλ
“Oracle” result under additional condition on ρλ
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39
Technical conditions
Requirements on loss and regularizer to ensure consistency ofstationary points
Restricted strong convexity of Ln
Bound on nonconvexity of ρλ
“Oracle” result under additional condition on ρλ
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39
Conditions on Ln
Restricted strong convexity (Negahban et al. ’12):
〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p
n ‖∆‖21, ∀‖∆‖2 ≤ r
α‖∆‖2 − τ√
log pn ‖∆‖1, o.w.
How is such a result possible?
−1−0.5
00.5
1 −1−0.5
00.5
1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Loss function has directions of both positive and negative curvature.
Negative directions are forbidden by regularizer.
Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 19 / 39
Conditions on Ln
Restricted strong convexity (Negahban et al. ’12):
〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p
n ‖∆‖21, ∀‖∆‖2 ≤ r
α‖∆‖2 − τ√
log pn ‖∆‖1, o.w.
How is such a result possible?
−1−0.5
00.5
1 −1−0.5
00.5
1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Loss function has directions of both positive and negative curvature.
Negative directions are forbidden by regularizer.Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 19 / 39
Conditions on ρλ
Focus on amenable regularizers ρλ(β) =∑p
j=1 ρλ(βj) satisfying:
ρλ(0) = 0, symmetric around 0Nondecreasing on R+
t 7→ ρλ(t)t nonincreasing on R+
qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0
Examples:
µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39
Conditions on ρλ
Focus on amenable regularizers ρλ(β) =∑p
j=1 ρλ(βj) satisfying:
ρλ(0) = 0, symmetric around 0Nondecreasing on R+
t 7→ ρλ(t)t nonincreasing on R+
qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0
ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0
Examples:
µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39
Conditions on ρλ
Focus on amenable regularizers ρλ(β) =∑p
j=1 ρλ(βj) satisfying:
ρλ(0) = 0, symmetric around 0Nondecreasing on R+
t 7→ ρλ(t)t nonincreasing on R+
qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0
Examples:
µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39
Conditions on ρλ
Focus on amenable regularizers ρλ(β) =∑p
j=1 ρλ(βj) satisfying:
ρλ(0) = 0, symmetric around 0Nondecreasing on R+
t 7→ ρλ(t)t nonincreasing on R+
qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0
Examples:
µ-amenable: `1, SCAD, MCP, LSP
(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39
Conditions on ρλ
Focus on amenable regularizers ρλ(β) =∑p
j=1 ρλ(βj) satisfying:
ρλ(0) = 0, symmetric around 0Nondecreasing on R+
t 7→ ρλ(t)t nonincreasing on R+
qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0
Examples:
µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCP
Neither: capped-`1, bridge penalty (`q, 0 < q < 1)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39
Conditions on ρλ
Focus on amenable regularizers ρλ(β) =∑p
j=1 ρλ(βj) satisfying:
ρλ(0) = 0, symmetric around 0Nondecreasing on R+
t 7→ ρλ(t)t nonincreasing on R+
qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0
Examples:
µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39
Alternative regularizers
Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)
−4 −2 0 2 40
0.5
1
1.5
2
2.5
3
3.5
4
SCADMCPlqcapped l1l1
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 21 / 39
Statistical consistency
Regularized M-estimator
β̂ ∈ arg min‖β‖1≤R
{Ln(β) + ρλ(β)} ,
loss function satisfies (α, τ)-RSC and regularizer is µ-amenable
Theorem (L. & Wainwright ’13)
Suppose R is chosen s.t. β∗ is feasible, and λ satisfies
max
{‖∇Ln(β∗)‖∞, α
√log p
n
}- λ -
α
R.
For n ≥ Cτ2
α2 R2 log p, any stationary point β̃ satisfies
‖β̃ − β∗‖2 -λ√k
α− µ, where k = ‖β∗‖0.
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39
Statistical consistency
Regularized M-estimator
β̂ ∈ arg min‖β‖1≤R
{Ln(β) + ρλ(β)} ,
loss function satisfies (α, τ)-RSC and regularizer is µ-amenable
Theorem (L. & Wainwright ’13)
Suppose R is chosen s.t. β∗ is feasible, and λ satisfies
max
{‖∇Ln(β∗)‖∞, α
√log p
n
}- λ -
α
R.
For n ≥ Cτ2
α2 R2 log p, any stationary point β̃ satisfies
‖β̃ − β∗‖2 -λ√k
α− µ, where k = ‖β∗‖0.
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39
Statistical consistency
Regularized M-estimator
β̂ ∈ arg min‖β‖1≤R
{Ln(β) + ρλ(β)} ,
loss function satisfies (α, τ)-RSC and regularizer is µ-amenable
Theorem (L. & Wainwright ’13)
Suppose R is chosen s.t. β∗ is feasible, and λ satisfies
max
{‖∇Ln(β∗)‖∞, α
√log p
n
}- λ -
α
R.
For n ≥ Cτ2
α2 R2 log p, any stationary point β̃ satisfies
‖β̃ − β∗‖2 -λ√k
α− µ, where k = ‖β∗‖0.
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39
More formally
�⇤b� e�
O
rk log p
n
!
Stationary points statistically indistinguishable from global optima
〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible
Nonasymptotic rates: For λ �√
log pn and R � 1
λ ,
‖β̃ − β∗‖2 ≤ c
√k log p
n≈ statistical error
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 23 / 39
Outline
1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima
2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 24 / 39
Robust regression functions
Linear model:yi = xTi β
∗ + εi
Heavy-tailed distribution on εi and/or outlier contamination
Use M-estimator
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(xTi β − yi )
}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 25 / 39
Robust regression functions
Linear model:yi = xTi β
∗ + εi
Heavy-tailed distribution on εi and/or outlier contamination
Use M-estimator
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(xTi β − yi )
}
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 25 / 39
Classes of loss functions
Bounded `′ limits influence of outliers:
IF ((x , y);T ,F ) = limt→0+
T ((1− t)F + tδ(x ,y))− T (F )
t
∝ `′(xTβ − y)x
where F ∼ Fβ and T minimizes M-estimator
Redescending M-estimators have finite rejection point:
`′(u) = 0, for |u| ≥ c
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39
Classes of loss functions
Bounded `′ limits influence of outliers:
IF ((x , y);T ,F ) = limt→0+
T ((1− t)F + tδ(x ,y))− T (F )
t
∝ `′(xTβ − y)x
where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:
`′(u) = 0, for |u| ≥ c
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17
But bad for optimization!!
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39
Classes of loss functions
Bounded `′ limits influence of outliers:
IF ((x , y);T ,F ) = limt→0+
T ((1− t)F + tδ(x ,y))− T (F )
t
∝ `′(xTβ − y)x
where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:
`′(u) = 0, for |u| ≥ c
IntroductionRobust regression
ImplementationScottish hill races
Loss functions
−6 −4 −2 0 2 4 6
01
23
45
6
Residual
Loss
Least squaresAbsolute valueHuberTukey
Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39
High-dimensional M-estimators
Natural idea: For p > n, use regularized version:
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(xTi β − yi ) + λ‖β‖1}
Complications:
Optimization for nonconvex `?
Statistical theory? Are certain losses provably better than others?
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 27 / 39
High-dimensional M-estimators
Natural idea: For p > n, use regularized version:
β̂ ∈ arg minβ
{1
n
n∑
i=1
`(xTi β − yi ) + λ‖β‖1}
Complications:
Optimization for nonconvex `?
Statistical theory? Are certain losses provably better than others?
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 27 / 39
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β̂ − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β̃ − β∗‖2 ≤ C ′√
k log p
n
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β̂ − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β̃ − β∗‖2 ≤ C ′√
k log p
n
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β̂ − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β̃ − β∗‖2 ≤ C ′√
k log p
n
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39
Overview of results
When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy
‖β̂ − β∗‖2 ≤ C
√k log p
n,
regardless of distribution of εi
Compare to Lasso theory: Requires sub-Gaussian εi ’s
If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy
‖β̃ − β∗‖2 ≤ C ′√
k log p
n
Local optima may be obtained via two-step algorithm
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39
Theoretical insight
Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):
β̂ ∈ arg minβ
{1
n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸
Ln(β)
}
Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming
λ ≥ 2∥∥∥XT ε
n
∥∥∥∞
, obtain
‖β̂ − β∗‖2 ≤ cλ√k
Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√
k log pn
)
bounds, minimax optimal
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39
Theoretical insight
Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):
β̂ ∈ arg minβ
{1
n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸
Ln(β)
}
Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming
λ ≥ 2∥∥∥XT ε
n
∥∥∥∞
, obtain
‖β̂ − β∗‖2 ≤ cλ√k
Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√
k log pn
)
bounds, minimax optimal
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39
Theoretical insight
Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):
β̂ ∈ arg minβ
{1
n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸
Ln(β)
}
Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming
λ ≥ 2∥∥∥XT ε
n
∥∥∥∞
, obtain
‖β̂ − β∗‖2 ≤ cλ√k
Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√
k log pn
)
bounds, minimax optimal
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39
Theoretical insight
Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)
n
∥∥∥∞
,
obtain‖β̂ − β∗‖2 ≤ cλ
√k
`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error
‖β̂ − β∗‖2 ≤ c
√k log p
n,
without assuming εi is sub-Gaussian
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39
Theoretical insight
Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)
n
∥∥∥∞
,
obtain‖β̂ − β∗‖2 ≤ cλ
√k
`′(ε) sub-Gaussian whenever `′ bounded
=⇒ can achieve estimation error
‖β̂ − β∗‖2 ≤ c
√k log p
n,
without assuming εi is sub-Gaussian
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39
Theoretical insight
Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)
n
∥∥∥∞
,
obtain‖β̂ − β∗‖2 ≤ cλ
√k
`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error
‖β̂ − β∗‖2 ≤ c
√k log p
n,
without assuming εi is sub-Gaussian
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39
Local statistical consistency
Local RSC condition: For ∆ := β1 − β2,
〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p
n‖∆‖21, ∀‖βj −β∗‖2 ≤ r
How is such a result possible?
−1−0.5
00.5
1 −1−0.5
00.5
1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Loss function has directions of both positive and negative curvature.
Negative directions are forbidden by regularizer.
Only requires restricted curvature within constant-radius regionaround β∗
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 31 / 39
Local statistical consistency
Local RSC condition: For ∆ := β1 − β2,
〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p
n‖∆‖21, ∀‖βj −β∗‖2 ≤ r
How is such a result possible?
−1−0.5
00.5
1 −1−0.5
00.5
1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Loss function has directions of both positive and negative curvature.
Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 31 / 39
Consistency of local stationary points
�⇤b� e�
O
rk log p
n
!
r
Theorem (L. ’15)
Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ
α−µk log p, any
stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies
‖β̃ − β∗‖2 -λ√k
α− µ.
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 32 / 39
Consistency of local stationary points
�⇤b� e�
O
rk log p
n
!
r
Theorem (L. ’15)
Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ
α−µk log p, any
stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies
‖β̃ − β∗‖2 -λ√k
α− µ.
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 32 / 39
Second-order considerations
Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?
Answer: Has to do with (asymptotic) efficiency
In low-dimensional settings, MLE maximally efficient with respect tovariance
Although MLE may behave erratically when pn → (0, 1], can achieve
simple asymptotic normality results under sparsity assumption, viaoracle property
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39
Second-order considerations
Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?
Answer: Has to do with (asymptotic) efficiency
In low-dimensional settings, MLE maximally efficient with respect tovariance
Although MLE may behave erratically when pn → (0, 1], can achieve
simple asymptotic normality results under sparsity assumption, viaoracle property
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39
Second-order considerations
Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?
Answer: Has to do with (asymptotic) efficiency
In low-dimensional settings, MLE maximally efficient with respect tovariance
Although MLE may behave erratically when pn → (0, 1], can achieve
simple asymptotic normality results under sparsity assumption, viaoracle property
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39
Second-order considerations
Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?
Answer: Has to do with (asymptotic) efficiency
In low-dimensional settings, MLE maximally efficient with respect tovariance
Although MLE may behave erratically when pn → (0, 1], can achieve
simple asymptotic normality results under sparsity assumption, viaoracle property
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39
Asymptotic efficiency
n/(k log p)0 5 10 15
∥β̂−β∗∥ 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1l2-error for robust regression losses
p=128p=256p=512HuberCauchy
n/(k log p)10 11 12 13 14 15 16 17 18 19 20
empi
rical
var
ianc
e of
firs
t com
pone
nt
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35variance for robust regression losses
p=128p=256p=512HuberCauchy
`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 34 / 39
How to obtain local efficient solutions?
Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC
Target: Locate stationary point within radius r of β∗
Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39
How to obtain local efficient solutions?
Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC
Target: Locate stationary point within radius r of β∗
Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39
How to obtain local efficient solutions?
Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC
Target: Locate stationary point within radius r of β∗
Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39
Two-step algorithm (L. 15)
Use composite gradient descent starting from close initialization
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H
2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H
Theoretical guarantees on (rate of) convergence to optimal point
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39
Two-step algorithm (L. 15)
Use composite gradient descent starting from close initialization
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H
2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H
Theoretical guarantees on (rate of) convergence to optimal point
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39
Two-step algorithm (L. 15)
Use composite gradient descent starting from close initialization
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H
2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H
Theoretical guarantees on (rate of) convergence to optimal point
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39
Two-step algorithm (L. 15)
Use composite gradient descent starting from close initialization
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H
2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H
Theoretical guarantees on (rate of) convergence to optimal point
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39
Two-step algorithm (L. 15)
Use composite gradient descent starting from close initialization
Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty
Algorithm
1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H
2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H
Theoretical guarantees on (rate of) convergence to optimal point
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39
Summary of two-step M-estimator
Output is computationally and statistically efficient:
Computational guarantee on rate of convergence in each step ofM-estimation algorithm
Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 37 / 39
Summary of two-step M-estimator
Output is computationally and statistically efficient:
Computational guarantee on rate of convergence in each step ofM-estimation algorithm
Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 37 / 39
Summary
Theory for nonconvex regularized M-estimators
Global RSC condition =⇒ all stationary points within statistical errorof β∗
Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗
Consequences for high-dimensional robust regression estimators
Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency
Two-step M-estimator produces local oracle solutions
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39
Summary
Theory for nonconvex regularized M-estimators
Global RSC condition =⇒ all stationary points within statistical errorof β∗
Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗
Consequences for high-dimensional robust regression estimators
Consistency under relaxed distributional assumptions when `′ bounded
Oracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency
Two-step M-estimator produces local oracle solutions
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39
Summary
Theory for nonconvex regularized M-estimators
Global RSC condition =⇒ all stationary points within statistical errorof β∗
Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗
Consequences for high-dimensional robust regression estimators
Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency
Two-step M-estimator produces local oracle solutions
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39
References
P. Loh and M. J. Wainwright (2015). Regularized M-estimators withnonconvexity: Statistical and algorithmic theory for local optima.Journal of Machine Learning Research.
P. Loh (2017). Statistical consistency and asymptotic normality forhigh-dimensional robust M-estimators. Annals of Statistics.
Thank you!
Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 39 / 39