Upload
elwin-gyles-carson
View
221
Download
2
Embed Size (px)
Citation preview
Sketching for M-Estimators: A Unified Approach to Robust
Regression
Kenneth Clarkson David Woodruff
IBM Almaden
RegressionLinear Regression• Statistical method to study linear dependencies between
variables in the presence of noise.
Example• Ohm's law V = R ∙ I
• Find linear function that
best fits the data
Regression
Standard Setting• One measured variable b• A set of predictor variables a ,…, a• Assumption:
b = x + a x + … + a x + is assumed to be noise and the xi are model
parameters we want to learn
• Can assume x0 = 0
• Now consider n observations of b
1 d
1
1 d
d
0
Regression
Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn)n is the number of observations; d is the number of predictor variables
Output: x* so that Ax* and b are close
• Consider the over-constrained case, when n À d
Fitness MeasuresLeast Squares Method
• Find x* that minimizes |Ax-b|22
• Ax* is the projection of b onto the column span of A• Certain desirable statistical properties• Closed form solution: x* = (ATA)-1 AT b
Method of least absolute deviation (l1 -regression)
• Find x* that minimizes |Ax-b|1 = |bi – <Ai*, x>|
• Cost is less sensitive to outliers than least squares• Can solve via linear programming
What about the many other fitness measures used in practice?
M-Estimators• Measure function
– G: R -> R¸ 0 – G(x) = G(-x), G(0) = 0– G is non-decreasing in |x|
• |y|M = Σi=1n G(yi)
• Solve minx |Ax-b|M
• Least squares and L1-regression are special cases
Huber Loss FunctionG(x) = x2/(2c) for |x| · c
G(x) = |x|-c/2 for |x| > c
Enjoys smoothness properties of l22 androbustness properties of l1
Other Examples• L1-L2
G(x) = 2((1+x2/2)1/2 – 1)
• Fair estimator
G(x) = c2 [ |x|/c - log(1+|x|/c) ]
• Tukey estimator
G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c = c2/6 if |x| > c
Nice M-Estimators• An M-Estimator is nice if it has at least linear growth and at
most quadratic growth
• There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0,
|a/a’|2 ¸ G(a)/G(a’) ¸ CG |a/a’|
• Any convex G satisfies the linear lower bound
• Any sketchable G satisfies the quadratic upper bound– sketchable => there is a distribution on t x n matrices S for which |
Sx|M = £(|x|M) with probability 2/3 and t is slow-growing function of n
Our ResultsLet nnz(A) denote # of non-zero entries of an n x d matrix A
1. [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p.
|Ax’-b|H · (1+ε) minx |Ax-b|H
2. [Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p.
|Ax’-b|M · C*minx |Ax-b|M
Remarks:
- For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time
- Our algorithm for nice M-estimators is universal
Talk Outline
• Huber result
• Nice M-Estimators result
Naive Sampling Algorithm
A x - bminx
M
x’ = argminxS¢A x S¢b-
MS uniformly samples poly(d/ε) rows – this is a terrible algorithm
Leverage Score Sampling
• For lp-norms, there are probabilities q1, …, qn with Σi qi = poly(d/ε) so that sampling works
A x - bminx M
x’ = argminxS¢A S¢b-
M• All qi can be found in O(nnz(A)log n) + poly(d) time
• S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise
x
- For l2, the qi are the squared row norms in an orthonormal basis of A
- For lp, the qi are p-th powers of the p-norms of rows in a
“well conditioned basis”[Dasgupta et al.]
- For l2, the qi are the squared row norms in an orthonormal basis of A
- For lp, the qi are p-th powers of the p-norms of rows in a
“well conditioned basis”[Dasgupta et al.]
Huber Regression Algorithm
• [Huber inequality]: For z 2 Rn,
£(n-1/2) min(|z|1, |z|22/(2c)) · |z|H · |z|1
• Proof by case analysis
• Sample from a mixture of l1-leverage scores and l2-leverage scores– pi = n1/2¢(qi
(1) + qi(2))
• Our nnz(A)log n + poly(d/ε) algorithm– After one step, number of rows < n1/2 poly(d/ε)– Recursively solve a weighted Huber– Weights do not grow quickly– Once size is < n.01 poly(d/ε), solve by convex programming
Talk Outline
• Huber result
• Nice M-Estimators result
CountSketch
• For l2 regression, CountSketch with poly(d) rows works [Clarkson, W]:
• Compute S*A in nnz(A) time
• Compute x’ = argminx |SAx-Sb|2 in poly(d) time
[ [0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 00 0 0 -1 1 0 -1 00-1 0 0 0 0 0 1
S =
M-Sketch
[[S0 ¢ D0
S1 ¢ D1
S2 ¢ D2
…Slog n ¢ Dlog n
• Si are independent CountSketch matrices with poly(d) rows
• Di is n x n diagonal and uniformly samples a 1/(d log n)i fraction of the n rows
-The same M-Sketch works for all nice M-estimators!
x’ = argminx |TAx-Tb|M
-The same M-Sketch works for all nice M-estimators!
x’ = argminx |TAx-Tb|M
- Sketch used for estimating frequency
moments [Indyk, W] and earthmover distance
[Verbin, Zhang]
- Sketch used for estimating frequency
moments [Indyk, W] and earthmover distance
[Verbin, Zhang]
T =
- many analyses of this data structure don’t work since they reduce the problem to a non-convex problem
- we show it works for “lopsided” subspace embeddings
- many analyses of this data structure don’t work since they reduce the problem to a non-convex problem
- we show it works for “lopsided” subspace embeddings
M-Sketch Intuition• Consider a fixed y = Ax-b
• For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i)
• [Contraction] |Ty|w,M ¸ ½ |y|M w.pr. 1-exp(-d log n)
• [Dilation] |Ty|w,M · 2 |y|M w.pr. 9/10
• Contraction allows for a net argument (no scale-invariance!)
• Dilation implies the optimal y* does not dilate much
• Analysis uses “bucket crowding”, “level sets”, and “Ky-Fan Norms”
ConclusionsSummary:
1. [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm
2. [Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm
Followup Work / Questions:
1. Results for low rank approximation [Clarkson,W15]
2. (Meta-question) Apply streaming techniques to linear algebra
- countsketch –> l2-regression
- p-stable random variables -> lp regression for p in [1,2]- countsketch + heavy hitters -> nice M-estimators- Pagh’s tensorsketch -> polynomial kernel regression
…