Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden

Sketching for M-Estimators: A Unified Approach to Robust

Regression

Kenneth Clarkson David Woodruff

IBM Almaden

RegressionLinear Regression• Statistical method to study linear dependencies between

variables in the presence of noise.

Example• Ohm's law V = R ∙ I

• Find linear function that

best fits the data

Regression

Standard Setting• One measured variable b• A set of predictor variables a ,…, a• Assumption:

b = x + a x + … + a x + is assumed to be noise and the xi are model

parameters we want to learn

• Can assume x0 = 0

• Now consider n observations of b

1 d

1

1 d

d

0

Regression

Matrix form

Input: nd-matrix A and a vector b=(b1,…, bn)n is the number of observations; d is the number of predictor variables

Output: x* so that Ax* and b are close

• Consider the over-constrained case, when n À d

Fitness MeasuresLeast Squares Method

• Find x* that minimizes |Ax-b|22

• Ax* is the projection of b onto the column span of A• Certain desirable statistical properties• Closed form solution: x* = (ATA)-1 AT b

Method of least absolute deviation (l1 -regression)

• Find x* that minimizes |Ax-b|1 = |bi – <Ai*, x>|

• Cost is less sensitive to outliers than least squares• Can solve via linear programming

What about the many other fitness measures used in practice?

M-Estimators• Measure function

– G: R -> R¸ 0 – G(x) = G(-x), G(0) = 0– G is non-decreasing in |x|

• |y|M = Σi=1n G(yi)

• Solve minx |Ax-b|M

• Least squares and L1-regression are special cases

Huber Loss FunctionG(x) = x2/(2c) for |x| · c

G(x) = |x|-c/2 for |x| > c

Enjoys smoothness properties of l22 androbustness properties of l1

Other Examples• L1-L2

G(x) = 2((1+x2/2)1/2 – 1)

• Fair estimator

G(x) = c2 [ |x|/c - log(1+|x|/c) ]

• Tukey estimator

G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c = c2/6 if |x| > c

Nice M-Estimators• An M-Estimator is nice if it has at least linear growth and at

most quadratic growth

• There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0,

|a/a’|2 ¸ G(a)/G(a’) ¸ CG |a/a’|

• Any convex G satisfies the linear lower bound

• Any sketchable G satisfies the quadratic upper bound– sketchable => there is a distribution on t x n matrices S for which |

Sx|M = £(|x|M) with probability 2/3 and t is slow-growing function of n

Our ResultsLet nnz(A) denote # of non-zero entries of an n x d matrix A

1. [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p.

|Ax’-b|H · (1+ε) minx |Ax-b|H

2. [Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p.

|Ax’-b|M · C*minx |Ax-b|M

Remarks:

- For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time

- Our algorithm for nice M-estimators is universal

Talk Outline

• Huber result

• Nice M-Estimators result

Naive Sampling Algorithm

A x - bminx

M

x’ = argminxS¢A x S¢b-

MS uniformly samples poly(d/ε) rows – this is a terrible algorithm

Leverage Score Sampling

• For lp-norms, there are probabilities q1, …, qn with Σi qi = poly(d/ε) so that sampling works

A x - bminx M

x’ = argminxS¢A S¢b-

M• All qi can be found in O(nnz(A)log n) + poly(d) time

• S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise

x

- For l2, the qi are the squared row norms in an orthonormal basis of A

- For lp, the qi are p-th powers of the p-norms of rows in a

“well conditioned basis”[Dasgupta et al.]

- For l2, the qi are the squared row norms in an orthonormal basis of A

- For lp, the qi are p-th powers of the p-norms of rows in a

“well conditioned basis”[Dasgupta et al.]

Huber Regression Algorithm

• [Huber inequality]: For z 2 Rn,

£(n-1/2) min(|z|1, |z|22/(2c)) · |z|H · |z|1

• Proof by case analysis

• Sample from a mixture of l1-leverage scores and l2-leverage scores– pi = n1/2¢(qi

(1) + qi(2))

• Our nnz(A)log n + poly(d/ε) algorithm– After one step, number of rows < n1/2 poly(d/ε)– Recursively solve a weighted Huber– Weights do not grow quickly– Once size is < n.01 poly(d/ε), solve by convex programming

Talk Outline

• Huber result

• Nice M-Estimators result

CountSketch

• For l2 regression, CountSketch with poly(d) rows works [Clarkson, W]:

• Compute S*A in nnz(A) time

• Compute x’ = argminx |SAx-Sb|2 in poly(d) time

[ [0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 00 0 0 -1 1 0 -1 00-1 0 0 0 0 0 1

S =

M-Sketch

[[S0 ¢ D0

S1 ¢ D1

S2 ¢ D2

…Slog n ¢ Dlog n

• Si are independent CountSketch matrices with poly(d) rows

• Di is n x n diagonal and uniformly samples a 1/(d log n)i fraction of the n rows

-The same M-Sketch works for all nice M-estimators!

x’ = argminx |TAx-Tb|M

-The same M-Sketch works for all nice M-estimators!

x’ = argminx |TAx-Tb|M

- Sketch used for estimating frequency

moments [Indyk, W] and earthmover distance

[Verbin, Zhang]

- Sketch used for estimating frequency

moments [Indyk, W] and earthmover distance

[Verbin, Zhang]

T =

- many analyses of this data structure don’t work since they reduce the problem to a non-convex problem

- we show it works for “lopsided” subspace embeddings

- many analyses of this data structure don’t work since they reduce the problem to a non-convex problem

- we show it works for “lopsided” subspace embeddings

M-Sketch Intuition• Consider a fixed y = Ax-b

• For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i)

• [Contraction] |Ty|w,M ¸ ½ |y|M w.pr. 1-exp(-d log n)

• [Dilation] |Ty|w,M · 2 |y|M w.pr. 9/10

• Contraction allows for a net argument (no scale-invariance!)

• Dilation implies the optimal y* does not dilate much

• Analysis uses “bucket crowding”, “level sets”, and “Ky-Fan Norms”

ConclusionsSummary:

1. [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm

2. [Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm

Followup Work / Questions:

1. Results for low rank approximation [Clarkson,W15]

2. (Meta-question) Apply streaming techniques to linear algebra

- countsketch –> l2-regression

- p-stable random variables -> lp regression for p in [1,2]- countsketch + heavy hitters -> nice M-estimators- Pagh’s tensorsketch -> polynomial kernel regression

…

Documents

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden