76
Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 A P PRO X Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

A ccelerated, P arallel and PROX imal coordinate descent

Embed Size (px)

DESCRIPTION

A ccelerated, P arallel and PROX imal coordinate descent. Peter Richt á rik. A. P. PROX. Moscow February 2014. (Joint work with Olivier Fercoq - arXiv:1312.5799). Optimization Problem. Problem. Loss. Regularizer. Convex (smooth or nonsmooth ). Convex - PowerPoint PPT Presentation

Citation preview

Page 1: A ccelerated,  P arallel and  PROX imal  coordinate descent

Accelerated, Parallel and PROXimal coordinate descent

Moscow February 2014

A P PROXPeter Richtárik

(Joint work with Olivier Fercoq - arXiv:1312.5799)

Page 2: A ccelerated,  P arallel and  PROX imal  coordinate descent

Optimization Problem

Page 3: A ccelerated,  P arallel and  PROX imal  coordinate descent

Problem

Convex (smooth or nonsmooth)

Convex (smooth or nonsmooth)- separable- allow

Loss Regularizer

Page 4: A ccelerated,  P arallel and  PROX imal  coordinate descent

Regularizer: examples

No regularizer Weighted L1 norm

Weighted L2 normBox constraints

e.g., SVM dual

e.g., LASSO

Page 5: A ccelerated,  P arallel and  PROX imal  coordinate descent

Loss: examples

Quadratic loss

L-infinity

L1 regression

Exponential loss

Logistic loss

Square hinge loss

BKBG’11RT’11bTBRS’13RT ’13a

FR’13

Page 6: A ccelerated,  P arallel and  PROX imal  coordinate descent

RANDOMIZED COORDINATE DESCENT

IN 2D

Page 7: A ccelerated,  P arallel and  PROX imal  coordinate descent

Find the minimizer of

2D OptimizationContours of a function

Goal:

a2 =b2

Page 8: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

N

S

EW

Page 9: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

Page 10: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

N

S

EW

2

Page 11: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

23 N

S

EW

Page 12: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

Page 13: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

23

4N

S

EW

5

Page 14: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

6

N

S

EW

Page 15: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Coordinate Descent in 2D

a2 =b2

1

23

45

N

S

EW

67SOLVED!

Page 16: A ccelerated,  P arallel and  PROX imal  coordinate descent

CONTRIBUTIONS

Page 17: A ccelerated,  P arallel and  PROX imal  coordinate descent

Variants of Randomized Coordinate Descent Methods

• Block– can operate on “blocks” of coordinates – as opposed to just on individual coordinates

• General – applies to “general” (=smooth convex) functions – as opposed to special ones such as quadratics

• Proximal– admits a “nonsmooth regularizer” that is kept intact in solving subproblems – regularizer not smoothed, nor approximated

• Parallel – operates on multiple blocks / coordinates in parallel– as opposed to just 1 block / coordinate at a time

• Accelerated– achieves O(1/k^2) convergence rate for convex functions– as opposed to O(1/k)

• Efficient– complexity of 1 iteration is O(1) per processor on sparse problems – as opposed to O(# coordinates) : avoids adding two full vectors

Page 18: A ccelerated,  P arallel and  PROX imal  coordinate descent

Brief History of Randomized Coordinate Descent Methods

+ new long stepsizes

Page 19: A ccelerated,  P arallel and  PROX imal  coordinate descent

APPROX

Page 20: A ccelerated,  P arallel and  PROX imal  coordinate descent

“PROXIMAL”“PAR

ALLE

L”

“ACCELERATED”

A P PROX

Page 21: A ccelerated,  P arallel and  PROX imal  coordinate descent

PCDM (R. & Takáč, 2012) = APPROX if we force

Page 22: A ccelerated,  P arallel and  PROX imal  coordinate descent

APPROX: Smooth Case

Want this to be as large as possible

Update for coordinate i

Partial derivative of f

Page 23: A ccelerated,  P arallel and  PROX imal  coordinate descent

CONVERGENCE RATE

Page 24: A ccelerated,  P arallel and  PROX imal  coordinate descent

Convergence Rate

average # coordinates updated / iteration

# coordinates# iterations

implies

Theorem [FR’13b]

Key assumption

Page 25: A ccelerated,  P arallel and  PROX imal  coordinate descent

Special Case: Fully Parallel Variant

all coordinates are updated in each iteration

# normalized weights (summing to n)

# iterations

implies

Page 26: A ccelerated,  P arallel and  PROX imal  coordinate descent

Special Case: Effect of New Stepsizes

Average degree of separability

“Average” of the Lipschitz constants

With the new stepsizes (will mention later!), we have:

Page 27: A ccelerated,  P arallel and  PROX imal  coordinate descent

“EFFICIENCY” OF

APPROX

Page 28: A ccelerated,  P arallel and  PROX imal  coordinate descent

Cost of 1 Iteration of APPROX

Assume N = n (all blocks are of size 1)and that

Sparse matrixThen the average cost of 1 iteration of APPROX is

Scalar function: derivative = O(1)

arithmetic ops

= average # nonzeros in a column of A

Page 29: A ccelerated,  P arallel and  PROX imal  coordinate descent

Bottleneck: Computation of Partial Derivatives

maintained

Page 30: A ccelerated,  P arallel and  PROX imal  coordinate descent

PRELIMINARYEXPERIMENTS

Page 31: A ccelerated,  P arallel and  PROX imal  coordinate descent

L1 Regularized L1 Regression

Dorothea dataset:

Gradient Method

Nesterov’s Accelerated Gradient Method

SPCDM

APPROX

Page 32: A ccelerated,  P arallel and  PROX imal  coordinate descent

L1 Regularized L1 Regression

Page 33: A ccelerated,  P arallel and  PROX imal  coordinate descent

L1 Regularized Least Squares (LASSO)

KDDB dataset:

PCDM

APPROX

Page 34: A ccelerated,  P arallel and  PROX imal  coordinate descent

Training Linear SVMs

Malicious URL dataset:

Page 35: A ccelerated,  P arallel and  PROX imal  coordinate descent

Choice of Stepsizes:

How (not) to ParallelizeCoordinate Descent

Page 36: A ccelerated,  P arallel and  PROX imal  coordinate descent

Convergence of Randomized Coordinate Descent

Strongly convex F(Simple Mehod)

Smooth or ‘simple’ nonsmooth F(Accelerated Method)

‘Difficult’ nonsmooth F(Accelerated Method)

or smooth F(Simple method)

‘Difficult’ nonsmooth F(Simple Method)

Focus on n

(big data = big n)

Page 37: A ccelerated,  P arallel and  PROX imal  coordinate descent

Parallelization Dream

Depends on to what extent we can add up individual updates, which depends on the properties of F and the

way coordinates are chosen at each iteration

Serial Parallel

What do we actually get?WANT

Page 38: A ccelerated,  P arallel and  PROX imal  coordinate descent

“Naive” parallelization

Do the same thing as before, but

for MORE or ALL coordinates &

ADD UP the updates

Page 39: A ccelerated,  P arallel and  PROX imal  coordinate descent

Failure of naive parallelization

1a

1b

0

Page 40: A ccelerated,  P arallel and  PROX imal  coordinate descent

Failure of naive parallelization

1

1a

1b

0

Page 41: A ccelerated,  P arallel and  PROX imal  coordinate descent

Failure of naive parallelization

1

2a

2b

Page 42: A ccelerated,  P arallel and  PROX imal  coordinate descent

Failure of naive parallelization

1

2a

2b

2

Page 43: A ccelerated,  P arallel and  PROX imal  coordinate descent

Failure of naive parallelization

2

OOPS!

Page 44: A ccelerated,  P arallel and  PROX imal  coordinate descent

1

1a

1b

0

Idea: averaging updates may help

SOLVED!

Page 45: A ccelerated,  P arallel and  PROX imal  coordinate descent

Averaging can be too conservative

1a

1b

0

12a

2b

2

and so on...

Page 46: A ccelerated,  P arallel and  PROX imal  coordinate descent

Averaging may be too conservative 2

WANT

BAD!!!But we wanted:

Page 47: A ccelerated,  P arallel and  PROX imal  coordinate descent

What to do?

Averaging:

Summation:

Update to coordinate i

i-th unit coordinate vector

Figure out when one can safely use:

Page 48: A ccelerated,  P arallel and  PROX imal  coordinate descent

ESO:Expected SeparableOverapproximation

Page 49: A ccelerated,  P arallel and  PROX imal  coordinate descent

5 Models for f Admitting Small

1

2

3

Smooth partially separable f [RT’11b ]

Nonsmooth max-type f [FR’13]

f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

Page 50: A ccelerated,  P arallel and  PROX imal  coordinate descent

5 Partially separable f with block smooth components [FR’13b]

5 Models for f Admitting Small

4 Partially separable f with smooth components [NC’13]

Page 51: A ccelerated,  P arallel and  PROX imal  coordinate descent

Randomized Parallel Coordinate Descent Method

Random set of coordinates (sampling)

Current iterate New iterate i-th unit coordinate vector

Update to i-th coordinate

Page 52: A ccelerated,  P arallel and  PROX imal  coordinate descent

ESO: Expected Separable Overapproximation

Definition [RT’11b]

1. Separable in h2. Can minimize in parallel3. Can compute updates for only

Shorthand:

Minimize in h

Page 53: A ccelerated,  P arallel and  PROX imal  coordinate descent

PART II.ADDITIONAL TOPICS

Page 54: A ccelerated,  P arallel and  PROX imal  coordinate descent

Partial Separability and

Doubly Uniform Samplings

Page 55: A ccelerated,  P arallel and  PROX imal  coordinate descent

Serial uniform samplingProbability law:

Page 56: A ccelerated,  P arallel and  PROX imal  coordinate descent

-nice samplingProbability law:

Good for shared memory systems

Page 57: A ccelerated,  P arallel and  PROX imal  coordinate descent

Doubly uniform sampling

Probability law:

Can model unreliable processors / machines

Page 58: A ccelerated,  P arallel and  PROX imal  coordinate descent

ESO for partially separable functionsand doubly uniform samplings

Theorem [RT’11b]

1 Smooth partially separable f [RT’11b ]

Page 59: A ccelerated,  P arallel and  PROX imal  coordinate descent

PCDM: Theoretical Speedup

Much of Big Data is here!

degree of partial separability

# coordinates

# coordinate updates / iter

WEAK OR NO SPEEDUP: Non-separable (dense) problems

LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

Page 60: A ccelerated,  P arallel and  PROX imal  coordinate descent
Page 61: A ccelerated,  P arallel and  PROX imal  coordinate descent

n = 1000(# coordinates)

Theory

Page 62: A ccelerated,  P arallel and  PROX imal  coordinate descent

Practice

n = 1000(# coordinates)

Page 63: A ccelerated,  P arallel and  PROX imal  coordinate descent

PCDM: Experiment with a 1 billion-by-2 billion

LASSO problem

Page 64: A ccelerated,  P arallel and  PROX imal  coordinate descent

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Page 65: A ccelerated,  P arallel and  PROX imal  coordinate descent

Coordinate Updates

Page 66: A ccelerated,  P arallel and  PROX imal  coordinate descent

Iterations

Page 67: A ccelerated,  P arallel and  PROX imal  coordinate descent

Wall Time

Page 68: A ccelerated,  P arallel and  PROX imal  coordinate descent

Distributed-Memory Coordinate Descent

Page 69: A ccelerated,  P arallel and  PROX imal  coordinate descent

Distributed -nice sampling

Probability law:

Machine 2Machine 1 Machine 3

Good for a distributed version of coordinate descent

Page 70: A ccelerated,  P arallel and  PROX imal  coordinate descent

ESO: Distributed setting

Theorem [RT’13b]

3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

spectral norm of the data

Page 71: A ccelerated,  P arallel and  PROX imal  coordinate descent

Bad partitioning at most doubles # of iterations

spectral norm of the partitioning

Theorem [RT’13b]

# nodes

# iterations = implies

# updates/node

Page 72: A ccelerated,  P arallel and  PROX imal  coordinate descent
Page 73: A ccelerated,  P arallel and  PROX imal  coordinate descent

LASSO with a 3TB data matrix

128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM

= # coordinates

Page 74: A ccelerated,  P arallel and  PROX imal  coordinate descent

• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.

• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.

• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.

• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.

• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.

• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.

References: serial coordinate descent

Page 75: A ccelerated,  P arallel and  PROX imal  coordinate descent

• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011

• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012

• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013

• [FR’13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013

• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013

• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013

References: parallel coordinate descent

Good entry point to the topic (4p paper)

Page 76: A ccelerated,  P arallel and  PROX imal  coordinate descent

• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.

• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.

• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.

• [FR’13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal coordinate descent. arXiv:1312.5799, 2013

References: parallel coordinate descent