56
Asynchronous Parallel Optimization Ji Liu University of Wisconsin-Madison [email protected] January 29, 2014 Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 1 / 56

Asynchronous Parallel Optimization

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Asynchronous Parallel Optimization

Asynchronous Parallel Optimization

Ji Liu

University of Wisconsin-Madison

[email protected]

January 29, 2014

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 1 / 56

Page 2: Asynchronous Parallel Optimization

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 2 / 56

Page 3: Asynchronous Parallel Optimization

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 3 / 56

Page 4: Asynchronous Parallel Optimization

Overview of All Projects

A Optimization Algorithm and Theory

B Sparse Learning and Compressive Sensing Theory

C Applications in Computer Vision, Multimedia, Video Surveillance,Medical Image Analysis, Biological Data Analysis

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 4 / 56

Page 5: Asynchronous Parallel Optimization

Optimization Algorithm and Theory

Asynchronous Parallel Optimization [current project, NIPS13 and 3under review]

Accelerated Randomized Kaczmarz Algorithm [under review]

Ax = b

Reinforcement Learning [NIPS12 spotlight]

E(A)x = E(b) find a sparse solution

Tensor Completion / Recovery [ICCV09, TPAMI13, cited by 90+]

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 5 / 56

Page 6: Asynchronous Parallel Optimization

Sparse Learning and Compressive Sensing Theory

Feature selection and cardinality optimization [ICML14]

minx

f (x) s.t. ‖x‖0 ≤ s

Dictionary LASSO [ICML13]

minx

1

2‖Ax − b‖2 + λ‖Dx‖1

Multi-stage LASSO and Dantzig selector [NIPS10, JMLR12]

Robust dequantized compressive sensing [ACHA13]

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 6 / 56

Page 7: Asynchronous Parallel Optimization

Applications in Computer Vision, Multimedia, VideoSurveillance, Medical Image Analysis, Biological DataAnalysis

Abnormal event detection and dictionary learning [CVPR11, PR12]

Multi-task learning [SIGKDD10 best paper finalist, TKDD12]Large scale spectral clustering [under review]Online scene classification [TIP13]

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 7 / 56

Page 8: Asynchronous Parallel Optimization

Applications in Computer Vision, Multimedia, VideoSurveillance, Medical Image Analysis, Biological DataAnalysis

Sparse tensor decomposition for Drosophila data analysis [PR12]

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 8 / 56

Page 9: Asynchronous Parallel Optimization

Main Collaborators

Stephen Wright (UW-Madison, Ph.D advisor)

Jieping Ye, Peter Wonka (ASU, Master’s advisors)

Christopher Re [Stanford]

Vikas Singh, Victor Bittorf, Srikrishna Sridhar [UW-Madison]

Bo Liu, Sridhar Mahadevan [UMass]

Yang Cong [Chinese Academy of Sciences]

Junsong Yuan [NUS]

Fujimaki Ryohei [NEC lab]

Jiebo Luo [Rochester]

Jianhui Chen [GE lab]

Jun Liu [SAS]

Lei Yuan [DOW lab]

Przemyslaw Musialski [Vienna University of Technology]

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 9 / 56

Page 10: Asynchronous Parallel Optimization

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 10 / 56

Page 11: Asynchronous Parallel Optimization

Tensor Completion / Recovery

Problem: How to estimate the missing values in a low rank tensor data?

Examples: Netflix challenge, recommendation problems, image / videoin-painting.

Figure: The left figure contains 80% missing entries shown as white pixels andthe right figure shows its reconstruction using the low rank approximation.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 11 / 56

Page 12: Asynchronous Parallel Optimization

Our Formulation

We propose the following formulation:

minX

‖X‖∗ s.t. XΩ = TΩ

where X and T are n dimensional tensors, Ω denotes the observedelements, and the tensor nuclear norm is defined by

‖X‖∗ :=n∑

i=1

αi‖unfold(i)(X )‖∗ where∑i

αi = 1, αi ≥ 0.

Tensor nuclear norm is used to capture the low rank structure in a tensor.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 12 / 56

Page 13: Asynchronous Parallel Optimization

Algorithms

Main advantages of this model

Convex =⇒ Robust

Potential strong theoretical guarantees from the study on the matrixcase =⇒ Very limit observations can recover the whole tensor data

Main difficulty in optimization

Multiple nonsmooth terms =⇒ General gradient methods do notwork.

Two algorithms are proposed

FaLRTC: Smoothing scheme + Nesterov’s accelerated scheme(convergence rate: O(1/K ))

HaLRTC: Alternating Direction Method of Multipliers (ADMM)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 13 / 56

Page 14: Asynchronous Parallel Optimization

Contributions and Video Show

Contributions

This is the first work to define a convex regularization to capture thelow rank structure for tensors. It is nontrival extension from matricesto tensors;

This pioneering work becomes the bench mark in tensor completion /recovery.

Video: http://www.youtube.com/watch?v=kbnmXM3uZFA

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 14 / 56

Page 15: Asynchronous Parallel Optimization

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 15 / 56

Page 16: Asynchronous Parallel Optimization

Trend of Optimization Algorithms in Machine Learning

before 2000 2000 ∼ 2010 2010 ∼ nowmethodologies 2nd order meth. 1st order meth. 1/2 order meth.representative Newton’s method gradient descent stochastic gradientalgorithms SDP Nesterov’s meth. coordinate descentconvergence speed fast medium slow / mediummemory cost high medium low

computation cost high (∇2f , ∇f ) medium (∇f ) low (∇f )problem size small medium big

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 16 / 56

Page 17: Asynchronous Parallel Optimization

Asynchronous Parallel Optimization

Figure: How does the asynchronous parallel procedure work?

RAM (Shared Memory)“X”

Core Core

Core Core

Cache

Core Core

Core Core

Cache

…...

Read “X”;Compute gradient at “X”Update “X” in RAM

All processors / cores share the same memory which saves the currentvariable x ;

All processors / cores run the same optimization algorithm independently;

All processors / cores update the coordinates of x concurrently without anysoftware locking.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 17 / 56

Page 18: Asynchronous Parallel Optimization

Contributions

Application

This procedure totally avoids the synchronization cost and even allowsdata distributed in different locations.The near linear speedup over a single processor / core is achieved inmany practical situations.Successes have been demonstrated in solving the sparse linearequations and our LP solver.

Theory: Build up the theoretical foundations by providing

Convergence rates (consistent with single core algorithm);Upper bounds of the number of processors / cores for which near linearspeedup can be guaranteed;An intuitive measure for the parallelizability of optimization problems.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 18 / 56

Page 19: Asynchronous Parallel Optimization

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 19 / 56

Page 20: Asynchronous Parallel Optimization

Asynchronous Parallel Stochastic Proximal CoordinateDescent Algorithm (AsySCD)

minx

: F (x) := f (x) + g(x) (1)

where f (·) : Rn 7→ R is convex and differentiable andg(·) : Rn 7→ R ∪ +∞ is a proper closed convex real value extendedfunction. We assume that g(x) is separable, that is, g(x) =

∑ni=1 gi ((x)i )

where gi (·) : R 7→ R ∪ +∞. Several examples for g(x)

Unconstrained case: g(x) = constant.

Box constraint case: g(x) =∑n

i=1 1[ai ,bi ]((x)i ) where 1[ai ,bi ](·) is anindicator function.

`p norm regularization: g(x) = ‖x‖pp where p ≥ 1.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 20 / 56

Page 21: Asynchronous Parallel Optimization

Examples

least squares: minx12‖Ax − b‖2;

LASSO: minx12‖Ax − b‖2 + λ‖x‖1;

support vector machine (SVM) with squared hinge loss:

minx

C∑i

maxyi (xTi w − b), 02 +

1

2‖w‖2

support vector machine (dual form with bias term):

minα

1

2(∑i ,j

αiαjyiyjK (xi , xj))−∑i

αi

s.t. 0 ≤ α ≤ C

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 21 / 56

Page 22: Asynchronous Parallel Optimization

Examples (continued)

logistic regression with `p norm regularization (p = 1, 2):

minx

1

n

∑i

log(1 + exp(−yixTi w)) + λ‖w‖pp

semi-supervised learning (Tikhonov Regularization)

minf

∑i∈labeled data

(fi − yi )2 + λf TLf

where L is the Laplacian matrix.

relaxed LP problem

minx≥0

cT x s.t. Ax = b ⇒ minx≥0

cT x + λ‖Ax − b‖2

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 22 / 56

Page 23: Asynchronous Parallel Optimization

Stochastic Proximal Coordinate Descent (SCD)

Stochastic Coordinate Descent Algorithm (SCD): Choose indexi ∈ 1, 2, · · · , n uniformly at random at iteration j , update

xj+1 = Pαgi (·) (xj − αei∇i f (xj))

:= arg minx∈Rn

1

2‖x − (xj − αei∇i f (xj)) ‖2 + αgi ((x)i )

= arg minx∈Rn

f (xj) + 〈∇i f (xj), (x − xj)i 〉+1

2α‖x − xj‖2 + gi ((x)i )

Note that only a single element in xj+1 would be updated:

(xj+1)l =

arg miny∈R

12‖y − ((xj)l − α∇l f (xj)) ‖2 + αgl(y), if l = i

(xj)l , otherwise

Our Target: Asynchronously parallelize SCD.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 23 / 56

Page 24: Asynchronous Parallel Optimization

Notation

Lmax: component Lipschitz constant (“max diagonal of Hessian”)

‖∇f (x + tej)−∇f (x)‖∞ ≤ Lmax|t| ∀x ∈ Rn, t ∈ R, i ;

Lres: restricted Lipschitz constant (“max row-norm of Hessian”)

‖∇f (x + tei )−∇f (x)‖ ≤ Lres|t| ∀x ∈ Rn, t ∈ R, i ;

Λ := LresLmax

;S : the solution set of (1);

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 24 / 56

Page 25: Asynchronous Parallel Optimization

Notation: Optimally Strongly Convex (OSC)

Optimal strong convexity parameter µ > 0

F (x)− F (PS(x)) ≥ µ

2‖x − PS(x)‖2

for all x ∈ domF . Weaker than usual strong convexity.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 25 / 56

Page 26: Asynchronous Parallel Optimization

Examples of OSC (but not strongly convex) functions

F (x) = 12‖Ax − b‖2 where A could be any matrices;

F (x) = 12‖Ax − b‖2 s.t. l ≤ x ≤ h where A could be any matrices;

Squared Hinge Loss F (x) =∑

k max(aTk x − bk , 0)2;

More generally

minx

f (Ax)

s.t. x ∈ Ω

where f (·) is strongly convex and Ω defines a polyhedron;

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 26 / 56

Page 27: Asynchronous Parallel Optimization

Asynchronous SPCD Algorithm (AsySCD) withInconsistent Read

Asynchronous parallelization in Hogwild! [Niu, Recht, Re, and Wright,2011].

All processors share the same memory storing the current x ;

All processors run the same algorithms; (SCD algorithm in AsySCD)

All processors update the values of x simultaneously — no softwarelocking.

We use the same setup for AsySCD.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 27 / 56

Page 28: Asynchronous Parallel Optimization

Asynchronous SPCD Algorithm (AsySCD) withInconsistent Read

At iteration j :

Choose i with equal probability from 1, 2, · · · , n;Update the i component:

xj+1 = P γLmax

gi (.)

(xj −

γ

Lmax∇i f (xj)ei

),

where K (j) is some iterate ≤ j . Here γ is a constant steplength.

Each core runs this process concurrently and asynchronously.

xj may be not any real state of x in the shared memory. It is called“inconsistent read”. Hogwild! [Niu, et al, 2010] assumes that xj is someearly iterate (real state) of x for simpler analysis. It is called “consistentread”. (The “consistent read” model is just a special case of the“inconsistent read” model.)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 28 / 56

Page 29: Asynchronous Parallel Optimization

When does Inconsistent Read happen?

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 29 / 56

Page 30: Asynchronous Parallel Optimization

Inconsistent Read

Mathematically, the relationship between xj and xj can expressed by

xj = xj +∑

d∈K(j)

(xd+1 − xd),

where K (j) defines an iterate set. Intuitively, xj missed a few updates fromxj .

Here we assume τ to be the upper bound of ages of all elements in K (j)for all j : τ ≥ j −mind | d ∈ K (j).

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 30 / 56

Page 31: Asynchronous Parallel Optimization

Key to Analysis

xj+1 = P γLmax

gi (.)

(xj −

γ

Lmax∇i f (xj)ei

)Choose some ρ > 1 and pick γ small enough to ensure that

E(‖xj − xj−1‖2) ≤ ρE(‖xj+1 − xj‖2) “ρ-condition”.

Not too much change in the gradient over each iteration, so not too muchprice to pay for using inexact information, in the asynchronous setting.

Can choose γ small enough to satisfy this property but large enough to geta better convergence rate.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 31 / 56

Page 32: Asynchronous Parallel Optimization

OSC: Linear Rate

Theorem

For any ρ > 1 + 4/√

n, define

θ :=ρ(τ+1)/2 − ρ1/2

ρ1/2 − 1θ′ :=

ρ(τ+1) − ρρ− 1

ψ := 1 +τθ′

n+

Λθ√n.

and choose

γ ≤ min

1

ψ,

√n(1− ρ−1)− 4

4(1 + θ)Λ

.

We have that the “ρ-condition” is satisfied and for any j ≥ 0

E‖xj − PS(xj)‖2 +2γ

Lmax(EF (xj)− F ∗)

≤(

1− l

n(l + γ−1Lmax)

)j (‖x0 − PS(x0)‖2 +

Lmax(F (x0)− F ∗)

).

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 32 / 56

Page 33: Asynchronous Parallel Optimization

A Particular Choice

Corollary

Consider the regime in which

4eΛ(τ + 1)2 ≤√

n,

and define ρ =(

1 + 4eΛ(τ+1)√n

)2. Thus we can choose γ = 1/2, and the

rate simplifies to:

E(F (xj)−F ∗) ≤(

1− µ

n(µ+ 2Lmax)

)j

(Lmax‖x0−PS(x0)‖2 +F (x0)−F ∗).

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 33 / 56

Page 34: Asynchronous Parallel Optimization

Comparison

Convergence rate for AsySCD choosing the number of processors in the

order of n1/4LmaxLres

:

E(F (xj)− F ∗) ≤

(1− cµn1/4Lmax

n(µ+ 2Lmax)Lres

)j

(F (x0)− F ∗)

≈(

1− cµ

n3/4Lres

)j

(F (x0)− F ∗)

where c is a constant. Recall the convergence rate of Proximal GradientDescent1:

F (xj)− F ∗ ≤(

1− µ

nL

)j(F (x0)− F ∗)

where L is the Lipschitz constant of ∇f (x).

L√n≤ Lres ≤ L ⇒ n3/4Lres ≤ nL.

1Compensated by the complexity factor n.Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 34 / 56

Page 35: Asynchronous Parallel Optimization

Weakly Convex: Sublinear Rate

Define ψ and γ as above, have

E(F (xj)− F ∗) ≤ n(Lmax‖x0 − PS(x0)‖2 + 2γ(F (x0)− F ∗))

2γ(j + n).

Roughtly ”1/j” behavior.

Assuming 4eΛ(τ + 1)2 ≤√

n and setting ρ and γ as above, have

E(F (xj)− F ∗) ≤ n(Lmax‖x0 − PS(x0)‖2 + F (x0)− F ∗)

j + n.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 35 / 56

Page 36: Asynchronous Parallel Optimization

Diagonalicity of Hessian

τ + 1 ≤ n1/4

√4eΛ

Smaller Λ = Lres/Lmax is beneficial to parallelization.

The ratio Lres/Lmax is particularly important – it measures the degree ofdiagonal dominance in the Hessian ∇2f (x) (Diagonalicity).

By convexity, we have

1 ≤ Lres

Lmax≤√

n.

Closer to 1 if Hessian is nearly diagonally dominant (eigenvectors close toprincipal coordinate axes). Closer to

√n otherwise.

If A is m× n Gaussian random matrix and f (x) = 12‖Ax − b‖2, the ratio is

bounded by 1 + O(√

n/m) with high probability.

Allows τ ≈ O(n1/4).

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 36 / 56

Page 37: Asynchronous Parallel Optimization

Constrained: 4-socket, 40-core Intel Xeon

minx≥0

(x − z)T (ATA + 0.5I )(x − z) ,

where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 20000,columns are normalized to 1) and z is a Gaussian random vector.Lres/Lmax ≈ 2.2. Choose γ = 1.

2 4 6 8 10 12 14 16 18 20 22

10−4

10−3

10−2

10−1

100

101

Synthetic Constrained QP: n = 20000 p = 10

# epochs

resid

ual

thread= 1thread=10thread=20thread=30thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Synthetic Constrained QP: n = 20000

threads

speedup

IdealAsySCD−DWGlobal Locking

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 37 / 56

Page 38: Asynchronous Parallel Optimization

Unconstrained: 4-socket, 40-core Intel Xeon

minx

‖Ax − b‖2 + 0.5‖x‖2

where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 20000,data size≈3GB, columns are normalized to 1). Lres/Lmax ≈ 2.2. Chooseγ = 1. 3-4 seconds to achieve the accuracy 10−5 on 40 cores.

5 10 15 20 25 30 35 40

10−4

10−3

10−2

10−1

100

101

Synthetic Unconstrained QP: n = 20000 p = 10

# epochs

resid

ual

thread= 1thread=10thread=20thread=30thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Synthetic Unconstrained QP: n = 20000

threads

speedup

IdealAsySCD−DWGlobal Locking

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 38 / 56

Page 39: Asynchronous Parallel Optimization

Experiments: 1-socket, 10-core Intel Xeon

minx

1

2‖Ax − b‖2 + λ‖x‖1

where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 10000,data size≈750MB),b = A ∗ sprandn(n, 1, 10) + 0.01 ∗ randn(n, 1), andλ = 0.2

√m log(n). Lres/Lmax ≈ 2.2. Choose γ = 1.

50 100 150 200 250 300

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

m = 6000 n = 10000 sparsity = 10 σ = 0.01

# epochs

Obje

ctive

thread= 1thread= 2thread= 4thread= 8thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 6000 n = 10000 sparsity = 10 σ = 0.01

threads

speedup

IdealAsySPCD

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 39 / 56

Page 40: Asynchronous Parallel Optimization

Experiments: 1-socket, 10-core Intel Xeon

minx

1

2‖Ax − b‖2 + λ‖x‖1,

where A ∈ Rm×n is a Gaussian random matrix (m = 12000, n = 20000,data size≈3GB),b = A ∗ sprandn(n, 1, 20) + 0.01 ∗ randn(n, 1), andλ = 0.2

√m log(n). Lres/Lmax ≈ 2.2. Choose γ = 1.

20 40 60 80 100 120 140 160 180 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m = 12000 n = 20000 sparsity = 20 σ = 0.01

# epochs

Obje

ctive

thread= 1thread= 2thread= 4thread= 8thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 12000 n = 20000 sparsity = 20 σ = 0.01

threads

speedup

IdealAsySPCD

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 40 / 56

Page 41: Asynchronous Parallel Optimization

AsySCD vs. SynGD

#cores Time(sec) SpeedupSynGD / AsySCD SynGD / AsySCD

1 121. / 27.1 0.22 / 1.0010 11.4 / 2.57 2.38 / 10.520 6.00 / 1.36 4.51 / 19.930 4.44 / 1.01 6.10 / 26.840 3.91 / 0.88 6.93 / 30.8

Table: Efficiency comparison between SynGD and AsySCD for the QP problem.The running time and speedup are calculated based on the residual 10−5.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 41 / 56

Page 42: Asynchronous Parallel Optimization

Vectex Cover Problem

The vertex cover problem for an undirected graph with edge set E andvertex set V can be written as a binary linear program:

miny∈0,1|V |

∑v∈V

yv s.t. yu + yv ≥ 0, ∀ (u, v) ∈ E .

By relaxing each binary constraint to the interval [0, 1], introducing slackvariables for the cover inequalities, we obtain a problem of the form

minyv∈[0,1], suv∈[0,1]

∑v∈V

yv s.t. yu + yv − suv = 0, ∀ (u, v) ∈ E .

This has the form

minx∈[0,1]n

cT x s.t. Ax = b, ⇒ minx∈[0,1]n

cT x +β

2‖Ax − b‖2

for n = |V |+ |E |. The test problem is a regularized quadratic penaltyreformulation of this linear program.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 42 / 56

Page 43: Asynchronous Parallel Optimization

Vertex Cover (Amazon): 4-socket, 40-core Intel Xeon

10 20 30 40 50 60 70

10−4

10−3

10−2

10−1

100

101

102

Amazon: n = 561050 p = 10

# epochs

resid

ual

thread= 1

thread=10

thread=20

thread=30

thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Amazon: n = 561050

threads

speedup

IdealAsySCD−DWGlobal Locking

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 43 / 56

Page 44: Asynchronous Parallel Optimization

Vertex Cover (DBLP): 4-socket, 40-core Intel Xeon

5 10 15 20 25 30 35 40

10−3

10−2

10−1

100

101

102

DBLP: n = 520891 p = 10

# epochs

resid

ual

thread= 1

thread=10

thread=20

thread=30

thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

DBLP: n = 520891

threads

speedup

IdealAsySCD−DWGlobal Locking

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 44 / 56

Page 45: Asynchronous Parallel Optimization

Running Time

Problem 1 core 40 cores

QP 98.4 3.03QPc 59.7 1.82Amazon 17.1 1.25DBLP 11.5 .91

Table: Runtimes (s) for the four test problems on 1 and 40 cores.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 45 / 56

Page 46: Asynchronous Parallel Optimization

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 46 / 56

Page 47: Asynchronous Parallel Optimization

Asynchronous Parallel Randomized Kaczmarz Algorithm(AsyRK)

Consider linear equations Ax = b, where the equations are feasible andmatrix A ∈ Rm×n is not necessarily square or full rank. Write

A =

aTiaT2...

aTm

, where ‖ai‖2 = 1, i = 1, 2, . . . ,m.

For infeasible system, one can minimize ‖Ax − b‖2, which is equivalent tosolving a feasible linear system ATAx = ATb or an extended version

Ax = y , AT y = ATb.

Randomized Kaczmarz (RK) Algorithm

Select row index i ∈ 1, 2, · · · ,m randomly with equal probability;

Updatexj+1 = xj − (aTi xj − bi )ai .

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 47 / 56

Page 48: Asynchronous Parallel Optimization

AsyRK Algorithm

At each iteration j :

Select i from 1, 2, . . . ,m with equal probability;

Select t from the support of ai with equal probability;

Update:

xj+1 =xj − γ‖ai‖0(aTi xk(j) − bi )(ai )tet

where

k(j) is some iterate prior to j but no more than τ cycles old:j − k(j) ≤ τ ;

γ is a constant steplength;

As in Hogwild!, different cores run this process concurrently, updatingan x accessible to all.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 48 / 56

Page 49: Asynchronous Parallel Optimization

AsyRK Analysis: Linear Convergence

Theorem

Choose any ρ > 1 and define γ via the following:

ψ = χ+2λmaxτρ

τ

m

γ ≤ min

1

ψ,

m(ρ− 1)

2λmaxρτ+1, m

√ρ− 1

ρτ (mα2 + λ2maxτρ

τ )

Then we have a certain “ρ-condition” and linear convergence rate:

E(‖xj − P(xj)‖2) ≤(

1− λminγ

mχ(2− γψ)

)j

‖x0 − P(x0)‖2,

where χ = maxmi=1 ‖ai‖0.

A particular choice of ρ leads to simplified results, in a reasonable regime.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 49 / 56

Page 50: Asynchronous Parallel Optimization

A Particular Choice

Corollary

Assumeτ + 1 ≤ m

2eλmax

and set ρ = 1 + 2eλmax/m. Can show that γ = 1/ψ for this case, soexpected convergence is

E(‖xj − P(xj)‖2) ≤(

1− λmin

m(χ+ 1)

)j

‖x0 − P(x0)‖2.

Converges to precision ε with probability at least 1− η in

K ≥ m(χ+ 1)

λmin

∣∣∣∣log‖x0 − P(x0)‖2

ηε

∣∣∣∣ iterations.

In the regime 2eλmax(τ + 1) ≤ m considered here the delay τ doesn’treally interfere with convergence rate.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 50 / 56

Page 51: Asynchronous Parallel Optimization

Discussion

For random matrices A with unit rows, we have λmax ≈ (1 + O(m/n)),with high probability.

Conditions on τ are less strict than for the SCD algorithms. For randommatrices A, with m and n of the same order, have τ = O(m) = O(n).

(Recall τ = O(n1/4) for AsySCD in the constrained case andτ = O(n1/2) for AsySCD in the unconstrained case.)

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 51 / 56

Page 52: Asynchronous Parallel Optimization

Comparison

algorithms RK AsySCD AsyRK

# oper. per iter. O(δn) minO(δ2mn), O(n) O(δn)

rate (iteration) 1− λminm

1− λmin2nLmax

1− λminm(µ+1)

# processors 1 O(√

nLmaxLres

)O

(m

Lmax

)rate (running time) 1− O

(λminδmn

)1− O

(λmin

n1.5Lres minδ2m, 1

)1− O

(λmin

δ2n2Lmax

)

AsyRK favors a sparse data for A.

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 52 / 56

Page 53: Asynchronous Parallel Optimization

Experiment

[Contributed by my wife]

Sparse Gaussian random matrix A ∈ Rm×n with m = 100000 andn = 80000, sparse ratio is 0.1%.

50 100 150 200 250 300 350 400

10−8

10−6

10−4

10−2

100

m = 80000 n = 100000 sparsity = 0.001

# epochs

resid

ua

l

thread= 1

thread= 2

thread= 4

thread= 8

thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 80000 n = 100000 sparsity = 0.001

threads

speedup

Ideal

AsyRK

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 53 / 56

Page 54: Asynchronous Parallel Optimization

Experiment

Sparse Gaussian random matrix A ∈ Rm×n with m = 100000 andn = 80000, sparse ratio is 0.3%.

50 100 150 200 250 300 350

10−6

10−5

10−4

10−3

10−2

10−1

100

101

m = 80000 n = 100000 sparsity = 0.003

# epochs

resid

ua

l

thread= 1

thread= 2

thread= 4

thread= 8

thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 80000 n = 100000 sparsity = 0.003

threads

speedup

Ideal

AsyRK

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 54 / 56

Page 55: Asynchronous Parallel Optimization

Table: Comparison of running time and epochs between AsySCD and AsyRKon 10 cores. We report their running time and number of epochs required toattain a residual of 10−5.

synthetic data size (MB) running time (sec) epochs

m n δ AsySCD AsyRK AsySCD AsyRK

80k 100k 0.0005 43 39. 3.6 199 19580k 100k 0.001 84 170. 7.6 267 28480k 100k 0.003 244 1279. 18.4 275 232

500k 1000k 0.00005 282 54. 5.8 19 19500k 1000k 0.0001 550 198. 10.4 24 30500k 1000k 0.0002 1086 734. 15.0 29 31

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 55 / 56

Page 56: Asynchronous Parallel Optimization

The End

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 56 / 56