Asynchronous Parallel Optimization

Asynchronous Parallel Optimization

Ji Liu

University of Wisconsin-Madison

[email protected]

January 29, 2014

Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 1 / 56

Overview

1 Overview of All Projects

2 Tensor Completion / Recovery

3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)


Overview





Overview of All Projects

A Optimization Algorithm and Theory

B Sparse Learning and Compressive Sensing Theory

C Applications in Computer Vision, Multimedia, Video Surveillance,Medical Image Analysis, Biological Data Analysis


Optimization Algorithm and Theory

Asynchronous Parallel Optimization [current project, NIPS13 and 3under review]

Accelerated Randomized Kaczmarz Algorithm [under review]

Ax = b

Reinforcement Learning [NIPS12 spotlight]

E(A)x = E(b) find a sparse solution

Tensor Completion / Recovery [ICCV09, TPAMI13, cited by 90+]


Sparse Learning and Compressive Sensing Theory

Feature selection and cardinality optimization [ICML14]

minx

f (x) s.t. ‖x‖0 ≤ s

Dictionary LASSO [ICML13]

minx

1

2‖Ax − b‖2 + λ‖Dx‖1

Multi-stage LASSO and Dantzig selector [NIPS10, JMLR12]

Robust dequantized compressive sensing [ACHA13]


Applications in Computer Vision, Multimedia, VideoSurveillance, Medical Image Analysis, Biological DataAnalysis

Abnormal event detection and dictionary learning [CVPR11, PR12]

Multi-task learning [SIGKDD10 best paper finalist, TKDD12]Large scale spectral clustering [under review]Online scene classification [TIP13]


Applications in Computer Vision, Multimedia, VideoSurveillance, Medical Image Analysis, Biological DataAnalysis

Sparse tensor decomposition for Drosophila data analysis [PR12]


Main Collaborators

Stephen Wright (UW-Madison, Ph.D advisor)

Jieping Ye, Peter Wonka (ASU, Master’s advisors)

Christopher Re [Stanford]

Vikas Singh, Victor Bittorf, Srikrishna Sridhar [UW-Madison]

Bo Liu, Sridhar Mahadevan [UMass]

Yang Cong [Chinese Academy of Sciences]

Junsong Yuan [NUS]

Fujimaki Ryohei [NEC lab]

Jiebo Luo [Rochester]

Jianhui Chen [GE lab]

Jun Liu [SAS]

Lei Yuan [DOW lab]

Przemyslaw Musialski [Vienna University of Technology]


Overview





Tensor Completion / Recovery

Problem: How to estimate the missing values in a low rank tensor data?

Examples: Netflix challenge, recommendation problems, image / videoin-painting.

Figure: The left figure contains 80% missing entries shown as white pixels andthe right figure shows its reconstruction using the low rank approximation.


Our Formulation

We propose the following formulation:

minX

‖X‖∗ s.t. XΩ = TΩ

where X and T are n dimensional tensors, Ω denotes the observedelements, and the tensor nuclear norm is defined by

‖X‖∗ :=n∑

i=1

αi‖unfold(i)(X )‖∗ where∑i

αi = 1, αi ≥ 0.

Tensor nuclear norm is used to capture the low rank structure in a tensor.


Algorithms

Main advantages of this model

Convex =⇒ Robust

Potential strong theoretical guarantees from the study on the matrixcase =⇒ Very limit observations can recover the whole tensor data

Main difficulty in optimization

Multiple nonsmooth terms =⇒ General gradient methods do notwork.

Two algorithms are proposed

FaLRTC: Smoothing scheme + Nesterov’s accelerated scheme(convergence rate: O(1/K ))

HaLRTC: Alternating Direction Method of Multipliers (ADMM)


Contributions and Video Show

Contributions

This is the first work to define a convex regularization to capture thelow rank structure for tensors. It is nontrival extension from matricesto tensors;

This pioneering work becomes the bench mark in tensor completion /recovery.

Video: http://www.youtube.com/watch?v=kbnmXM3uZFA


http://www.youtube.com/watch?v=kbnmXM3uZFA

Overview





Trend of Optimization Algorithms in Machine Learning

before 2000 2000 ∼ 2010 2010 ∼ nowmethodologies 2nd order meth. 1st order meth. 1/2 order meth.representative Newton’s method gradient descent stochastic gradientalgorithms SDP Nesterov’s meth. coordinate descentconvergence speed fast medium slow / mediummemory cost high medium low

computation cost high (∇2f , ∇f ) medium (∇f ) low (∇f )problem size small medium big


Asynchronous Parallel Optimization

Figure: How does the asynchronous parallel procedure work?

RAM (Shared Memory)“X”

Core Core

Core Core

Cache

Core Core

Core Core

Cache

…...

Read “X”;Compute gradient at “X”Update “X” in RAM

All processors / cores share the same memory which saves the currentvariable x ;

All processors / cores run the same optimization algorithm independently;

All processors / cores update the coordinates of x concurrently without anysoftware locking.


Contributions

Application

This procedure totally avoids the synchronization cost and even allowsdata distributed in different locations.The near linear speedup over a single processor / core is achieved inmany practical situations.Successes have been demonstrated in solving the sparse linearequations and our LP solver.

Theory: Build up the theoretical foundations by providing

Convergence rates (consistent with single core algorithm);Upper bounds of the number of processors / cores for which near linearspeedup can be guaranteed;An intuitive measure for the parallelizability of optimization problems.


Overview





Asynchronous Parallel Stochastic Proximal CoordinateDescent Algorithm (AsySCD)

minx

: F (x) := f (x) + g(x) (1)

where f (·) : Rn 7→ R is convex and differentiable andg(·) : Rn 7→ R ∪ +∞ is a proper closed convex real value extendedfunction. We assume that g(x) is separable, that is, g(x) =

∑ni=1 gi ((x)i )

where gi (·) : R 7→ R ∪ +∞. Several examples for g(x)

Unconstrained case: g(x) = constant.

Box constraint case: g(x) =∑n

i=1 1[ai ,bi ]((x)i ) where 1[ai ,bi ](·) is anindicator function.

`p norm regularization: g(x) = ‖x‖pp where p ≥ 1.


Examples

least squares: minx12‖Ax − b‖2;

LASSO: minx12‖Ax − b‖2 + λ‖x‖1;

support vector machine (SVM) with squared hinge loss:

minx

C∑i

maxyi (xTi w − b), 02 +

1

2‖w‖2

support vector machine (dual form with bias term):

minα

1

2(∑i ,j

αiαjyiyjK (xi , xj))−∑i

αi

s.t. 0 ≤ α ≤ C


Examples (continued)

logistic regression with `p norm regularization (p = 1, 2):

minx

1

n

∑i

log(1 + exp(−yixTi w)) + λ‖w‖pp

semi-supervised learning (Tikhonov Regularization)

minf

∑i∈labeled data

(fi − yi )2 + λf TLf

where L is the Laplacian matrix.

relaxed LP problem

minx≥0

cT x s.t. Ax = b ⇒ minx≥0

cT x + λ‖Ax − b‖2


Stochastic Proximal Coordinate Descent (SCD)

Stochastic Coordinate Descent Algorithm (SCD): Choose indexi ∈ 1, 2, · · · , n uniformly at random at iteration j , update

xj+1 = Pαgi (·) (xj − αei∇i f (xj))

:= arg minx∈Rn

1

2‖x − (xj − αei∇i f (xj)) ‖2 + αgi ((x)i )

= arg minx∈Rn

f (xj) + 〈∇i f (xj), (x − xj)i 〉+1

2α‖x − xj‖2 + gi ((x)i )

Note that only a single element in xj+1 would be updated:

(xj+1)l =

arg miny∈R

12‖y − ((xj)l − α∇l f (xj)) ‖2 + αgl(y), if l = i

(xj)l , otherwise

Our Target: Asynchronously parallelize SCD.


Notation

Lmax: component Lipschitz constant (“max diagonal of Hessian”)

‖∇f (x + tej)−∇f (x)‖∞ ≤ Lmax|t| ∀x ∈ Rn, t ∈ R, i ;

Lres: restricted Lipschitz constant (“max row-norm of Hessian”)

‖∇f (x + tei )−∇f (x)‖ ≤ Lres|t| ∀x ∈ Rn, t ∈ R, i ;

Λ := LresLmax

;S : the solution set of (1);


Notation: Optimally Strongly Convex (OSC)

Optimal strong convexity parameter µ > 0

F (x)− F (PS(x)) ≥ µ

2‖x − PS(x)‖2

for all x ∈ domF . Weaker than usual strong convexity.


Examples of OSC (but not strongly convex) functions

F (x) = 12‖Ax − b‖2 where A could be any matrices;

F (x) = 12‖Ax − b‖2 s.t. l ≤ x ≤ h where A could be any matrices;

Squared Hinge Loss F (x) =∑

k max(aTk x − bk , 0)2;

More generally

minx

f (Ax)

s.t. x ∈ Ω

where f (·) is strongly convex and Ω defines a polyhedron;


Asynchronous SPCD Algorithm (AsySCD) withInconsistent Read

Asynchronous parallelization in Hogwild! [Niu, Recht, Re, and Wright,2011].

All processors share the same memory storing the current x ;

All processors run the same algorithms; (SCD algorithm in AsySCD)

All processors update the values of x simultaneously — no softwarelocking.

We use the same setup for AsySCD.


Asynchronous SPCD Algorithm (AsySCD) withInconsistent Read

At iteration j :

Choose i with equal probability from 1, 2, · · · , n;Update the i component:

xj+1 = P γLmax

gi (.)

(xj −

γ

Lmax∇i f (xj)ei

),

where K (j) is some iterate ≤ j . Here γ is a constant steplength.

Each core runs this process concurrently and asynchronously.

xj may be not any real state of x in the shared memory. It is called“inconsistent read”. Hogwild! [Niu, et al, 2010] assumes that xj is someearly iterate (real state) of x for simpler analysis. It is called “consistentread”. (The “consistent read” model is just a special case of the“inconsistent read” model.)


When does Inconsistent Read happen?


Inconsistent Read

Mathematically, the relationship between xj and xj can expressed by

xj = xj +∑

d∈K(j)

(xd+1 − xd),

where K (j) defines an iterate set. Intuitively, xj missed a few updates fromxj .

Here we assume τ to be the upper bound of ages of all elements in K (j)for all j : τ ≥ j −mind | d ∈ K (j).


Key to Analysis

xj+1 = P γLmax

gi (.)

(xj −

γ

Lmax∇i f (xj)ei

)Choose some ρ > 1 and pick γ small enough to ensure that

E(‖xj − xj−1‖2) ≤ ρE(‖xj+1 − xj‖2) “ρ-condition”.

Not too much change in the gradient over each iteration, so not too muchprice to pay for using inexact information, in the asynchronous setting.

Can choose γ small enough to satisfy this property but large enough to geta better convergence rate.


OSC: Linear Rate

Theorem

For any ρ > 1 + 4/√

n, define

θ :=ρ(τ+1)/2 − ρ1/2

ρ1/2 − 1θ′ :=

ρ(τ+1) − ρρ− 1

ψ := 1 +τθ′

n+

Λθ√n.

and choose

γ ≤ min

1

ψ,

√n(1− ρ−1)− 4

4(1 + θ)Λ

.

We have that the “ρ-condition” is satisfied and for any j ≥ 0

E‖xj − PS(xj)‖2 +2γ

Lmax(EF (xj)− F ∗)

≤(

1− l

n(l + γ−1Lmax)

)j (‖x0 − PS(x0)‖2 +

2γ

Lmax(F (x0)− F ∗)

).


A Particular Choice

Corollary

Consider the regime in which

4eΛ(τ + 1)2 ≤√

n,

and define ρ =(

1 + 4eΛ(τ+1)√n

)2. Thus we can choose γ = 1/2, and the

rate simplifies to:

E(F (xj)−F ∗) ≤(

1− µ

n(µ+ 2Lmax)

)j

(Lmax‖x0−PS(x0)‖2 +F (x0)−F ∗).


Comparison

Convergence rate for AsySCD choosing the number of processors in the

order of n1/4LmaxLres

:

E(F (xj)− F ∗) ≤

(1− cµn1/4Lmax

n(µ+ 2Lmax)Lres

)j

(F (x0)− F ∗)

≈(

1− cµ

n3/4Lres

)j

(F (x0)− F ∗)

where c is a constant. Recall the convergence rate of Proximal GradientDescent1:

F (xj)− F ∗ ≤(

1− µ

nL

)j(F (x0)− F ∗)

where L is the Lipschitz constant of ∇f (x).

L√n≤ Lres ≤ L ⇒ n3/4Lres ≤ nL.

1Compensated by the complexity factor n.Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 34 / 56

Weakly Convex: Sublinear Rate

Define ψ and γ as above, have

E(F (xj)− F ∗) ≤ n(Lmax‖x0 − PS(x0)‖2 + 2γ(F (x0)− F ∗))

2γ(j + n).

Roughtly ”1/j” behavior.

Assuming 4eΛ(τ + 1)2 ≤√

n and setting ρ and γ as above, have

E(F (xj)− F ∗) ≤ n(Lmax‖x0 − PS(x0)‖2 + F (x0)− F ∗)

j + n.


Diagonalicity of Hessian

τ + 1 ≤ n1/4

√4eΛ

Smaller Λ = Lres/Lmax is beneficial to parallelization.

The ratio Lres/Lmax is particularly important – it measures the degree ofdiagonal dominance in the Hessian ∇2f (x) (Diagonalicity).

By convexity, we have

1 ≤ Lres

Lmax≤√

n.

Closer to 1 if Hessian is nearly diagonally dominant (eigenvectors close toprincipal coordinate axes). Closer to

√n otherwise.

If A is m× n Gaussian random matrix and f (x) = 12‖Ax − b‖2, the ratio is

bounded by 1 + O(√

n/m) with high probability.

Allows τ ≈ O(n1/4).


Constrained: 4-socket, 40-core Intel Xeon

minx≥0

(x − z)T (ATA + 0.5I )(x − z) ,

where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 20000,columns are normalized to 1) and z is a Gaussian random vector.Lres/Lmax ≈ 2.2. Choose γ = 1.

2 4 6 8 10 12 14 16 18 20 22

10−4

10−3

10−2

10−1

100

101

Synthetic Constrained QP: n = 20000 p = 10

# epochs

resid

ual

thread= 1thread=10thread=20thread=30thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Synthetic Constrained QP: n = 20000

threads

speedup

IdealAsySCD−DWGlobal Locking


Unconstrained: 4-socket, 40-core Intel Xeon

minx

‖Ax − b‖2 + 0.5‖x‖2

where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 20000,data size≈3GB, columns are normalized to 1). Lres/Lmax ≈ 2.2. Chooseγ = 1. 3-4 seconds to achieve the accuracy 10−5 on 40 cores.

5 10 15 20 25 30 35 40

10−4

10−3

10−2

10−1

100

101

Synthetic Unconstrained QP: n = 20000 p = 10

# epochs

resid

ual

thread= 1thread=10thread=20thread=30thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Synthetic Unconstrained QP: n = 20000

threads

speedup



Experiments: 1-socket, 10-core Intel Xeon

minx

1

2‖Ax − b‖2 + λ‖x‖1

where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 10000,data size≈750MB),b = A ∗ sprandn(n, 1, 10) + 0.01 ∗ randn(n, 1), andλ = 0.2

√m log(n). Lres/Lmax ≈ 2.2. Choose γ = 1.

50 100 150 200 250 300

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

m = 6000 n = 10000 sparsity = 10 σ = 0.01

# epochs

Obje

ctive

thread= 1thread= 2thread= 4thread= 8thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 6000 n = 10000 sparsity = 10 σ = 0.01

threads

speedup

IdealAsySPCD


Experiments: 1-socket, 10-core Intel Xeon

minx

1

2‖Ax − b‖2 + λ‖x‖1,

where A ∈ Rm×n is a Gaussian random matrix (m = 12000, n = 20000,data size≈3GB),b = A ∗ sprandn(n, 1, 20) + 0.01 ∗ randn(n, 1), andλ = 0.2

√m log(n). Lres/Lmax ≈ 2.2. Choose γ = 1.

20 40 60 80 100 120 140 160 180 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

m = 12000 n = 20000 sparsity = 20 σ = 0.01

# epochs

Obje

ctive

thread= 1thread= 2thread= 4thread= 8thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 12000 n = 20000 sparsity = 20 σ = 0.01

threads

speedup

IdealAsySPCD


AsySCD vs. SynGD

#cores Time(sec) SpeedupSynGD / AsySCD SynGD / AsySCD

1 121. / 27.1 0.22 / 1.0010 11.4 / 2.57 2.38 / 10.520 6.00 / 1.36 4.51 / 19.930 4.44 / 1.01 6.10 / 26.840 3.91 / 0.88 6.93 / 30.8

Table: Efficiency comparison between SynGD and AsySCD for the QP problem.The running time and speedup are calculated based on the residual 10−5.


Vectex Cover Problem

The vertex cover problem for an undirected graph with edge set E andvertex set V can be written as a binary linear program:

miny∈0,1|V |

∑v∈V

yv s.t. yu + yv ≥ 0, ∀ (u, v) ∈ E .

By relaxing each binary constraint to the interval [0, 1], introducing slackvariables for the cover inequalities, we obtain a problem of the form

minyv∈[0,1], suv∈[0,1]

∑v∈V

yv s.t. yu + yv − suv = 0, ∀ (u, v) ∈ E .

This has the form

minx∈[0,1]n

cT x s.t. Ax = b, ⇒ minx∈[0,1]n

cT x +β

2‖Ax − b‖2

for n = |V |+ |E |. The test problem is a regularized quadratic penaltyreformulation of this linear program.


Vertex Cover (Amazon): 4-socket, 40-core Intel Xeon

10 20 30 40 50 60 70

10−4

10−3

10−2

10−1

100

101

102

Amazon: n = 561050 p = 10

# epochs

resid

ual

thread= 1

thread=10

thread=20

thread=30

thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Amazon: n = 561050

threads

speedup



Vertex Cover (DBLP): 4-socket, 40-core Intel Xeon

5 10 15 20 25 30 35 40

10−3

10−2

10−1

100

101

102

DBLP: n = 520891 p = 10

# epochs

resid

ual

thread= 1

thread=10

thread=20

thread=30

thread=40

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

DBLP: n = 520891

threads

speedup



Running Time

Problem 1 core 40 cores

QP 98.4 3.03QPc 59.7 1.82Amazon 17.1 1.25DBLP 11.5 .91

Table: Runtimes (s) for the four test problems on 1 and 40 cores.


Overview





Asynchronous Parallel Randomized Kaczmarz Algorithm(AsyRK)

Consider linear equations Ax = b, where the equations are feasible andmatrix A ∈ Rm×n is not necessarily square or full rank. Write

A =

aTiaT2...

aTm

, where ‖ai‖2 = 1, i = 1, 2, . . . ,m.

For infeasible system, one can minimize ‖Ax − b‖2, which is equivalent tosolving a feasible linear system ATAx = ATb or an extended version

Ax = y , AT y = ATb.

Randomized Kaczmarz (RK) Algorithm

Select row index i ∈ 1, 2, · · · ,m randomly with equal probability;

Updatexj+1 = xj − (aTi xj − bi )ai .


AsyRK Algorithm

At each iteration j :

Select i from 1, 2, . . . ,m with equal probability;

Select t from the support of ai with equal probability;

Update:

xj+1 =xj − γ‖ai‖0(aTi xk(j) − bi )(ai )tet

where

k(j) is some iterate prior to j but no more than τ cycles old:j − k(j) ≤ τ ;

γ is a constant steplength;

As in Hogwild!, different cores run this process concurrently, updatingan x accessible to all.


AsyRK Analysis: Linear Convergence

Theorem

Choose any ρ > 1 and define γ via the following:

ψ = χ+2λmaxτρ

τ

m

γ ≤ min

1

ψ,

m(ρ− 1)

2λmaxρτ+1, m

√ρ− 1

ρτ (mα2 + λ2maxτρ

τ )

Then we have a certain “ρ-condition” and linear convergence rate:

E(‖xj − P(xj)‖2) ≤(

1− λminγ

mχ(2− γψ)

)j

‖x0 − P(x0)‖2,

where χ = maxmi=1 ‖ai‖0.

A particular choice of ρ leads to simplified results, in a reasonable regime.


A Particular Choice

Corollary

Assumeτ + 1 ≤ m

2eλmax

and set ρ = 1 + 2eλmax/m. Can show that γ = 1/ψ for this case, soexpected convergence is

E(‖xj − P(xj)‖2) ≤(

1− λmin

m(χ+ 1)

)j

‖x0 − P(x0)‖2.

Converges to precision ε with probability at least 1− η in

K ≥ m(χ+ 1)

λmin

∣∣∣∣log‖x0 − P(x0)‖2

ηε

∣∣∣∣ iterations.

In the regime 2eλmax(τ + 1) ≤ m considered here the delay τ doesn’treally interfere with convergence rate.


Discussion

For random matrices A with unit rows, we have λmax ≈ (1 + O(m/n)),with high probability.

Conditions on τ are less strict than for the SCD algorithms. For randommatrices A, with m and n of the same order, have τ = O(m) = O(n).

(Recall τ = O(n1/4) for AsySCD in the constrained case andτ = O(n1/2) for AsySCD in the unconstrained case.)


Comparison

algorithms RK AsySCD AsyRK

# oper. per iter. O(δn) minO(δ2mn), O(n) O(δn)

rate (iteration) 1− λminm

1− λmin2nLmax

1− λminm(µ+1)

# processors 1 O(√

nLmaxLres

)O

(m

Lmax

)rate (running time) 1− O

(λminδmn

)1− O

(λmin

n1.5Lres minδ2m, 1

)1− O

(λmin

δ2n2Lmax

)

AsyRK favors a sparse data for A.


Experiment

[Contributed by my wife]

Sparse Gaussian random matrix A ∈ Rm×n with m = 100000 andn = 80000, sparse ratio is 0.1%.

50 100 150 200 250 300 350 400

10−8

10−6

10−4

10−2

100

m = 80000 n = 100000 sparsity = 0.001

# epochs

resid

ua

l

thread= 1

thread= 2

thread= 4

thread= 8

thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 80000 n = 100000 sparsity = 0.001

threads

speedup

Ideal

AsyRK


Experiment

Sparse Gaussian random matrix A ∈ Rm×n with m = 100000 andn = 80000, sparse ratio is 0.3%.

50 100 150 200 250 300 350

10−6

10−5

10−4

10−3

10−2

10−1

100

101

m = 80000 n = 100000 sparsity = 0.003

# epochs

resid

ua

l

thread= 1

thread= 2

thread= 4

thread= 8

thread=10

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

m = 80000 n = 100000 sparsity = 0.003

threads

speedup

Ideal

AsyRK


Table: Comparison of running time and epochs between AsySCD and AsyRKon 10 cores. We report their running time and number of epochs required toattain a residual of 10−5.

synthetic data size (MB) running time (sec) epochs

m n δ AsySCD AsyRK AsySCD AsyRK

80k 100k 0.0005 43 39. 3.6 199 19580k 100k 0.001 84 170. 7.6 267 28480k 100k 0.003 244 1279. 18.4 275 232

500k 1000k 0.00005 282 54. 5.8 19 19500k 1000k 0.0001 550 198. 10.4 24 30500k 1000k 0.0002 1086 734. 15.0 29 31


The End


Documents

Asynchronous Parallel Optimization