Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Asynchronous Parallel Optimization
Ji Liu
University of Wisconsin-Madison
January 29, 2014
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 1 / 56
Overview
1 Overview of All Projects
2 Tensor Completion / Recovery
3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 2 / 56
Overview
1 Overview of All Projects
2 Tensor Completion / Recovery
3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 3 / 56
Overview of All Projects
A Optimization Algorithm and Theory
B Sparse Learning and Compressive Sensing Theory
C Applications in Computer Vision, Multimedia, Video Surveillance,Medical Image Analysis, Biological Data Analysis
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 4 / 56
Optimization Algorithm and Theory
Asynchronous Parallel Optimization [current project, NIPS13 and 3under review]
Accelerated Randomized Kaczmarz Algorithm [under review]
Ax = b
Reinforcement Learning [NIPS12 spotlight]
E(A)x = E(b) find a sparse solution
Tensor Completion / Recovery [ICCV09, TPAMI13, cited by 90+]
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 5 / 56
Sparse Learning and Compressive Sensing Theory
Feature selection and cardinality optimization [ICML14]
minx
f (x) s.t. ‖x‖0 ≤ s
Dictionary LASSO [ICML13]
minx
1
2‖Ax − b‖2 + λ‖Dx‖1
Multi-stage LASSO and Dantzig selector [NIPS10, JMLR12]
Robust dequantized compressive sensing [ACHA13]
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 6 / 56
Applications in Computer Vision, Multimedia, VideoSurveillance, Medical Image Analysis, Biological DataAnalysis
Abnormal event detection and dictionary learning [CVPR11, PR12]
Multi-task learning [SIGKDD10 best paper finalist, TKDD12]Large scale spectral clustering [under review]Online scene classification [TIP13]
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 7 / 56
Applications in Computer Vision, Multimedia, VideoSurveillance, Medical Image Analysis, Biological DataAnalysis
Sparse tensor decomposition for Drosophila data analysis [PR12]
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 8 / 56
Main Collaborators
Stephen Wright (UW-Madison, Ph.D advisor)
Jieping Ye, Peter Wonka (ASU, Master’s advisors)
Christopher Re [Stanford]
Vikas Singh, Victor Bittorf, Srikrishna Sridhar [UW-Madison]
Bo Liu, Sridhar Mahadevan [UMass]
Yang Cong [Chinese Academy of Sciences]
Junsong Yuan [NUS]
Fujimaki Ryohei [NEC lab]
Jiebo Luo [Rochester]
Jianhui Chen [GE lab]
Jun Liu [SAS]
Lei Yuan [DOW lab]
Przemyslaw Musialski [Vienna University of Technology]
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 9 / 56
Overview
1 Overview of All Projects
2 Tensor Completion / Recovery
3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 10 / 56
Tensor Completion / Recovery
Problem: How to estimate the missing values in a low rank tensor data?
Examples: Netflix challenge, recommendation problems, image / videoin-painting.
Figure: The left figure contains 80% missing entries shown as white pixels andthe right figure shows its reconstruction using the low rank approximation.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 11 / 56
Our Formulation
We propose the following formulation:
minX
‖X‖∗ s.t. XΩ = TΩ
where X and T are n dimensional tensors, Ω denotes the observedelements, and the tensor nuclear norm is defined by
‖X‖∗ :=n∑
i=1
αi‖unfold(i)(X )‖∗ where∑i
αi = 1, αi ≥ 0.
Tensor nuclear norm is used to capture the low rank structure in a tensor.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 12 / 56
Algorithms
Main advantages of this model
Convex =⇒ Robust
Potential strong theoretical guarantees from the study on the matrixcase =⇒ Very limit observations can recover the whole tensor data
Main difficulty in optimization
Multiple nonsmooth terms =⇒ General gradient methods do notwork.
Two algorithms are proposed
FaLRTC: Smoothing scheme + Nesterov’s accelerated scheme(convergence rate: O(1/K ))
HaLRTC: Alternating Direction Method of Multipliers (ADMM)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 13 / 56
Contributions and Video Show
Contributions
This is the first work to define a convex regularization to capture thelow rank structure for tensors. It is nontrival extension from matricesto tensors;
This pioneering work becomes the bench mark in tensor completion /recovery.
Video: http://www.youtube.com/watch?v=kbnmXM3uZFA
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 14 / 56
Overview
1 Overview of All Projects
2 Tensor Completion / Recovery
3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 15 / 56
Trend of Optimization Algorithms in Machine Learning
before 2000 2000 ∼ 2010 2010 ∼ nowmethodologies 2nd order meth. 1st order meth. 1/2 order meth.representative Newton’s method gradient descent stochastic gradientalgorithms SDP Nesterov’s meth. coordinate descentconvergence speed fast medium slow / mediummemory cost high medium low
computation cost high (∇2f , ∇f ) medium (∇f ) low (∇f )problem size small medium big
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 16 / 56
Asynchronous Parallel Optimization
Figure: How does the asynchronous parallel procedure work?
RAM (Shared Memory)“X”
Core Core
Core Core
Cache
Core Core
Core Core
Cache
…...
Read “X”;Compute gradient at “X”Update “X” in RAM
All processors / cores share the same memory which saves the currentvariable x ;
All processors / cores run the same optimization algorithm independently;
All processors / cores update the coordinates of x concurrently without anysoftware locking.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 17 / 56
Contributions
Application
This procedure totally avoids the synchronization cost and even allowsdata distributed in different locations.The near linear speedup over a single processor / core is achieved inmany practical situations.Successes have been demonstrated in solving the sparse linearequations and our LP solver.
Theory: Build up the theoretical foundations by providing
Convergence rates (consistent with single core algorithm);Upper bounds of the number of processors / cores for which near linearspeedup can be guaranteed;An intuitive measure for the parallelizability of optimization problems.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 18 / 56
Overview
1 Overview of All Projects
2 Tensor Completion / Recovery
3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 19 / 56
Asynchronous Parallel Stochastic Proximal CoordinateDescent Algorithm (AsySCD)
minx
: F (x) := f (x) + g(x) (1)
where f (·) : Rn 7→ R is convex and differentiable andg(·) : Rn 7→ R ∪ +∞ is a proper closed convex real value extendedfunction. We assume that g(x) is separable, that is, g(x) =
∑ni=1 gi ((x)i )
where gi (·) : R 7→ R ∪ +∞. Several examples for g(x)
Unconstrained case: g(x) = constant.
Box constraint case: g(x) =∑n
i=1 1[ai ,bi ]((x)i ) where 1[ai ,bi ](·) is anindicator function.
`p norm regularization: g(x) = ‖x‖pp where p ≥ 1.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 20 / 56
Examples
least squares: minx12‖Ax − b‖2;
LASSO: minx12‖Ax − b‖2 + λ‖x‖1;
support vector machine (SVM) with squared hinge loss:
minx
C∑i
maxyi (xTi w − b), 02 +
1
2‖w‖2
support vector machine (dual form with bias term):
minα
1
2(∑i ,j
αiαjyiyjK (xi , xj))−∑i
αi
s.t. 0 ≤ α ≤ C
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 21 / 56
Examples (continued)
logistic regression with `p norm regularization (p = 1, 2):
minx
1
n
∑i
log(1 + exp(−yixTi w)) + λ‖w‖pp
semi-supervised learning (Tikhonov Regularization)
minf
∑i∈labeled data
(fi − yi )2 + λf TLf
where L is the Laplacian matrix.
relaxed LP problem
minx≥0
cT x s.t. Ax = b ⇒ minx≥0
cT x + λ‖Ax − b‖2
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 22 / 56
Stochastic Proximal Coordinate Descent (SCD)
Stochastic Coordinate Descent Algorithm (SCD): Choose indexi ∈ 1, 2, · · · , n uniformly at random at iteration j , update
xj+1 = Pαgi (·) (xj − αei∇i f (xj))
:= arg minx∈Rn
1
2‖x − (xj − αei∇i f (xj)) ‖2 + αgi ((x)i )
= arg minx∈Rn
f (xj) + 〈∇i f (xj), (x − xj)i 〉+1
2α‖x − xj‖2 + gi ((x)i )
Note that only a single element in xj+1 would be updated:
(xj+1)l =
arg miny∈R
12‖y − ((xj)l − α∇l f (xj)) ‖2 + αgl(y), if l = i
(xj)l , otherwise
Our Target: Asynchronously parallelize SCD.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 23 / 56
Notation
Lmax: component Lipschitz constant (“max diagonal of Hessian”)
‖∇f (x + tej)−∇f (x)‖∞ ≤ Lmax|t| ∀x ∈ Rn, t ∈ R, i ;
Lres: restricted Lipschitz constant (“max row-norm of Hessian”)
‖∇f (x + tei )−∇f (x)‖ ≤ Lres|t| ∀x ∈ Rn, t ∈ R, i ;
Λ := LresLmax
;S : the solution set of (1);
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 24 / 56
Notation: Optimally Strongly Convex (OSC)
Optimal strong convexity parameter µ > 0
F (x)− F (PS(x)) ≥ µ
2‖x − PS(x)‖2
for all x ∈ domF . Weaker than usual strong convexity.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 25 / 56
Examples of OSC (but not strongly convex) functions
F (x) = 12‖Ax − b‖2 where A could be any matrices;
F (x) = 12‖Ax − b‖2 s.t. l ≤ x ≤ h where A could be any matrices;
Squared Hinge Loss F (x) =∑
k max(aTk x − bk , 0)2;
More generally
minx
f (Ax)
s.t. x ∈ Ω
where f (·) is strongly convex and Ω defines a polyhedron;
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 26 / 56
Asynchronous SPCD Algorithm (AsySCD) withInconsistent Read
Asynchronous parallelization in Hogwild! [Niu, Recht, Re, and Wright,2011].
All processors share the same memory storing the current x ;
All processors run the same algorithms; (SCD algorithm in AsySCD)
All processors update the values of x simultaneously — no softwarelocking.
We use the same setup for AsySCD.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 27 / 56
Asynchronous SPCD Algorithm (AsySCD) withInconsistent Read
At iteration j :
Choose i with equal probability from 1, 2, · · · , n;Update the i component:
xj+1 = P γLmax
gi (.)
(xj −
γ
Lmax∇i f (xj)ei
),
where K (j) is some iterate ≤ j . Here γ is a constant steplength.
Each core runs this process concurrently and asynchronously.
xj may be not any real state of x in the shared memory. It is called“inconsistent read”. Hogwild! [Niu, et al, 2010] assumes that xj is someearly iterate (real state) of x for simpler analysis. It is called “consistentread”. (The “consistent read” model is just a special case of the“inconsistent read” model.)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 28 / 56
When does Inconsistent Read happen?
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 29 / 56
Inconsistent Read
Mathematically, the relationship between xj and xj can expressed by
xj = xj +∑
d∈K(j)
(xd+1 − xd),
where K (j) defines an iterate set. Intuitively, xj missed a few updates fromxj .
Here we assume τ to be the upper bound of ages of all elements in K (j)for all j : τ ≥ j −mind | d ∈ K (j).
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 30 / 56
Key to Analysis
xj+1 = P γLmax
gi (.)
(xj −
γ
Lmax∇i f (xj)ei
)Choose some ρ > 1 and pick γ small enough to ensure that
E(‖xj − xj−1‖2) ≤ ρE(‖xj+1 − xj‖2) “ρ-condition”.
Not too much change in the gradient over each iteration, so not too muchprice to pay for using inexact information, in the asynchronous setting.
Can choose γ small enough to satisfy this property but large enough to geta better convergence rate.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 31 / 56
OSC: Linear Rate
Theorem
For any ρ > 1 + 4/√
n, define
θ :=ρ(τ+1)/2 − ρ1/2
ρ1/2 − 1θ′ :=
ρ(τ+1) − ρρ− 1
ψ := 1 +τθ′
n+
Λθ√n.
and choose
γ ≤ min
1
ψ,
√n(1− ρ−1)− 4
4(1 + θ)Λ
.
We have that the “ρ-condition” is satisfied and for any j ≥ 0
E‖xj − PS(xj)‖2 +2γ
Lmax(EF (xj)− F ∗)
≤(
1− l
n(l + γ−1Lmax)
)j (‖x0 − PS(x0)‖2 +
2γ
Lmax(F (x0)− F ∗)
).
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 32 / 56
A Particular Choice
Corollary
Consider the regime in which
4eΛ(τ + 1)2 ≤√
n,
and define ρ =(
1 + 4eΛ(τ+1)√n
)2. Thus we can choose γ = 1/2, and the
rate simplifies to:
E(F (xj)−F ∗) ≤(
1− µ
n(µ+ 2Lmax)
)j
(Lmax‖x0−PS(x0)‖2 +F (x0)−F ∗).
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 33 / 56
Comparison
Convergence rate for AsySCD choosing the number of processors in the
order of n1/4LmaxLres
:
E(F (xj)− F ∗) ≤
(1− cµn1/4Lmax
n(µ+ 2Lmax)Lres
)j
(F (x0)− F ∗)
≈(
1− cµ
n3/4Lres
)j
(F (x0)− F ∗)
where c is a constant. Recall the convergence rate of Proximal GradientDescent1:
F (xj)− F ∗ ≤(
1− µ
nL
)j(F (x0)− F ∗)
where L is the Lipschitz constant of ∇f (x).
L√n≤ Lres ≤ L ⇒ n3/4Lres ≤ nL.
1Compensated by the complexity factor n.Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 34 / 56
Weakly Convex: Sublinear Rate
Define ψ and γ as above, have
E(F (xj)− F ∗) ≤ n(Lmax‖x0 − PS(x0)‖2 + 2γ(F (x0)− F ∗))
2γ(j + n).
Roughtly ”1/j” behavior.
Assuming 4eΛ(τ + 1)2 ≤√
n and setting ρ and γ as above, have
E(F (xj)− F ∗) ≤ n(Lmax‖x0 − PS(x0)‖2 + F (x0)− F ∗)
j + n.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 35 / 56
Diagonalicity of Hessian
τ + 1 ≤ n1/4
√4eΛ
Smaller Λ = Lres/Lmax is beneficial to parallelization.
The ratio Lres/Lmax is particularly important – it measures the degree ofdiagonal dominance in the Hessian ∇2f (x) (Diagonalicity).
By convexity, we have
1 ≤ Lres
Lmax≤√
n.
Closer to 1 if Hessian is nearly diagonally dominant (eigenvectors close toprincipal coordinate axes). Closer to
√n otherwise.
If A is m× n Gaussian random matrix and f (x) = 12‖Ax − b‖2, the ratio is
bounded by 1 + O(√
n/m) with high probability.
Allows τ ≈ O(n1/4).
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 36 / 56
Constrained: 4-socket, 40-core Intel Xeon
minx≥0
(x − z)T (ATA + 0.5I )(x − z) ,
where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 20000,columns are normalized to 1) and z is a Gaussian random vector.Lres/Lmax ≈ 2.2. Choose γ = 1.
2 4 6 8 10 12 14 16 18 20 22
10−4
10−3
10−2
10−1
100
101
Synthetic Constrained QP: n = 20000 p = 10
# epochs
resid
ual
thread= 1thread=10thread=20thread=30thread=40
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Synthetic Constrained QP: n = 20000
threads
speedup
IdealAsySCD−DWGlobal Locking
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 37 / 56
Unconstrained: 4-socket, 40-core Intel Xeon
minx
‖Ax − b‖2 + 0.5‖x‖2
where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 20000,data size≈3GB, columns are normalized to 1). Lres/Lmax ≈ 2.2. Chooseγ = 1. 3-4 seconds to achieve the accuracy 10−5 on 40 cores.
5 10 15 20 25 30 35 40
10−4
10−3
10−2
10−1
100
101
Synthetic Unconstrained QP: n = 20000 p = 10
# epochs
resid
ual
thread= 1thread=10thread=20thread=30thread=40
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Synthetic Unconstrained QP: n = 20000
threads
speedup
IdealAsySCD−DWGlobal Locking
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 38 / 56
Experiments: 1-socket, 10-core Intel Xeon
minx
1
2‖Ax − b‖2 + λ‖x‖1
where A ∈ Rm×n is a Gaussian random matrix (m = 6000, n = 10000,data size≈750MB),b = A ∗ sprandn(n, 1, 10) + 0.01 ∗ randn(n, 1), andλ = 0.2
√m log(n). Lres/Lmax ≈ 2.2. Choose γ = 1.
50 100 150 200 250 300
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
m = 6000 n = 10000 sparsity = 10 σ = 0.01
# epochs
Obje
ctive
thread= 1thread= 2thread= 4thread= 8thread=10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
m = 6000 n = 10000 sparsity = 10 σ = 0.01
threads
speedup
IdealAsySPCD
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 39 / 56
Experiments: 1-socket, 10-core Intel Xeon
minx
1
2‖Ax − b‖2 + λ‖x‖1,
where A ∈ Rm×n is a Gaussian random matrix (m = 12000, n = 20000,data size≈3GB),b = A ∗ sprandn(n, 1, 20) + 0.01 ∗ randn(n, 1), andλ = 0.2
√m log(n). Lres/Lmax ≈ 2.2. Choose γ = 1.
20 40 60 80 100 120 140 160 180 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
m = 12000 n = 20000 sparsity = 20 σ = 0.01
# epochs
Obje
ctive
thread= 1thread= 2thread= 4thread= 8thread=10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
m = 12000 n = 20000 sparsity = 20 σ = 0.01
threads
speedup
IdealAsySPCD
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 40 / 56
AsySCD vs. SynGD
#cores Time(sec) SpeedupSynGD / AsySCD SynGD / AsySCD
1 121. / 27.1 0.22 / 1.0010 11.4 / 2.57 2.38 / 10.520 6.00 / 1.36 4.51 / 19.930 4.44 / 1.01 6.10 / 26.840 3.91 / 0.88 6.93 / 30.8
Table: Efficiency comparison between SynGD and AsySCD for the QP problem.The running time and speedup are calculated based on the residual 10−5.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 41 / 56
Vectex Cover Problem
The vertex cover problem for an undirected graph with edge set E andvertex set V can be written as a binary linear program:
miny∈0,1|V |
∑v∈V
yv s.t. yu + yv ≥ 0, ∀ (u, v) ∈ E .
By relaxing each binary constraint to the interval [0, 1], introducing slackvariables for the cover inequalities, we obtain a problem of the form
minyv∈[0,1], suv∈[0,1]
∑v∈V
yv s.t. yu + yv − suv = 0, ∀ (u, v) ∈ E .
This has the form
minx∈[0,1]n
cT x s.t. Ax = b, ⇒ minx∈[0,1]n
cT x +β
2‖Ax − b‖2
for n = |V |+ |E |. The test problem is a regularized quadratic penaltyreformulation of this linear program.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 42 / 56
Vertex Cover (Amazon): 4-socket, 40-core Intel Xeon
10 20 30 40 50 60 70
10−4
10−3
10−2
10−1
100
101
102
Amazon: n = 561050 p = 10
# epochs
resid
ual
thread= 1
thread=10
thread=20
thread=30
thread=40
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Amazon: n = 561050
threads
speedup
IdealAsySCD−DWGlobal Locking
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 43 / 56
Vertex Cover (DBLP): 4-socket, 40-core Intel Xeon
5 10 15 20 25 30 35 40
10−3
10−2
10−1
100
101
102
DBLP: n = 520891 p = 10
# epochs
resid
ual
thread= 1
thread=10
thread=20
thread=30
thread=40
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
DBLP: n = 520891
threads
speedup
IdealAsySCD−DWGlobal Locking
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 44 / 56
Running Time
Problem 1 core 40 cores
QP 98.4 3.03QPc 59.7 1.82Amazon 17.1 1.25DBLP 11.5 .91
Table: Runtimes (s) for the four test problems on 1 and 40 cores.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 45 / 56
Overview
1 Overview of All Projects
2 Tensor Completion / Recovery
3 Asynchronous Parallel OptimizationAsynchronous Parallel Stochastic Proximal Coordinate DescentAlgorithm (AsySCD)Asynchronous Parallel Randomized Kaczmarz Algorithm (AsyRK)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 46 / 56
Asynchronous Parallel Randomized Kaczmarz Algorithm(AsyRK)
Consider linear equations Ax = b, where the equations are feasible andmatrix A ∈ Rm×n is not necessarily square or full rank. Write
A =
aTiaT2...
aTm
, where ‖ai‖2 = 1, i = 1, 2, . . . ,m.
For infeasible system, one can minimize ‖Ax − b‖2, which is equivalent tosolving a feasible linear system ATAx = ATb or an extended version
Ax = y , AT y = ATb.
Randomized Kaczmarz (RK) Algorithm
Select row index i ∈ 1, 2, · · · ,m randomly with equal probability;
Updatexj+1 = xj − (aTi xj − bi )ai .
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 47 / 56
AsyRK Algorithm
At each iteration j :
Select i from 1, 2, . . . ,m with equal probability;
Select t from the support of ai with equal probability;
Update:
xj+1 =xj − γ‖ai‖0(aTi xk(j) − bi )(ai )tet
where
k(j) is some iterate prior to j but no more than τ cycles old:j − k(j) ≤ τ ;
γ is a constant steplength;
As in Hogwild!, different cores run this process concurrently, updatingan x accessible to all.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 48 / 56
AsyRK Analysis: Linear Convergence
Theorem
Choose any ρ > 1 and define γ via the following:
ψ = χ+2λmaxτρ
τ
m
γ ≤ min
1
ψ,
m(ρ− 1)
2λmaxρτ+1, m
√ρ− 1
ρτ (mα2 + λ2maxτρ
τ )
Then we have a certain “ρ-condition” and linear convergence rate:
E(‖xj − P(xj)‖2) ≤(
1− λminγ
mχ(2− γψ)
)j
‖x0 − P(x0)‖2,
where χ = maxmi=1 ‖ai‖0.
A particular choice of ρ leads to simplified results, in a reasonable regime.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 49 / 56
A Particular Choice
Corollary
Assumeτ + 1 ≤ m
2eλmax
and set ρ = 1 + 2eλmax/m. Can show that γ = 1/ψ for this case, soexpected convergence is
E(‖xj − P(xj)‖2) ≤(
1− λmin
m(χ+ 1)
)j
‖x0 − P(x0)‖2.
Converges to precision ε with probability at least 1− η in
K ≥ m(χ+ 1)
λmin
∣∣∣∣log‖x0 − P(x0)‖2
ηε
∣∣∣∣ iterations.
In the regime 2eλmax(τ + 1) ≤ m considered here the delay τ doesn’treally interfere with convergence rate.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 50 / 56
Discussion
For random matrices A with unit rows, we have λmax ≈ (1 + O(m/n)),with high probability.
Conditions on τ are less strict than for the SCD algorithms. For randommatrices A, with m and n of the same order, have τ = O(m) = O(n).
(Recall τ = O(n1/4) for AsySCD in the constrained case andτ = O(n1/2) for AsySCD in the unconstrained case.)
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 51 / 56
Comparison
algorithms RK AsySCD AsyRK
# oper. per iter. O(δn) minO(δ2mn), O(n) O(δn)
rate (iteration) 1− λminm
1− λmin2nLmax
1− λminm(µ+1)
# processors 1 O(√
nLmaxLres
)O
(m
Lmax
)rate (running time) 1− O
(λminδmn
)1− O
(λmin
n1.5Lres minδ2m, 1
)1− O
(λmin
δ2n2Lmax
)
AsyRK favors a sparse data for A.
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 52 / 56
Experiment
[Contributed by my wife]
Sparse Gaussian random matrix A ∈ Rm×n with m = 100000 andn = 80000, sparse ratio is 0.1%.
50 100 150 200 250 300 350 400
10−8
10−6
10−4
10−2
100
m = 80000 n = 100000 sparsity = 0.001
# epochs
resid
ua
l
thread= 1
thread= 2
thread= 4
thread= 8
thread=10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
m = 80000 n = 100000 sparsity = 0.001
threads
speedup
Ideal
AsyRK
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 53 / 56
Experiment
Sparse Gaussian random matrix A ∈ Rm×n with m = 100000 andn = 80000, sparse ratio is 0.3%.
50 100 150 200 250 300 350
10−6
10−5
10−4
10−3
10−2
10−1
100
101
m = 80000 n = 100000 sparsity = 0.003
# epochs
resid
ua
l
thread= 1
thread= 2
thread= 4
thread= 8
thread=10
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
m = 80000 n = 100000 sparsity = 0.003
threads
speedup
Ideal
AsyRK
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 54 / 56
Table: Comparison of running time and epochs between AsySCD and AsyRKon 10 cores. We report their running time and number of epochs required toattain a residual of 10−5.
synthetic data size (MB) running time (sec) epochs
m n δ AsySCD AsyRK AsySCD AsyRK
80k 100k 0.0005 43 39. 3.6 199 19580k 100k 0.001 84 170. 7.6 267 28480k 100k 0.003 244 1279. 18.4 275 232
500k 1000k 0.00005 282 54. 5.8 19 19500k 1000k 0.0001 550 198. 10.4 24 30500k 1000k 0.0002 1086 734. 15.0 29 31
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 55 / 56
The End
Ji Liu (UW-Madison) Asynchronous Parallel Optimization January 29, 2014 56 / 56