View
5
Download
0
Category
Preview:
Citation preview
1
On the Linear Convergence of Distributed
Optimization over Directed Graphs
Chenguang Xi, and Usman A. Khan†
Abstract
This paper develops a fast distributed algorithm, termed DEXTRA, to solve the optimization prob-
lem when N agents reach agreement and collaboratively minimize the sum of their local objective
functions over the network, where the communications between agents are described by a directed
graph, which is uniformly strongly-connected. Existing algorithms, including Gradient-Push (GP) and
Directed-Distributed Gradient Descent (D-DGD), solve the same problem restricted to directed graphs
with a convergence rate of O(ln k/√k), where k is the number of iterations. Our analysis shows that
DEXTRA converges at a linear rate O(εk) for some constant ε < 1, with the assumption that the
objective functions are strongly convex. The implementation of DEXTRA requires each agent know its
local out-degree only. Simulation examples illustrate our findings.
Index Terms
Distributed optimization; multi-agent networks; directed graphs;.
I. INTRODUCTION
Distributed computation and optimization have gained great interests due to their widely
applications often in, e.g., large scale machine learning, [1, 2], model predictive control, [3],
cognitive networks, [4, 5], source localization, [6, 7], resource scheduling, [8], and message
routing, [9]. All of above applications can be reduced to variations of distributed optimization
problems by a network of nodes when knowledge of the objective function is scattered over the
network. In particular, we consider the problem that minimizing a sum of objectives,∑n
i=1 fi(x),
where fi : Rp → R is a private objective function at the ith agent of the network.
†C. Xi and U. A. Khan are with the Department of Electrical and Computer Engineering, Tufts University, 161 College Ave,
Medford, MA 02155; chenguang.xi@tufts.edu, khan@ece.tufts.edu. This work has been partially supported by
an NSF Career Award # CCF-1350264.
October 23, 2021 DRAFT
arX
iv:1
510.
0214
9v1
[m
ath.
OC
] 7
Oct
201
5
2
There have been many types of algorithms to solve the problem in a distributed manner.
The most common alternatives are Distributed Gradient Descent (DGD), [10, 11], Distributed
Dual Averaging (DDA), [12], and the distributed implementations of the Alternating Direction
Method of Multipliers (ADMM), [13–15]. DGD and DDA can be reduced to gradient-based
methods, where at each iteration a gradient related step is calculated, followed by averaging
with neighbors in the network. The main advantage of these methods is computational simplicity.
However, they suffer from a slower convergence rate due to the necessary slowly diminishing
stepsize in order to achieve an exact convergence. The convergence rate of DGD and DDA, with a
diminishing stepsize, is shown to be O( ln k√k
), [10]. When given a constant stepsize, the algorithms
accelerate themselves to O( 1k), at a cost of inexact converging to a neighborhood of optimal
solutions, [11]. To overcome the difficulties, some alternatives include Nesterov-based methods,
e.g., the Distributed Nesterov Gradient (DNG), [16], whose convergence rate is O( log kk
) with a
diminishing stepsize, and Distributed Nesterov gradient with Consensus iterations (DNC), [16],
with a faster convergence rate of O( 1k2
) at the cost of k times iterations at kth iteration. Some
methods use second order information such as the Network Newton (NN) method, [17], with a
convergence rate that is at least linear. Both DNC and NN need multiple communication steps
in one iteration.
The distributed implementations of ADMM is based on augmented Lagrangians, where at
each iteration the primal and dual variables are solved to minimize a lagrangian related function.
This type of methods converges exactly to the optimal point with a faster rate of O( 1k) with a
constant stepsize and further get linear convergence rate with the assumption of strongly convex
of objective functions. However, the disadvantage is the high computation burden. Each agent
needs to solve an optimization problem at each iteration. To resolve the issue, Decentralized
Linearized ADMM (DLM), [18], and EXTRA, [19], are proposed, which can be considered
as a first order approximation of decentralized ADMM. DLM and EXTRA have been proven
to converge at a linear rate if the local objective functions are assumed to be strongly convex.
All these proposed distributed algorithms, [10–19], assume the multi-agent network to be an
undirected graph. In contrast, Researches successfully solving the problem restricted to directed
graphs are limited.
We report the papers considering directed graphs here. There are two types of alternatives,
which are both gradient-based algorithms, solving the problems restricted to directed graphs.
The first type of algorithms, called Gradient-Push (GP) method, [20, 21], combines gradient
October 23, 2021 DRAFT
3
descent and push-sum consensus. The push-sum algorithm, [22, 23], is first proposed in consensus
problems1 to achieve average-consensus given a column-stochastic matrix. The idea is based
on computing the stationary distribution of the Markov chain characterized by the multi-agent
network and canceling the imbalance by dividing with the left-eigenvector. GP follows a similar
spirit of push-sum consensus and can be regarded as a perturbed push-sum protocol converging
to the optimal solution. Directed-Distributed Graident Descent (D-DGD), [28, 29], is the other
gradient-based alternative to solve the problem. The main idea follows Cai’s paper, [30], where a
new non-doubly-stochastic matrix is constructed to reach average-consensus. In each iteration, D-
DGD simultaneously constructs a row-stochastic matrix and a column-stochastic matrix instead
of only a doubly-stochastic matrix. The convergence of the new weight matrix, depending on
the row-stochastic and column-stochastic matrices, ensures agents to reach both consensus and
optimality. Since both GP and D-DGD are gradient-based algorithm, they suffer from a slow
convergence rate of O( ln k√k
) due to the diminishing stepsize. We conclude the existing first order
distributed algorithms solving the optimization problem of minimizing the sum of agents local
objective functions, and do a comparison in terms of speed, in Table I, including both the
circumstance of undirected graphs and directed graphs. As seen in Table I, ‘I’ means DGD with
a constant stepsize is an Inexact method; ‘M’ is saying that the DNC needs to do Multiple (k)
times of consensus at kth iteration, which is impractical in real situation as k goes to infinity,
and ‘C’ represents that DADMM has a much higher Computation burden compared to other first
order methods.
In this paper, we propose a fast distributed algorithm, termed DEXTRA, to solve the problem
over directed graphs, by combining the push-sum protocol and EXTRA. The push-sum protocol
has proven to be a good technique for dealing with optimization over digraphs, [20, 21]. EXTRA
works well in optimization problems over undirected graph with a fast convergence rate and
a low computation complexity. By applying the push-sum techniqies into EXTRA, we show
that DEXTRA converges exactly to the optimal solution with a linear rate, O(εk), even when
the underlying network is directed. The fast convergence rate is guaranteed because we apply
a constant stepsize in DEXTRA compared with the diminishing stepsize used in GP or D-
DGD. Currently, our formulation is restricted to strongly convex functions. We claim that though
DEXTRA is a combination of push-sum and EXTRA, the proof is not trival by simply following
1See, [24–27], for additional information on average consensus problems.
October 23, 2021 DRAFT
4
Algorithms General Convex Strongly Convex
DGD (αk) O( ln k√k) O( ln k
k)
DDA (αk) O( ln k√k) O( ln k
k)
DGD (α) O( 1k) I
DNG (αk) O( ln kk
)
DNC (α) O( 1k2
) M
DADMM (α) O(εk) C
DLM (α) O(εk)
EXTRA (α) O( 1k) O(εk)
GP (αk) O( ln k√k) O( ln k
k)
D-DGD (αk) O( ln k√k) O( ln k
k)
DEXTRA (α) O(εk)
TABLE I: Convergence rate of first order distributed optimization algorithms regarding undirected
and directed graphs.
the proof of either algorithm. As the communication link is more restricted compared with the
case in undirected graph, many properties used in the proof of EXTRA no longer works for
DEXTRA, e.g., the proof of EXTRA requiring the weighting matrix to be symmetric which
never happens in directed graphs.
The remainder of the paper is organized as follows. Section II describes, develops and interprets
the DEXTRA algorithm. Section III presents the appropriate assumptions and states the main
convergence results. In Section IV, we show the proof of some key lemmas, which will be used in
the subsequent proofs of convergence rate of DEXTRA. The main proof of the convergence rate
of DEXTRA is provided in Section V. We show numerical results in Section VI and Section VII
contains the concluding remarks.
Notation: We use lowercase bold letters to denote vectors and uppercase italic letters to denote
matrices. We denote by [x]i the ith component of a vector x. For matrix A, we denote by [A]i
the ith row of the matrix, and by [A]ij the (i, j)th element of a matrix, A. The matrix, In,
October 23, 2021 DRAFT
5
represents the n × n identity, and 1n (0n) are the n-dimensional vector of all 1’s (0’s). The
inner product of two vectors x and y is 〈x,y〉. We define the A-matrix norm, ‖x‖2A, of any
vector x as ‖x‖2A , 〈x, Ax〉 = 〈x, A>x〉 = 〈x, A+A>2
x〉 for A+A> being positive semi-definite.
If a symmetric matrix A is positive semi-definite, we write A � 0, while A � 0 means A is
positive definite. The largest and smallest eigenvalues of a matrix A are denoted as λmax(A)
and λmin(A). The smallest nonzero eigenvalue of a matrix A is denoted as λmin(A). For any
objective function f(x), we denote ∇f(x) the gradient of f at x.
II. DEXTRA DEVELOPMENT
In this section, we formulate the optimization problems and describe the DEXTRA algorithm.
We derive an informal but intuitive proof of DEXTRA, showing that the algorithm push agents
to achieve consensus and reach the optimal solution. The EXTRA algorithm, [19], is briefly
recapitulated in this section. We derive DEXTRA to a similar form as EXTRA such that our
algorithm can be viewed as an improvement of EXTRA suited to the case of directed graphs.
This reveals the meaning of DEXTRA: Directed EXTRA. Formal convergence results and proofs
are left to Section III, IV, and V.
Consider a strongly-connected network of n agents communicating over a directed graph, G =
(V , E), where V is the set of agents, and E is the collection of ordered pairs, (i, j), i, j ∈ V , such
that agent j can send information to agent i. Define N ini to be the collection of in-neighbors,
i.e., the set of agents that can send information to agent i. Similarly, N outi is defined as the
out-neighborhood of agent i, i.e., the set of agents that can receive information from agent i. We
allow both N ini and N out
i to include the node i itself. Note that in a directed graph when (i, j) ∈ E ,
it is not necessary that (j, i) ∈ E . Consequently, N ini 6= N out
i , in general. We focus on solving a
convex optimization problem that is distributed over the above multi-agent network. In particular,
the network of agents cooperatively solve the following optimization problem:
P1 : min f(x) =n∑i=1
fi(x),
where each fi : Rp → R is convex, not necessarily differentiable, representing the local objective
function at agent i.
October 23, 2021 DRAFT
6
A. EXTRA for undirected graphs
EXTRA is a fast exact first-order algorithm to solve Problem P1 but in the case that the
communication network is described by an undirected graph. At kth iteration, each agent i
performs the following updates:
xk+1i =xki +
∑j∈Ni
wijxkj −
∑j∈Ni
wijxk−1j − α
[∇fi(xki )−∇fi(xk−1i )
], (1)
where the weight, wij , forms a weighting matrix, W = {wij} satisfying that W is symmetric
and doubly stochastic, and the collection W = {wij} satisfies W = θIn + (1 − θ)W with
some θ ∈ (0, 12]. The update in Eq. (1) ensures xki to converge to the optimal solution of the
problem for all i with convergence rate O( 1k), and converge linearly when the objective function
is further strongly convex. To better represent EXTRA and later compare with the DEXTRA
algorithm, we write Eq. (1) in a matrix form. Let xk, ∇f(xk) ∈ Rnp be the collection of all agents
states and gradients at time k, i.e., xk , [xk1; · · · ;xkn], ∇f(xk) , [∇f1(xk1); , · · · ,∇fn(xkn)],
and W , W ∈ Rn×n be the weighting matrix collecting weights wij , wij , respectively. Then
Eq. (1) can be represented in a matrix form as:
xk+1 = [(In +W )⊗ Ip]xk − (W ⊗ Ip)xk−1 − α[∇f(xk)−∇f(xk−1)
], (2)
where the symbol ⊗ is the Kronecker product. We now state DEXTRA and derive it in a similar
form as EXTRA.
B. DEXTRA Algorithm
To solve the Problem P1 suited to the case of directed graphs, we claim the DEXTRA
algorithm implemented in a distributed manner as follow. Each agent, j ∈ V , maintains two
vector variables: xkj , zkj ∈ Rp, as well as a scalar variable, ykj ∈ R, where k is the discrete-time
index. At the kth iteration, agent j weights its state estimate, aijxkj , aijykj ,and aijxk−1j , and sends
it to each out-neighbor, i ∈ N outj , where the weights, aij , and, aij ,’s are such that:
aij =
> 0, i ∈ N out
j ,
0, otw.,
n∑i=1
aij = 1, ∀j,
aij =
θ + (1− θ)aij, i = j,
(1− θ)aij, i 6= j,
∀j,
October 23, 2021 DRAFT
7
where θ ∈ (0, 12]. With agent i receiving the information from its in-neighbors, j ∈ N in
i , it
calculates the states, zki , by dividing xki over yki , and updates the variables xk+1i , and yk+1
i as
follows:
zki =xkiyki, (3a)
xk+1i =xki +
∑j∈N in
i
(aijx
kj
)−∑j∈N in
i
(aijx
k−1j
)− α
[∇fi(zki )−∇fi(zk−1i )
], (3b)
yk+1i =
∑j∈N in
i
(aijy
kj
), (3c)
where ∇fi(zki ) is the gradient of the function fi(z) at z = zki , and ∇fi(zk−1i ) is the gradient
at zk−1i , respectively. The method is initiated with an arbitrary vector, x0i , and with y0i = 1 for
any agent i. The stepsize, α, is a positive number within a certain interval. We will explicitly
discuss the range of α in Section III. At the first iteration, i.e., k = 0, we have the following
iteration instead of Eq. (3),
z0i =x0i
y0i,
x1i =
∑j∈N in
i
(aijx
0j
)− α∇fi(z0i ),
y1i =∑j∈N in
i
(aijy
0j
).
We note that the implementation of Eq. (3) needs each agent having the knowledge of its out-
neighbors (such that it can design the weights). In a more restricted setting, e.g., a broadcast
application where it may not be possible to know the out-neighbors, we may use aij = |N outj |−1;
thus, the implementation only requires each agent knowing its out-degree, see, e.g., [20, 28] for
similar assumptions.
To simplify the proof, we write DEXTRA, Eq. (3), in a matrix form. Let, A = {aij} ∈Rn×n, A = {aij} ∈ Rn×n, be the collection of weights, aij , aij , respectively, xk, zk, ∇f(xk) ∈Rnp, be the collection of all agents states and gradients at time k, i.e., xk , [xk1; · · · ;xkn], zk ,
[zk1; · · · ; zkn], ∇f(xk) , [∇f1(xk1); · · · ;∇fn(xkn)], and yk ∈ Rn be the collection of agent
states, yki , i.e., yk , [yk1 ; · · · ; ykn]. Note that at time k, yk can be represented by the initial
value, y0:
yk = Ayk−1 = Aky0 = Ak · 1n. (5)
October 23, 2021 DRAFT
8
Define the diagonal matrix, Dk ∈ Rn×n, for each k, such that the ith element of Dk is yki , i.e.,
Dk = diag(Ak · 1n
). (6)
Then we have Eq. (3) in the matrix form equivalently as follows:
zk =([Dk]−1 ⊗ Ip)xk, (7a)
xk+1 =xk + (A⊗ Ip)xk − (A⊗ Ip)xk−1 − α[∇f(zk)−∇f(zk−1)
], (7b)
yk+1 =Ayk, (7c)
where both the two weight matrices, A and A, are column stochastic matrix and satisfy the
relationship: A = θIn + (1 − θ)A with some θ ∈ (0, 12], and when k = 0, Eq. (7b) changes
to x1 = (A⊗ Ip)x0 − α∇f(z0). Consider Eq. (7a), we obtain
xk =(Dk ⊗ Ip
)zk, ∀k. (8)
Therefore, it follows that DEXTRA can be represented in one single equation:(Dk+1 ⊗ Ip
)zk+1 =
[(In + A)Dk ⊗ Ip
]zk − (ADk−1 ⊗ Ip)zk−1 − α
[∇f(zk)−∇f(zk−1)
].
(9)
We refer to our algorithm DEXTRA, since Eq. (9) has a similar form as the EXTRA algorithm,
Eq. (2), and can be used to solve Problem P1 in the case of directed graph. We state our main
result in Section III, showing that as time goes to infinity, the iteration in Eq. (9) push zk to
achieve consensus and reach the optimal solution. Our proof in this paper will based on the
form, Eq. (9), of DEXTRA.
C. Interpretation of DEXTRA
In this section, we give an intuitive interpretation on DEXTRA’s convergence to the optimal
point before we show the formal proof which will appear in Section IV, V. Note that the analysis
given here is not a formal proof of DEXTRA. However, it gives a snapshot on how DEXTRA
works.
Since A is a column stochastic matrix, the sequence of iterates,{yk}
, generated by Eq. (7c)
converges to the span of A’s right eigenvector corresponding to eigenvalue 1, i.e., y∞ = u ·π for
some number u, where π is the right eigenvector of A corresponding to eigenvalue 1. We assume
that the sequence of iterates,{zk}
and{xk}
, generated by DEXTRA, Eq. (7) or (9), converges
October 23, 2021 DRAFT
9
to their own limit points, z∞, x∞, respectively, (which might not be the truth). According to the
updating rule in Eq. (7b), the limit point x∞ satisfies
x∞ =x∞ + (A⊗ Ip)x∞ − (A⊗ Ip)x∞ − α [∇f(z∞)−∇f(z∞)] , (10)
which implies that [(A− A)⊗ Ip]x∞ = 0np, or x∞ = π⊗v for some vector v ∈ Rp. Therefore,
it follows from Eq. (7a) that
z∞ =([D∞]−1 ⊗ Ip
)(π ⊗ v) ∈ {1n ⊗ v}, (11)
where the consensus is reached. The above analysis reveals the idea of DEXTRA to overcome
the imbalance of agent states occurred when the graph is directed: both x∞ and y∞ lie in the
span of π. By dividing x∞ over y∞, the imbalance is canceled.
By summing up the updates in Eq. (7b) over k, we obtain that
xk+1 = (A⊗ Ip)xk − α∇f(zk)−k−1∑r=0
[(A− A
)⊗ Ip
]xr.
Consider that x∞ = π⊗v and the preceding relation. It follows that the limit point z∞ satisfies
α∇f(z∞) = −∞∑r=0
[(A− A
)⊗ Ip
]xr. (12)
Therefore, we obtain that
α (1n ⊗ Ip)>∇f(z∞) = −[1>n
(A− A
)⊗ Ip
] ∞∑r=0
xr = 0p,
which is the optimality condition of Problem P1. Therefore, given the assumption that the
sequence of DEXTRA iterates,{zk}
and{xk}
, have limit points, z∞ and x∞, the limit point, z∞,
achieves consensus and reaches the optimal solution of Problem P1. In next section, we state
our main result of this paper and we leave the formal proof to Section IV, V.
III. ASSUMPTIONS AND MAIN RESULTS
With appropriate assumptions, our main result states that DEXTRA converges to the optimal
solution of Problem P1 linearly. In this paper, we assume that the agent graph, G, is strongly-
connected; each local function, fi : Rp → R, is convex and differentiable, ∀i ∈ V , and the
optimal solution of Problem P1 and the corresponding optimal value exist. Formally, we denote
the optimal solution by φ ∈ Rp and optimal value by f ∗, i.e.,
f(φ) = f ∗ = minx∈Rp
f(x). (13)
October 23, 2021 DRAFT
10
Besides the above assumptions, we emphasize some other assumptions regarding the objective
functions and weighting matrices, which are formally presented as follow.
Assumption A1 (Functions and Gradients). For all agent i, each function, fi, is differentiable
and satisfies the following assumptions.
(a) The function, fi, has Lipschitz gradient with the constant Lfi , i.e., ‖∇fi(x) −∇fi(y)‖ ≤Lfi‖x− y‖, ∀x,y ∈ Rp.
(b) The gradient, ∇fi(x), is bounded, i.e., ‖∇fi(x)‖ ≤ Bfi , ∀x ∈ Rp.
(c) The function, fi, is strictly convex with a constant Sfi , i.e., Sfi‖x − y‖2 ≤ 〈∇fi(x) −∇fi(y),x− y〉, ∀x,y ∈ Rp.
Following Assumption A1, we have for any x,y ∈ Rnp,
‖∇f(x)−∇f(y)‖ ≤ Lf ‖x− y‖ , (14a)
‖∇f(x)‖ ≤ Bf , (14b)
Sf ‖x− y‖2 ≤ 〈∇f(x)−∇f(y),x− y〉 , (14c)
where the constants Bf = maxi{Bfi}, Lf = maxi{Lfi}, and Sf = maxi{Sfi}. By combining
Eqs. (14b) and (14c), we obtain
‖x− y‖ ≤ 1
Sf‖∇f(x)−∇f(y)‖ ≤ 2Bf
Sf,
which is
‖x‖ ≤ 2Bf
Sf, (15)
by letting y = 0np. Eq. (15) reveals that the sequences updated by DEXTRA is bounded.
Recall the definition of Dk in Eq. (6), we denote the limits of Dk by D∞, i.e.,
D∞ = limk→∞
Dk = diag (A∞ · 1n) = n · diag (π) , (16)
where π is the normalized right eigenvector of A corresponding to eigenvalue 1. The next
assumption is related to the weighting matrices, A and A, and D∞.
Assumption A2 (Weighting matrices). The weighting matrices, A and A, used in DEXTRA,
Eq. (7) or (9), satisfy the following.
(a) A is a column stochastic matrix.
(b) A is a column stochastic matrix and satisfies A = θIn + (1− θ)A, for some θ ∈ (0, 12].
October 23, 2021 DRAFT
11
(c) (D∞)−1 A+ A> (D∞)−1 � 0.
To ensure (D∞)−1 A + A> (D∞)−1 � 0, the agent i can assign a large value to its own
weight aii in the implementation of DEXTRA, (such that the weighting matrices A and A are
more like diagonally-dominant matrices). Then all eigenvalues of (D∞)−1 A+ A> (D∞)−1 can
be positive.
Before we present the main result of this paper, we define some auxiliary sequences. Let z∗ ∈Rnp be the collection of optimal solution, φ, of Problem P1, i.e.,
z∗ = 1n ⊗ φ, (17)
and q∗ ∈ Rnp some vector satisfying[(A− A
)⊗ Ip
]q∗ + α∇f(z∗) = 0np. (18)
Introduce auxiliary sequence
qk =k∑r=0
xr, (19)
and for each k
tk =
(Dk ⊗ Ip
)zk
qk
, t∗ =
(D∞ ⊗ Ip) z∗
q∗
,
G =
[A> (D∞)−1
]⊗ Ip [
(D∞)−1(A− A
)]⊗ Ip
, (20)
We are ready to state the main result as shown in Theorem 1, which makes explicit the rate at
which the objective function converges to its optimal value.
Theorem 1. Let Assumptions A1, A2 hold. Denote P = In−A, L = A−A, N = (D∞)−1(A−A),
M = (D∞)−1A, dmax = maxk{‖Dk‖
}, d−max = maxk
{‖(Dk)−1‖
}, and d−∞ = ‖(D∞)−1‖.
Define ρ1 = 4(1 − 2θ)2λmax
(P>P
)+ 8(αLfd
−max)
2, and ρ2 = 4λmax
(A>A
)+ 8(αLfd
−max)
2,
where α is some constant regarded as a ceiling of α. Then with proper step size α ∈ [αmin, αmax],
there exists, δ > 0, Γ <∞, and γ ∈ (0, 1), such that the sequence{tk}
generated by DEXTRA
satisfies ∥∥tk − t∗∥∥2G≥ (1 + δ)
∥∥tk+1 − t∗∥∥2G− Γγk. (21)
October 23, 2021 DRAFT
12
The lower bound, αmin, of α is
αmin =
1σ
+ σλmax(NN>)
2λmin(L>L)ρ1
Sf
2(dmax)2 − η
2
,
and the upper bound, αmax, of α is
αmax =λmin
(M+M>
2
)− σλmax
(MM>
2
)− σλmax(NN>)
2λmin(L>L)ρ2
L2f
η(d−maxd
−∞)2
,
where, η and σ, are arbitrary constants to ensure αmin > 0 and αmax > 0, i.e., η < Sfd2max
, and
σ <λmin(M+M>)
12λmax
(MM>
2
)+λmax(NN>)
λmin(L>L)ρ2
.
We will show the proof of Theorem 1 in Section V, with the help of Lemmas 1, 3, and 4
in Section IV. Note that the result of Theorem 1, Eq. (21), does not mean the convergence of
DEXTRA to the optimal solution. To proof tk → t∗, we further need to show that the matrix G
satisfying ‖a‖2G ≥ 0, ∀a ∈ R2np. With Assumption A2(c), we already have ‖a‖2A>(D∞)−1
≥0, ∀a ∈ Rn. It is left to show that ‖a‖2
(D∞)−1(A−A)≥ 0, ∀a ∈ Rn, which will be presented in
Section IV. Assume that we have shown that ‖a‖2(D∞)−1(A−A)
≥ 0, ∀a ∈ Rn. Since γ ∈ (0, 1),
then we can say that∥∥tk − t∗
∥∥2G
converges to 0 at the R-linear rate O((1 + δ)−k) for some δ ∈(0, δ). Consequently, ‖zk − z∗‖2
A>(D∞)−1converges to 0 at the R-linear rate O((1 + δ)−k) for
some δ ∈ (0, δ). Again with Assumption A2(c), it follows that zk → z∗ with a R-linear rate.
In Theorem 1, we have given the specific value of αmin and αmax. We represent αmin =αtmin
αdmin,
and αmax = αtmax
αdmaxfor the sake of simplicity. To ensure αmin < αmax, the objective function needs
to have the strong convexity constant Sf satisfying
Sf ≥ 2(dmax)2
(αtminα
dmax
αtmax
+η
2
)(22)
It follows from Eq. (22) that in order to achieve a linear convergence rate for DEXTRA algorithm,
the objective function should not only be strongly convex, but also with the constant Sf above
some threshold. It is beyond the scope of this paper on how to design the weighting matrix, A,
which influences the eigenvalues in the representation of αmin and αmax, such that the threshold
of Sf can be close to zero. In real situation, numerical experiments illustrate that practically Sf
can be really small, i.e., DEXTRA have all agents converge to the optimal solution with an
R-linear rate for almost all the strongly convex functions..
Note that the values of αmin and αmax need the knowledge of network topology, which is not
available in a distributed manner. It is an open question on how to avoid the global knowledge
October 23, 2021 DRAFT
13
of network topology when designing the value of α. In numerical experiments, however, the
range of α is much wider than the theoretical value.
IV. BASIC RELATIONS
We will present the complete proof of Theorem 1 in Section V, which will rely on three
lemmas stated in this section. For the proof from now on, we will assume that the sequences
updated by DEXTRA have only one dimension, i.e., zki , xki ∈ R,∀i, k. For xki , zki ∈ Rp being
p-dimensional vectors, the proof is same for every dimension by applying the results to each
coordinate. Thus, assuming p = 1 is without the loss of generality. The advantage is that with p =
1 we get rid of the Kronecker product in equations, which is easier to follow. Let p = 1, and
rewrite DEXTRA, Eq. (9),
Dk+1zk+1 =(In + A)Dkzk − ADk−1zk−1 − α[∇f(zk)−∇f(zk−1)
]. (23)
We begin to introduce three main lemmas in this section. Recall the definition of qk,q∗,
in Eqs. (19) and (18), and Dk, D∞, in Eqs. (6) and (16). Lemma 1 establishes the relations
among Dkzk, qk, D∞z∗, and q∗.
Lemma 1. Let Assumptions A1, A2 hold. In DEXTRA, the qradruple sequence {Dkzk,qk, D∞z∗,q∗}obeys
(In + A− 2A)(Dk+1zk+1 −D∞z∗
)+ A
(Dk+1zk+1 −Dkzk
)= −
(A− A
) (qk+1 − q∗
)− α
[∇f(zk)−∇f(z∗)
], (24)
for any k = 0, 1, · · · .
Proof. We sum DEXTRA, Eq. (23), over time from 0 to k,
Dk+1zk+1 = ADkzk − α∇f(zk)−k∑r=0
(A− A
)Drzr.
By substracting(A− A
)Dk+1zk+1 on both sides of the preceding equation, and rearrange the
terms, it follows that(In + A− 2A
)Dk+1zk+1 + A
(Dk+1zk+1 −Dkzk
)= −
(A− A
)qk+1 − α∇f
(zk). (25)
Since(In + A− 2A
)π = 0n, where π is the right eigenvector of A corresponding to eigen-
value 1, we have (In + A− 2A
)D∞z∗ = 0n. (26)
October 23, 2021 DRAFT
14
By combining Eqs. (25) and (26), and noting that(A− A
)q∗ + α∇f(z∗) = 0n, Eq. (18), we
obtain the desired result.
Before presenting Lemma 3, we introduce an existing result on the special structure of column-
stochastic matrices, which helps efficiently on our proof. Specifically, we have the following
property of the weighting matrix, A and A, used in DEXTRA.
Lemma 2. (Nedic et al. [20]) For any column stochastic matrix A ∈ Rn×n, we have
(a) The limit limk→∞[Ak]
exists and limk→∞[Ak]ij
= πi, where π is the right eigenvector
of A corresponding to eigenvalue 1.
(b) For all i ∈ {1, · · · , n}, the entries[Ak]ij
and πi satisfy∣∣∣[Ak]ij− πi
∣∣∣ < Cλk, ∀j,
where we can have C = 4 and λ = (1− 1nn
).
The convergence properties of the weighting matrix, A, help to build the relation between∥∥Dk+1zk+1 −Dkzk∥∥ and
∥∥zk+1 − zk∥∥. There is also similar relations between
∥∥Dk+1zk+1 −D∞z∗∥∥
and∥∥zk+1 − z∗
∥∥.
Lemma 3. Let Assumptions A1, A2 hold. Denote dmax = maxk{‖Dk‖
}, d−max = maxk
{‖(Dk)−1‖
}.
The sequences,∥∥Dk+1zk+1 −D∞z∗
∥∥,∥∥zk+1 − z∗
∥∥,∥∥Dk+1zk+1 −Dkzk
∥∥, and∥∥zk+1 − zk
∥∥, gen-
erated by DEXTRA, satisfies the following relations.
(a) ∥∥zk+1 − zk∥∥ ≤d−max
∥∥Dk+1zk+1 −Dkzk∥∥+
4d−maxnCBf
Sfλk.
(b) ∥∥zk+1 − z∗∥∥ ≤d−max
∥∥Dk+1zk+1 −D∞z∗∥∥+
2d−maxnCBf
Sfλk.
(c) ∥∥Dk+1zk+1 −D∞z∗∥∥ ≤dmax
∥∥zk+1 − z∗∥∥+ 2nCBf
Sfλk.
Proof. (a) ∥∥zk+1 − zk∥∥ =
∥∥∥(Dk+1)−1 (
Dk+1) (
zk+1 − zk)∥∥∥
≤∥∥∥(Dk+1
)−1∥∥∥∥∥Dk+1zk+1 −Dkzk +Dkzk −Dk+1zk∥∥
October 23, 2021 DRAFT
15
≤ d−max
∥∥Dk+1zk+1 −Dkzk∥∥+ d−max
∥∥Dk −Dk+1∥∥∥∥zk∥∥ .
≤ d−max
∥∥Dk+1zk+1 −Dkzk∥∥+
4d−maxnCBf
Sfλk,
where the last inequality follows by applying Lemma 2 and the boundedness of the sequences
updated by DEXTRA, Eq. (15). Similarly, we can proof the result of part (b). The proof of part
(c) follows: ∥∥Dk+1zk+1 −D∞z∗∥∥ =
∥∥Dk+1zk+1 −Dk+1z∗ +Dk+1z∗ −D∞z∗∥∥
≤ dmax
∥∥zk+1 − z∗∥∥+
∥∥Dk+1 −D∞∥∥ ‖z∗‖ .
≤ dmax
∥∥zk+1 − z∗∥∥+
2nCBf
Sfλk.
Lemmas 1 and 3 will be used in the proof of Theorem 1 in Section V. We now analysis the
property of matrix (D∞)−1(A− A
). As we have stated in Section III, the result of Theorem 1
itself can not show that the sequence,{zk}
, generated by DEXTRA converges, to the optimal
solution, z∗, i.e., zk → z∗. To proof this, we still need to show that ‖a‖2(D∞)−1(A−A)
≥ 0, ∀a ∈Rn. This property can be easily proved by directly applying the following lemma.
Lemma 4. (Chung. [31]) The Laplacian, L, of a directed graph, G, is defined by
L(G) = In −S1/2RS−1/2 + S−1/2R>S1/2
2, (27)
where R is the transition probability matrix and S = diag(s), where s is the left eigenvector
of R corresponding to eigenvalue 1. If G is strongly connected, then the eigenvalues of L(G)
satisfy 0 = λ0 < λ1 < · · · < λn.
Since
(D∞)−1 (In − A) + (In − A)> (D∞)−1 = (D∞)−1/2 L(G) (D∞)−1/2 ,
it follows from Lemma 4 that for any a ∈ Rn,
‖a‖22(D∞)−1(In−A) = ‖a‖2
(D∞)−1(In−A)+(In−A)>(D∞)−1 ≥ 0.
Noting that A− A = θ(In − A), and In + A− 2A = (1− 2θ)(In − A), and θ ∈ (0, 12], we also
have the following relations.
‖a‖2(D∞)−1(A−A) ≥ 0, ∀a ∈ Rn, (28)
October 23, 2021 DRAFT
16
‖a‖2(D∞)−1(In+A−2A) ≥ 0, ∀a ∈ Rn, (29)
and
λmin
((D∞)−1
(A− A
)+(A− A
)>(D∞)−1
)= 0,
λmin
((D∞)−1 (In + A− 2A) + (In + A− 2A)> (D∞)−1
)= 0,
Therefore, Eq. (28), together with the result of Theorem 1, Eq. (21) prove the convergence of
DEXTRA to the optimal solution of Problem P1. What is left to prove is Theorem 1, which is
presented in the next section.
V. PROOF OF THEOREM 1
With all the pieces in place, we are finally ready to prove Theorem 1. According to the strong
convexity assumption, A1(c), we have
2αSf∥∥zk+1 − z∗
∥∥2 ≤2α⟨D∞zk+1 −D∞z∗, (D∞)−1
[∇f(zk+1)−∇f(z∗)
]⟩,
=2α⟨D∞zk+1 −Dk+1zk+1, (D∞)−1
[∇f(zk+1)−∇f(z∗)
]⟩+ 2α
⟨Dk+1zk+1 −D∞z∗, (D∞)−1
[∇f(zk+1)−∇f(zk)
]⟩+ 2
⟨Dk+1zk+1 −D∞z∗, (D∞)−1 α
[∇f(zk)−∇f(z∗)
]⟩(30)
:=s1 + s2 + s3,
where s1, s2, s3 denote each of RHS terms in Eq. (30). We discuss each term in sequence.
Denote dmax = maxk{‖Dk‖
}, d−max = maxk
{‖(Dk)−1‖
}, and d−∞ = ‖(D∞)−1‖, which are all
bounded due to the convergence of the weighting matrix, A. For the first two terms, we have
s1 ≤2α∥∥D∞ −Dk+1
∥∥∥∥zk+1∥∥∥∥(D∞)−1
∥∥∥∥∇f(zk+1)−∇f(z∗)∥∥
≤8αncB2
fd−∞
Sfλk; (31)
s2 ≤αη∥∥Dk+1zk+1 −D∞z∗
∥∥2 +αL2
f
η
∥∥(D∞)−1∥∥2 ∥∥zk+1 − zk
∥∥2≤αη
∥∥Dk+1zk+1 −D∞z∗∥∥2 +
2αL2f (d−∞d−max)
2
η
∥∥Dk+1zk+1 −Dkzk∥∥2
+32α(ncLfBfd
−∞d−max)
2
ηS2f
λk; (32)
October 23, 2021 DRAFT
17
Recall Lemma 1, Eq. (24), we can represent s3 as
s3 =∥∥Dk+1zk+1 −D∞z∗
∥∥2−2(D∞)−1(In+A−2A)
+ 2⟨Dk+1zk+1 −D∞z∗, (D∞)−1 A
(Dkzk −Dk+1zk+1
)⟩+ 2
⟨Dk+1zk+1 −D∞z∗, (D∞)−1
(A− A
) (q∗ − qk+1
)⟩:=s3a + s3b + s3c, (33)
where
s3b = 2⟨Dk+1zk+1 −Dkzk, A> (D∞)−1
(D∞z∗ −Dk+1zk+1
)⟩(34)
and
s3c = 2⟨Dk+1zk+1, (D∞)−1
(A− A
) (q∗ − qk+1
)⟩= 2
⟨qk+1 − qk, (D∞)−1
(A− A
) (q∗ − qk+1
)⟩. (35)
The first equality in Eq. (35) holds because we have (A−A)>(D∞)−1D∞z∗ = 0n. By substituting
Eqs. (34) and (35) into (33), and recalling the definition of tk, t∗, G in Eq. (20), we obtain
s3 =∥∥Dk+1zk+1 −D∞z∗
∥∥2−2(D∞)−1(In+A−2A) + 2
⟨tk+1 − tk, G
(t∗ − tk+1
)⟩. (36)
By applying the basic inequality⟨tk+1 − tk, G
(t∗ − tk+1
)⟩=∥∥tk − t∗
∥∥2G−∥∥tk+1 − t∗
∥∥2G−∥∥tk+1 − tk
∥∥2G−⟨G(tk+1 − tk
), t∗ − tk+1
⟩, (37)
and note that
−∥∥tk+1 − tk
∥∥2G
=−∥∥Dk+1zk+1 −Dkzk
∥∥2A>(D∞)−1 −
∥∥qk+1 − qk∥∥2(D∞)−1(A−A)
≤−∥∥Dk+1zk+1 −Dkzk
∥∥2A>(D∞)−1 , (38)
where the inequality holds due to Eq. (28), and
−⟨G(tk+1 − tk
), t∗ − tk+1
⟩=−
⟨A> (D∞)−1
(Dk+1zk+1 −Dkzk
), D∞z∗ −Dk+1zk+1
⟩−⟨Dk+1zk+1 −D∞z∗,
(A− A
)>(D∞)−1
(q∗ − qk+1
)⟩≤σ
2
∥∥Dk+1zk+1 −Dkzk∥∥2(D∞)−1AA>(D∞)−1 +
1
σ
∥∥Dk+1zk+1 −D∞z∗∥∥2
October 23, 2021 DRAFT
18
+σ
2
∥∥q∗ − qk+1∥∥2(D∞)−1(A−A)(A−A)>(D∞)−1 , (39)
we are able to bound s3 with all terms of matrix norms by combining Eqs. (36), (37), (38),
and (39),
s3 ≤∥∥Dk+1zk+1 −D∞z∗
∥∥2−2(D∞)−1(In+A−2A) + 2
∥∥tk − t∗∥∥2G− 2
∥∥tk+1 − t∗∥∥2G
+∥∥Dk+1zk+1 −Dkzk
∥∥2−2A>(D∞)−1+σ(D∞)−1AA>(D∞)−1 +
2
σ
∥∥Dk+1zk+1 −D∞z∗∥∥2
+ σ∥∥q∗ − qk+1
∥∥2(D∞)−1(A−A)(A−A)>(D∞)−1 . (40)
It follows from Lemma 3(c) that∥∥Dk+1zk+1 −D∞z∗∥∥2 ≤2d2max
∥∥zk+1 − z∗∥∥+
8(nCBf )2
S2f
λk. (41)
By combining Eqs. (41) and (30), we obtain
αSfd2max
∥∥Dk+1zk+1 −D∞z∗∥∥2 ≤s1 + s2 + s3 +
8α(nCBf )2
Sfd2max
λk. (42)
Finally, by plugging the bounds of s1, Eq. (31), s2, Eq. (32), and s3, Eq. (40), into Eq. (42),
and rearranging the equations, it follows that∥∥tk − t∗∥∥2G−∥∥tk+1 − t∗
∥∥2G
≥∥∥Dk+1zk+1 −D∞z∗
∥∥2α2
(Sf
d2max−η)In− 1
σIn+(D∞)−1(In+A−2A)
+∥∥Dk+1zk+1 −Dkzk
∥∥2M>−σ
2MM>−
αL2f(d−∞d
−max)
2
ηIn
−[
4α(nCBf )2
Sfd2max
+8αncB2
fd−∞
Sf+
16α(ncLfBfd−∞d−max)
2
ηS2f
]λk
− σ
2
∥∥q∗ − qk+1∥∥2NN>
, (43)
where we denote M = (D∞)−1A, and N = (D∞)−1(A− A) for simplicity in representation.
In order to establish the result of Theorem 1, Eq. (21), i.e.,∥∥tk − t∗
∥∥2G≥ (1+δ)
∥∥tk+1 − t∗∥∥2G−
Γγk. it remains to show that the RHS of Eq. (43) is no less than δ∥∥tk+1 − t∗
∥∥2G− Γγk. This is
equivalent to show that∥∥Dk+1zk+1 −D∞z∗∥∥2α2
(Sf
d2max−η)In− 1
σIn+(D∞)−1(In+A−2A)−δM>
+∥∥Dk+1zk+1 −Dkzk
∥∥2M>−σ
2MM>−
αL2f(d−∞d
−max)
2
ηIn
October 23, 2021 DRAFT
19
−[
4α(nCBf )2
Sfd2max
+8αncB2
fd−∞
Sf+
16α(ncLfBfd−∞d−max)
2
ηS2f
]λk
≥∥∥q∗ − qk+1
∥∥2σ2NN>+δN
− Γλk. (44)
From Lemma 1, we have∥∥q∗ − qk+1∥∥2
(A−A)>(A−A)
=∥∥∥(A− A) (q∗ − qk+1
)∥∥∥2=∥∥∥(In + A− 2A)(Dk+1zk+1 −D∞z∗) + α[∇f(zk+1)−∇f(z∗)]
+ A(Dk+1zk+1 −Dkzk) + α[∇f(zk)−∇f(zk+1)]∥∥∥2
≤4(∥∥Dk+1zk+1 −D∞z∗
∥∥2(1−2θ)2P>P + α2L2
f
∥∥zk+1 − z∗∥∥2)
+ 4(∥∥Dk+1zk+1 −Dkzk
∥∥2A>A
+ α2L2f
∥∥zk+1 − zk∥∥2)
≤∥∥Dk+1zk+1 −D∞z∗
∥∥24(1−2θ)2P>P+8(αLfd
−max)2In
+∥∥Dk+1zk+1 −Dkzk
∥∥24A>A+8(αLfd
−max)2In
+ 160
(αLfd
−maxnCBf
Sf
)2
λk, (45)
where we denote P = In−A for simplicity in representation. Consider the analysis in Eqs. (28)
and (29), we get λ(N+N>
2
)≥ 0. It is also true that λ
(NN>
)≥ 0, and λ
((A− A)>(A− A)
)≥
0. Therefore, we have the relation that
1
σ2λmax (NN>) + δλmax
(N+N>
2
) ∥∥q∗ − qk+1∥∥2σ2NN>+δN
≤∥∥q∗ − qk+1
∥∥2≤ 1
λmin
((A− A)>(A− A)
) ∥∥q∗ − qk+1∥∥2
(A−A)>(A−A) , (46)
where λmin(·) gives the smallest nonzero eigenvalue. We denote L = (A − A). By considering
Eqs. (44), (45), and (46), and let Γ in Eq. (44) be
Γ =
[4α(nCBf )
2
Sfd2max
+8αncB2
fd−∞
Sf+
16α(ncLfBfd−∞d−max)
2
ηS2f
]
+
σ2λmax
(NN>
)+ δλmax
(N+N>
2
)λmin (L>L)
160
(αLfd
−maxnCBf
Sf
)2
,
October 23, 2021 DRAFT
20
it follows that in order to valid Eq. (44), we need to have the following relation be valid.∥∥Dk+1zk+1 −D∞z∗∥∥2α2
(Sf
d2max−η)In− 1
σIn+(D∞)−1(In+A−2A)−δM>
+∥∥Dk+1zk+1 −Dkzk
∥∥2M>−σ
2MM>−
αL2f(d−∞d
−max)
2
ηIn
≥σ2λmax
(NN>
)+ δλmax
(N+N>
2
)λmin (L>L)
·(∥∥Dk+1zk+1 −D∞z∗∥∥24(1−2θ)2P>P+8(αLfd
−max)2In
+∥∥Dk+1zk+1 −Dkzk
∥∥24A>A+8(αLfd
−max)2In
),
which can be implemented by letting
α2
(Sfd2max− η)In− 1
σIn − δ
(M+M>
2
)�
σ2λmax(NN>)+δλmax
(N+N>
2
)λmin(L>L)
·(4(1− 2θ)2P>P + 8(αLfd
−max)
2In),
M+M>
2− σ
2MM>− αL2
f (d−∞d−max)
2
ηIn
�σ2λmax(NN>)+δλmax
(N+N>
2
)λmin(L>L)
·(
4A>A+ 8(αLfd−max)
2In
).
Denote ρ1 = 4(1 − 2θ)2λmax
(P>P
)+ 8(αLfd
−max)
2, and ρ2 = 4λmax
(A>A
)+ 8(αLfd
−max)
2,
where α is some constant regarded as a ceiling of α. Therefore, we have
δ ≤ min
α2
(Sfd2max− η)− 1
σ− σλmax(NN>)
2λmin(L>L)ρ1
λmax
(M+M>
2
)+
λmax
(N+N>
2
)λmin(L>L)
ρ1
,
λmin
(M+M>
2
)− σ
2λmax
(MM>)− αL2
f (d−∞d−max)
2
η
λmax
(N+N>
2
)λmin(L>L)
ρ2
− σλmax
(NN>
)2λmax
(N+N>
2
).
To ensure δ > 0, we set the parameters, η, σ, to be η < Sfd2max
, σ <λmin(M+M>)
12λmax
(MM>
2
)+λmax(NN>)
λmin(L>L)ρ2
,
and also α ∈ (αmin, αmax), where αmin, and αmax are the values given in Theorem 1. Therefore,
we obtained the desired result.
VI. NUMERICAL EXPERIMENTS
This section provides numerical experiments to study the convergence rate of DEXTRA for
a least squares problem in a directed graph. The local objective functions in the least squares
October 23, 2021 DRAFT
21
problems are strongly convex. We compare the performance of DEXTRA with other algorithms
suited to the case of directed graph: GP as defined by [20, 21] and D-DGD as defined by [32].
Our second experiment verifies the existance of αmin and αmax, such that the proper stepsize α
should be α ∈ [αmin, αmax]. We also consider various network topologies, to see how they effect
the stepsize α. Convergence is studied in terms of the residual
r =1
n
n∑i=1
∥∥zki − φ∥∥ ,where φ is the optimal solution. The distributed least squares problem is described as follow.
Each agent owns a private objective function, si = Rix + ni, where si ∈ Rmi and Mi ∈ Rmi×p
are measured data, x ∈ Rp is unknown states, and ni ∈ Rmi is random unknown noise. The goal
is to estimate x. This problem can be formulated as a distributed optimization problem solving
min f(x) =1
n
n∑i=1
‖Rix− si‖ .
We consider the network topology as the digraphs shown in Fig. 1.
23 1
56 47
10 89
Ga
23 1
56 47
10 89
Gb
23 1
56 47
10 89
Gc
Fig. 1: Three examples of strongly-connected but non-balanced digraphs.
Our first experiment compares several algorithms with the choice of network Ga, illustrated
in Fig. 1. The comparison of DEXTRA, GP, D-DGD and DGD with weighting matrix being
row-stochastic is shown in Fig. 2. In this experiment, we set α = 0.1. The convergence rate of
DEXTRA is linear as stated in Section III. G-P and D-DGD apply the same stepsize, α = α√k.
As a result, the convergence rate of both is sub-linear. We also consider the DGD algorithm,
but with the weighting matrix being row-stochastic. The reason is that in a directed graph, it
is impossible to construct a doubly-stochastic matrix. As expected, DGD with row-stochastic
matrix does not converge to the exact optimal solution while other three algorithms are suited
to the case of directed graph.
October 23, 2021 DRAFT
22
As stated in Theorem 1, DEXTRA converges to the optimal solution when the stepsize α is
in a proper interval. Fig. 3 illustrate this fact. The network topology is chosen to be Ga in Fig. 1.
It is shown that when α = 0.01 or α = 1, DEXTRA does not converge. DEXTRA has a more
restricted stepsize compared with EXTRA, see [19], where the value of stepsize can be as low
as any value close to zero, i.e., αmin = 0, when a symmetric and doubly-stochastic matrix is
applied in EXTRA. The relative smaller range of interval is due to the fact that the weighting
matrix applied in DEXTRA can not be symmetric.
0 500 1000 1500 200010
−10
10−8
10−6
10−4
10−2
100
k
Residual
DGD with Row−Stochastic MatrixGPD−DGDDEXTRA
Fig. 2: Comparison of different distributed optimization algorithms in a least squares problem.
GP, D-DGD, and DEXTRA are proved to work when the network topology is described by
digraphs. Moreover, DEXTRA has a much faster convergence rate compared with GP and D-
DGD.
In third experiments, we explore the relations between the range of stepsize and the network
topology. Fig. 4 shows the interval of proper stepsizes for the convergence of DEXTRA to optimal
solution when the network topology is chosen to be Ga, Gb, and Gc in Fig. 1. The interval
of proper stepsizes is related to the network topology as stated in Theorem 1 in Section III.
Compared Fig. 4 (b) and (c) with (a), a faster information diffusion on the network can increase
the range of stepsize interval. Given a fixed network topology, e.g., see Fig. 4(a) only, a larger
stepsize increase the convergence speed of algorithms. When we compare Fig. 4(b) with (a), it
is interesting to see that α = 0.3 makes DEXTRA iterates about 190 times to converge with the
October 23, 2021 DRAFT
23
0 500 1000 1500 200010
−20
10−10
100
1010
1020
1030
1040
1050
k
Residual
DEXTRA α=0.01DEXTRA α=0.1DEXTRA α=1
Fig. 3: There exists some interval for the stepsize to have DEXTRA converge to the optimal
solution.
choice of Ga while the same stepsize needs DEXTRA iterates 230 times to converge when the
topology changes to Gb. We claim that a faster information diffusion on the network fasten the
convergence speed.
VII. CONCLUSIONS
We introduce DEXTRA, a distributed algorithm to solve multi-agent optimization problems
over directed graphs. We have shown that DEXTRA succeeds in driving all agents to the same
point, which is the optimal solution of the problem, given that the communication graph is
strongly-connected and the objective functions are strongly convex. Moreover, the algorithm
converges at a linear rate O(εk) for some constant ε < 1. The constant depends on the network
topology G, as well as the Lipschitz constant, strong convexity constant, and gradient norm of
objective functions. Numerical experiments are conducted for a least squares problem, where we
show that DEXTRA is the fastest distributed algorithm among all algorithms suited to the case
of directed graphs, compared with GP, and D-DGD.
Our analysis also explicitly gives the range of interval, which needs the knowledge of network
topology, of stepsize to have DEXTRA converge to the optimal solution. Although numerical
experiments show that the practical range of interval are much bigger, it is still an open question
October 23, 2021 DRAFT
24
0 200 400 600 800 100010
−15
10−10
10−5
100
105
k
Residual
Ga, α ∈ [0.03, 0.3]
0 200 400 600 800 100010
−20
10−15
10−10
10−5
100
105
k
Residual
Gb, α ∈ [10−10, 0.3]
0 200 400 600 800 100010
−20
10−15
10−10
10−5
100
105
k
Residual
Gc, α ∈ [10−10, 0.4]
Fig. 4: There exists some interval for the stepsize to have DEXTRA converge to the optimal
solution. The upperbound and lowerbound of the interval depends on the network topology,
Lipschitz constant, strong convexity constant, and gradient norm of objective functions. (a) Ga;(b) Gb; (c) Gc. Given a fixed topology, the convergence speed increases as α increases.
October 23, 2021 DRAFT
25
on how to get rid of the knowlege of network topology, which is a global information in a
distributed setting.
REFERENCES
[1] V. Cevher, S. Becker, and M. Schmidt, “Convex optimization for big data: Scalable,
randomized, and parallel algorithms for big data analytics,” IEEE Signal Processing
Magazine, vol. 31, no. 5, pp. 32–43, 2014.
[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and
statistical learning via the alternating direction method of multipliers,” Foundation and
Trends in Maching Learning, vol. 3, no. 1, pp. 1–122, Jan. 2011.
[3] I. Necoara and J. A. K. Suykens, “Application of a smoothing technique to decomposition
in convex optimization,” IEEE Transactions on Automatic Control, vol. 53, no. 11, pp.
2674–2679, Dec. 2008.
[4] G. Mateos, J. A. Bazerque, and G. B. Giannakis, “Distributed sparse linear regression,”
IEEE Transactions on Signal Processing, vol. 58, no. 10, pp. 5262–5276, Oct. 2010.
[5] J. A. Bazerque and G. B. Giannakis, “Distributed spectrum sensing for cognitive radio
networks by exploiting sparsity,” IEEE Transactions on Signal Processing, vol. 58, no. 3,
pp. 1847–1862, March 2010.
[6] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in 3rd
International Symposium on Information Processing in Sensor Networks, Berkeley, CA,
Apr. 2004, pp. 20–27.
[7] U. A. Khan, S. Kar, and J. M. F. Moura, “Diland: An algorithm for distributed sensor
localization with noisy distance measurements,” IEEE Transactions on Signal Processing,
vol. 58, no. 3, pp. 1940–1947, Mar. 2010.
[8] C. L and L. Li, “A distributed multiple dimensional qos constrained resource scheduling
optimization policy in computational grid,” Journal of Computer and System Sciences, vol.
72, no. 4, pp. 706 – 726, 2006.
[9] G. Neglia, G. Reina, and S. Alouf, “Distributed gradient optimization for epidemic routing:
A preliminary evaluation,” in 2nd IFIP in IEEE Wireless Days, Paris, Dec. 2009, pp. 1–6.
[10] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,”
IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, Jan. 2009.
October 23, 2021 DRAFT
26
[11] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,”
arXiv preprint arXiv:1310.7063, 2013.
[12] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed
optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic
Control, vol. 57, no. 3, pp. 592–606, Mar. 2012.
[13] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. Puschel, “D-ADMM: A
communication-efficient distributed algorithm for separable optimization,” IEEE Trans-
actions on Signal Processing, vol. 61, no. 10, pp. 2718–2723, May 2013.
[14] E. Wei and A. Ozdaglar, “Distributed alternating direction method of multipliers,” in 51st
IEEE Annual Conference on Decision and Control, Dec. 2012, pp. 5445–5450.
[15] W. Shi, Q. Ling, K Yuan, G Wu, and W Yin, “On the linear convergence of the admm in
decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62,
no. 7, pp. 1750–1761, April 2014.
[16] D. Jakovetic, J. Xavier, and J. M. F. Moura, “Fast distributed gradient methods,” IEEE
Transactions onAutomatic Control, vol. 59, no. 5, pp. 1131–1146, 2014.
[17] A. Mokhatari, Q. Ling, and A. Ribeiro, “Network newton,” http://www.seas.upenn.edu/
∼aryanm/wiki/NN.pdf, 2014.
[18] Q. Ling and A. Ribeiro, “Decentralized linearized alternating direction method of
multipliers,” in IEEE International Conference on Acoustics, Speech and Signal Processing.
IEEE, 2014, pp. 5447–5451.
[19] W. Shi, Q. Ling, G. Wu, and W Yin, “Extra: An exact first-order algorithm for decentralized
consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
[20] A. Nedic and A. Olshevsky, “Distributed optimization over time-varying directed graphs,”
IEEE Transactions on Automatic Control, vol. PP, no. 99, pp. 1–1, 2014.
[21] K. I. Tsianos, The role of the Network in Distributed Optimization Algorithms: Convergence
Rates, Scalability, Communication/Computation Tradeoffs and Communication Delays,
Ph.D. thesis, Dept. Elect. Comp. Eng. McGill University, 2013.
[22] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,”
in 44th Annual IEEE Symposium on Foundations of Computer Science, Oct. 2003, pp. 482–
491.
[23] F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vetterli, “Weighted gossip: Distributed
averaging using non-doubly stochastic matrices,” in IEEE International Symposium on
October 23, 2021 DRAFT
27
Information Theory, Jun. 2010, pp. 1753–1757.
[24] A. Jadbabaie, J. Lim, and A. S. Morse, “Coordination of groups of mobile autonomous
agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48,
no. 6, pp. 988–1001, Jun. 2003.
[25] R. Olfati-Saber and R. M. Murray, “Consensus problems in networks of agents with
switching topology and time-delays,” IEEE Transactions on Automatic Control, vol. 49,
no. 9, pp. 1520–1533, Sep. 2004.
[26] R. Olfati-Saber and R. M. Murray, “Consensus protocols for networks of dynamic agents,”
in IEEE American Control Conference, Denver, Colorado, Jun. 2003, vol. 2, pp. 951–956.
[27] L. Xiao, S. Boyd, and S. J. Kim, “Distributed average consensus with least-mean-square
deviation,” Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp. 33 – 46,
2007.
[28] C. Xi, Q. Wu, and U. A. Khan, “Distributed gradient descent over directed graphs,”
http://www.eecs.tufts.edu/∼cxi01/paper/2015D-DGD.pdf, 2015.
[29] C. Xi and U. A. Khan, “Directed-distributed gradient descent,” in 53rd Annual Allerton
Conference on Communication, Control, and Computing, Monticello, IL, USA, Oct. 2015.
[30] K. Cai and H. Ishii, “Average consensus on general strongly connected digraphs,”
Automatica, vol. 48, no. 11, pp. 2750 – 2761, 2012.
[31] F. Chung, “Laplacians and the cheeger inequality for directed graphs,” Annals of
Combinatorics, vol. 9, no. 1, pp. 1–19, 2005.
[32] C. Xi, Q. Wu, and U. A. Khan, “Distributed gradient descent over directed graphs,”
http://www.eecs.tufts.edu/∼cxi01/paper/2015D-DGD.pdf, 2015.
October 23, 2021 DRAFT
Recommended