FedPAQ: A Communication-Eﬃcient Federated Learning Method …proceedings.mlr.press/v108/reisizadeh20a/reisizadeh20a.pdf · 2020-07-22 · Reisizadeh, Mokhtari, Hassani, Jadbabaie,

FedPAQ: A Communication-Efficient Federated Learning Method withPeriodic Averaging and Quantization

Amirhossein Reisizadeh Aryan Mokhtari Hamed Hassani

UC Santa Barbara UT Austin UPenn

Ali Jadbabaie Ramtin Pedarsani

MIT UC Santa Barbara

Abstract

Federated learning is a distributed frameworkaccording to which a model is trained overa set of devices, while keeping data local-ized. This framework faces several systems-oriented challenges which include (i) commu-nication bottleneck since a large number ofdevices upload their local updates to a pa-rameter server, and (ii) scalability as the fed-erated network consists of millions of devices.Due to these systems challenges as well asissues related to statistical heterogeneity ofdata and privacy concerns, designing a prov-ably efficient federated learning method is ofsignificant importance yet it remains chal-lenging. In this paper, we present FedPAQ,a communication-efficient Federated Learn-ing method with Periodic Averaging andQuantization. FedPAQ relies on three keyfeatures: (1) periodic averaging where mod-els are updated locally at devices and onlyperiodically averaged at the server; (2) par-tial device participation where only a frac-tion of devices participate in each roundof the training; and (3) quantized message-passing where the edge nodes quantize theirupdates before uploading to the parameterserver. These features address the commu-nications and scalability challenges in feder-ated learning. We also show that FedPAQachieves near-optimal theoretical guaranteesfor strongly convex and non-convex lossfunctions and empirically demonstrate thecommunication-computation tradeoff pro-vided by our method.

Proceedings of the 23rdInternational Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2020, Palermo,Italy. PMLR: Volume 108. Copyright 2020 by the au-thor(s).

1 Introduction

In many large-scale machine learning applications,data is acquired and processed at the edge nodes ofthe network such as mobile devices, users’ devices, andIoT sensors. Federated Learning is a novel paradigmthat aims to train a statistical model at the “edge”nodes as opposed to the traditional distributed com-puting systems such as data centers [Konečny et al.,2016, Li et al., 2019a]. The main objective of federatedlearning is to fit a model to data generated from net-work devices without continuous transfer of the mas-sive amount of collected data from edge of the networkto back-end servers for processing.

Federated learning has been deployed by major tech-nology companies with the goal of providing privacy-preserving services using users’ data [Bonawitz et al.,2019]. Examples of such applications are learning fromwearable devices [Huang et al., 2018], learning senti-ment [Smith et al., 2017], and location-based services[Samarakoon et al., 2018]. While federated learning isa promising paradigm for such applications, there areseveral challenges that remain to be resolved. In thispaper, we focus on two significant challenges of feder-ated learning, and propose a novel federated learningalgorithm that addresses the following two challenges:

(1) Communication bottleneck. Communicationbandwidth is a major bottleneck in federated learn-ing as a large number of devices attempt to communi-cate their local updates to a central parameter server.Thus, for a communication-efficient federated learningalgorithm, it is crucial that such updates are sent in acompressed manner and infrequently.

(2) Scale. A federated network typically consists ofthousands to millions of devices that may be active,slow, or completely inactive during the training proce-dure. Thus, a proposed federated learning algorithmshould be able to operate efficiently with partial deviceparticipation or random sampling of devices.

FedPAQ: A Communication-Efficient Federated Learning Method

The goal of this paper is to develop a provably effi-cient federated learning algorithm that addresses theabove-mentioned systems challenges. More precisely,we consider the task of training a model in a feder-ated learning setup where we aim to find an accuratemodel over a collection of n distributed nodes. In thissetting, each node contains m independent and identi-cally distributed samples from an unknown probabilitydistribution and a parameter server helps coordinationbetween the nodes. We focus on solving the empiricalrisk minimization problem for a federated architecturewhile addressing the challenges mentioned above. Inparticular, we consider both strongly convex and non-convex settings and provide sharp guarantees on theperformance of our proposed algorithm.

Contributions. In this work, we propose FedPAQ, acommunication-efficient Federated learning algorithmwith Periodic Averaging and Quantization, which ad-dresses federated learning systems’ bottlenecks. Inparticular, FedPAQ has three key features that enableefficient federated learning implementation:

(1) FedPAQ allows the nodes (users) of the network torun local training before synchronizing with the pa-rameter server. In particular, each node iterativelyupdates its local model for a period of iterations us-ing the stochastic gradient descent (SGD) method andthen uploads its model to the parameter server whereall the received models are averaged periodically. Bytuning the parameter which corresponds to the num-ber of local iterations before communicating to theserver, periodic averaging results in slashing the num-ber of communication rounds and hence the total com-munication cost of the training process.

(2) FedPAQ captures the constraint on availability ofactive edge nodes by allowing a partial node partic-ipation. That is, in each round of the method, onlya fraction of the total devices–which are the activeones–contribute to train the model. This procedurenot only addresses the scalability challenge, but alsoleads to smaller communication load compared to thecase that all nodes participate in training the learningmodel.

(3) In FedPAQ, nodes only send a quantized versionof their local information to the server at each roundof communication. As the training models are oflarge sizes, quantization significantly helps reducingthe communication overhead on the network.

While these features have been proposed in the lit-erature, to the best of our knowledge, FedPAQ is thefirst federated learning algorithm that simultaneouslyincorporates these features and provides near-optimaltheoretical guarantees on its statistical accuracy, while

being communication-efficient via periodic averaging,partial node participation and quantization.

In particular, we analyze our proposed FedPAQ methodfor two general class of loss functions: strongly-convexand non-convex. For the strongly-convex setting,we show that after T iterations the squared normof the distance between the solution of our methodand the optimal solution is of O(1/T ) in expecta-tion. We also show that FedPAQ approaches a first-order stationary point for non-convex losses at a rate ofO(1/

pT ). This demonstrates that our method signifi-

cantly improves the communication-efficiency of fed-erated learning while preserving the optimality andconvergence guarantees of the baseline methods. Inaddition, we would like to highlight that our theoreti-cal analysis is based on few relaxed and customary as-sumptions which yield more technical challenges com-pared to the existing works with stronger assumptionsand hence acquires novel analytical techniques. Moreexplanations will be provided in Section 4.

Related Work. The main premise of federated learn-ing has been collective learning using a network of com-mon devices such as phones and tablets. This frame-work potentially allows for smarter models, lower la-tency, and less power consumption, all while ensuringprivacy. Successfully achieving these goals in prac-tice requires addressing key challenges of federatedlearning such as communication complexity, systemsheterogeneity, privacy, robustness, and heterogene-ity of the users. Recently, many federated methodshave been considered in the literature which mostlyaim at reducing the communication cost. McMahanet al. [2016] proposed the FedAvg algorithm, where theglobal model is updated by averaging local SGD up-dates. Guha et al. [2019] proposed one-shot federatedlearning in which the master node learns the modelafter a single round of communication.

Optimization methods for federated learning are nat-urally tied with tools from stochastic and distributedoptimization. Minibatch stochastic gradient descentdistributed optimization methods have been largelystudied in the literature without considering the com-munication bottleneck. Addressing the communica-tion bottleneck via quantization and compression indistributed learning has recently gained considerableattention for both master-worker [Alistarh et al., 2017,Bernstein et al., 2018, Seide et al., 2014, Smith et al.,2016] and masterless topologies [Koloskova et al., 2019,Reisizadeh et al., 2019a, Wang et al., 2019, Zhanget al., 2018]. Moreover, Wang et al. [2019] reducesthe communication delay by decomposing the graph.

Local updates, as another approach to reduce the com-munication load in distributed learning has been stud-

Reisizadeh, Mokhtari, Hassani, Jadbabaie, Pedarsani

ied in the literature, where each learning node car-ries out multiple local updates before sharing with themaster or its neighboring nodes. Stich [2018] consid-ered a master-worker topology and provides theoreti-cal analysis for the convergence of local-SGD method.Lin et al. [2018] introduced a variant of local-SGDnamely post-local-SGD which demonstrates empiricalimprovements over local-SGD. Wang and Joshi [2018]provided a general analysis of such cooperative methodfor decentralized settings as well.

Statistical heterogeneity of users’ data points is an-other major challenge in federated learning. To ad-dress this heterogeneity, other methods such as mul-titask learning and meta learning have been proposedto train multiple local models [Li et al., 2019b, Nicholet al., 2018, Smith et al., 2017]. Many methods havebeen proposed to address systems heterogeneity and inparticular stragglers in distributed learning using cod-ing theory, e.g., [Dutta et al., 2016, Lee et al., 2018,Reisizadeh et al., 2019b, Tandon et al., 2016, Yu et al.,2017]. Another important challenge in federated learn-ing is to preserve privacy in learning [Duchi et al.,2014]. Agarwal et al. [2018], McMahan et al. [2017]proposed privacy-preserving methods for distributedand federated learning using differential privacy tech-niques. Federated heavy hitters discovery with differ-ential privacy was proposed in [Zhu et al., 2019].

Robustness against adversarial devices is another chal-lenge in federated learning and distributed learningthat has been studied in [Chen et al., 2017, Ghoshet al., 2019, Yin et al., 2018]. Finally, several workshave considered communication-efficient collaborativelearning where there is no master node, and the com-puting nodes learn a model collaboratively in a decen-tralized manner [Doan et al., 2018, Koloskova et al.,2019, Lalitha et al., 2019, Reisizadeh et al., 2019a,Zhang et al., 2018]. While such techniques are relatedto federated learning, the network topology in master-less collaborative learning is fundamentally different.

2 Federated Learning Setup

In this paper, we focus on a federated architecturewhere a parameter server (or server) aims at findinga model that performs well with respect to the datapoints that are available at different nodes (users) ofthe network, while nodes exchange their local informa-tion with the server. We further assume that the datapoints for all nodes in the network are generated froma common probability distribution. In particular, weconsider the following stochastic learning problem

minx

f(x) := minx

1

n

nX

i=1

fi(x), (1)

where the local objective function of each node i isdefined as the expected loss of its local sample distri-butions

fi(x) := E⇠⇠Pi [`(x, ⇠)]. (2)

Here ` : Rp⇥ Ru

! R is a stochastic loss function,x 2 Rp is the model vector, and ⇠ 2 Ru is a ran-dom variable with unknown probability distributionP

i. Moreover, f : Rp! R denotes the expected

loss function also called population risk. In our con-sidered federated setting, each of the n distributednodes generates a local loss function according to adistribution P

i resulting in a local stochastic functionfi(x) := E⇠⇠Pi [`(x, ⇠)]. A special case of this formu-lation is when each node i maintains a collection ofm samples from distribution P

i which we denote byD

i = {⇠i1, · · · , ⇠im} for i 2 [n]. This results in the fol-

lowing empirical risk minimization problem over thecollection of nm samples in D := D

1[ · · · [D

n:

minx

L(x) = minx

1

nm

X

⇠2D`(x, ⇠), (3)

We denote the optimal model x⇤ as the solution to theexpected risk minimization problem in (1) and denotethe minimum loss f⇤ := minx f(x) = f(x⇤) as theoptimal objective function value of the expected riskminimization problem in (1). In this work, we focus onthe case that the data over the n nodes is independentand identically distributed (i.i.d.), which implies thelocal distributions are common.

As stated above, our goal is to minimize the expectedloss f(x). However, due to the fact that we do nothave access to the underlying distribution P, therehave been prior works that focus on minimizing theempirical risk L(x) which can be viewed as an ap-proximation of the expected loss f(x). The accu-racy of this approximation is determined by the num-ber of samples N = nm. It has been shown thatfor convex losses `, the population risk f is at mostO(1/

pnm) distant from the empirical risk L, uni-

formly and with high probability [Bottou and Bous-quet, 2008]. That is, supx |f(x)�L(x)| O(1/

pnm)

with high probability. This result implies that if eachof the n nodes separately minimizes its local empiri-cal loss function, the expected deviation from the lo-cal solution and the solution to the population riskminimization problem is of O(1/

pm) (note that each

node has access to m data samples). However, ifthe nodes manage to somehow share or synchronizetheir solutions, then a more accurate solution can beachieved, that is a solution with accuracy of orderO(1/

pnm). Therefore, when all the mn available sam-

ples are leveraged, one can obtain a solution x thatsatisfies E[L(x)� L(x⇤)] O(1/

pnm). This also im-

plies that E[f(x)�minx f(x)] O(1/pnm).


For the case of non-convex loss function `, however,finding the solution to the expected risk minimiza-tion problem in (1) is hard. Even further, finding (ortesting) a local optimum is NP-hard in many cases[Murty and Kabadi, 1987]. Therefore, for non-convexlosses we relax our main goal and instead look forfirst-order optimal solutions (or stationary points) for(1). That is, we aim to find a model x that sat-isfies

��rf(x)�� ✏ for an arbitrarily small approxi-

mation error ✏. Mei et al. [2018] characterized thegap for the gradients of the two expected risk andempirical risk functions. That is, if the gradientof loss is sub-Gaussian, then with high probabilitysupx

��rL(x)�rf(x)�� O(1/

pnm). This result fur-

ther implies that having all the nodes contribute inminimizing the empirical risk results in better approx-imation for a first-order stationary point of the ex-pected risk L. In summary, our goal in non-convex set-ting is to find x that satisfies

��rf(x)�� O(1/

pnm)

which also implies��rL(x)

�� O(1/pnm).

3 Proposed FedPAQ Method

In this section, we present our proposedcommunication-efficient federated learning methodcalled FedPAQ, which consists of three main modules:(1) periodic averaging, (2) partial node participation,and (3) quantized message passing.

3.1 Periodic averaging

As explained in Section 2, to leverage from all theavailable data samples on the nodes, any trainingmethod should incorporate synchronizing the interme-diate models obtained at local devices. One approachis to let the participating nodes synchronize their mod-els through the parameter server in each iteration ofthe training. This, however, implies many rounds ofcommunication between the federated nodes and theparameter server which results in communication con-tention over the network. Instead, we let the partici-pating nodes conduct a number of local updates andsynchronize through the parameter server periodically.To be more specific, once nodes pull an updated modelfrom the server, they update the model locally by run-ning ⌧ iterations of the SGD method and then sendproper information to the server for updating the ag-gregate model. Indeed, this periodic averaging schemereduces the rounds of communication between serverand the nodes and consequently the overall communi-cation cost of training the model. In particular, forthe case that we plan to run T iterations of SGD ateach node, nodes need to communicate with the serverK = T/⌧ rounds, hence reducing the total communi-cation cost by a factor of 1/⌧ .

Choosing a larger value of ⌧ indeed reduces the rounds

of communication for a fixed number of iterations T .However, if our goal is to obtain a specific accuracy", choosing a very large value for ⌧ is not necessar-ily optimal as by increasing ⌧ the noise of the systemincreases and the local models approach the local op-timal solutions instead of the global optimal solution.Hence, we might end up running more iterations T toachieve a specific accuracy " comparing to a case that⌧ is small. Indeed, a crucial question that we need toaddress is finding the optimal choice of ⌧ for minimiz-ing the overall communication cost of the process.

3.2 Partial node participation

In a federated network, often there is a large num-ber of devices such as smart phones communicatingthrough a base station. On one hand, base stationshave limited download bandwidth and hence only afew of devices are able to simultaneously upload theirmessages to the base station. Due to this limitationthe messages sent from the devices will be pipelined atthe base station which results in a dramatically slowtraining. On the other hand, having all of the devicesparticipate through the whole training process inducesa large communication overhead on the network whichis often costly. Moreover, in practice not all the devicescontribute in each round of the training. Indeed, thereare multiple factors that determine whether a devicecan participate in the training [McMahan and Ram-age, 2017]: a device should be available in the reach-able range of the base station; a device should be idle,plugged in and connected to a free wireless networkduring the training; etc.

Our proposed FedPAQ method captures the restric-tions mentioned above. In particular, we assume thatamong the total of n devices, only r nodes (r n) areavailable in each round of the training. We can alsoassume that due to the availability criterion describedbefore, such available devices are randomly and uni-formly distributed over the network [Sahu et al., 2018].In summary, in each period k = 0, 1, · · · ,K � 1 of thetraining algorithm, the parameter server sends its cur-rent model xk to all the r nodes in subset Sk, whichare distributed uniformly at random among the totaln nodes, i.e., Pr [Sk] = 1/

�nr

�.

3.3 Quantized message-passing

Another aspect of the communication bottleneck infederated learning is the limited uplink bandwidthat the devices which makes the communication fromdevices to the parameter server slow and expensive.Hence, it is critical to reduce the size of the uploadedmessages from the federated devices [Li et al., 2019a].Our proposal is to employ quantization operators onthe transmitted massages. Depending on the accuracy


of the quantizer, the network communication overheadis reduced by exchanging the quantized updates.

In the proposed FedPAQ, each node i 2 Sk obtains themodel x(i)

k,⌧ after running ⌧ local iterations of an opti-mization method (possibly SGD) on the most recentmodel xk that it has received form the server. Theneach node i applies a quantizer operator Q(·) on thedifference between the received model and its updatedmodel, i.e., x(i)

k,⌧ � xk, and uploads the quantized vec-tor Q(x(i)

k,⌧ � xk) to the parameter server. Once thesequantized vectors are sent to the server, it decodes thequantized signals and combines them to come up witha new model xk+1.

Next, we describe a widely-used random quantizer.

Example 1 (Low-precision quantizer [Alistarh et al.,2017]). For any variable x 2 Rp, the low precisionquantizer QLP : Rp

! Rp is defined as below

QLPi (x) =kxk · sign(xi) · ⇠i(x, s), i 2 [p], (4)

where ⇠i(x, s) is a random variable taking on valuel+1/s with probability |xi|

kxk s� l and l/s otherwise. Here,the tuning parameter s corresponds to the number ofquantization levels and l 2 [0, s) is an integer such that|xi|/kxk 2 [l/s, l+1/s).

3.4 Algorithm update

Now we use the building blocks developed in Sec-tions 3.1-3.3 to precisely present FedPAQ. Our proposedmethod consists of K periods, and during a period,each node performs ⌧ local updates, which results intotal number of T = K⌧ iterations. In each periodk = 0, · · · ,K�1 of the algorithm, the parameter serverpicks r n nodes uniformly at random which we de-note by Sk. The parameter server then broadcasts itscurrent model xk to all the nodes in Sk and each nodei 2 Sk performs ⌧ local SGD updates using its localdataset. To be more specific, let x(i)

k,t denote the modelat node i at t-th iteration of the k-th period. At eachlocal iteration t = 0, · · · , ⌧�1, node i updates its localmodel according to the following rule:

x(i)k,t+1 = x(i)

k,t � ⌘k,t erfi⇣x(i)k,t

⌘, (5)

where the stochastic gradient erfi is computed usinga random sample1 picked from the local dataset D

i.Note that all the nodes begin with a common initial-ization x(i)

k,0 = xk. After ⌧ local updates, each nodecomputes the overall update in that period, that isx(i)k,⌧�xk, and uploads a quantized update Q(x(i)

k,⌧�xk)to the parameter server. The parameter server then

1The method can be easily made compatible with usinga mini-batch during each iteration.

Algorithm 1 FedPAQ1: for k = 0, 1, · · · ,K � 1 do

2: server picks r nodes Sk uniformly at random3: server sends xk to nodes in Sk

4: for node i 2 Sk do

5: x(i)k,0 xk

6: for t = 0, 1, · · · , ⌧ � 1 do

7: compute stochastic gradient8: erfi(x) = r`(x, ⇠) for a ⇠ 2 P

i

9: set x(i)k,t+1 x(i)

k,t � ⌘k,t erfi(x(i)k,t)

10: end for

11: send Q(x(i)k,⌧ � xk) to the server

12: end for

13: server finds xk+1 xk+1r

Pi2Sk

Q(x(i)k,⌧ �xk)

14: end for

aggregates the r received quantized local updates andcomputes the next model according to

xk+1 = xk +1

r

X

i2Sk

Q⇣x(i)k,⌧ � xk

⌘, (6)

and the procedure is repeated for K periods. The pro-posed method is formally summarized in Algorithm 1.

4 Convergence Analysis

In this section, we present our theoretical results onthe guarantees of the FedPAQ method. We first con-sider the strongly convex setting and state the con-vergence guarantee of FedPAQ for such losses in The-orem 1. Then, in Theorem 2, we present the overallcomplexity of our method for finding a first-order sta-tionary point of the aggregate objective function f ,when the loss function ` is non-convex (All proofs areprovided in the supplementary material). Before that,we first mention three customary assumptions requiredfor both convex and non-convex settings.

Assumption 1. The random quantizer Q(·) is unbi-ased and its variance grows with the squared of l2-normof its argument, i.e.,

E⇥Q(x)|x

⇤= x, E

h��Q(x)� x��2 |x

i qkxk2 , (7)

for some positive real constant q and any x 2 Rp.Assumption 2. The loss functions fi are L-smoothwith respect to x, i.e., for any x, x 2 Rp, we have��rfi(x)�rfi(x)

�� Lkx� xk.

Assumption 3. Stochastic gradients erfi(x) are unbi-ased and variance bounded, i.e., E⇠[erfi(x)] = rfi(x)and E⇠[kerfi(x)�rfi(x)k2] �2.

The conditions in Assumption 1 ensure that output ofquantization is an unbiased estimator of the input witha variance that is proportional to the norm-squared of


the input. This condition is satisfied with most com-mon quantization schemes including the low-precisionquantizer introduced in Example 1. Assumption 2 im-plies that the gradients of local functions rfi and theaggregated objective function rf are also L-Lipschitzcontinuous. The conditions in Assumption 3 on thebias and variance of stochastic gradients are also cus-tomary. Note that this is a much weaker assumptioncompared to the one that uniformly bounds the ex-pected norm of the stochastic gradient.

Challenges in analyzing the FedPAQ method.

Here, we highlight the main theoretical challenges inproving our main results. As outlined in the descrip-tion of the proposed method, in the k-th round ofFedPAQ, each participating node i updates its localmodel for ⌧ iterations via SGD method in (5). Letus focus on a case that we use a constant stepsize forthe purpose of this discussion. First consider the naiveparallel SGD case which corresponds to ⌧ = 1. Theupdated local model after ⌧ = 1 local update is

x(i)k,⌧ = x(i)

k,0 � ⌘ erfi⇣x(i)k,0

⌘. (8)

Note that x(i)k,0 = xk is the parameter server’s model

sent to the nodes. Since we assume the stochastic gra-dients are unbiased estimators of the gradient, it yieldsthat the local update x(i)

k,⌧ �xk is an unbiased estima-tor of �⌘rf(xk) for every participating node. Hence,the aggregated updates at the server and the updatedmodel xk+1 can be simply related to the current modelxk as one step of parallel SGD. However, this is notthe case when the period length ⌧ is larger than 1.For instance, in the case that ⌧ = 2, the local updatedmodel after ⌧ = 2 iterations is

x(i)k,⌧ =xk�⌘ erfi (xk)�⌘ erfi

⇣xk�⌘ erfi (xk)

⌘. (9)

Clearly, x(i)k,⌧ � xk is not an unbiased estimator of

�⌘rf(xk) or �⌘rf(xk � ⌘rf(xk)). This demon-strates that the aggregated model at server cannot betreated as ⌧ iterations of parallel SGD, since each localupdate contains a bias. Indeed, this bias gets propa-gated when ⌧ gets larger. For our running example⌧ = 2, the variance of the bias, i.e. Ek⌘ erfi(xk �

⌘ erfi(xk))k2 is not uniformly bounded either (As-sumption 3), which makes the analysis even more chal-lenging compared to the works with bounded gradientassumption (e.g. [Stich, 2018, Yu et al., 2019]).

4.1 Strongly convex setting

Now we proceed to establish the convergence rate ofthe proposed FedPAQ method for a federated settingwith strongly convex and smooth loss function `. Wefirst formally state the strong convexity assumption.

Assumption 4. The loss functions fi are µ-stronglyconvex, i.e., for any x, x 2 Rp we have that hrfi(x)�rfi(x),x� xi � µkx� xk2 .

Theorem 1 (Strongly convex loss). Consider the se-quence of iterates xk at the parameter server gener-ated according to the FedPAQ method outlined in Algo-rithm 1. Suppose the conditions in Assumptions 1–4are satisfied. Further, let us define the constant B1 as

B1 = 2L2

✓q

n+

n� r

r(n� 1)4(1 + q)

◆, (10)

where q is the quantization variance parameter definedin (7) and r is the number of active nodes at eachround of communication. If we set the stepsize inFedPAQ as ⌘k,t = ⌘k = 4µ�1/k⌧+1, then for any k � k0where k0 is the smallest integer satisfying

k0 � 4max

⇢L

µ, 4

✓B1

µ2+ 1

◆,1

⌧,4n

µ2⌧

�, (11)

the expected error E[kxk � x⇤k]2 is bounded above by

Ekxk � x⇤k2

(k0⌧ + 1)2

(k⌧ + 1)2kxk0 � x⇤

k2

+ C1⌧

k⌧ + 1+ C2

(⌧ � 1)2

k⌧ + 1+ C3

⌧ � 1

(k⌧ + 1)2, (12)

where the constants in (12) are defined as

C1=16�2

µ2n

✓1+2q +8(1+q)

n(n�r)

r(n�1)

◆, C2=

16eL2�2

µ2n,

C3=256eL2�2

µ4n

✓n+ 2q + 8(1 + q)

n(n� r)

r(n� 1)

◆. (13)

Remark 1. Under the same conditions as in Theo-rem 1 and for a total number of iterations T = K⌧ �k0⌧ we have the following convergence rate

EkxK � x⇤k2 O

✓⌧

T

◆+O

✓⌧2

T 2

◆

+O

✓(⌧ � 1)2

T

◆+O

✓⌧ � 1

T 2

◆. (14)

As expected, the fastest convergence rate is attainedwhen the contributing nodes synchronize with the pa-rameter server in each iteration, i.e. when ⌧ = 1.Theorem 1 however characterizes how large the pe-riod length ⌧ can be picked. In particular, any pickof ⌧ = o(

pT ) ensures the convergence of the FedPAQ

to the global optimal for strongly convex losses.Remark 2. By setting ⌧ = 1, q = 0 and r = n, Theo-rem 1 recovers the convergence rate of vanilla parallelSGD, i.e., O(1/T ) for strongly-convex losses. Our re-sult is however more general since we remove the uni-formly bounded assumption on the norm of stochasticgradient. For ⌧ � 1, Theorem 1 does not recover theresult in [Stich, 2018] due to our weaker condition inAssumption 3. Nevertheless, the same rate O(1/T ) isguaranteed by FedPAQ for constant values of ⌧ .


4.2 Non-convex setting

We now present the convergence result of FedPAQ forsmooth non-convex loss functions.Theorem 2 (Non-convex Losses). Consider the se-quence of iterates xk at the parameter server gener-ated according to the FedPAQ method outlined in Algo-rithm 1. Suppose the conditions in Assumptions 1–3are satisfied. Further, let us define the constant B2 as

B2 :=q

n+

4(n� r)

r(n� 1)(1 + q), (15)

where q is the quantization variance parameter definedin (7) and r is the number of active nodes at eachround. If the total number of iterations T and the pe-riod length ⌧ satisfy the following conditions,

T � 2, ⌧

pB2

2 + 0.8�B2

8

p

T , (16)

and we set the stepsize as ⌘k,t = 1/LpT , then the fol-

lowing first-order stationary condition holds

1

T

K�1X

k=0

⌧�1X

t=0

E��rf(xk,t)

��2

2L(f(x0)� f⇤)

pT

+N11pT

+N2⌧ � 1

T, (17)

where the constants in (17) are defined as

N1 := (1 + q)�2

n

✓1 +

n(n� r)

r(n� 1)

◆, N2 :=

�2

n(n+ 1).

Remark 3. The result in Theorem 2 implies the fol-lowing order-wise rate

1

T

K�1X

k=0

⌧�1X

t=0

E��rf(xk,t)

��2 O

✓1pT

◆+O

✓⌧�1

T

◆.

Clearly, the fastest convergence rate is achieved for thesmallest possible period length, i.e., ⌧ = 1. This how-ever implies that the edge nodes communicate with theparameter server in each iteration, i.e. T rounds ofcommunications which is costly. On the other hand,the conditions (16) in Theorem 2 allow the periodlength ⌧ to grow up to O(

pT ) which results in an over-

all convergence rate of O(1/pT ) in reaching an sta-

tionary point. This result shows that with only O(pT )

rounds of communication FedPAQ can still ensure theconvergence rate of O(1/

pT ) for non-convex losses.

Remark 4. Theorem 2 recovers the convergence rateof the vanilla parallel SGD [Yu et al., 2019] for non-convex losses as a special case of ⌧ = 1, q = 0 andr = n. Nevertheless, we remove the uniformly boundedassumption on the norm of the stochastic gradient inour theoretical analysis. We also recover the result in[Wang and Joshi, 2018] when there is no quatizationq = 0 and we have a full device participation r = n.

It is worth mentioning that for Theorems 1 and 2, onecan use a batch of size m for each local SGD updateand the same results hold by changing �2/n to �2/mn.

5 Numerical Results and Discussions

The proposed FedPAQ method reduces the communi-cation load by employing three modules: periodic av-eraging, partial node participation, and quantization.This communication reduction however comes with acost in reducing the convergence accuracy and hencerequiring more iterations of the training, which wecharacterized in Theorems 1 and 2. In this section,we empirically study this communication-computationtrade-off and evaluate FedPAQ in comparison to otherbenchmarks. To evaluate the total cost of a method,we first need to specifically model such cost. We con-sider the total training time as the cost objective whichconsists of communication and computation time [Be-rahas et al., 2018, Reisizadeh et al., 2019c]. ConsiderT iterations of training with FedPAQ that consists ofK = T/⌧ rounds of communication. In each round, rworkers compute ⌧ iterations of SGD with batchsizeB and send a quantized vector of size p to the server.

Communication time. We fix a bandwidth BW anddefine the communication time in each round as thetotal number of uploaded bits divided by BW. Totalnumber of bits in each round is r · |Q(p, s)|, where|Q(p, s)| denotes the number of bits required to en-code a quantized vector of dimension p according to aspecific quantizer with s levels. In our simulations, weuse the low-precision quantizer described in Example1 and assume it takes pF bits to represent an unquan-tized vector of length p, where F is typically 32 bits.

Computation time. We consider the well-knownshifted-exponential model for gradient computationtime [Lee et al., 2017]. In particular, we assume thatfor any node, computing the gradients in a period with⌧ iterations and using batchsize B takes a determinis-tic shift ⌧ · B · shift plus a random exponential timewith mean value ⌧ · B · scale�1, where shift andscale are respectively shift and scale parameters ofthe shifted-exponential distribution. Total computa-tion time of each round is then the largest local com-putation time among the r contributing nodes. Wealso define a communication-computation ratio

Ccomm

Ccomp=

pF/BWshift+ 1/scale

as the communication time for a length-p-vector overthe average computation time for one gradient vector.This ratio captures the relative cost of communicationand computation, and since communication is a majorbottleneck, we have Ccomm/Ccomp � 1. In all of our


Figure 1: Training Loss vs. Training Time: Logistic Regression on MNIST (top). Neural Network on CIFAR-10 (bottom).

experiments, we use batchsize B = 10 and finely tunethe stepsize’s coefficient.

5.1 Logistic Regression on MNIST

In Figure 1, the top four plots demonstrate the train-ing time for a regularized logistic regression problemover MNIST dataset (‘0’ and ‘8’ digits) for T = 100iterations. The network has n = 50 nodes each loadedwith 200 samples. We set Ccomm/Ccomp = 100/1 tocapture the communication bottleneck. Among thethree parameters quantization levels s, number of ac-tive nodes in each round r, and period length ⌧ , we fixtwo and vary the third one. First plot demonstratesthe relative training loss for different quantization lev-els s 2 {1, 5, 10} and the case with no quantizationwhich corresponds to the FedAvg method [McMahanet al., 2016]. The other two parameters are fixed to(⌧, r) = (5, 25). Each curve shows the training timeversus the achieved training loss for the aggregatedmodel at the server for each round k = 1, · · · , T/⌧ .In the second plot, (s, ⌧) = (1, 5) are fixed. Thethird plot demonstrates the effect of period length ⌧ inthe communication-computation tradeoff. As demon-strated, after T/⌧ rounds, smaller choices for ⌧ (e.g.⌧ = 1, 2) result in slower convergence while the largerones (e.g. ⌧ = 50) run faster though providing lessaccurate models. Here ⌧ = 10 is the optimal choice.The last plot compares the training time of FedPAQwith two other benchmarks FedAvg and QSGD. Forboth FedPAQ and FedAvg, we set ⌧ = 2 while FedPAQand QSGD use quantization with s = 1 level. All threemethods use r = n = 50 nodes in each round.

5.2 Neural Network training over CIFAR-10

We conduct another set of numerical experiments toevaluate the performance of FedPAQ on non-convex

and smooth objectives. Here we train a neural net-work with four hidden layers consisting of n = 50nodes and more thatn 92K parameters, where we use10K samples from CIFAR-10 dataset with 10 labels.Since models are much larger than the previous setup,we increase the communication-computation ratio toCcomm/Ccomp = 1000/1 to better capture the commu-nication bottleneck for large models. The bottomfour plots in Figure 1 demonstrate the training lossover time for T = 100 iterations. In the first plot,(⌧, r) = (2, 25) are fixed and we vary the quantiza-tion levels. The second plot shows the effect of rwhile (s, ⌧) = (1, 2). The communication-computationtradeoff in terms of period length ⌧ is demonstrated inthe third plot, where picking ⌧ = 10 turns out to attainthe fastest convergence. Lastly, we compare FedPAQwith other benchmarks in the forth plot. Here, weset (s, r, ⌧) = (1, 20, 10) in FedPAQ, (r, ⌧) = (20, 10) inFedAvg and (s, r, ⌧) = (1, 50, 1) for QSGD.

6 ConclusionIn this paper, we addressed some of the communica-tion and scalability challenges of federated learningand proposed FedPAQ, a communication-efficient feder-ated learning method with provable performance guar-antees. FedPAQ is based on three modules: (1) periodicaveraging in which each edge node performs local it-erative updates; (2) partial node participation whichcaptures the random availability of the edge nodes;and (3) quantization in which each model is quantizedbefore being uploaded to the server. We provided rig-orous analysis for our proposed method for two gen-eral classes of strongly-convex and non-convex losses.We further provided numerical results evaluating theperformance of FedPAQ, and discussing the trade-offbetween communication and computation.


ReferencesNaman Agarwal, Ananda Theertha Suresh, Felix Xin-

nan X Yu, Sanjiv Kumar, and Brendan McMahan.cpsgd: Communication-efficient and differentially-private distributed sgd. In Advances in Neural Infor-mation Processing Systems, pages 7564–7575, 2018.

Dan Alistarh, Demjan Grubic, Jerry Li, Ry-ota Tomioka, and Milan Vojnovic. Qsgd:Communication-efficient sgd via gradient quantiza-tion and encoding. In Advances in Neural Informa-tion Processing Systems, pages 1709–1720, 2017.

Albert Berahas, Raghu Bollapragada, Nitish ShirishKeskar, and Ermin Wei. Balancing communicationand computation in distributed optimization. IEEETransactions on Automatic Control, 2018.

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Aziz-zadenesheli, and Anima Anandkumar. signsgd:Compressed optimisation for non-convex problems.arXiv preprint arXiv:1802.04434, 2018.

Keith Bonawitz, Hubert Eichner, WolfgangGrieskamp, Dzmitry Huba, Alex Ingerman,Vladimir Ivanov, Chloe Kiddon, Jakub Konecny,Stefano Mazzocchi, H Brendan McMahan, et al.Towards federated learning at scale: System design.arXiv preprint arXiv:1902.01046, 2019.

Léon Bottou and Olivier Bousquet. The tradeoffs oflarge scale learning. In Advances in neural informa-tion processing systems, pages 161–168, 2008.

Yudong Chen, Lili Su, and Jiaming Xu. Distributedstatistical machine learning in adversarial settings:Byzantine gradient descent. Proceedings of the ACMon Measurement and Analysis of Computing Sys-tems, 1(2):44, 2017.

Thinh T Doan, Siva Theja Maguluri, and JustinRomberg. Accelerating the convergence rates of dis-tributed subgradient methods with adaptive quan-tization. arXiv preprint arXiv:1810.13245, 2018.

John C Duchi, Michael I Jordan, and Martin J Wain-wright. Privacy aware learning. Journal of the ACM(JACM), 61(6):38, 2014.

Sanghamitra Dutta, Viveck Cadambe, and PulkitGrover. Short-dot: Computing large linear trans-forms distributedly using coded short dot products.In Advances In Neural Information Processing Sys-tems, pages 2092–2100, 2016.

Avishek Ghosh, Justin Hong, Dong Yin, and Kan-nan Ramchandran. Robust federated learningin a heterogeneous environment. arXiv preprintarXiv:1906.06629, 2019.

Neel Guha, Ameet Talwlkar, and Virginia Smith.One-shot federated learning. arXiv preprintarXiv:1902.11175, 2019.

Li Huang, Yifeng Yin, Zeng Fu, Shifa Zhang, HaoDeng, and Dianbo Liu. Loadaboost: Loss-based ad-aboost federated machine learning on medical data.arXiv preprint arXiv:1811.12629, 2018.

Anastasia Koloskova, Sebastian U Stich, and MartinJaggi. Decentralized stochastic optimization andgossip algorithms with compressed communication.arXiv preprint arXiv:1902.00340, 2019.

Jakub Konečny, H Brendan McMahan, Felix X Yu, Pe-ter Richtárik, Ananda Theertha Suresh, and DaveBacon. Federated learning: Strategies for im-proving communication efficiency. arXiv preprintarXiv:1610.05492, 2016.

Anusha Lalitha, Osman Cihan Kilinc, Tara Javidi, andFarinaz Koushanfar. Peer-to-peer federated learningon graphs. arXiv preprint arXiv:1901.11173, 2019.

Kangwook Lee, Maximilian Lam, Ramtin Pedarsani,Dimitris Papailiopoulos, and Kannan Ramchan-dran. Speeding up distributed machine learning us-ing codes. IEEE Transactions on Information The-ory, 64(3):1514–1529, 2017.

Kangwook Lee, Maximilian Lam, Ramtin Pedarsani,Dimitris Papailiopoulos, and Kannan Ramchan-dran. Speeding up distributed machine learning us-ing codes. IEEE Transactions on Information The-ory, 64(3):1514–1529, 2018.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, andVirginia Smith. Federated learning: Challenges,methods, and future directions. arXiv preprintarXiv:1908.07873, 2019a.

Xiang Li, Kaixuan Huang, Wenhao Yang, ShusenWang, and Zhihua Zhang. On the conver-gence of fedavg on non-iid data. arXiv preprintarXiv:1907.02189, 2019b.

Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, andMartin Jaggi. Don’t use large mini-batches, use localsgd. arXiv preprint arXiv:1808.07217, 2018.

Brendan McMahan and Daniel Ramage. Fed-erated learning: Collaborative machinelearning without centralized training data.https://ai.googleblog.com/2017/04/federated-learning-collaborative.html,2017. Accessed: 2019-09-13.

H Brendan McMahan, Eider Moore, Daniel Ramage,Seth Hampson, et al. Communication-efficient learn-ing of deep networks from decentralized data. arXivpreprint arXiv:1602.05629, 2016.

H Brendan McMahan, Daniel Ramage, Kunal Tal-war, and Li Zhang. Learning differentially pri-vate recurrent language models. arXiv preprintarXiv:1710.06963, 2017.

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

https://ai.googleblog.com/2017/04/federated-learning-collaborative.html


Song Mei, Yu Bai, Andrea Montanari, et al. The land-scape of empirical risk for nonconvex losses. TheAnnals of Statistics, 46(6A):2747–2774, 2018.

Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear pro-gramming. Mathematical programming, 39(2):117–129, 1987.

Alex Nichol, Joshua Achiam, and John Schulman. Onfirst-order meta-learning algorithms. arXiv preprintarXiv:1803.02999, 2018.

Amirhossein Reisizadeh, Aryan Mokhtari, HamedHassani, and Ramtin Pedarsani. An exact quantizeddecentralized gradient descent algorithm. IEEETransactions on Signal Processing, 67(19):4934–4947, 2019a.

Amirhossein Reisizadeh, Saurav Prakash, RamtinPedarsani, and Amir Salman Avestimehr. Code-dreduce: A fast and robust framework for gradientaggregation in distributed learning. arXiv preprintarXiv:1902.01981, 2019b.

Amirhossein Reisizadeh, Hossein Taheri, AryanMokhtari, Hamed Hassani, and Ramtin Pedarsani.Robust and communication-efficient collaborativelearning. arXiv preprint arXiv:1907.10595, 2019c.

Anit Kumar Sahu, Tian Li, Maziar Sanjabi, ManzilZaheer, Ameet Talwalkar, and Virginia Smith. Onthe convergence of federated optimization in hetero-geneous networks. arXiv preprint arXiv:1812.06127,2018.

Sumudu Samarakoon, Mehdi Bennis, Walid Saady,and Merouane Debbah. Distributed federated learn-ing for ultra-reliable low-latency vehicular commu-nications. arXiv preprint arXiv:1807.08127, 2018.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, andDong Yu. 1-bit stochastic gradient descent and itsapplication to data-parallel distributed training ofspeech dnns. In Fifteenth Annual Conference ofthe International Speech Communication Associa-tion, 2014.

Virginia Smith, Simone Forte, Chenxin Ma, Mar-tin Takac, Michael I Jordan, and Martin Jaggi.Cocoa: A general framework for communication-efficient distributed optimization. arXiv preprintarXiv:1611.02189, 2016.

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi,and Ameet S Talwalkar. Federated multi-task learn-ing. In Advances in Neural Information ProcessingSystems, pages 4424–4434, 2017.

Sebastian U Stich. Local sgd converges fast and com-municates little. arXiv preprint arXiv:1805.09767,2018.

Rashish Tandon, Qi Lei, Alexandros G Dimakis, andNikos Karampatziakis. Gradient coding. arXivpreprint arXiv:1612.03301, 2016.

Jianyu Wang and Gauri Joshi. Cooperative sgd:A unified framework for the design and analysisof communication-efficient sgd algorithms. arXivpreprint arXiv:1808.07576, 2018.

Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, GauriJoshi, and Soummya Kar. Matcha: Speeding updecentralized sgd via matching decomposition sam-pling. arXiv preprint arXiv:1905.09435, 2019.

Dong Yin, Yudong Chen, Kannan Ramchandran,and Peter Bartlett. Byzantine-robust distributedlearning: Towards optimal statistical rates. arXivpreprint arXiv:1803.01498, 2018.

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallelrestarted sgd with faster convergence and less com-munication: Demystifying why model averagingworks for deep learning. In Proceedings of the AAAIConference on Artificial Intelligence, volume 33,pages 5693–5700, 2019.

Qian Yu, Mohammad Ali Maddah-Ali, and A SalmanAvestimehr. Polynomial codes: an optimal designfor high-dimensional coded matrix multiplication.arXiv preprint arXiv:1705.10464, 2017.

Xin Zhang, Jia Liu, Zhengyuan Zhu, and Elizabeth SBentley. Compressed distributed gradient descent:Communication-efficient consensus over networks.arXiv preprint arXiv:1812.04048, 2018.

Wennan Zhu, Peter Kairouz, Haicheng Sun, BrendanMcMahan, and Wei Li. Federated heavy hittersdiscovery with differential privacy. arXiv preprintarXiv:1902.08534, 2019.

Documents

FedPAQ: A Communication-Eﬃcient Federated Learning Method …proceedings.mlr.press/v108/reisizadeh20a/reisizadeh20a.pdf · 2020-07-22 · Reisizadeh, Mokhtari, Hassani, Jadbabaie,